# HW3
 by Mateusz Stączek

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split
from dalex import Explainer


In [2]:
df = pd.read_csv("australia.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56420 entries, 0 to 56419
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MinTemp        56420 non-null  float64
 1   MaxTemp        56420 non-null  float64
 2   Rainfall       56420 non-null  float64
 3   Evaporation    56420 non-null  float64
 4   Sunshine       56420 non-null  float64
 5   WindGustSpeed  56420 non-null  float64
 6   WindSpeed9am   56420 non-null  float64
 7   WindSpeed3pm   56420 non-null  float64
 8   Humidity9am    56420 non-null  float64
 9   Humidity3pm    56420 non-null  float64
 10  Pressure9am    56420 non-null  float64
 11  Pressure3pm    56420 non-null  float64
 12  Cloud9am       56420 non-null  float64
 13  Cloud3pm       56420 non-null  float64
 14  Temp9am        56420 non-null  float64
 15  Temp3pm        56420 non-null  float64
 16  RainToday      56420 non-null  int64  
 17  RainTomorrow   56420 non-null  int64  
dtypes: flo

No nulls detected, all features are numerical (as stated in homework desciption).

In [4]:
df.RainTomorrow.value_counts()

0    43993
1    12427
Name: RainTomorrow, dtype: int64

## Splitting data on test and train

In [5]:
X = df.drop(columns=["RainTomorrow"])
y = df.RainTomorrow
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=3)

Most records have target value $0$, which may cause lower scores in some cases in some classifiers... Let's create 3 classifiers.

## Creating and fitting 3 models

In [6]:
# Logistic regression parameters: max iterations 100 -> 1000, tolerance for stopping criteria 1e-4 -> 1e-5
lr_model = LogisticRegression(max_iter=1000, tol=0.00001, n_jobs=-1,random_state=3)
# KNeighbors parameters: n_neighbors 5 -> 8, algorithm auto -> brute
knc_model = KNeighborsClassifier(n_neighbors=8,algorithm='brute', n_jobs=-1)
# LGBM parameters: num_leaves 31 -> 10, n_estimators 100 -> 1000
lgbm_model = LGBMClassifier(num_leaves=10,n_estimators=1000, random_state=3)

In [7]:
def fitmodel(model):
    return model.fit(X_train, y_train)

lr_model = fitmodel(lr_model)
knc_model = fitmodel(knc_model)
lgbm_model = fitmodel(lgbm_model)

## Comparing classifiers

In [8]:
def compare_models(models: list, X, y):
    def get_scores(model):
        return Explainer(model, X, y, verbose=False).model_performance().result
    return pd.concat([get_scores(model) for model in models])
    
compare_models([lr_model, knc_model, lgbm_model], X_test, y_test)

Unnamed: 0,recall,precision,f1,accuracy,auc
LogisticRegression,0.528594,0.721018,0.609991,0.849167,0.87378
KNeighborsClassifier,0.56672,0.671214,0.614556,0.841368,0.846966
LGBMClassifier,0.556394,0.733124,0.632648,0.855814,0.891601


Are the models visibly overfitted?

In [9]:
compare_models([lr_model, knc_model, lgbm_model], X_train, y_train)

Unnamed: 0,recall,precision,f1,accuracy,auc
LogisticRegression,0.52972,0.728927,0.613559,0.853509,0.882737
KNeighborsClassifier,0.646483,0.746272,0.692803,0.874136,0.926644
LGBMClassifier,0.682309,0.869247,0.764516,0.907723,0.957494


LBGM might be a bit overfitted but overall everything is looking fine. 

## Summary

Given unbalanced data, model with the highest f1 score is LGBM Classifier.

Fine-tuning hyperparameters would probably yield better scores for every model but this task is beyond this homework.

Overall, no model can be chosen as "better" than others since all have very similar scores.