<a href="https://colab.research.google.com/github/WoradeeKongthong/raining_tomorrow_classification/blob/master/02_Raining_LogisticEngineering_plus_FeatureEngineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
# libraries
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# **Data Set**

In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/WoradeeKongthong/raining_tomorrow_classification/master/weatherAUS.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142193 entries, 0 to 142192
Data columns (total 24 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           142193 non-null  object 
 1   Location       142193 non-null  object 
 2   MinTemp        141556 non-null  float64
 3   MaxTemp        141871 non-null  float64
 4   Rainfall       140787 non-null  float64
 5   Evaporation    81350 non-null   float64
 6   Sunshine       74377 non-null   float64
 7   WindGustDir    132863 non-null  object 
 8   WindGustSpeed  132923 non-null  float64
 9   WindDir9am     132180 non-null  object 
 10  WindDir3pm     138415 non-null  object 
 11  WindSpeed9am   140845 non-null  float64
 12  WindSpeed3pm   139563 non-null  float64
 13  Humidity9am    140419 non-null  float64
 14  Humidity3pm    138583 non-null  float64
 15  Pressure9am    128179 non-null  float64
 16  Pressure3pm    128212 non-null  float64
 17  Cloud9am       88536 non-null

In [0]:
# drop RISK_MM column (Recommendation from data description in Kaggle)
df.drop(['RISK_MM'], axis = 1, inplace = True)

# Extract Year, Month, Day from Date column
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

# drop Date column
df.drop(['Date'], axis = 1, inplace = True)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142193 entries, 0 to 142192
Data columns (total 25 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Location       142193 non-null  object 
 1   MinTemp        141556 non-null  float64
 2   MaxTemp        141871 non-null  float64
 3   Rainfall       140787 non-null  float64
 4   Evaporation    81350 non-null   float64
 5   Sunshine       74377 non-null   float64
 6   WindGustDir    132863 non-null  object 
 7   WindGustSpeed  132923 non-null  float64
 8   WindDir9am     132180 non-null  object 
 9   WindDir3pm     138415 non-null  object 
 10  WindSpeed9am   140845 non-null  float64
 11  WindSpeed3pm   139563 non-null  float64
 12  Humidity9am    140419 non-null  float64
 13  Humidity3pm    138583 non-null  float64
 14  Pressure9am    128179 non-null  float64
 15  Pressure3pm    128212 non-null  float64
 16  Cloud9am       88536 non-null   float64
 17  Cloud3pm       85099 non-null

In [0]:
X = df.drop(['RainTomorrow'], axis=1)
y = df['RainTomorrow']

# **Trial 1**
- keep the outliers
- impute the missing categorical values with mode
- impute the missing numerical values with median

**Create Preprocessor : ColumnTransformer of numerical and categorical features**

In [0]:
numerical_features = [x for x in X.columns if df[x].dtype != 'object']

numeric_transformer = Pipeline(steps=[
          ('imputer', SimpleImputer(strategy='median')),
          ('scaler', MinMaxScaler())
])

categorical_features = [x for x in X.columns if df[x].dtype == 'object']

categorical_transformer = Pipeline(steps=[
          ('imputer', SimpleImputer(strategy='most_frequent')),
          ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
          ('num', numeric_transformer, numerical_features),
          ('cat', categorical_transformer, categorical_features)
    ]
)

**Create model**

In [0]:
model = LogisticRegression(solver='sag', max_iter=500, n_jobs = -1)

**Create Pipeline**

In [0]:
clf = Pipeline(steps=[
      ('preprocessor', preprocessor),
      ('model', model)
])

**Cross Validation**

In [10]:
accuracy = cross_val_score(clf,X,y,cv=10)
print('accuracy : ', accuracy)
print('mean : ', accuracy.mean())
print('std : ', accuracy.std())

accuracy :  [0.8326301  0.68994374 0.83410689 0.71158309 0.81559885 0.5278149
 0.67156621 0.6257824  0.71763134 0.83409522]
mean :  0.7260752747680471
std :  0.09830128530393102


Note : It looks like the model has not high bias and variance

**Training and Test Sets**

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print('\n\nTraining and Test Sets result')
print('\naccuracy score : ', accuracy_score(y_test,y_pred))
print('\nconfusion matrix : \n', confusion_matrix(y_test, y_pred))
print('\nclassification report : \n', classification_report(y_test,y_pred))





Training and Test Sets result

accuracy score :  0.8447554414712191

confusion matrix : 
 [[20840  1191]
 [ 3224  3184]]

classification report : 
               precision    recall  f1-score   support

           0       0.87      0.95      0.90     22031
           1       0.73      0.50      0.59      6408

    accuracy                           0.84     28439
   macro avg       0.80      0.72      0.75     28439
weighted avg       0.83      0.84      0.83     28439



In [12]:
print('Training set score : ',clf.score(X_train,y_train))
print('Test set score : ',clf.score(X_test,y_test))

Training set score :  0.8489283893313642
Test set score :  0.8447554414712191


Note : The model is not overfitting

# Trial 2 
- drop the outliers from df
- impute the missing categorical values with mode
- impute the missing numerical values with median

In [29]:
# dropping the outliers

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df_drop_outliers = df[~((df < (Q1 - 1.5*IQR)) | (df > (Q3 + 1.5*IQR))).any(axis=1)]
df_drop_outliers

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,Year,Month,Day
0,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No,2008,12,1
1,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No,2008,12,2
2,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No,2008,12,3
3,Albury,9.2,28.0,0.0,,,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No,2008,12,4
4,Albury,17.5,32.3,1.0,,,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No,2008,12,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142188,Uluru,3.5,21.8,0.0,,,E,31.0,ESE,E,15.0,13.0,59.0,27.0,1024.7,1021.2,,,9.4,20.9,No,No,2017,6,20
142189,Uluru,2.8,23.4,0.0,,,E,31.0,SE,ENE,13.0,11.0,51.0,24.0,1024.6,1020.3,,,10.1,22.4,No,No,2017,6,21
142190,Uluru,3.6,25.3,0.0,,,NNW,22.0,SE,N,13.0,9.0,56.0,21.0,1023.5,1019.1,,,10.9,24.5,No,No,2017,6,22
142191,Uluru,5.4,26.9,0.0,,,N,37.0,SE,WNW,9.0,9.0,53.0,24.0,1021.0,1016.8,,,12.5,26.1,No,No,2017,6,23


In [0]:
X = df_drop_outliers.drop(['RainTomorrow'], axis=1)
y = df_drop_outliers['RainTomorrow']

**Cross Validation**

In [31]:
accuracy = cross_val_score(clf,X,y,cv=10)
print('accuracy : ', accuracy)
print('mean : ', accuracy.mean())
print('std : ', accuracy.std())

accuracy :  [0.86424265 0.76247031 0.85236616 0.78722821 0.86113649 0.72300384
 0.70847798 0.6459894  0.74666545 0.86477844]
mean :  0.781635892366505
std :  0.07339684435367744


Note : Dropping the outliers improves the model

**Training and Test Sets**

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print('\n\nTraining and Test Sets result')
print('\naccuracy score : ', accuracy_score(y_test,y_pred))
print('\nconfusion matrix : \n', confusion_matrix(y_test, y_pred))
print('\nclassification report : \n', classification_report(y_test,y_pred))



Training and Test Sets result

accuracy score :  0.8726932212680432

confusion matrix : 
 [[17800   568]
 [ 2219  1305]]

classification report : 
               precision    recall  f1-score   support

           0       0.89      0.97      0.93     18368
           1       0.70      0.37      0.48      3524

    accuracy                           0.87     21892
   macro avg       0.79      0.67      0.71     21892
weighted avg       0.86      0.87      0.86     21892





In [16]:
print('Training set score : ',clf.score(X_train,y_train))
print('Test set score : ',clf.score(X_test,y_test))

Training set score :  0.8719494786848927
Test set score :  0.8726932212680432


Note : The model is not overfitting

# Trial 3
- drop the outliers from X_train
- impute the missing categorical values with mode
- impute the missing numerical values with median

In [17]:
df

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,Year,Month,Day
0,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No,2008,12,1
1,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No,2008,12,2
2,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No,2008,12,3
3,Albury,9.2,28.0,0.0,,,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No,2008,12,4
4,Albury,17.5,32.3,1.0,,,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No,2008,12,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142188,Uluru,3.5,21.8,0.0,,,E,31.0,ESE,E,15.0,13.0,59.0,27.0,1024.7,1021.2,,,9.4,20.9,No,No,2017,6,20
142189,Uluru,2.8,23.4,0.0,,,E,31.0,SE,ENE,13.0,11.0,51.0,24.0,1024.6,1020.3,,,10.1,22.4,No,No,2017,6,21
142190,Uluru,3.6,25.3,0.0,,,NNW,22.0,SE,N,13.0,9.0,56.0,21.0,1023.5,1019.1,,,10.9,24.5,No,No,2017,6,22
142191,Uluru,5.4,26.9,0.0,,,N,37.0,SE,WNW,9.0,9.0,53.0,24.0,1021.0,1016.8,,,12.5,26.1,No,No,2017,6,23


In [0]:
X = df.drop(['RainTomorrow'], axis=1)
y = df['RainTomorrow']

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [49]:
X_train

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,Year,Month,Day
89080,GoldCoast,18.3,19.8,4.6,,,NW,35.0,NNW,NW,17.0,17.0,99.0,83.0,1017.8,1015.0,,,18.7,17.7,Yes,2012,5,25
56495,Ballarat,7.6,25.1,0.0,,,NW,33.0,N,W,4.0,11.0,81.0,37.0,1008.5,1005.8,,4.0,14.9,23.0,No,2016,12,1
63581,MelbourneAirport,7.7,13.4,0.0,1.8,1.1,SW,39.0,WSW,WSW,19.0,11.0,75.0,83.0,1021.0,1019.0,7.0,7.0,10.1,12.2,No,2011,5,26
21452,NorfolkIsland,15.5,20.3,3.0,4.4,8.6,S,39.0,S,S,22.0,17.0,52.0,53.0,1023.9,1021.9,2.0,6.0,18.2,19.0,Yes,2011,5,16
2498,Albury,18.1,32.8,0.0,,,SE,39.0,SE,SSW,11.0,11.0,56.0,30.0,1012.1,1009.6,,1.0,24.0,30.5,No,2016,1,25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17553,Newcastle,25.1,40.7,2.2,,,,,,,0.0,,44.0,,,,6.0,,32.4,,Yes,2017,1,31
38527,WaggaWagga,8.8,16.0,2.4,2.0,,N,30.0,WNW,NW,9.0,19.0,85.0,48.0,1016.2,1014.4,8.0,1.0,12.0,15.2,Yes,2016,8,10
131812,Launceston,4.4,14.5,0.8,,,ENE,13.0,,,0.0,0.0,98.0,79.0,,,8.0,8.0,7.5,13.6,No,2014,7,7
1510,Albury,1.7,20.6,0.0,,,NE,22.0,ENE,NNW,7.0,6.0,79.0,35.0,1027.4,1020.9,,,9.1,20.4,No,2013,5,3


In [50]:
# append y_train into X_train to form a dataset and to handle the outliers
X_train['RainTomorrow'] = y_train

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [51]:
X_train

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,Year,Month,Day,RainTomorrow
89080,GoldCoast,18.3,19.8,4.6,,,NW,35.0,NNW,NW,17.0,17.0,99.0,83.0,1017.8,1015.0,,,18.7,17.7,Yes,2012,5,25,Yes
56495,Ballarat,7.6,25.1,0.0,,,NW,33.0,N,W,4.0,11.0,81.0,37.0,1008.5,1005.8,,4.0,14.9,23.0,No,2016,12,1,No
63581,MelbourneAirport,7.7,13.4,0.0,1.8,1.1,SW,39.0,WSW,WSW,19.0,11.0,75.0,83.0,1021.0,1019.0,7.0,7.0,10.1,12.2,No,2011,5,26,Yes
21452,NorfolkIsland,15.5,20.3,3.0,4.4,8.6,S,39.0,S,S,22.0,17.0,52.0,53.0,1023.9,1021.9,2.0,6.0,18.2,19.0,Yes,2011,5,16,No
2498,Albury,18.1,32.8,0.0,,,SE,39.0,SE,SSW,11.0,11.0,56.0,30.0,1012.1,1009.6,,1.0,24.0,30.5,No,2016,1,25,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17553,Newcastle,25.1,40.7,2.2,,,,,,,0.0,,44.0,,,,6.0,,32.4,,Yes,2017,1,31,No
38527,WaggaWagga,8.8,16.0,2.4,2.0,,N,30.0,WNW,NW,9.0,19.0,85.0,48.0,1016.2,1014.4,8.0,1.0,12.0,15.2,Yes,2016,8,10,No
131812,Launceston,4.4,14.5,0.8,,,ENE,13.0,,,0.0,0.0,98.0,79.0,,,8.0,8.0,7.5,13.6,No,2014,7,7,Yes
1510,Albury,1.7,20.6,0.0,,,NE,22.0,ENE,NNW,7.0,6.0,79.0,35.0,1027.4,1020.9,,,9.1,20.4,No,2013,5,3,No


In [52]:
Q1 = X_train.quantile(0.25)
Q3 = X_train.quantile(0.75)
IQR = Q3 - Q1
X_train_drop_outliers = X_train[~((X_train < (Q1 - 1.5*IQR)) | (X_train > (Q3 + 1.5*IQR))).any(axis=1)]
X_train_drop_outliers

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,Year,Month,Day,RainTomorrow
56495,Ballarat,7.6,25.1,0.0,,,NW,33.0,N,W,4.0,11.0,81.0,37.0,1008.5,1005.8,,4.0,14.9,23.0,No,2016,12,1,No
63581,MelbourneAirport,7.7,13.4,0.0,1.8,1.1,SW,39.0,WSW,WSW,19.0,11.0,75.0,83.0,1021.0,1019.0,7.0,7.0,10.1,12.2,No,2011,5,26,Yes
2498,Albury,18.1,32.8,0.0,,,SE,39.0,SE,SSW,11.0,11.0,56.0,30.0,1012.1,1009.6,,1.0,24.0,30.5,No,2016,1,25,No
49515,Tuggeranong,9.2,25.9,0.0,,,E,46.0,S,SW,20.0,7.0,47.0,24.0,1017.6,1014.6,,,13.2,24.4,No,2014,1,7,No
1578,Albury,5.0,14.8,0.0,,,ENE,17.0,,ESE,0.0,7.0,100.0,74.0,1029.4,1026.1,5.0,6.0,8.0,14.0,No,2013,7,13,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94608,Adelaide,10.0,13.6,1.2,2.8,6.0,WNW,57.0,WNW,WNW,24.0,28.0,64.0,69.0,1010.2,1011.4,,,12.2,12.8,Yes,2010,8,19,Yes
14441,Moree,7.3,21.9,0.0,,,NE,33.0,E,ESE,19.0,11.0,63.0,43.0,1030.3,1026.2,,1.0,15.8,21.1,No,2016,9,6,No
37139,WaggaWagga,10.2,14.3,0.6,3.0,0.0,WNW,54.0,N,NNW,9.0,22.0,72.0,78.0,1007.8,1005.0,8.0,7.0,12.7,13.8,No,2012,8,23,Yes
131812,Launceston,4.4,14.5,0.8,,,ENE,13.0,,,0.0,0.0,98.0,79.0,,,8.0,8.0,7.5,13.6,No,2014,7,7,Yes


In [0]:
X_train = X_train_drop_outliers.drop(['RainTomorrow'], axis=1)
y_train = X_train_drop_outliers['RainTomorrow']

**Cross Validation on Training set**

In [56]:
accuracy = cross_val_score(clf,X_train,y_train,cv=10)
print('accuracy : ', accuracy)
print('mean : ', accuracy.mean())
print('std : ', accuracy.std())

accuracy :  [0.86918771 0.86884497 0.86713127 0.8694162  0.87421456 0.87170113
 0.86953045 0.86735976 0.8728436  0.87145795]
mean :  0.8701687588021532
std :  0.0021986105847968743


Note : The model is low bias and low variance

**Training and Test sets**

In [57]:
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print('\n\nTraining and Test Sets result')
print('\naccuracy score : ', accuracy_score(y_test,y_pred))
print('\nconfusion matrix : \n', confusion_matrix(y_test, y_pred))
print('\nclassification report : \n', classification_report(y_test,y_pred))

print('Training set score : ',clf.score(X_train,y_train))
print('Test set score : ',clf.score(X_test,y_test))



Training and Test Sets result

accuracy score :  0.8340658954252963

confusion matrix : 
 [[20046  2075]
 [ 2644  3674]]

classification report : 
               precision    recall  f1-score   support

          No       0.88      0.91      0.89     22121
         Yes       0.64      0.58      0.61      6318

    accuracy                           0.83     28439
   macro avg       0.76      0.74      0.75     28439
weighted avg       0.83      0.83      0.83     28439

Training set score :  0.8707856824595277
Test set score :  0.8340658954252963


Note : Train and test scores are slightly different.  
The model is not overfitting.  
But let's try adding regularization.

In [63]:
# C=inverse regularization strength

# using the same preprocessor
model = LogisticRegression(solver='sag', max_iter=500, n_jobs = -1, C=0.01)
clf = Pipeline(steps=[
      ('preprocessor', preprocessor),
      ('model', model)
])

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print('\n\nTraining and Test Sets result')
print('\naccuracy score : ', accuracy_score(y_test,y_pred))
print('\nconfusion matrix : \n', confusion_matrix(y_test, y_pred))
print('\nclassification report : \n', classification_report(y_test,y_pred))

print('Training set score : ',clf.score(X_train,y_train))
print('Test set score : ',clf.score(X_test,y_test))



Training and Test Sets result

accuracy score :  0.832905517071627

confusion matrix : 
 [[20363  1758]
 [ 2994  3324]]

classification report : 
               precision    recall  f1-score   support

          No       0.87      0.92      0.90     22121
         Yes       0.65      0.53      0.58      6318

    accuracy                           0.83     28439
   macro avg       0.76      0.72      0.74     28439
weighted avg       0.82      0.83      0.83     28439

Training set score :  0.865347484833598
Test set score :  0.832905517071627


Note : adding regularization makes the scores getting closer

# Trial 4
- cap the outliers in df
- impute the missing categorical values with mode
- impute the missing numerical values with median

In [64]:
df

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,Year,Month,Day
0,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No,2008,12,1
1,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No,2008,12,2
2,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No,2008,12,3
3,Albury,9.2,28.0,0.0,,,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No,2008,12,4
4,Albury,17.5,32.3,1.0,,,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No,2008,12,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142188,Uluru,3.5,21.8,0.0,,,E,31.0,ESE,E,15.0,13.0,59.0,27.0,1024.7,1021.2,,,9.4,20.9,No,No,2017,6,20
142189,Uluru,2.8,23.4,0.0,,,E,31.0,SE,ENE,13.0,11.0,51.0,24.0,1024.6,1020.3,,,10.1,22.4,No,No,2017,6,21
142190,Uluru,3.6,25.3,0.0,,,NNW,22.0,SE,N,13.0,9.0,56.0,21.0,1023.5,1019.1,,,10.9,24.5,No,No,2017,6,22
142191,Uluru,5.4,26.9,0.0,,,N,37.0,SE,WNW,9.0,9.0,53.0,24.0,1021.0,1016.8,,,12.5,26.1,No,No,2017,6,23


In [0]:
# cap the outliers
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
lower_cap = Q1 - 1.5*IQR
upper_cap = Q3 + 1.5*IQR

features = lower_cap.index.values

for feature in features :
  df[feature] = np.where(df[feature]<lower_cap[feature],lower_cap[feature], df[feature])
  df[feature] = np.where(df[feature]>upper_cap[feature],upper_cap[feature], df[feature])

In [107]:
df.shape

(142193, 25)

In [108]:
df_drop_outliers = df[~((df < (Q1 - 1.5*IQR)) | (df > (Q3 + 1.5*IQR))).any(axis=1)]
df_drop_outliers.shape

(142193, 25)

Note : no outliers are dropped, cap outliers successfully

In [109]:
df

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,Year,Month,Day
0,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No,2008.0,12.0,1.0
1,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No,2008.0,12.0,2.0
2,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No,2008.0,12.0,3.0
3,Albury,9.2,28.0,0.0,,,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No,2008.0,12.0,4.0
4,Albury,17.5,32.3,1.0,,,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No,2008.0,12.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142188,Uluru,3.5,21.8,0.0,,,E,31.0,ESE,E,15.0,13.0,59.0,27.0,1024.7,1021.2,,,9.4,20.9,No,No,2017.0,6.0,20.0
142189,Uluru,2.8,23.4,0.0,,,E,31.0,SE,ENE,13.0,11.0,51.0,24.0,1024.6,1020.3,,,10.1,22.4,No,No,2017.0,6.0,21.0
142190,Uluru,3.6,25.3,0.0,,,NNW,22.0,SE,N,13.0,9.0,56.0,21.0,1023.5,1019.1,,,10.9,24.5,No,No,2017.0,6.0,22.0
142191,Uluru,5.4,26.9,0.0,,,N,37.0,SE,WNW,9.0,9.0,53.0,24.0,1021.0,1016.8,,,12.5,26.1,No,No,2017.0,6.0,23.0


In [0]:
X = df.drop(['RainTomorrow'], axis=1)
y = df['RainTomorrow']

**Cross Validation**

In [111]:
accuracy = cross_val_score(clf,X,y,cv=10)
print('accuracy : ', accuracy)
print('mean : ', accuracy.mean())
print('std : ', accuracy.std())

accuracy :  [0.83185654 0.78769339 0.83410689 0.81876363 0.83001618 0.83128209
 0.77108095 0.81630213 0.82108446 0.84190168]
mean :  0.8184087934677446
std :  0.02115354600608366


Note :  low bias and low variance.  
The result is better than keeping or dropping the outliers in df

**Training and Test Sets**

In [112]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print('\n\nTraining and Test Sets result')
print('\naccuracy score : ', accuracy_score(y_test,y_pred))
print('\nconfusion matrix : \n', confusion_matrix(y_test, y_pred))
print('\nclassification report : \n', classification_report(y_test,y_pred))

print('\nTraining set score : ',clf.score(X_train,y_train))
print('Test set score : ',clf.score(X_test,y_test))



Training and Test Sets result

accuracy score :  0.8426456626463659

confusion matrix : 
 [[21002  1067]
 [ 3408  2962]]

classification report : 
               precision    recall  f1-score   support

          No       0.86      0.95      0.90     22069
         Yes       0.74      0.46      0.57      6370

    accuracy                           0.84     28439
   macro avg       0.80      0.71      0.74     28439
weighted avg       0.83      0.84      0.83     28439


Training set score :  0.8439615310230849
Test set score :  0.8426456626463659


Note : the model is not overfitting

# Trial 5
- cap the outliers in X_train
- cap the outliers in X_test using the upper_cap and lower_cap of X_train
- impute the missing categorical values with mode
- impute the missing numerical values with median

In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/WoradeeKongthong/raining_tomorrow_classification/master/weatherAUS.csv')

In [0]:
# drop RISK_MM column (Recommendation from data description in Kaggle)
df.drop(['RISK_MM'], axis = 1, inplace = True)

# Extract Year, Month, Day from Date column
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

# drop Date column
df.drop(['Date'], axis = 1, inplace = True)

In [116]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142193 entries, 0 to 142192
Data columns (total 25 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Location       142193 non-null  object 
 1   MinTemp        141556 non-null  float64
 2   MaxTemp        141871 non-null  float64
 3   Rainfall       140787 non-null  float64
 4   Evaporation    81350 non-null   float64
 5   Sunshine       74377 non-null   float64
 6   WindGustDir    132863 non-null  object 
 7   WindGustSpeed  132923 non-null  float64
 8   WindDir9am     132180 non-null  object 
 9   WindDir3pm     138415 non-null  object 
 10  WindSpeed9am   140845 non-null  float64
 11  WindSpeed3pm   139563 non-null  float64
 12  Humidity9am    140419 non-null  float64
 13  Humidity3pm    138583 non-null  float64
 14  Pressure9am    128179 non-null  float64
 15  Pressure3pm    128212 non-null  float64
 16  Cloud9am       88536 non-null   float64
 17  Cloud3pm       85099 non-null

In [0]:
X = df.drop(['RainTomorrow'], axis=1)
y = df['RainTomorrow']

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [120]:
# append y_train into X_train to form a dataset and to handle the outliers
X_train['RainTomorrow'] = y_train

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [121]:
X_train

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,Year,Month,Day,RainTomorrow
118889,Perth,15.5,20.9,0.2,0.6,0.0,WNW,24.0,NE,NNW,13.0,11.0,75.0,81.0,1010.7,1008.4,7.0,8.0,18.2,20.1,No,2011,10,26,Yes
124806,Walpole,13.3,27.8,0.0,,,ENE,37.0,NE,S,9.0,13.0,66.0,71.0,1016.0,1011.3,,,21.3,24.7,No,2011,11,2,No
100839,Nuriootpa,0.3,12.0,0.0,0.8,8.9,SSE,24.0,,ENE,0.0,7.0,90.0,55.0,1029.5,1026.9,2.0,2.0,7.0,11.6,No,2011,6,14,No
134988,AliceSprings,22.6,31.7,0.0,16.8,1.7,SE,31.0,SE,SE,13.0,20.0,10.0,15.0,1015.0,1011.6,7.0,7.0,25.5,28.8,No,2014,11,27,No
138997,Darwin,24.6,34.4,0.0,4.8,11.3,E,28.0,ESE,NNW,7.0,15.0,70.0,62.0,1009.7,1005.9,3.0,2.0,28.3,32.2,No,2017,3,1,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31642,Sydney,16.1,29.7,0.2,7.2,10.4,NE,31.0,W,NE,19.0,19.0,58.0,33.0,1021.0,1018.7,1.0,5.0,19.9,28.7,No,2014,3,18,No
90808,GoldCoast,16.2,21.9,21.6,,,ESE,50.0,ENE,SE,26.0,26.0,82.0,99.0,1023.6,1020.9,,,20.5,18.1,Yes,2017,6,11,Yes
114472,PearceRAAF,11.4,24.6,0.0,,12.6,WSW,43.0,E,SW,17.0,30.0,46.0,44.0,1019.0,1015.5,,,17.0,21.6,No,2016,11,7,No
36090,WaggaWagga,5.6,14.5,0.0,2.0,1.9,WSW,30.0,W,SW,11.0,17.0,86.0,62.0,1014.0,1012.1,8.0,7.0,7.6,14.0,No,2009,9,9,No


In [122]:
Q1 = X_train.quantile(0.25)
Q3 = X_train.quantile(0.75)
IQR = Q3 - Q1

lower_cap = Q1 - 1.5*IQR
upper_cap = Q3 + 1.5*IQR

features = lower_cap.index.values

for feature in features :
  X_train[feature] = np.where(X_train[feature]<lower_cap[feature],lower_cap[feature], X_train[feature])
  X_train[feature] = np.where(X_train[feature]>upper_cap[feature],upper_cap[feature], X_train[feature])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


In [123]:
X_train.shape

(113754, 25)

In [124]:
X_train_drop_outliers = X_train[~((X_train < (Q1 - 1.5*IQR)) | (X_train > (Q3 + 1.5*IQR))).any(axis=1)]
X_train_drop_outliers.shape

(113754, 25)

Note : no outliers are dropped, cap successfully

In [0]:
# get y_train from X_train
X_train = X_train_drop_outliers.drop(['RainTomorrow'], axis=1)
y_train = X_train_drop_outliers['RainTomorrow']

**Cross Validation on Training set**

In [128]:
accuracy = cross_val_score(clf,X_train,y_train,cv=10)
print('accuracy : ', accuracy)
print('mean : ', accuracy.mean())
print('std : ', accuracy.std())

accuracy :  [0.84379395 0.84045359 0.84194796 0.84502461 0.84483516 0.84043956
 0.84615385 0.84052747 0.84131868 0.84483516]
mean :  0.8429330002627472
std :  0.002108412164477752


Note : low bias and low variance 

**Training and Test sets**

In [130]:
# first, cap the outliers in Tes set 

for feature in features :
  X_test[feature] = np.where(X_test[feature]<lower_cap[feature],lower_cap[feature], X_test[feature])
  X_test[feature] = np.where(X_test[feature]>upper_cap[feature],upper_cap[feature], X_test[feature])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [131]:
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print('\n\nTraining and Test Sets result')
print('\naccuracy score : ', accuracy_score(y_test,y_pred))
print('\nconfusion matrix : \n', confusion_matrix(y_test, y_pred))
print('\nclassification report : \n', classification_report(y_test,y_pred))

print('Training set score : ',clf.score(X_train,y_train))
print('Test set score : ',clf.score(X_test,y_test))



Training and Test Sets result

accuracy score :  0.8425050107247091

confusion matrix : 
 [[20970  1041]
 [ 3438  2990]]

classification report : 
               precision    recall  f1-score   support

          No       0.86      0.95      0.90     22011
         Yes       0.74      0.47      0.57      6428

    accuracy                           0.84     28439
   macro avg       0.80      0.71      0.74     28439
weighted avg       0.83      0.84      0.83     28439

Training set score :  0.8440230673207096
Test set score :  0.8425050107247091


Note : There is no overfitting.
The training and test score are closer than dropping the outliers from X_train.