<a href="https://colab.research.google.com/github/WoradeeKongthong/raining_tomorrow_classification/blob/master/02_Raining_LogisticRegression_plus_FeatureEngineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
# libraries
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# **Data Set**

In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/WoradeeKongthong/raining_tomorrow_classification/master/weatherAUS.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142193 entries, 0 to 142192
Data columns (total 24 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           142193 non-null  object 
 1   Location       142193 non-null  object 
 2   MinTemp        141556 non-null  float64
 3   MaxTemp        141871 non-null  float64
 4   Rainfall       140787 non-null  float64
 5   Evaporation    81350 non-null   float64
 6   Sunshine       74377 non-null   float64
 7   WindGustDir    132863 non-null  object 
 8   WindGustSpeed  132923 non-null  float64
 9   WindDir9am     132180 non-null  object 
 10  WindDir3pm     138415 non-null  object 
 11  WindSpeed9am   140845 non-null  float64
 12  WindSpeed3pm   139563 non-null  float64
 13  Humidity9am    140419 non-null  float64
 14  Humidity3pm    138583 non-null  float64
 15  Pressure9am    128179 non-null  float64
 16  Pressure3pm    128212 non-null  float64
 17  Cloud9am       88536 non-null

In [0]:
# drop RISK_MM column (Recommendation from data description in Kaggle)
df.drop(['RISK_MM'], axis = 1, inplace = True)

# Extract Year, Month, Day from Date column
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

# drop Date column
df.drop(['Date'], axis = 1, inplace = True)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142193 entries, 0 to 142192
Data columns (total 25 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Location       142193 non-null  object 
 1   MinTemp        141556 non-null  float64
 2   MaxTemp        141871 non-null  float64
 3   Rainfall       140787 non-null  float64
 4   Evaporation    81350 non-null   float64
 5   Sunshine       74377 non-null   float64
 6   WindGustDir    132863 non-null  object 
 7   WindGustSpeed  132923 non-null  float64
 8   WindDir9am     132180 non-null  object 
 9   WindDir3pm     138415 non-null  object 
 10  WindSpeed9am   140845 non-null  float64
 11  WindSpeed3pm   139563 non-null  float64
 12  Humidity9am    140419 non-null  float64
 13  Humidity3pm    138583 non-null  float64
 14  Pressure9am    128179 non-null  float64
 15  Pressure3pm    128212 non-null  float64
 16  Cloud9am       88536 non-null   float64
 17  Cloud3pm       85099 non-null

In [0]:
X = df.drop(['RainTomorrow'], axis=1)
y = df['RainTomorrow']

# **Trial 1**
- keep the outliers
- impute the missing categorical values with mode
- impute the missing numerical values with median

**Create Preprocessor : ColumnTransformer of numerical and categorical features**

In [0]:
numerical_features = [x for x in X.columns if df[x].dtype != 'object']

numeric_transformer = Pipeline(steps=[
          ('imputer', SimpleImputer(strategy='median')),
          ('scaler', MinMaxScaler())
])

categorical_features = [x for x in X.columns if df[x].dtype == 'object']

categorical_transformer = Pipeline(steps=[
          ('imputer', SimpleImputer(strategy='most_frequent')),
          ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
          ('num', numeric_transformer, numerical_features),
          ('cat', categorical_transformer, categorical_features)
    ]
)

**Create model**

In [0]:
model = LogisticRegression(solver='sag', max_iter=500, n_jobs = -1)

**Create Pipeline**

In [0]:
clf = Pipeline(steps=[
      ('preprocessor', preprocessor),
      ('model', model)
])

**Cross Validation**

In [10]:
accuracy = cross_val_score(clf,X,y,cv=10)
print('accuracy : ', accuracy)
print('mean : ', accuracy.mean())
print('std : ', accuracy.std())

accuracy :  [0.8326301  0.68994374 0.83410689 0.71158309 0.81559885 0.5278149
 0.67156621 0.6257824  0.71763134 0.83409522]
mean :  0.7260752747680471
std :  0.09830128530393102


Note : It looks like the model has not high bias and variance

**Training and Test Sets**

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print('\n\nTraining and Test Sets result')
print('\naccuracy score : ', accuracy_score(y_test,y_pred))
print('\nconfusion matrix : \n', confusion_matrix(y_test, y_pred))
print('\nclassification report : \n', classification_report(y_test,y_pred))





Training and Test Sets result

accuracy score :  0.8493617919054819

confusion matrix : 
 [[20902  1194]
 [ 3090  3253]]

classification report : 
               precision    recall  f1-score   support

          No       0.87      0.95      0.91     22096
         Yes       0.73      0.51      0.60      6343

    accuracy                           0.85     28439
   macro avg       0.80      0.73      0.76     28439
weighted avg       0.84      0.85      0.84     28439



In [12]:
print('Training set score : ',clf.score(X_train,y_train))
print('Test set score : ',clf.score(X_test,y_test))

Training set score :  0.8478734813720836
Test set score :  0.8493617919054819


Note : The model is not overfitting

# Trial 2 
- drop the outliers from df
- impute the missing categorical values with mode
- impute the missing numerical values with median

In [13]:
# drop the outliers from df

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df_drop_outliers = df[~((df < (Q1 - 1.5*IQR)) | (df > (Q3 + 1.5*IQR))).any(axis=1)]
df_drop_outliers

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,Year,Month,Day
0,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No,2008,12,1
1,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No,2008,12,2
2,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No,2008,12,3
3,Albury,9.2,28.0,0.0,,,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No,2008,12,4
4,Albury,17.5,32.3,1.0,,,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No,2008,12,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142188,Uluru,3.5,21.8,0.0,,,E,31.0,ESE,E,15.0,13.0,59.0,27.0,1024.7,1021.2,,,9.4,20.9,No,No,2017,6,20
142189,Uluru,2.8,23.4,0.0,,,E,31.0,SE,ENE,13.0,11.0,51.0,24.0,1024.6,1020.3,,,10.1,22.4,No,No,2017,6,21
142190,Uluru,3.6,25.3,0.0,,,NNW,22.0,SE,N,13.0,9.0,56.0,21.0,1023.5,1019.1,,,10.9,24.5,No,No,2017,6,22
142191,Uluru,5.4,26.9,0.0,,,N,37.0,SE,WNW,9.0,9.0,53.0,24.0,1021.0,1016.8,,,12.5,26.1,No,No,2017,6,23


In [0]:
X = df_drop_outliers.drop(['RainTomorrow'], axis=1)
y = df_drop_outliers['RainTomorrow']

**Cross Validation**

In [15]:
accuracy = cross_val_score(clf,X,y,cv=10)
print('accuracy : ', accuracy)
print('mean : ', accuracy.mean())
print('std : ', accuracy.std())

accuracy :  [0.86232414 0.71395944 0.78503563 0.70281381 0.85254888 0.68189293
 0.5458615  0.52685913 0.73488032 0.8648698 ]
mean :  0.7271045575534767
std :  0.11496545129877667


Note : Dropping the outliers improves the model mean accuracy but the variance is higher.

**Training and Test Sets**

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print('\n\nTraining and Test Sets result')
print('\naccuracy score : ', accuracy_score(y_test,y_pred))
print('\nconfusion matrix : \n', confusion_matrix(y_test, y_pred))
print('\nclassification report : \n', classification_report(y_test,y_pred))





Training and Test Sets result

accuracy score :  0.8712771788781289

confusion matrix : 
 [[17795   547]
 [ 2271  1279]]

classification report : 
               precision    recall  f1-score   support

          No       0.89      0.97      0.93     18342
         Yes       0.70      0.36      0.48      3550

    accuracy                           0.87     21892
   macro avg       0.79      0.67      0.70     21892
weighted avg       0.86      0.87      0.85     21892



In [17]:
print('Training set score : ',clf.score(X_train,y_train))
print('Test set score : ',clf.score(X_test,y_test))

Training set score :  0.8725090502129798
Test set score :  0.8712771788781289


Note : The model is not overfitting

# Trial 3
- drop the outliers from X_train
- impute the missing categorical values with mode
- impute the missing numerical values with median

In [18]:
df

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,Year,Month,Day
0,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No,2008,12,1
1,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No,2008,12,2
2,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No,2008,12,3
3,Albury,9.2,28.0,0.0,,,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No,2008,12,4
4,Albury,17.5,32.3,1.0,,,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No,2008,12,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142188,Uluru,3.5,21.8,0.0,,,E,31.0,ESE,E,15.0,13.0,59.0,27.0,1024.7,1021.2,,,9.4,20.9,No,No,2017,6,20
142189,Uluru,2.8,23.4,0.0,,,E,31.0,SE,ENE,13.0,11.0,51.0,24.0,1024.6,1020.3,,,10.1,22.4,No,No,2017,6,21
142190,Uluru,3.6,25.3,0.0,,,NNW,22.0,SE,N,13.0,9.0,56.0,21.0,1023.5,1019.1,,,10.9,24.5,No,No,2017,6,22
142191,Uluru,5.4,26.9,0.0,,,N,37.0,SE,WNW,9.0,9.0,53.0,24.0,1021.0,1016.8,,,12.5,26.1,No,No,2017,6,23


In [0]:
X = df.drop(['RainTomorrow'], axis=1)
y = df['RainTomorrow']

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [21]:
# create temp_df to combine X_train and y_train and cap the outliers
temp_df = X_train
temp_df['RainTomorrow'] = y_train

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [22]:
temp_df

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,Year,Month,Day,RainTomorrow
35906,WaggaWagga,17.6,31.3,0.0,7.4,11.6,E,43.0,E,NNE,30.0,7.0,56.0,26.0,1019.8,1016.2,0.0,1.0,20.6,30.6,No,2009,3,9,No
63019,MelbourneAirport,4.4,21.4,0.0,4.8,9.1,S,46.0,N,N,28.0,26.0,54.0,30.0,1021.2,1015.7,1.0,6.0,13.7,18.5,No,2009,10,11,Yes
80764,Dartmoor,13.7,20.6,0.0,2.8,0.8,SSE,41.0,ESE,SE,22.0,22.0,100.0,76.0,1010.4,1008.7,,,16.6,19.0,No,2014,12,7,No
103951,Woomera,9.3,25.7,0.0,8.0,9.7,SSW,24.0,ESE,NW,15.0,13.0,66.0,21.0,1017.6,1013.9,0.0,4.0,14.3,24.1,No,2011,10,8,No
86105,Cairns,18.1,25.9,0.2,3.0,10.2,NE,26.0,SE,NE,13.0,13.0,73.0,61.0,1013.3,1009.4,1.0,1.0,22.1,25.1,No,2012,6,4,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107687,Albany,14.8,22.5,0.0,7.4,13.0,,,NE,E,28.0,20.0,64.0,72.0,1015.9,1011.8,0.0,0.0,20.0,21.2,No,2013,11,16,No
59447,Bendigo,3.8,16.0,0.0,,,N,13.0,SSE,NNW,2.0,6.0,89.0,62.0,1031.0,1028.5,8.0,8.0,10.6,15.6,No,2016,9,6,No
49442,Tuggeranong,15.2,19.1,4.4,,,W,52.0,W,W,28.0,30.0,59.0,55.0,1006.7,1007.6,,,16.3,17.7,Yes,2013,10,23,No
129072,Hobart,8.9,15.8,0.0,2.2,3.7,WSW,44.0,NW,SW,9.0,13.0,57.0,53.0,1018.1,1019.4,7.0,7.0,11.1,13.8,No,2015,4,27,No


In [23]:
Q1 = temp_df.quantile(0.25)
Q3 = temp_df.quantile(0.75)
IQR = Q3 - Q1
temp_df_drop_outliers = temp_df[~((temp_df < (Q1 - 1.5*IQR)) | (temp_df > (Q3 + 1.5*IQR))).any(axis=1)]
temp_df_drop_outliers

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,Year,Month,Day,RainTomorrow
35906,WaggaWagga,17.6,31.3,0.0,7.4,11.6,E,43.0,E,NNE,30.0,7.0,56.0,26.0,1019.8,1016.2,0.0,1.0,20.6,30.6,No,2009,3,9,No
63019,MelbourneAirport,4.4,21.4,0.0,4.8,9.1,S,46.0,N,N,28.0,26.0,54.0,30.0,1021.2,1015.7,1.0,6.0,13.7,18.5,No,2009,10,11,Yes
80764,Dartmoor,13.7,20.6,0.0,2.8,0.8,SSE,41.0,ESE,SE,22.0,22.0,100.0,76.0,1010.4,1008.7,,,16.6,19.0,No,2014,12,7,No
103951,Woomera,9.3,25.7,0.0,8.0,9.7,SSW,24.0,ESE,NW,15.0,13.0,66.0,21.0,1017.6,1013.9,0.0,4.0,14.3,24.1,No,2011,10,8,No
86105,Cairns,18.1,25.9,0.2,3.0,10.2,NE,26.0,SE,NE,13.0,13.0,73.0,61.0,1013.3,1009.4,1.0,1.0,22.1,25.1,No,2012,6,4,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123826,SalmonGums,4.4,16.2,0.6,,,SSW,37.0,W,SW,7.0,7.0,93.0,77.0,,,,,11.8,13.6,No,2017,5,27,No
107687,Albany,14.8,22.5,0.0,7.4,13.0,,,NE,E,28.0,20.0,64.0,72.0,1015.9,1011.8,0.0,0.0,20.0,21.2,No,2013,11,16,No
59447,Bendigo,3.8,16.0,0.0,,,N,13.0,SSE,NNW,2.0,6.0,89.0,62.0,1031.0,1028.5,8.0,8.0,10.6,15.6,No,2016,9,6,No
129072,Hobart,8.9,15.8,0.0,2.2,3.7,WSW,44.0,NW,SW,9.0,13.0,57.0,53.0,1018.1,1019.4,7.0,7.0,11.1,13.8,No,2015,4,27,No


In [0]:
# retrive X_train and y_train from temp_df
X_train = temp_df_drop_outliers.drop(['RainTomorrow'], axis=1)
y_train = temp_df_drop_outliers['RainTomorrow']

**Cross Validation on Training set**

In [25]:
accuracy = cross_val_score(clf,X_train,y_train,cv=10)
print('accuracy : ', accuracy)
print('mean : ', accuracy.mean())
print('std : ', accuracy.std())

accuracy :  [0.87290168 0.87221651 0.87016101 0.86968936 0.87380082 0.87128826
 0.87174509 0.87460027 0.86866149 0.87094564]
mean :  0.8716010132571356
std :  0.0017556257753764488


Note : The model is low bias and low variance.  
The variance is much lower than Trial 1 and 2.

**Training and Test sets**

In [26]:
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print('\n\nTraining and Test Sets result')
print('\naccuracy score : ', accuracy_score(y_test,y_pred))
print('\nconfusion matrix : \n', confusion_matrix(y_test, y_pred))
print('\nclassification report : \n', classification_report(y_test,y_pred))

print('Training set score : ',clf.score(X_train,y_train))
print('Test set score : ',clf.score(X_test,y_test))





Training and Test Sets result

accuracy score :  0.8296001969126903

confusion matrix : 
 [[19826  2202]
 [ 2644  3767]]

classification report : 
               precision    recall  f1-score   support

          No       0.88      0.90      0.89     22028
         Yes       0.63      0.59      0.61      6411

    accuracy                           0.83     28439
   macro avg       0.76      0.74      0.75     28439
weighted avg       0.83      0.83      0.83     28439

Training set score :  0.872309080319313
Test set score :  0.8296001969126903


Note : Train and test scores are slightly different.  
Test score is lower because I didn't drop the outliers in the test set.  
Think of them as real data to be predicted.  
The model is not overfitting.  
But let's try adding regularization.

In [27]:
# C=inverse regularization strength

# using the same preprocessor
model = LogisticRegression(solver='sag', max_iter=500, n_jobs = -1, C=0.01)
clf = Pipeline(steps=[
      ('preprocessor', preprocessor),
      ('model', model)
])

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print('\n\nTraining and Test Sets result')
print('\naccuracy score : ', accuracy_score(y_test,y_pred))
print('\nconfusion matrix : \n', confusion_matrix(y_test, y_pred))
print('\nclassification report : \n', classification_report(y_test,y_pred))

print('Training set score : ',clf.score(X_train,y_train))
print('Test set score : ',clf.score(X_test,y_test))



Training and Test Sets result

accuracy score :  0.8268926474207954

confusion matrix : 
 [[20226  1802]
 [ 3121  3290]]

classification report : 
               precision    recall  f1-score   support

          No       0.87      0.92      0.89     22028
         Yes       0.65      0.51      0.57      6411

    accuracy                           0.83     28439
   macro avg       0.76      0.72      0.73     28439
weighted avg       0.82      0.83      0.82     28439

Training set score :  0.8659593663990498
Test set score :  0.8268926474207954


Note : adding regularization makes the scores lower.

# Trial 4
- cap the outliers in df
- impute the missing categorical values with mode
- impute the missing numerical values with median

In [28]:
df

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,Year,Month,Day
0,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No,2008,12,1
1,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No,2008,12,2
2,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No,2008,12,3
3,Albury,9.2,28.0,0.0,,,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No,2008,12,4
4,Albury,17.5,32.3,1.0,,,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No,2008,12,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142188,Uluru,3.5,21.8,0.0,,,E,31.0,ESE,E,15.0,13.0,59.0,27.0,1024.7,1021.2,,,9.4,20.9,No,No,2017,6,20
142189,Uluru,2.8,23.4,0.0,,,E,31.0,SE,ENE,13.0,11.0,51.0,24.0,1024.6,1020.3,,,10.1,22.4,No,No,2017,6,21
142190,Uluru,3.6,25.3,0.0,,,NNW,22.0,SE,N,13.0,9.0,56.0,21.0,1023.5,1019.1,,,10.9,24.5,No,No,2017,6,22
142191,Uluru,5.4,26.9,0.0,,,N,37.0,SE,WNW,9.0,9.0,53.0,24.0,1021.0,1016.8,,,12.5,26.1,No,No,2017,6,23


In [0]:
# cap the outliers
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
lower_cap = Q1 - 1.5*IQR
upper_cap = Q3 + 1.5*IQR

features = lower_cap.index.values

for feature in features :
  df[feature] = np.where(df[feature]<lower_cap[feature],lower_cap[feature], df[feature])
  df[feature] = np.where(df[feature]>upper_cap[feature],upper_cap[feature], df[feature])

In [30]:
df.shape

(142193, 25)

In [31]:
df_drop_outliers = df[~((df < (Q1 - 1.5*IQR)) | (df > (Q3 + 1.5*IQR))).any(axis=1)]
df_drop_outliers.shape

(142193, 25)

Note : no outliers are dropped, cap outliers successfully

In [32]:
df

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,Year,Month,Day
0,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No,2008.0,12.0,1.0
1,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No,2008.0,12.0,2.0
2,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No,2008.0,12.0,3.0
3,Albury,9.2,28.0,0.0,,,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No,2008.0,12.0,4.0
4,Albury,17.5,32.3,1.0,,,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No,2008.0,12.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142188,Uluru,3.5,21.8,0.0,,,E,31.0,ESE,E,15.0,13.0,59.0,27.0,1024.7,1021.2,,,9.4,20.9,No,No,2017.0,6.0,20.0
142189,Uluru,2.8,23.4,0.0,,,E,31.0,SE,ENE,13.0,11.0,51.0,24.0,1024.6,1020.3,,,10.1,22.4,No,No,2017.0,6.0,21.0
142190,Uluru,3.6,25.3,0.0,,,NNW,22.0,SE,N,13.0,9.0,56.0,21.0,1023.5,1019.1,,,10.9,24.5,No,No,2017.0,6.0,22.0
142191,Uluru,5.4,26.9,0.0,,,N,37.0,SE,WNW,9.0,9.0,53.0,24.0,1021.0,1016.8,,,12.5,26.1,No,No,2017.0,6.0,23.0


In [0]:
X = df.drop(['RainTomorrow'], axis=1)
y = df['RainTomorrow']

**Cross Validation**

In [34]:
accuracy = cross_val_score(clf,X,y,cv=10)
print('accuracy : ', accuracy)
print('mean : ', accuracy.mean())
print('std : ', accuracy.std())

accuracy :  [0.83185654 0.78769339 0.83410689 0.8186933  0.82994585 0.83128209
 0.77108095 0.81630213 0.82108446 0.84190168]
mean :  0.8183947277809877
std :  0.021149587306617656


Note :  low bias and low variance.  
The result is better than keeping or dropping the outliers in df

**Training and Test Sets**

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print('\n\nTraining and Test Sets result')
print('\naccuracy score : ', accuracy_score(y_test,y_pred))
print('\nconfusion matrix : \n', confusion_matrix(y_test, y_pred))
print('\nclassification report : \n', classification_report(y_test,y_pred))

print('\nTraining set score : ',clf.score(X_train,y_train))
print('Test set score : ',clf.score(X_test,y_test))



Training and Test Sets result

accuracy score :  0.8465839164527585

confusion matrix : 
 [[21148  1085]
 [ 3278  2928]]

classification report : 
               precision    recall  f1-score   support

          No       0.87      0.95      0.91     22233
         Yes       0.73      0.47      0.57      6206

    accuracy                           0.85     28439
   macro avg       0.80      0.71      0.74     28439
weighted avg       0.84      0.85      0.83     28439


Training set score :  0.843047277458375
Test set score :  0.8465839164527585


Note : the model is not overfitting

# Trial 5
- cap the outliers in X_train
- cap the outliers in X_test using the upper_cap and lower_cap of X_train
- impute the missing categorical values with mode
- impute the missing numerical values with median

In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/WoradeeKongthong/raining_tomorrow_classification/master/weatherAUS.csv')

In [0]:
# drop RISK_MM column (Recommendation from data description in Kaggle)
df.drop(['RISK_MM'], axis = 1, inplace = True)

# Extract Year, Month, Day from Date column
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

# drop Date column
df.drop(['Date'], axis = 1, inplace = True)

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142193 entries, 0 to 142192
Data columns (total 25 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Location       142193 non-null  object 
 1   MinTemp        141556 non-null  float64
 2   MaxTemp        141871 non-null  float64
 3   Rainfall       140787 non-null  float64
 4   Evaporation    81350 non-null   float64
 5   Sunshine       74377 non-null   float64
 6   WindGustDir    132863 non-null  object 
 7   WindGustSpeed  132923 non-null  float64
 8   WindDir9am     132180 non-null  object 
 9   WindDir3pm     138415 non-null  object 
 10  WindSpeed9am   140845 non-null  float64
 11  WindSpeed3pm   139563 non-null  float64
 12  Humidity9am    140419 non-null  float64
 13  Humidity3pm    138583 non-null  float64
 14  Pressure9am    128179 non-null  float64
 15  Pressure3pm    128212 non-null  float64
 16  Cloud9am       88536 non-null   float64
 17  Cloud3pm       85099 non-null

In [0]:
X = df.drop(['RainTomorrow'], axis=1)
y = df['RainTomorrow']

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [41]:
# cap the outliers in X_train

Q1 = X_train.quantile(0.25)
Q3 = X_train.quantile(0.75)
IQR = Q3 - Q1

lower_cap = Q1 - 1.5*IQR
upper_cap = Q3 + 1.5*IQR

features = lower_cap.index.values

for feature in features :
  X_train[feature] = np.where(X_train[feature]<lower_cap[feature],lower_cap[feature], X_train[feature])
  X_train[feature] = np.where(X_train[feature]>upper_cap[feature],upper_cap[feature], X_train[feature])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [0]:
# retrive y_train and X_train from temp_df
X_train = temp_df.drop(['RainTomorrow'], axis=1)
y_train = temp_df['RainTomorrow']

**Cross Validation on Training set**

In [43]:
accuracy = cross_val_score(clf,X_train,y_train,cv=10)
print('accuracy : ', accuracy)
print('mean : ', accuracy.mean())
print('std : ', accuracy.std())

accuracy :  [0.84229958 0.84300281 0.84168425 0.84098101 0.83912088 0.84378022
 0.84017582 0.83956044 0.84351648 0.84131868]
mean :  0.8415440178668027
std :  0.0015380418734954025


Note : low bias and very low variance 

**Training and Test sets**

In [44]:
# first, cap the outliers in Tes set 

for feature in features :
  X_test[feature] = np.where(X_test[feature]<lower_cap[feature],lower_cap[feature], X_test[feature])
  X_test[feature] = np.where(X_test[feature]>upper_cap[feature],upper_cap[feature], X_test[feature])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [45]:
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print('\n\nTraining and Test Sets result')
print('\naccuracy score : ', accuracy_score(y_test,y_pred))
print('\nconfusion matrix : \n', confusion_matrix(y_test, y_pred))
print('\nclassification report : \n', classification_report(y_test,y_pred))

print('Training set score : ',clf.score(X_train,y_train))
print('Test set score : ',clf.score(X_test,y_test))



Training and Test Sets result

accuracy score :  0.8423291958226379

confusion matrix : 
 [[21154   945]
 [ 3539  2801]]

classification report : 
               precision    recall  f1-score   support

          No       0.86      0.96      0.90     22099
         Yes       0.75      0.44      0.56      6340

    accuracy                           0.84     28439
   macro avg       0.80      0.70      0.73     28439
weighted avg       0.83      0.84      0.83     28439

Training set score :  0.8424846598800921
Test set score :  0.8423291958226379


Note : There is no overfitting.
The training and test score are closer than dropping the outliers from X_train.