<a href="https://colab.research.google.com/github/WoradeeKongthong/raining_tomorrow_classification/blob/master/06_Raining_DecisionTreeClassifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Based on Feature Engineering in part 02  
Outliers : I'll cap the outliers in X_train. And cap the outliers in X_test using the boundaries of X_train.  
Missing values : I'll impute the missing values in categorical features with 'most frequent' value,  
and impute the missing values in numerical features with median.

In [0]:
# libraries
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline

from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# **Data Set**

In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/WoradeeKongthong/raining_tomorrow_classification/master/weatherAUS.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142193 entries, 0 to 142192
Data columns (total 24 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           142193 non-null  object 
 1   Location       142193 non-null  object 
 2   MinTemp        141556 non-null  float64
 3   MaxTemp        141871 non-null  float64
 4   Rainfall       140787 non-null  float64
 5   Evaporation    81350 non-null   float64
 6   Sunshine       74377 non-null   float64
 7   WindGustDir    132863 non-null  object 
 8   WindGustSpeed  132923 non-null  float64
 9   WindDir9am     132180 non-null  object 
 10  WindDir3pm     138415 non-null  object 
 11  WindSpeed9am   140845 non-null  float64
 12  WindSpeed3pm   139563 non-null  float64
 13  Humidity9am    140419 non-null  float64
 14  Humidity3pm    138583 non-null  float64
 15  Pressure9am    128179 non-null  float64
 16  Pressure3pm    128212 non-null  float64
 17  Cloud9am       88536 non-null

In [0]:
# drop RISK_MM column (Recommendation from data description in Kaggle)
df.drop(['RISK_MM'], axis = 1, inplace = True)

# Extract Year, Month, Day from Date column
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

# drop Date column
df.drop(['Date'], axis = 1, inplace = True)

# select year 2015-2017 to train the model
df = df[df['Year'] > 2014]

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43205 entries, 2109 to 142192
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Location       43205 non-null  object 
 1   MinTemp        43002 non-null  float64
 2   MaxTemp        43081 non-null  float64
 3   Rainfall       42799 non-null  float64
 4   Evaporation    19472 non-null  float64
 5   Sunshine       16085 non-null  float64
 6   WindGustDir    40818 non-null  object 
 7   WindGustSpeed  40837 non-null  float64
 8   WindDir9am     40515 non-null  object 
 9   WindDir3pm     41457 non-null  object 
 10  WindSpeed9am   42990 non-null  float64
 11  WindSpeed3pm   41715 non-null  float64
 12  Humidity9am    42696 non-null  float64
 13  Humidity3pm    40781 non-null  float64
 14  Pressure9am    38536 non-null  float64
 15  Pressure3pm    38533 non-null  float64
 16  Cloud9am       25312 non-null  float64
 17  Cloud3pm       22877 non-null  float64
 18  Te

In [0]:
X = df.drop(['RainTomorrow'], axis=1)
y = df['RainTomorrow']

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [8]:
# create temp_df to combine X_train and y_train and cap the outliers
temp_df = X_train
temp_df['RainTomorrow'] = y_train

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [9]:
temp_df

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,Year,Month,Day,RainTomorrow
67875,Melbourne,6.5,13.0,1.2,1.6,7.0,SSW,35.0,W,SSW,13.0,13.0,85.0,63.0,1015.2,1015.7,7.0,5.0,8.1,12.2,Yes,2016,8,25,No
120610,Perth,7.0,19.2,0.0,4.0,4.0,WSW,24.0,SW,W,7.0,11.0,74.0,62.0,1025.1,1021.7,8.0,7.0,12.6,17.5,No,2016,9,9,No
138536,Darwin,28.1,35.7,0.0,11.2,8.0,ESE,72.0,W,WNW,13.0,31.0,65.0,55.0,1010.7,1007.3,2.0,7.0,31.8,34.1,No,2015,11,26,Yes
44146,Wollongong,19.9,24.0,0.0,,,S,65.0,E,S,11.0,46.0,72.0,87.0,1009.9,1008.6,,8.0,22.1,19.4,No,2016,11,23,Yes
135658,AliceSprings,14.0,24.0,3.6,0.0,,N,52.0,NNE,NNW,17.0,31.0,77.0,52.0,1015.4,1011.5,8.0,3.0,18.2,24.1,Yes,2016,9,27,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11764,CoffsHarbour,19.0,26.0,4.2,,,SE,22.0,SSW,ESE,11.0,13.0,91.0,75.0,1017.9,1014.2,8.0,,21.0,26.0,Yes,2017,3,2,No
96694,Adelaide,11.7,25.5,0.0,,,N,31.0,NNE,WNW,13.0,17.0,31.0,34.0,1022.4,1018.3,,,18.3,24.9,No,2016,10,14,No
20023,NorahHead,15.7,20.3,5.6,,,S,54.0,WSW,S,13.0,24.0,96.0,84.0,1019.9,1019.6,,,16.0,18.3,Yes,2015,11,8,Yes
116928,PerthAirport,6.6,23.0,0.0,2.0,7.7,SW,31.0,SW,SSW,2.0,13.0,78.0,46.0,1028.8,1026.2,1.0,3.0,15.1,21.6,No,2015,5,9,No


In [10]:
# cap the outliers in Training set

Q1 = temp_df.quantile(0.25)
Q3 = temp_df.quantile(0.75)
IQR = Q3 - Q1

lower_cap = Q1 - 1.5*IQR
upper_cap = Q3 + 1.5*IQR

features = lower_cap.index.values

for feature in features :
  temp_df[feature] = np.where(temp_df[feature]<lower_cap[feature],lower_cap[feature], temp_df[feature])
  temp_df[feature] = np.where(temp_df[feature]>upper_cap[feature],upper_cap[feature], temp_df[feature])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [0]:
# get y_train and X_train from temp_df
X_train = temp_df.drop(['RainTomorrow'], axis=1)
y_train = temp_df['RainTomorrow']

In [12]:
# cap outliers on the test set
for feature in features :
  X_test[feature] = np.where(X_test[feature]<lower_cap[feature],lower_cap[feature], X_test[feature])
  X_test[feature] = np.where(X_test[feature]>upper_cap[feature],upper_cap[feature], X_test[feature])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


# **Create Model Pipeline**

**Create Preprocessor : ColumnTransformer of numerical and categorical features**

In [0]:
numerical_features = [x for x in X.columns if df[x].dtype != 'object']

numeric_transformer = Pipeline(steps=[
          ('imputer', SimpleImputer(strategy='median')),
          ('scaler', MinMaxScaler())
])

categorical_features = [x for x in X.columns if df[x].dtype == 'object']

categorical_transformer = Pipeline(steps=[
          ('imputer', SimpleImputer(strategy='most_frequent')),
          ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
          ('num', numeric_transformer, numerical_features),
          ('cat', categorical_transformer, categorical_features)
    ]
)

**Create model**

In [0]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion = 'entropy')

**Create Pipeline**

In [0]:
clf = Pipeline(steps=[
      ('preprocessor', preprocessor),
      ('model', model)
])

# **Cross Validation on Training set**

In [16]:
accuracy = cross_val_score(clf,X_train,y_train,cv=10)
print('accuracy : ', accuracy)
print('mean : ', accuracy.mean())
print('std : ', accuracy.std())

accuracy :  [0.78680937 0.78247035 0.78160255 0.7856523  0.7884838  0.79021991
 0.7818287  0.77922454 0.78703704 0.79108796]
mean :  0.7854416511988558
std :  0.0037907883929400713


Note : low bias and very low variance

# **Training and Test sets**

In [17]:
# Make sure your X_test is capped before running this cell

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print('\n\nTraining and Test Sets result')
print('\naccuracy score : ', accuracy_score(y_test,y_pred))
print('\nconfusion matrix : \n', confusion_matrix(y_test, y_pred))
print('\nclassification report : \n', classification_report(y_test,y_pred))

print('Training set score : ',clf.score(X_train,y_train))
print('Test set score : ',clf.score(X_test,y_test))



Training and Test Sets result

accuracy score :  0.7901863210276588

confusion matrix : 
 [[5830  943]
 [ 870  998]]

classification report : 
               precision    recall  f1-score   support

          No       0.87      0.86      0.87      6773
         Yes       0.51      0.53      0.52      1868

    accuracy                           0.79      8641
   macro avg       0.69      0.70      0.69      8641
weighted avg       0.79      0.79      0.79      8641

Training set score :  1.0
Test set score :  0.7901863210276588


Note : Overfitting.

# **Improve the model with GridSearchCV**

In [18]:
# set parameter grid
param_grid = {
    'model__splitter':['best','random'],
    'model__max_depth':[5,10,15,20,25],
    'model__max_features':['auto','sqrt','log2']
}

# GridSearchCV
search = GridSearchCV(clf, param_grid, n_jobs=-1)
search.fit(X_train, y_train)

print("Best score :",search.best_score_)
print("Best parameter :",search.best_params_)

Best score : 0.8097441255672353
Best parameter : {'model__max_depth': 10, 'model__max_features': 'auto', 'model__splitter': 'best'}


**Create Preprocessor : ColumnTransformer of numerical and categorical features**

In [0]:
numerical_features = [x for x in X.columns if df[x].dtype != 'object']

numeric_transformer = Pipeline(steps=[
          ('imputer', SimpleImputer(strategy='median')),
          ('scaler', MinMaxScaler())
])

categorical_features = [x for x in X.columns if df[x].dtype == 'object']

categorical_transformer = Pipeline(steps=[
          ('imputer', SimpleImputer(strategy='most_frequent')),
          ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
          ('num', numeric_transformer, numerical_features),
          ('cat', categorical_transformer, categorical_features)
    ]
)

**Create model**

In [0]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion = 'entropy', max_depth=10, max_features='auto', splitter='best')

**Create Pipeline**

In [0]:
clf = Pipeline(steps=[
      ('preprocessor', preprocessor),
      ('model', model)
])

**Cross Validation on Training set**

In [22]:
accuracy = cross_val_score(clf,X_train,y_train,cv=10)
print('accuracy : ', accuracy)
print('mean : ', accuracy.mean())
print('std : ', accuracy.std())

accuracy :  [0.80300839 0.82267862 0.81226497 0.81978594 0.80295139 0.80584491
 0.80960648 0.80989583 0.80034722 0.81828704]
mean :  0.8104670793425042
std :  0.00733012784133786


Note : low bias and low variance

 **Training and Test sets**

In [23]:
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print('\n\nTraining and Test Sets result')
print('\naccuracy score : ', accuracy_score(y_test,y_pred))
print('\nconfusion matrix : \n', confusion_matrix(y_test, y_pred))
print('\nclassification report : \n', classification_report(y_test,y_pred))

print('Training set score : ',clf.score(X_train,y_train))
print('Test set score : ',clf.score(X_test,y_test))



Training and Test Sets result

accuracy score :  0.8150677004976276

confusion matrix : 
 [[6398  375]
 [1223  645]]

classification report : 
               precision    recall  f1-score   support

          No       0.84      0.94      0.89      6773
         Yes       0.63      0.35      0.45      1868

    accuracy                           0.82      8641
   macro avg       0.74      0.64      0.67      8641
weighted avg       0.79      0.82      0.79      8641

Training set score :  0.8271322763569031
Test set score :  0.8150677004976276


Note : the scores getting closer, No overfitting