# HW 2 - Decision Tree Induction
## CIS 600
## Evan Smith

### Data Cleaning
First, we import the data from disk.

In [82]:
import pandas as pd

train = pd.read_csv("C:/Users/evana/Documents/GitHub/SyracuseMasters/CIS_600_Fund_Data_Mining/HW2/Weather Forecast Training.csv")
test = pd.read_csv("C:/Users/evana/Documents/GitHub/SyracuseMasters/CIS_600_Fund_Data_Mining/HW2/Weather Forecast Testing.csv")

We then define a function which displays the nunmber of unique and nulls values in each column. This shows us which columns to drop due to non-variation or excessive nulls.

In [83]:
def DisplayNullRows (df):
    dataCleaningCounts = []
    for col in df.columns:
        nullsCount = df[col].isnull().sum()
        if nullsCount > 0:
            dataCleaningCounts.append([col, df[col].nunique(), nullsCount])
    display(pd.DataFrame(dataCleaningCounts, columns = ['Column', 'Unique Values', 'Nulls']))
    return

print("Training Data Report")
DisplayNullRows(train)
print("Testing Data Report")
DisplayNullRows(test)

Training Data Report


Unnamed: 0,Column,Unique Values,Nulls
0,MinTemp,372,284
1,MaxTemp,484,129
2,Rainfall,599,747
3,Evaporation,270,22553
4,Sunshine,144,24875
5,WindGustDir,16,3598
6,WindGustSpeed,66,3571
7,WindDir,16,1513
8,WindSpeed,40,1024
9,Humidity,100,1429


Testing Data Report


Unnamed: 0,Column,Unique Values,Nulls
0,MinTemp,350,47
1,MaxTemp,436,18
2,Rainfall,376,161
3,Evaporation,185,5516
4,Sunshine,140,6094
5,WindGustDir,16,929
6,WindGustSpeed,60,920
7,WindDir,16,387
8,WindSpeed,39,261
9,Humidity,100,349


Based on the results, we expect all columns have interesting information, but `Evaporation`, `Sunshine`, and `Cloud` have too many null rows to be useful, so we drop those columns. For the remaining rows, we drop all nulls.

In [84]:
manyNullsColumns = ["Evaporation", "Sunshine", "Cloud"]

train.drop(columns= manyNullsColumns, inplace=True)
train.dropna(inplace = True)

test.drop(columns= manyNullsColumns, inplace=True)
test.dropna(inplace = True)

display(train.columns)

Index(['Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'WindGustDir',
       'WindGustSpeed', 'WindDir', 'WindSpeed', 'Humidity', 'Pressure', 'Temp',
       'RainToday', 'RainTomorrow'],
      dtype='object')

Now, we must numerize the relevant columns in the data. The affected columns are `Location`, `WindGustDir`, `WindDir`, `RainToday`, and `RainTomorrow`. 

In [85]:
from sklearn.preprocessing import LabelEncoder

def NumerizeObjectColumns(df):

    le = LabelEncoder()
    
    cols_to_encode = df.select_dtypes(['object']).columns
    df[cols_to_encode] = df[cols_to_encode].apply(le.fit_transform)
    return

NumerizeObjectColumns(train)
NumerizeObjectColumns(test)

### Decision Tree Training

We make the decision tree first, using a standard splitting procedure to generate testing and verification data. 

In [129]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics
np.random.seed(1)

## Model Training

X = train.drop('RainTomorrow', axis=1)
y = train['RainTomorrow']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print(f"train data size is {X_train.shape}")

clf = DecisionTreeClassifier()

train data size is (30299, 12)


Using the basic, default settings of the classifier, we can see the baseline performance of the classifier.

In [130]:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
clf.tree_.max_depth
print(f"Accuracy: {round(metrics.accuracy_score(y_test, y_pred)*100,2)}%")

Accuracy: 72.05%


After establishing the baseline, we now want to tune the hyperparameters of the model to attempt to get the best accuracy on the training data. This is first done with a wide range of values, and then later fleshed out by getting more precise in the regions around the preferred hyperparameters at each step. 

In [105]:
## Model Hyperparameter Fine Tuning

param_grid = {'criterion': ['gini', 'entropy'],
              'min_samples_split': [2, 10, 20],
              'max_depth': [5, 10, 20, 25, 30],
              'min_samples_leaf': [1, 5, 10],
              'max_leaf_nodes': [2, 5, 10, 20]}
grid = GridSearchCV(clf, param_grid, cv=10, scoring='accuracy')
grid.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [5, 10, 20, 25, 30],
                         'max_leaf_nodes': [2, 5, 10, 20],
                         'min_samples_leaf': [1, 5, 10],
                         'min_samples_split': [2, 10, 20]},
             scoring='accuracy')

In [107]:
## Output Best Model Hyperparameters
print(f"Accuracy: {round(grid.best_score_*100, 2)}%")
for hps, values in grid.best_params_.items():
  print(f"{hps}: {values}")

Accuracy: 76.95%
criterion: gini
max_depth: 10
max_leaf_nodes: 20
min_samples_leaf: 1
min_samples_split: 2


We can now adjust the values in the hyperparameter grid to find better options. Only the final adjusted grid search is shown below.

In [108]:
## Model Hyperparameter Fine Tuning

param_grid = {'criterion': ['gini', 'entropy'],
              'min_samples_split': [2],
              'max_depth': [24, 25, 26, 27, 28],
              'min_samples_leaf': [1],
              'max_leaf_nodes': [32, 33, 34]}
grid = GridSearchCV(clf, param_grid, cv=10, scoring='accuracy')
grid.fit(X_train, y_train)

## Output Best Model Hyperparameters
print(f"Accuracy: {round(grid.best_score_*100, 2)}%")
for hps, values in grid.best_params_.items():
  print(f"{hps}: {values}")

Accuracy: 77.23%
criterion: gini
max_depth: 24
max_leaf_nodes: 32
min_samples_leaf: 1
min_samples_split: 2


As can be seen here, we are able to produce an improvement using tuning up about 5.18% in accuracy. 

### Examining The Tree
We can now visualize the final tree and discuss some basic takeaways of the exercise. 

In [128]:
from sklearn.tree import export_text

r = export_text(clf, feature_names = list(X_train.columns), max_depth = 3)
print(r)

|--- Humidity <= 63.50
|   |--- WindGustSpeed <= 51.00
|   |   |--- Humidity <= 52.50
|   |   |   |--- Pressure <= 1011.65
|   |   |   |   |--- truncated branch of depth 21
|   |   |   |--- Pressure >  1011.65
|   |   |   |   |--- truncated branch of depth 25
|   |   |--- Humidity >  52.50
|   |   |   |--- Pressure <= 1014.35
|   |   |   |   |--- truncated branch of depth 19
|   |   |   |--- Pressure >  1014.35
|   |   |   |   |--- truncated branch of depth 24
|   |--- WindGustSpeed >  51.00
|   |   |--- Humidity <= 44.50
|   |   |   |--- Pressure <= 1008.85
|   |   |   |   |--- truncated branch of depth 24
|   |   |   |--- Pressure >  1008.85
|   |   |   |   |--- truncated branch of depth 20
|   |   |--- Humidity >  44.50
|   |   |   |--- Pressure <= 1013.75
|   |   |   |   |--- truncated branch of depth 21
|   |   |   |--- Pressure >  1013.75
|   |   |   |   |--- truncated branch of depth 20
|--- Humidity >  63.50
|   |--- Humidity <= 75.50
|   |   |--- WindGustSpeed <= 36.00
|   |  

A great benefit of using Decision Trees is that the decision making process of the model is transparent to human readers. Here we can see that the highest-level categorization techniques the model uses are based on `Humidity`, `WindGustSpeed`, and `Rainfall`. These are intuitive, and are likely similar paths that a human might take when approaching the problem. 

### Producing Output Set

We now run the tuned model against the test set which has no known correct classifications. This file is saved to disk.

In [146]:
test_data = test.drop('ID', axis=1)
pred = grid.predict(test_data)

frame = {'ID': test['ID'], 'DT' : pred}
outputData = pd.DataFrame(frame)
outputData.to_csv('output_predictions.csv', index=False)