# General Overview - Machine Learning

The goal of building our machine learning model is to correctly predict a tree's health based on independent variables. We are classifying categorical variables. They are nominal, meaning that they do not have any intrinsic order to them, unlike ordinal variables. To measure our model's success, we are relying on a [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html), which shows the main classification metrics such as precision, recall, and f1-score.

Our algorithms of choice are: [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), and [Gaussian Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes). To prepare our data, we separate our target variable, y or tree health, from the independent variables, X. Next, we split X and y into training and test sets by a percentage. In our case, we are training with 75% of our data and testing with the remaining 25% percent. After splitting, we are ready to begin testing.

It's important to note that due to the heavily imbalanced group representation of our data, we are incorporating under sampling and over sampling methods in an effort to improve our precision and recall scores to find the best possible model. There are two separate notebooks for under and over sampling techniques due to the number of methods used and the length of the notebooks.

In [1]:
import numpy as np
import pandas as pd
import sklearn
from sklearn import datasets
from sklearn import metrics
from collections import Counter

from sklearn.model_selection import (StratifiedKFold, cross_val_score, GridSearchCV, train_test_split)
from sklearn.metrics import (classification_report, confusion_matrix)

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

# under sampling
import imblearn
from imblearn.under_sampling import (RandomUnderSampler, EditedNearestNeighbours)

Using TensorFlow backend.


In [2]:
np.random.seed(42)

In [3]:
data = pd.read_csv('tree_ml.csv', index_col=0) # import data
tree = data.copy() # save a copy of data as tree

In [4]:
tree.head()

Unnamed: 0,health,num_problems,tree_dbh,root_stone_l,root_grate_l,root_other_l,trunk_wire_l,trnk_light_l,trnk_other_l,brch_light_l,...,OnCurb,Harmful,Helpful,Unsure,Damage,Bronx,Brooklyn,Manhattan,Queens,Staten Island
0,Fair,0,3,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
1,Fair,1,21,1,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,1,0
2,Good,0,3,0,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,0,0
3,Good,1,10,1,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,0,0
4,Good,1,21,1,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,0,0


In [5]:
tree.shape

(651535, 25)

# Machine Learning Models

We start with splitting our data into training and testing sets in a stratified fashion so that our resulting sets have the same proportions of classes as our originals. 75% of our data is used to train the models while the remaining 25% is used for testing. 

In [6]:
y = tree['health'].values # target variable
X = tree.drop('health', axis=1).values # feature variables

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(456074, 24) (456074,)
(195461, 24) (195461,)


## Logistic Regression

In [7]:
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)

# accuracy scores
print('Accuracy Score, Training Set: ', logreg.score(X_train, y_train))
print('Accuracy Score, Test Set: ', logreg.score(X_test, y_test))

# classification report
print('Classification Report \n')
print(classification_report(y_test, logreg_pred))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Accuracy Score, Training Set:  0.810708349960752
Accuracy Score, Test Set:  0.8108062477936775
Classification Report 

              precision    recall  f1-score   support

        Fair       0.36      0.02      0.04     28928
        Good       0.81      1.00      0.90    158499
        Poor       0.40      0.00      0.00      8034

    accuracy                           0.81    195461
   macro avg       0.52      0.34      0.31    195461
weighted avg       0.73      0.81      0.73    195461



## KNN Classifier

This one takes a long time to run.

In [8]:
# GridSearch
knn = KNeighborsClassifier()
parameters = {'n_neighbors': [4,15]}

clf = GridSearchCV(knn, parameters, cv=5, verbose=1, n_jobs=-1)
clf.fit(X, y).best_params_

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed: 23.2min remaining: 15.5min
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed: 31.1min finished


{'n_neighbors': 15}

In [10]:
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)

# accuracy scoring
print('Accuracy Score, Training Set: ', knn.score(X_train, y_train))
print('Accuracy Score, Test Set: ', knn.score(X_test, y_test))

# classification report
print('Classification Report \n\n {}'.format(classification_report(y_test, knn_pred)))
print()

Accuracy Score, Training Set:  0.8125348079478331
Accuracy Score, Test Set:  0.8088877064989947
Classification Report 

               precision    recall  f1-score   support

        Fair       0.36      0.04      0.08     28928
        Good       0.82      0.99      0.90    158499
        Poor       0.42      0.02      0.04      8034

    accuracy                           0.81    195461
   macro avg       0.53      0.35      0.34    195461
weighted avg       0.73      0.81      0.74    195461




## Decision Tree Classifier

In [11]:
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train, y_train)
decision_tree_pred = decision_tree.predict(X_test)

# accuracy scores
print('Accuracy Score, Training Set:', decision_tree.score(X_train, y_train))
print('Accuracy Score, Test Set:', decision_tree.score(X_test, y_test))

# classification report
print('Classification Report \n\n {}'.format(classification_report(y_test, decision_tree_pred)))

Accuracy Score, Training Set: 0.8259317566886075
Accuracy Score, Test Set: 0.8012647024214549
Classification Report 

               precision    recall  f1-score   support

        Fair       0.30      0.06      0.10     28928
        Good       0.82      0.98      0.89    158499
        Poor       0.25      0.03      0.06      8034

    accuracy                           0.80    195461
   macro avg       0.46      0.36      0.35    195461
weighted avg       0.72      0.80      0.74    195461



## Random Forest Classifier

In [12]:
forest = RandomForestClassifier(random_state=42)
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)

# accuracy scores
print('Accuracy Score, Training Set:', forest.score(X_train, y_train))
print('Accuracy Score, Test Set:', forest.score(X_test, y_test))

# classification report
print('Classification Report \n\n {}'.format(classification_report(y_test, y_pred)))

Accuracy Score, Training Set: 0.8259273714353372
Accuracy Score, Test Set: 0.8053678227370166
Classification Report 

               precision    recall  f1-score   support

        Fair       0.33      0.05      0.08     28928
        Good       0.82      0.98      0.89    158499
        Poor       0.27      0.04      0.07      8034

    accuracy                           0.81    195461
   macro avg       0.47      0.36      0.35    195461
weighted avg       0.72      0.81      0.74    195461



## Gaussian Naive Bayes

In [13]:
gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
gaussian_pred = gaussian.predict(X_test)

# accuracy scores
print('Accuracy Score, Training Set:', gaussian.score(X_train, y_train))
print('Accuracy Score, Test Set:', gaussian.score(X_test, y_test))

# classification report
print('Classification Report \n')
print(classification_report(y_test, gaussian_pred))

Accuracy Score, Training Set: 0.7371084516986278
Accuracy Score, Test Set: 0.7380091169082323
Classification Report 

              precision    recall  f1-score   support

        Fair       0.21      0.11      0.14     28928
        Good       0.83      0.88      0.86    158499
        Poor       0.12      0.19      0.15      8034

    accuracy                           0.74    195461
   macro avg       0.39      0.39      0.38    195461
weighted avg       0.71      0.74      0.72    195461



Even though accuracy scores are relatively high for our models, they consistently under predict the number of fair and poor trees. Good trees are the vast majority so our next step is to selectively remove those trees until they are about equal in amount to fair and poor trees. By under sampling, we balance the classes so that machine learning models can learn to correctly identify each type of tree. We are using edited nearest neighbors for our under sampling method, which edits the samples based on the edited nearest neighbor method.

# Random Under Sampling

The under sampling method we are using is edited nearest neighbors.

## Edited Nearest Neighbors

In [14]:
# initialize
enn = EditedNearestNeighbours()
X_enn, y_enn = enn.fit_resample(X, y)

print('Resampled dataset shape: {}'.format(Counter(y_enn)))

# train test split
X_train_enn, X_test_enn, y_train_enn, y_test_enn = train_test_split(X_enn, y_enn, test_size=0.3, random_state=42)

print(X_train_enn.shape, y_train_enn.shape)
print(X_test_enn.shape, y_test_enn.shape)

Resampled dataset shape: Counter({'Good': 312229, 'Poor': 26781, 'Fair': 1553})
(238394, 24) (238394,)
(102169, 24) (102169,)


### Logistic Regression

In [15]:
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train_enn, y_train_enn)
logreg_pred = logreg.predict(X_test_enn)

# accuracy scores
print('Accuracy Score, Training Set: ', logreg.score(X_train_enn, y_train_enn))
print('Accuracy Score, Test Set: ', logreg.score(X_test_enn, y_test_enn))

# classification report
print('Classification Report \n\n {}'.format(classification_report(y_test_enn, logreg_pred)))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Accuracy Score, Training Set:  0.9209040495985638
Accuracy Score, Test Set:  0.921825602677916
Classification Report 

               precision    recall  f1-score   support

        Fair       0.08      0.00      0.00       462
        Good       0.93      1.00      0.96     93715
        Poor       0.64      0.10      0.17      7992

    accuracy                           0.92    102169
   macro avg       0.55      0.37      0.38    102169
weighted avg       0.90      0.92      0.89    102169



### KNN Classifier

In [16]:
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train_enn, y_train_enn)
knn_pred = knn.predict(X_test_enn)

# accuracy scoring
print('Accuracy Score, Training Set: ', knn.score(X_train_enn, y_train_enn))
print('Accuracy Score, Test Set: ', knn.score(X_test_enn, y_test_enn))

# classification report
print('Classification Report \n\n {}'.format(classification_report(y_test_enn, knn_pred)))

Accuracy Score, Training Set:  0.9459382367005881
Accuracy Score, Test Set:  0.9449050103260285
Classification Report 

               precision    recall  f1-score   support

        Fair       0.61      0.44      0.51       462
        Good       0.95      1.00      0.97     93715
        Poor       0.85      0.38      0.53      7992

    accuracy                           0.94    102169
   macro avg       0.80      0.61      0.67    102169
weighted avg       0.94      0.94      0.94    102169



### Decision Tree Classifier

In [17]:
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train_enn, y_train_enn)
decision_tree_pred = decision_tree.predict(X_test_enn)

# accuracy scores
print('Accuracy Score, Training Set:', decision_tree.score(X_train_enn, y_train_enn))
print('Accuracy Score, Test Set:', decision_tree.score(X_test_enn, y_test_enn))

# classification report
print('Classification Report \n\n {}'.format(classification_report(y_test_enn, decision_tree_pred)))

Accuracy Score, Training Set: 0.9653179190751445
Accuracy Score, Test Set: 0.9557889379361646
Classification Report 

               precision    recall  f1-score   support

        Fair       0.69      0.85      0.76       462
        Good       0.96      1.00      0.98     93715
        Poor       0.91      0.49      0.64      7992

    accuracy                           0.96    102169
   macro avg       0.85      0.78      0.79    102169
weighted avg       0.95      0.96      0.95    102169



This model has very high accuracy scores for both training and test sets. It has a high precision rate for good and poor trees and strong recall scores for fair and good trees. The two top performing models both have lower scores for classifying fair trees. Since the accuracy scores are already in the mid-90s, the amount of hyper parameter tuning that we can do is minimal. Thus, to evaluate the model's performance, we are using cross validation.

In [18]:
# stratified KFold
kf = StratifiedKFold(5, shuffle=True, random_state=42)

# cross validation
tree_score = cross_val_score(decision_tree, X_enn, y_enn, cv=kf)

print('Scores: ', tree_score)
print("Average 5-Fold Scores: {}".format(np.mean(tree_score)))

Scores:  [0.95560319 0.95494252 0.95580873 0.95551445 0.95607235]
Average 5-Fold Scores: 0.9555882477259795


### Random Forest Classifier

In [19]:
forest = RandomForestClassifier(random_state=42)
forest.fit(X_train_enn, y_train_enn)
forest_pred = forest.predict(X_test_enn)

# accuracy scores
print('Accuracy Score, Training Set:', forest.score(X_train_enn, y_train_enn))
print('Accuracy Score, Test Set:', forest.score(X_test_enn, y_test_enn))

# classification report
print('Classification Report \n\n {}'.format(classification_report(y_test_enn, forest_pred)))

Accuracy Score, Training Set: 0.9653179190751445
Accuracy Score, Test Set: 0.9571298534780609
Classification Report 

               precision    recall  f1-score   support

        Fair       0.77      0.84      0.80       462
        Good       0.96      1.00      0.98     93715
        Poor       0.94      0.49      0.64      7992

    accuracy                           0.96    102169
   macro avg       0.89      0.78      0.81    102169
weighted avg       0.96      0.96      0.95    102169



This model has high precision scores overall and strong recall scores for good and fair trees. Since precision and recall have an inverse relationship, we are aiming for high precision scores. The overall accuracy score is at 95% and the difference between the training and test set scores is less than 1%, so our model is not grossly over fitted.

To evaluate this model, we are using cross validation.

In [20]:
# stratified KFold
kf = StratifiedKFold(5, shuffle=True, random_state=42)

# cross validation
forest_score = cross_val_score(forest, X_enn, y_enn, cv=kf)

print('Scores: ', forest_score)
print("Average 5-Fold Scores: {}".format(np.mean(forest_score)))

Scores:  [0.95761455 0.9567043  0.95752646 0.95768734 0.95773138]
Average 5-Fold Scores: 0.9574528075953843


### Gaussian Naive Bayes

In [21]:
gaussian = GaussianNB()
gaussian.fit(X_train_enn, y_train_enn)
gaussian_pred = gaussian.predict(X_test_enn)

# accuracy scores
print('Accuracy Score, Training Set:', gaussian.score(X_train_enn, y_train_enn))
print('Accuracy Score, Test Set:', gaussian.score(X_test_enn, y_test_enn))

# classification report
print('Classification Report \n\n {}'.format(classification_report(y_test_enn, gaussian_pred)))

Accuracy Score, Training Set: 0.8436915358608018
Accuracy Score, Test Set: 0.8435924791277197
Classification Report 

               precision    recall  f1-score   support

        Fair       0.03      0.31      0.06       462
        Good       0.94      0.90      0.92     93715
        Poor       0.21      0.19      0.20      7992

    accuracy                           0.84    102169
   macro avg       0.39      0.47      0.39    102169
weighted avg       0.88      0.84      0.86    102169

