# General Overview - Machine Learning

The goal of building our machine learning model is to correctly predict a tree's health based on independent variables. We are classifying categorical variables. They are nominal, meaning that they do not have any intrinsic order to them, unlike ordinal variables. To measure our model's success, we are relying on a [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html), which shows the main classification metrics such as precision, recall, and f1-score.

We are using [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). We start by separating the target variable, y or tree health, from the feature variables, X. Next, we split X and y into training and test sets with 70% training and 30% testing. We train the models using training data and gauge their accuracy scores using a classification report.

Since the classes are imbalanced, we use under sampling to minimize the size of majority samples so that each class may be represented as equally as possible. The technique we are using is edited nearest neighbors.

In [1]:
!pip install imblearn

Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn
  Downloading imbalanced_learn-0.7.0-py3-none-any.whl (167 kB)
[K     |████████████████████████████████| 167 kB 15.6 MB/s eta 0:00:01
Installing collected packages: imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.7.0 imblearn-0.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
import numpy as np
import pandas as pd
import sklearn
from sklearn import datasets
from sklearn import metrics
from collections import Counter

from sklearn.model_selection import (StratifiedKFold, cross_val_score, GridSearchCV, train_test_split)
from sklearn.metrics import (classification_report, confusion_matrix)

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# under sampling
import imblearn
from imblearn.under_sampling import (RandomUnderSampler, EditedNearestNeighbours)

In [3]:
np.random.seed(42)

In [4]:
# import data
data = pd.read_csv('tree_ml.csv', index_col=0)

tree = data.copy()

In [5]:
tree.head()

Unnamed: 0,tree_dbh,curb_loc,health,sidewalk,root_stone,root_grate,root_other,trunk_wire,trnk_light,trnk_other,...,Stew_N,Guard_N,Harmful,Helpful,Unsure,Bronx,Brooklyn,Manhattan,Queens,Staten Island
0,3,1,Fair,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,1,0
1,21,1,Fair,1,1,0,0,0,0,0,...,1,1,0,0,0,0,0,0,1,0
2,3,1,Good,1,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
3,10,1,Good,1,1,0,0,0,0,0,...,1,1,0,0,0,0,1,0,0,0
4,21,1,Good,1,1,0,0,0,0,0,...,1,1,0,0,0,0,1,0,0,0


In [6]:
tree.shape

(651535, 29)

# Modeling

We split the data into training and testing sets in a stratified fashion so that resulting sets have the same proportions of classes as the originals.

In [7]:
X = tree.drop('health', axis=1)
y = tree['health']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(456074, 28) (456074,)
(195461, 28) (195461,)


## Logistic Regression

In [8]:
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)

# accuracy scores
print('Accuracy Score, Training Set: ', logreg.score(X_train, y_train))
print('Accuracy Score, Test Set: ', logreg.score(X_test, y_test))

# classification report
print('Classification Report \n')
print(classification_report(y_test, logreg_pred))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Accuracy Score, Training Set:  0.8106623048014138
Accuracy Score, Test Set:  0.8106936933710561
Classification Report 



  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

        Fair       0.35      0.02      0.04     28928
        Good       0.81      1.00      0.90    158499
        Poor       0.00      0.00      0.00      8034

    accuracy                           0.81    195461
   macro avg       0.39      0.34      0.31    195461
weighted avg       0.71      0.81      0.73    195461



## KNN Classifier

In [9]:
# GridSearch - this takes a while to run
knn = KNeighborsClassifier()
parameters = {'n_neighbors': [4,15]}

clf = GridSearchCV(knn, parameters, cv=5, verbose=1, n_jobs=-1)
clf.fit(X, y).best_params_

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed: 18.4min finished


{'n_neighbors': 15}

In [10]:
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)

# accuracy scoring
print('Accuracy Score, Training Set: ', knn.score(X_train, y_train))
print('Accuracy Score, Test Set: ', knn.score(X_test, y_test))

# classification report
print('Classification Report \n\n {}'.format(classification_report(y_test, knn_pred)))
print()

Accuracy Score, Training Set:  0.8172314142003272
Accuracy Score, Test Set:  0.8088109648472074
Classification Report 

               precision    recall  f1-score   support

        Fair       0.38      0.07      0.11     28928
        Good       0.82      0.98      0.90    158499
        Poor       0.43      0.03      0.06      8034

    accuracy                           0.81    195461
   macro avg       0.54      0.36      0.36    195461
weighted avg       0.74      0.81      0.75    195461




## Decision Tree Classifier

In [11]:
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train, y_train)
decision_tree_pred = decision_tree.predict(X_test)

# accuracy scores
print('Accuracy Score, Training Set:', decision_tree.score(X_train, y_train))
print('Accuracy Score, Test Set:', decision_tree.score(X_test, y_test))

# classification report
print('Classification Report \n\n {}'.format(classification_report(y_test, decision_tree_pred)))

Accuracy Score, Training Set: 0.9999824589869188
Accuracy Score, Test Set: 0.7409252996761503
Classification Report 

               precision    recall  f1-score   support

        Fair       0.29      0.30      0.30     28928
        Good       0.86      0.85      0.85    158499
        Poor       0.18      0.19      0.18      8034

    accuracy                           0.74    195461
   macro avg       0.44      0.45      0.45    195461
weighted avg       0.75      0.74      0.74    195461



## Random Forest Classifier

In [12]:
forest = RandomForestClassifier(random_state=42)
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)

# accuracy scores
print('Accuracy Score, Training Set:', forest.score(X_train, y_train))
print('Accuracy Score, Test Set:', forest.score(X_test, y_test))

# classification report
print('Classification Report \n\n {}'.format(classification_report(y_test, y_pred)))

Accuracy Score, Training Set: 0.9999298359476751
Accuracy Score, Test Set: 0.8061198909245323
Classification Report 

               precision    recall  f1-score   support

        Fair       0.41      0.19      0.26     28928
        Good       0.84      0.95      0.89    158499
        Poor       0.34      0.12      0.17      8034

    accuracy                           0.81    195461
   macro avg       0.53      0.42      0.44    195461
weighted avg       0.76      0.81      0.77    195461



Even though accuracy scores are relatively high for our models, they consistently under predict the number of fair and poor trees. Good trees are the vast majority so our next step is to selectively remove those trees until they are about equal in amount to fair and poor trees. By under sampling, we balance the classes so that machine learning models can learn to correctly identify each type of tree. We are using edited nearest neighbors for our under sampling method, which edits the samples based on the edited nearest neighbor method.

# Edited Nearest Neighbors

In [13]:
# initialize
enn = EditedNearestNeighbours()
X_enn, y_enn = enn.fit_resample(X, y)

print('Resampled dataset:', Counter(y_enn))

X_train_enn, X_test_enn, y_train_enn, y_test_enn = train_test_split(X_enn, y_enn, test_size=0.3, random_state=42)

print(X_train_enn.shape, y_train_enn.shape)
print(X_test_enn.shape, y_test_enn.shape)

Resampled dataset: Counter({'Good': 339374, 'Poor': 26781, 'Fair': 3713})
(258907, 28) (258907,)
(110961, 28) (110961,)


### Logistic Regression

In [14]:
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train_enn, y_train_enn)
logreg_pred = logreg.predict(X_test_enn)

# accuracy scores
print('Accuracy Score, Training Set: ', logreg.score(X_train_enn, y_train_enn))
print('Accuracy Score, Test Set: ', logreg.score(X_test_enn, y_test_enn))

# classification report
print('Classification Report \n\n {}'.format(classification_report(y_test_enn, logreg_pred)))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Accuracy Score, Training Set:  0.9213810364339319
Accuracy Score, Test Set:  0.9215039518389344


  _warn_prf(average, modifier, msg_start, len(result))


Classification Report 

               precision    recall  f1-score   support

        Fair       0.00      0.00      0.00      1128
        Good       0.92      1.00      0.96    101806
        Poor       0.68      0.09      0.15      8027

    accuracy                           0.92    110961
   macro avg       0.53      0.36      0.37    110961
weighted avg       0.90      0.92      0.89    110961



### KNN Classifier

In [15]:
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train_enn, y_train_enn)
knn_pred = knn.predict(X_test_enn)

# accuracy scoring
print('Accuracy Score, Training Set: ', knn.score(X_train_enn, y_train_enn))
print('Accuracy Score, Test Set: ', knn.score(X_test_enn, y_test_enn))

# classification report
print('Classification Report \n\n {}'.format(classification_report(y_test_enn, knn_pred)))

Accuracy Score, Training Set:  0.9266377502346402
Accuracy Score, Test Set:  0.9241174827191536
Classification Report 

               precision    recall  f1-score   support

        Fair       0.51      0.11      0.18      1128
        Good       0.93      1.00      0.96    101806
        Poor       0.68      0.12      0.21      8027

    accuracy                           0.92    110961
   macro avg       0.71      0.41      0.45    110961
weighted avg       0.91      0.92      0.90    110961



### Decision Tree Classifier

In [16]:
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train_enn, y_train_enn)
decision_tree_pred = decision_tree.predict(X_test_enn)

# accuracy scores
print('Accuracy Score, Training Set:', decision_tree.score(X_train_enn, y_train_enn))
print('Accuracy Score, Test Set:', decision_tree.score(X_test_enn, y_test_enn))

# classification report
print('Classification Report \n\n {}'.format(classification_report(y_test_enn, decision_tree_pred)))

Accuracy Score, Training Set: 1.0
Accuracy Score, Test Set: 0.9088418453330449
Classification Report 

               precision    recall  f1-score   support

        Fair       0.56      0.61      0.58      1128
        Good       0.95      0.95      0.95    101806
        Poor       0.40      0.39      0.39      8027

    accuracy                           0.91    110961
   macro avg       0.64      0.65      0.64    110961
weighted avg       0.91      0.91      0.91    110961



In [17]:
# stratified KFold
kf = StratifiedKFold(5, shuffle=True, random_state=42)

# cross validation
tree_score = cross_val_score(decision_tree, X_enn, y_enn, cv=kf)

print('Scores: ', tree_score)
print("Average 5-Fold Scores: {}".format(np.mean(tree_score)))

Scores:  [0.91060373 0.91168519 0.91184741 0.91062955 0.91226529]
Average 5-Fold Scores: 0.9114062316350069


### Random Forest Classifier

In [18]:
forest = RandomForestClassifier(random_state=42)
forest.fit(X_train_enn, y_train_enn)
forest_pred = forest.predict(X_test_enn)

# accuracy scores
print('Accuracy Score, Training Set:', forest.score(X_train_enn, y_train_enn))
print('Accuracy Score, Test Set:', forest.score(X_test_enn, y_test_enn))

# classification report
print('Classification Report \n\n {}'.format(classification_report(y_test_enn, forest_pred)))

Accuracy Score, Training Set: 0.9999613760925737
Accuracy Score, Test Set: 0.9415740665639278
Classification Report 

               precision    recall  f1-score   support

        Fair       0.89      0.65      0.75      1128
        Good       0.94      1.00      0.97    101806
        Poor       0.83      0.29      0.43      8027

    accuracy                           0.94    110961
   macro avg       0.89      0.64      0.72    110961
weighted avg       0.94      0.94      0.93    110961



In [19]:
# stratified KFold
kf = StratifiedKFold(5, shuffle=True, random_state=42)

# cross validation
forest_score = cross_val_score(forest, X_enn, y_enn, cv=kf)

print('Scores: ', forest_score)
print("Average 5-Fold Scores: {}".format(np.mean(forest_score)))

Scores:  [0.94299348 0.94354773 0.9435207  0.94381734 0.94356049]
Average 5-Fold Scores: 0.9434879481380675
