# Nonparametric ML Models - Cumulative Lab

## Introduction

In this cumulative lab, you will apply two nonparametric models you have just learned — k-nearest neighbors and decision trees — to the forest cover dataset.

## Objectives

* Practice identifying and applying appropriate preprocessing steps
* Perform an iterative modeling process, starting from a baseline model
* Explore multiple model algorithms, and tune their hyperparameters
* Practice choosing a final model across multiple model algorithms and evaluating its performance

## Your Task: Complete an End-to-End ML Process with Nonparametric Models on the Forest Cover Dataset

![line of pine trees](images/trees.jpg)

Photo by <a href="https://unsplash.com/@michaelbenz?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Michael Benz</a> on <a href="/s/photos/forest?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

### Business and Data Understanding

To repeat the previous description:

> Here we will be using an adapted version of the forest cover dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/covertype). Each record represents a 30 x 30 meter cell of land within Roosevelt National Forest in northern Colorado, which has been labeled as `Cover_Type` 1 for "Cottonwood/Willow" and `Cover_Type` 0 for "Ponderosa Pine". (The original dataset contained 7 cover types but we have simplified it.)

The task is to predict the `Cover_Type` based on the available cartographic variables:

In [9]:
# Run this cell without changes
import pandas as pd

df = pd.read_csv('data/forest_cover.csv')
df

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type_31,Soil_Type_32,Soil_Type_33,Soil_Type_34,Soil_Type_35,Soil_Type_36,Soil_Type_37,Soil_Type_38,Soil_Type_39,Cover_Type
0,2553,235,17,351,95,780,188,253,199,1410,...,0,0,0,0,0,0,0,0,0,0
1,2011,344,17,313,29,404,183,211,164,300,...,0,0,0,0,0,0,0,0,0,0
2,2022,24,13,391,42,509,212,212,134,421,...,0,0,0,0,0,0,0,0,0,0
3,2038,50,17,408,71,474,226,200,102,283,...,0,0,0,0,0,0,0,0,0,0
4,2018,341,27,351,34,390,152,188,168,190,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38496,2396,153,20,85,17,108,240,237,118,837,...,0,0,0,0,0,0,0,0,0,0
38497,2391,152,19,67,12,95,240,237,119,845,...,0,0,0,0,0,0,0,0,0,0
38498,2386,159,17,60,7,90,236,241,130,854,...,0,0,0,0,0,0,0,0,0,0
38499,2384,170,15,60,5,90,230,245,143,864,...,0,0,0,0,0,0,0,0,0,0


> As you can see, we have over 38,000 rows, each with 52 feature columns and 1 target column:

> * `Elevation`: Elevation in meters
> * `Aspect`: Aspect in degrees azimuth
> * `Slope`: Slope in degrees
> * `Horizontal_Distance_To_Hydrology`: Horizontal dist to nearest surface water features in meters
> * `Vertical_Distance_To_Hydrology`: Vertical dist to nearest surface water features in meters
> * `Horizontal_Distance_To_Roadways`: Horizontal dist to nearest roadway in meters
> * `Hillshade_9am`: Hillshade index at 9am, summer solstice
> * `Hillshade_Noon`: Hillshade index at noon, summer solstice
> * `Hillshade_3pm`: Hillshade index at 3pm, summer solstice
> * `Horizontal_Distance_To_Fire_Points`: Horizontal dist to nearest wildfire ignition points, meters
> * `Wilderness_Area_x`: Wilderness area designation (3 columns)
> * `Soil_Type_x`: Soil Type designation (39 columns)
> * `Cover_Type`: 1 for cottonwood/willow, 0 for ponderosa pine

This is also an imbalanced dataset, since cottonwood/willow trees are relatively rare in this forest:

In [10]:
# Run this cell without changes
print("Raw Counts")
print(df["Cover_Type"].value_counts())
print()
print("Percentages")
print(df["Cover_Type"].value_counts(normalize=True))

Raw Counts
0    35754
1     2747
Name: Cover_Type, dtype: int64

Percentages
0    0.928651
1    0.071349
Name: Cover_Type, dtype: float64


Thus, a baseline model that always chose the majority class would have an accuracy of over 92%. Therefore we will want to report additional metrics at the end.

### Previous Best Model

In a previous lab, we used SMOTE to create additional synthetic data, then tuned the hyperparameters of a logistic regression model to get the following final model metrics:

* **Log loss:** 0.13031294393913376
* **Accuracy:** 0.9456679825472678
* **Precision:** 0.6659919028340081
* **Recall:** 0.47889374090247455

In this lab, you will try to beat those scores using more-complex, nonparametric models.

### Modeling

Although you may be aware of some additional model algorithms available from scikit-learn, for this lab you will be focusing on two of them: k-nearest neighbors and decision trees. Here are some reminders about these models:

#### kNN - [documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

This algorithm — unlike linear models or tree-based models — does not emphasize learning the relationship between the features and the target. Instead, for a given test record, it finds the most similar records in the training set and returns an average of their target values.

* **Training speed:** Fast. In theory it's just saving the training data for later, although the scikit-learn implementation has some additional logic "under the hood" to make prediction faster.
* **Prediction speed:** Very slow. The model has to look at every record in the training set to find the k closest to the new record.
* **Requires scaling:** Yes. The algorithm to find the nearest records is distance-based, so it matters that distances are all on the same scale.
* **Key hyperparameters:** `n_neighbors` (how many nearest neighbors to find; too few neighbors leads to overfitting, too many leads to underfitting), `p` and `metric` (what kind of distance to use in defining "nearest" neighbors)

#### Decision Trees - [documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

Similar to linear models (and unlike kNN), this algorithm emphasizes learning the relationship between the features and the target. However, unlike a linear model that tries to find linear relationships between each of the features and the target, decision trees look for ways to split the data based on features to decrease the entropy of the target in each split.

* **Training speed:** Slow. The model is considering splits based on as many as all of the available features, and it can split on the same feature multiple times. This requires exponential computational time that increases based on the number of columns as well as the number of rows.
* **Prediction speed:** Medium fast. Producing a prediction with a decision tree means applying several conditional statements, which is slower than something like logistic regression but faster than kNN.
* **Requires scaling:** No. This model is not distance-based. You also can use a `LabelEncoder` rather than `OneHotEncoder` for categorical data, since this algorithm doesn't necessarily assume that the distance between `1` and `2` is the same as the distance between `2` and `3`.
* **Key hyperparameters:** Many features relating to "pruning" the tree. By default they are set so the tree can overfit, and by setting them higher or lower (depending on the hyperparameter) you can reduce overfitting, but too much will lead to underfitting. These are: `max_depth`, `min_samples_split`, `min_samples_leaf`, `min_weight_fraction_leaf`, `max_features`, `max_leaf_nodes`, and `min_impurity_decrease`. You can also try changing the `criterion` to "entropy" or the `splitter` to "random" if you want to change the splitting logic.

### Requirements

#### 1. Prepare the Data for Modeling

#### 2. Build a Baseline kNN Model

#### 3. Build Iterative Models to Find the Best kNN Model

#### 4. Build a Baseline Decision Tree Model

#### 5. Build Iterative Models to Find the Best Decision Tree Model

#### 6. Choose and Evaluate an Overall Best Model

## 1. Prepare the Data for Modeling

The target is `Cover_Type`. In the cell below, split `df` into `X` and `y`, then perform a train-test split with `random_state=42` and `stratify=y` to create variables with the standard `X_train`, `X_test`, `y_train`, `y_test` names.

Include the relevant imports as you go.

In [11]:
# Your code here
X=df.drop("Cover_Type", axis=1)
y=df["Cover_Type"]

from sklearn.model_selection import train_test_split

# Split the data into training and test subsets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

Now, instantiate a `StandardScaler`, fit it on `X_train`, and create new variables `X_train_scaled` and `X_test_scaled` containing values transformed with the scaler.

In [12]:
# Your code here
from sklearn.preprocessing import StandardScaler 

# Instantiate StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

The following code checks that everything is set up correctly:

In [13]:
# Run this cell without changes

# Checking that df was separated into correct X and y
assert type(X) == pd.DataFrame and X.shape == (38501, 52)
assert type(y) == pd.Series and y.shape == (38501,)

# Checking the train-test split
assert type(X_train) == pd.DataFrame and X_train.shape == (28875, 52)
assert type(X_test) == pd.DataFrame and X_test.shape == (9626, 52)
assert type(y_train) == pd.Series and y_train.shape == (28875,)
assert type(y_test) == pd.Series and y_test.shape == (9626,)

# Checking the scaling
assert X_train_scaled.shape == X_train.shape
assert round(X_train_scaled[0][0], 3) == -0.636
assert X_test_scaled.shape == X_test.shape
assert round(X_test_scaled[0][0], 3) == -1.370

## 2. Build a Baseline kNN Model

Build a scikit-learn kNN model with default hyperparameters. Then use `cross_val_score` with `scoring="neg_log_loss"` to find the mean log loss for this model (passing in `X_train_scaled` and `y_train` to `cross_val_score`). You'll need to find the mean of the cross-validated scores, and negate the value (either put a `-` at the beginning or multiply by `-1`) so that your answer is a log loss rather than a negative log loss.

Call the resulting score `knn_baseline_log_loss`.

Your code might take a minute or more to run.

In [14]:
# Replace None with appropriate code

# Relevant imports
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

# Creating the model
knn_baseline_model = KNeighborsClassifier().fit(X_train_scaled, y_train)

# Perform cross-validation
knn_baseline_log_loss = cross_val_score(knn_baseline_model, X_train_scaled, y_train, scoring="neg_log_loss")

knn_baseline_log_loss
-(knn_baseline_log_loss.mean())

array([-0.11101314, -0.09149019, -0.13059783, -0.13916178, -0.1553815 ])

In [16]:
-(knn_baseline_log_loss.mean())

0.1255288892455634

Our best logistic regression model had a log loss of 0.13031294393913376

Is this model better? Compare it in terms of metrics and speed.

In [None]:
# Replace None with appropriate text
"""
it has lower log loss, so should be better.  it had pretty good runtime, a minute or so, compared to the logreg which took a while

"""

## 3. Build Iterative Models to Find the Best kNN Model

Build and evaluate at least two more kNN models to find the best one. Explain why you are changing the hyperparameters you are changing as you go. These models will be *slow* to run, so be thinking about what you might try next as you run them.

In [17]:
# Your code here (add more cells as needed)
knn_model_2 = KNeighborsClassifier(n_neighbors=6).fit(X_train_scaled, y_train)

# Perform cross-validation
knn_model_2_log_loss = cross_val_score(knn_model_2, X_train_scaled, y_train, scoring="neg_log_loss")

knn_model_2_log_loss
-(knn_model_2_log_loss.mean())

0.10487348038090896

In [None]:
#first i want to check out adding more neighbors (then reducing).
#adding more neighbors(6) was an improvement.  we could check even higher.

In [18]:
# Your code here (add more cells as needed)
knn_model_2b = KNeighborsClassifier(n_neighbors=4).fit(X_train_scaled, y_train)

# Perform cross-validation
knn_model_2b_log_loss = cross_val_score(knn_model_2b, X_train_scaled, y_train, scoring="neg_log_loss")

knn_model_2b_log_loss
-(knn_model_2b_log_loss.mean())

0.14445909301128243

In [None]:
#next I checked lowering n_neighbors - it made performance worse.  

In [19]:
# Your code here (add more cells as needed)
knn_model_3 = KNeighborsClassifier(n_neighbors=5, p=1).fit(X_train_scaled, y_train)

# Perform cross-validation
knn_model_3_log_loss = cross_val_score(knn_model_3, X_train_scaled, y_train, scoring="neg_log_loss")

knn_model_3_log_loss
-(knn_model_3_log_loss.mean())

0.1144531688354272

In [None]:
#then, I checked changing metric from euclidean to manhattan - this was an improvement too.  So, we could explore higher n_neighbors
#as well as p=1

In [20]:
# Your code here (add more cells as needed)
knn_model_4 = KNeighborsClassifier(n_neighbors=7, p=1).fit(X_train_scaled, y_train)

# Perform cross-validation
knn_model_4_log_loss = cross_val_score(knn_model_4, X_train_scaled, y_train, scoring="neg_log_loss")

knn_model_4_log_loss
-(knn_model_4_log_loss.mean())

0.08441299087246057

In [21]:
#wow, that was even better.  Instead of just blindly picking, I could iterate through...
for i in range(8, 15):
    knn_model = KNeighborsClassifier(n_neighbors=i, p=1).fit(X_train_scaled, y_train)

    # Perform cross-validation
    knn_model_log_loss = cross_val_score(knn_model, X_train_scaled, y_train, scoring="neg_log_loss")

    knn_model_log_loss
    print(i, -(knn_model_log_loss.mean()))
    

8 0.07018235128351437
9 0.06374575533465694
10 0.06254007184711532
11 0.06172556244896841
12 0.06177497736710367
13 0.0630702704169221
14 0.06291417506796734


In [None]:
#looks like n_neighbors=11 is pretty good.

## 4. Build a Baseline Decision Tree Model

Now that you have chosen your best kNN model, start investigating decision tree models. First, build and evaluate a baseline decision tree model, using default hyperparameters (with the exception of `random_state=42` for reproducibility).

(Use cross-validated log loss, just like with the previous models.)

In [22]:
# Your code here
from sklearn.tree import DecisionTreeClassifier 
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train,y_train)

dt_model_log_loss = cross_val_score(dt_model, X_train_scaled, y_train, scoring="neg_log_loss")

#dt_model_log_loss
print( -(dt_model_log_loss.mean()))

0.7057351605151588


Interpret this score. How does this compare to the log loss from our best logistic regression and best kNN models? Any guesses about why?

In [None]:
# Replace None with appropriate text
"""
Decision tree is quite a bit worse than kNN.  First, because we haven't hyperparameter tuned the dtree approach yet.
But there may be an inherent reason which is that spatial relationships like those that occur in the tree cover file
lend themselves well to a knn model over a binary classification-decision tree model.  
Surely there are other reasons too.
"""

## 5. Build Iterative Models to Find the Best Decision Tree Model

Build and evaluate at least two more decision tree models to find the best one. Explain why you are changing the hyperparameters you are changing as you go.

In [23]:
# Your code here (add more cells as needed)
dt_model = DecisionTreeClassifier(criterion='entropy', random_state=42)
dt_model.fit(X_train,y_train)

dt_model_log_loss = cross_val_score(dt_model, X_train_scaled, y_train, scoring="neg_log_loss")

#dt_model_log_loss
print( -(dt_model_log_loss.mean()))

#changing the loss criterion from gini to entropy helped.

0.653103979503405


In [25]:
# Your code here (add more cells as needed)
for md in range(1,20):
    dt_model = DecisionTreeClassifier(criterion='entropy', random_state=42, max_depth=md)
    dt_model.fit(X_train,y_train)

    dt_model_log_loss = cross_val_score(dt_model, X_train_scaled, y_train, scoring="neg_log_loss")

    #dt_model_log_loss
    print(md,  -(dt_model_log_loss.mean()))
    
#next, I wanted to address max_depth, because this is likely a culprit for overfitting.  And wow, was it ever.
#changing max_depth to 6 helps immensely.

1 0.21122323200787632
2 0.17110532803415182
3 0.14446509918733946
4 0.13234185089154932
5 0.11994764825821917
6 0.1191790711666659
7 0.1298679201964616
8 0.1775996337258546
9 0.24865257046826192
10 0.3207542010693237
11 0.3778680279243406
12 0.4665253451875496
13 0.5351015780065211
14 0.5951995281147878
15 0.6038151378766712
16 0.6503609729571844
17 0.6482489490972803
18 0.6501644194478378
19 0.6329568627225856


In [26]:
# Your code here (add more cells as needed)
for max_feat in range(1,50):
    dt_model = DecisionTreeClassifier(criterion='entropy', random_state=42, max_features=max_feat)
    dt_model.fit(X_train,y_train)

    dt_model_log_loss = cross_val_score(dt_model, X_train_scaled, y_train, scoring="neg_log_loss")

    #dt_model_log_loss
    print(max_feat,  -(dt_model_log_loss.mean()))
    
#next, I wanted to address max_features, because this could remove some extraneous colums which overfit.
#unfortunately, nothing helps a lot here.

1 1.1937688745318766
2 1.2978352546052432
3 1.2081234271021617
4 1.150707625997695
5 1.1841994681963364
6 0.9318093114151649
7 1.0705639587042883
8 1.0310908775540182
9 0.8528628721977647
10 0.9174552849869121
11 0.8839626951127517
12 0.8743942302945319
13 0.8456860666712827
14 0.8169778753563474
15 0.8528629275811366
16 0.7751124980055636
17 0.752385794867431
18 0.7942503414676378
19 0.8791785457786995
20 0.7894668013506753
21 0.7284619190366636
22 0.747600787091116
23 0.7703281548297102
24 0.7918584052590412
25 0.7332465391293759
26 0.7380311592220885
27 0.8002314973442095
28 0.7344425487712031
29 0.741619935823089
30 0.7296582609787213
31 0.7380322391978382
32 0.7392275288558322
33 0.7224811508436944
34 0.7523852410337131
35 0.7487967967329432
36 0.7356393060885494
37 0.7212850581268094
38 0.7332466222044336
39 0.6949695230043043
40 0.7272663801534964
41 0.7200892423267835
42 0.7057350774401012
43 0.7212852242769248
44 0.6854003658939369
45 0.7260700935948103
46 0.698558022688446
47

In [31]:
for minss in range(15,40):
    dt_model = DecisionTreeClassifier(criterion='entropy', random_state=42, min_samples_split =minss)
    dt_model.fit(X_train,y_train)

    dt_model_log_loss = cross_val_score(dt_model, X_train_scaled, y_train, scoring="neg_log_loss")

    #dt_model_log_loss
    print(minss,  -(dt_model_log_loss.mean()))
    
#next, I wanted to address min_samples_split, even though this could have similar effect to max_depth.


15 0.3858747199360895
16 0.37851811903707877
17 0.37865986461523704
18 0.3714917316073346
19 0.3582000082328332
20 0.3542099976991659
21 0.34109329895175644
22 0.34049125748522296
23 0.32075097579713724
24 0.30735329413667306
25 0.30676100686760976
26 0.2988039504382948
27 0.2931039583554921
28 0.29337527474145214
29 0.2905360244569308
30 0.2888192264039214
31 0.28489779980417895
32 0.2801200929860136
33 0.27332962568944275
34 0.2721657485870345
35 0.27235271422087426
36 0.27518050091033214
37 0.26742528056684967
38 0.26432023730911036
39 0.26462889517459387


In [36]:
final_DT_model = DecisionTreeClassifier(criterion='entropy', random_state=42, min_samples_split =38, max_depth=6)

# Fit the model on the full training data
# (scaled or unscaled depending on the model)
final_DT_model.fit(X_train,y_train)

final_dt_model_log_loss = cross_val_score(final_model, X_train_scaled, y_train, scoring="neg_log_loss")

#dt_model_log_loss
print( -(final_dt_model_log_loss.mean()))

# so it does a lot better than we started with, but still not as well as kNN.

0.11796060834002864


## 6. Choose and Evaluate an Overall Best Model

Which model had the best performance? What type of model was it?

Instantiate a variable `final_model` using your best model with the best hyperparameters.

In [38]:
# Replace None with appropriate code
final_model = KNeighborsClassifier(n_neighbors=11, p=1)

# Fit the model on the full training data
# (scaled or unscaled depending on the model)
final_model.fit(X_train_scaled, y_train)

final_model_log_loss = cross_val_score(final_model, X_train_scaled, y_train, scoring="neg_log_loss")

#dt_model_log_loss
print( -(final_model_log_loss.mean()))

0.12685488982085605


Now, evaluate the log loss, accuracy, precision, and recall. This code is mostly filled in for you, but you need to replace `None` with either `X_test` or `X_test_scaled` depending on the model you chose.

In [None]:
# Replace None with appropriate code
from sklearn.metrics import accuracy_score, precision_score, recall_score

preds = final_model.predict(X_test_scaled)
probs = final_model.predict_proba(X_test_scaled)

print("log loss: ", log_loss(y_test, probs))
print("accuracy: ", accuracy_score(y_test, preds))
print("precision:", precision_score(y_test, preds))
print("recall:   ", recall_score(y_test, preds))

Interpret your model performance. How would it perform on different kinds of tasks? How much better is it than a "dummy" model that always chooses the majority class, or the logistic regression described at the start of the lab?

In [None]:
# Replace None with appropriate text
"""
A dummy model would have 92% accuracy.  Our logistic regression had 95%; 
this does even better with 0.98% accuracy.  Awesome!
"""

## Conclusion

In this lab, you practiced the end-to-end machine learning process with multiple model algorithms, including tuning the hyperparameters for those different algorithms. You saw how nonparametric models can be more flexible than linear models, potentially leading to overfitting but also potentially reducing underfitting by being able to learn non-linear relationships between variables. You also likely saw how there can be a tradeoff between speed and performance, with good metrics correlating with slow speeds.