<font color="#de3023"><h1><b>REMINDER MAKE A COPY OF THIS NOTEBOOK, DO NOT EDIT</b></h1></font>

# Goals
In this notebook you will:

*   Extend your decision tree model into a random forest regression model.
*   Discuss what *overfitting* means and determine if your model is overfitting.
*   Learn about different hyperparameters you can change in your random forest model to reduce overfitting.
*   Implement grid search and cross validation to find the best set of hyperparameters for your random forest model.


In [None]:
#@title ###Setup notebook.

# Sample metadata
!wget -q --show-progress "https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Sustainable%20Farming/sample_metadata.tsv"

# bacteria counts lognorm
!wget -q --show-progress "https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Sustainable%20Farming/bacteria_counts_lognorm.csv"

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

from sklearn.tree import plot_tree
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.model_selection import GridSearchCV


metadata = pd.read_table('sample_metadata.tsv')
metadata.index = ['farm_%i' % i for i in range(len(metadata))]

bacteria_counts_lognorm = pd.read_csv('bacteria_counts_lognorm.csv', index_col=0)

#@title ###Setup notebook.
#@title ###Setup notebook.
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score, mean_absolute_error


# Let's take a vote!

Let's go back to the decision tree classifier you built in the previous notebooks and learn how to build an ***ensemble model*** from it.

This analogy might date the creator of this notebook a little, but there used to be a trivia game show called "Who Wants to Be A Millionaire". In this show, if a contestant didn't know the answer to a question, they could either "phone-a-friend" or "ask-the-audience". If asking a machine learning model for a prediction is analogous to "phone-a-friend", then maybe we can also "ask-the-audience" by asking many different machine learning models to vote on their prediction!



<img src="https://live.staticflickr.com/3267/3156507938_2258fe3ae0.jpg"/>



This idea of building multiple machine learning models to vote on their prediction for a label is known as **ensemble learning**.  We train a bunch of different models (with slightly different sets of features), and ask them to vote on the best prediction. From there, we either take a majority vote (if doing classification) or an average of the votes (if doing regression).

# From Trees to Forests

What do you get if you put a bunch of trees together?  You get a forest!
An ensemble machine learning model built using many *decision trees* is called a **random forest**, which is what we are going to use today in order to improve our crop-yield-from-bacteria model.  The image below illustrates how a random forest works.  ```scikit-learn``` has a random forest model ```RandomForestRegressor``` that you can initialize, train, and test using the same syntax we've been practicing with.


<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/rfc_vs_dt1.png" width=500/>

#### **Exercise: Using ```RandomForestRegressor```, train and test a random forest model. Use ```n_estimators=10``` (the number of decision trees in your random forest).**

In [None]:
# We helped you define your X and y data here.
X = bacteria_counts_lognorm
y = metadata['crop_yield']


# Split your data into testing and training.
### FILL IN ###

# Now, initialize your model (just use the default settings for now!)
## FILL IN ###

# Train your model with the training data.
### FILL IN ###

# Make predictions on your test data. (Don't try to compute accuracy just yet...)
### FILL IN ###

# Plot your predictions against the true crop yields of the test data
### FILL IN ###
plt.xlabel('True crop yields')
plt.ylabel('Predicted crop yields')
plt.show()

# Compute your r2_score using r2_score(y_test, preds)
print('r2_score:', ) ## FILL IN ###

In [None]:
#@title Example Solution
# We helped you define your X and y data here.
X = bacteria_counts_lognorm
y = metadata['crop_yield']

# Split your data into testing and training.
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Now, initialize your model (just use the default settings for now!)
model = RandomForestRegressor(n_estimators=10)

# Now, initialize your model (just use the default settings for now!)
model.fit(X_train, y_train)

# Make predictions on your test data. (Don't try to compute accuracy just yet...)
preds = model.predict(X_test)

# Plot your predictions against the true crop yields of the test data
plt.plot(y_test, preds, '.')
plt.xlabel('True crop yields')
plt.ylabel('Predicted crop yields')
plt.show()

print('r2_score:', r2_score(y_test, preds))

# Feature-reduced Models

We talked about feature importances in the second notebook, and we will use them once again to build a *feature-reduced* model.  To review, recall the equation that linear regression is fitting:

$
Y = \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + ...
$

The $\beta$s correspond to the weights of features, or how important each feature is to the model. In decision trees & random forests, we don't fit to a linear equation so we don't have feature weight coefficients per say, but we can calculate something called **feature importances**. Feature importances are similar to weights, in that they have to do with how well a certain feature helped with the prediction.  

We can access the values of the feature importances after training with ```model.feature_importances_```.

After training, if any ```feature_importances```'s equal 0, that means the corresponding feature *did not get used in the model*.  ***The number of features used in the model equals the number of non-zero feature importances.***



### **Exercise: Recall the Random Forest Model you just trained. How many *features* did we end up using in the model?**

**Hint:** you can sum based on conditionals! If I wanted to see how many items in the features list are `> 5`, I could write a line of code like this:

```
num_features_greater_than_five = sum(feature_importances>5)
```
Fill in the code below to see how many features were used in our model!


In [None]:
feature_importances = [] #### FILL IN ######
n_possible_features = len(feature_importances)
n_features_used = 0 ##### FILL IN ######
print("The random forest model used %i out of a possible %i features" %
      (n_features_used, n_possible_features))

In [None]:
#@title #### Example Solution
feature_importances = model.feature_importances_
n_possible_features = len(feature_importances)
n_features_used = sum(feature_importances!=0)
print("The random forest model used %i out of a possible %i features" %
      (n_features_used, n_possible_features))

Thousands of features is a lot of features! In feature-reduced models, we penalize the model for having more features. Hopefully then the model will pick only a few very important features to use.

In [None]:
#@title ### **Exercise: Why might we want to try to *reduce* the number of features we use in a model?**
_1_ = "" #@param {type:"string"}
_2_ = "" #@param {type:"string"}

print("1. To prevent overfitting. Especially in models with more features than samples, ")
print("we can end up overfitting on the training set.\n")
print("2. In order to determine which features are the most important.")
print("In the case of SARS-CoV-2, we can use the important features as ")
print("biomarkers to determine which lineage came from which region.")

**Optional:** If you'd like to learn more about feature reduction and dimensionality in machine learning, take a look at [this article!](https://venturebeat.com/2021/05/16/understanding-dimensionality-reduction-in-machine-learning-models/)

#### **Exercise: Compute training and testing $R^2$ to see if your original model was overfitting.**  Remember the ```r2_score``` function!

What does it mean if your training accuracy is significantly higher than your testing accuracy?

In [None]:
# Predict on the training data.
y_pred_train = None ## FILL IN ###
training_r2 = None ## FILL IN ###

# Predict on the testing data.
y_pred_test = None ## FILL IN ###
# Compare the predictions on the training data against the true crop yields.
testing_r2 = None ## FILL IN ###

print("Training r2: %", training_r2)
print("Testing r2: %", testing_r2)

In [None]:
#@title ####Example Solution
# Predict on the training data.
y_pred_train = model.predict(X_train)
training_r2 = r2_score(y_train, y_pred_train)

# Predict on the testing data.
y_pred_test = model.predict(X_test)
# Compare the predictions on the training data against the true crop yields.
testing_r2 = r2_score(y_test, y_pred_test)

print("Training r2: %", training_r2)
print("Testing r2: %", testing_r2)

## Build a Feature-Reduced Model

We can use something called **pruning** in order to incentivize our model to use less features.  In **pruning**, the decision tree or random forest "prunes" or removes branches in its tree in order to decrease the number of features being used, even if it sacrifices training accuracy a bit.  We can specify how much pruning we want to be done by adding an argument ```ccp_alpha``` (ccp stands for cross-complexity pruning) to ```RandomForestRegressor```. A higher value of ```ccp_alpha``` means the model will perform more pruning.


### **Exercise: Investigate how different values of ```ccp_alpha``` affect the number of features in a model and its accuracy.  Experiment with this to try to get the best acurracy you can.**


In [None]:
ccp_alpha = 0.00003 #@param {type:"slider", min:0.00001, max:0.0001, step:0.00001}
model = RandomForestRegressor(n_estimators=10, ccp_alpha=ccp_alpha)
model.fit(X_train, y_train)
print("Using ccp_alpha=", ccp_alpha)
print("Pruned Model Training R2:", r2_score(y_train,model.predict(X_train)))
print("Pruned Model Testing R2:", r2_score(y_test,model.predict(X_test)))
print("Number of non-zero feature importances in pruned model:", sum(model.feature_importances_!=0))

# Testing vs. Validation

Imagine this: I tell everyone to go home and build the best model you can for this project. Whoever builds the best model wins a prize. Not that this is just a pretend scenario -- in reality, you all should be working together :)
<img src="https://cdn.pixabay.com/photo/2016/08/23/17/30/cup-1615074_960_720.png" width="500">

#### **Exercise: While you are at home experimenting with different models, how would you decide if you built a good model?**



#### **Exercise: When you all convene once again, how should I decide whose model is the best?**


In [None]:
_explanation_ = "" #@param {type:"string"}

It's not really fair for you all to be evaluating your model on a test dataset, and then getting scored on the same test dataset you had access to all along!  Instead I should hold out *another* test dataset that I will use to evaluate the final performance of all of your models.

We sometimes refer to the test dataset that you have access to during model tuning as the validation dataset, and the final data used to figure out whose model performed the best is the test dataset. And confusingly, sometimes we just call them both "test datasets."

# Cross Validation
We can use a method called cross validation to decide what the best value of ```ccp_alpha``` is. In cross validation, we split the training set into a training and validation set. We train the model using different values of ```ccp_alpha``` and evaluate the accuracy using the training set. We do this using $K$ different iterations of testing/validation split. Then we choose the value of ```ccp_alpha```that has the best average accuracy for the final output of the training. K-fold cross validation is also better for evaluating a model's accuracy, since every sample in the dataset will be part of the training and testing set at some iteration in the k-fold cross validation process.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/K-fold_cross_validation_EN.svg/1280px-K-fold_cross_validation_EN.svg.png" alt="drawing" width="500"/>

Instead of specifying ```ccp_alpha=``` in our arguments to the ```RandomForestRegressor``` class, we can pass our ```RandomForestRegressor``` to the ```GridSearchCV``` model class  and tell it which values of ```ccp_alpha```  values we wish for the model to go through.

```
model = RandomForestRegressor(n_estimators=10)
model_cv = GridSearchCV(model, param_grid={'ccp_alpha': [.00001, .00002, .00003, .00004, .00005])
```

Then, we can run our regular ```.fit()``` and ```.predict()``` functions on ```model_cv```.

### **Exercise: Implement cross-validation to your random forest model and compute $R^2$.**

Note: Sometimes you can get unlucky with the randomly selected validation sets, and cross validation won't improve your model.

In [None]:
model = RandomForestRegressor(n_estimators=10)

# Pass model to GridSearchCV as well as the values of ccp_alpha you wish to try out.
model_cv = GridSearchCV(model, param_grid={'ccp_alpha': [0]})#### FILL IN###)


model_cv.fit(X_train, y_train)
print("Pruned Model Training R2:", r2_score(y_train,model_cv.predict(X_train)))
print("Pruned Model Testing R2:", r2_score(y_test,model_cv.predict(X_test)))
print("Best value of ccp_alpha:", model_cv.best_estimator_.ccp_alpha)
print("Number of non-zero feature importances in best model:", sum( model_cv.best_estimator_.feature_importances_!=0))

In [None]:
#@title ####Example Solution
model = RandomForestRegressor(n_estimators=10)

# Pass model to GridSearchCV as well as the values of ccp_alpha you wish to try out.
model_cv = GridSearchCV(model, param_grid={'ccp_alpha': [.00001, .00002, .00003, .00004, .00005]})


model_cv.fit(X_train, y_train)
print("Pruned Model Training R2:", r2_score(y_train,model_cv.predict(X_train)))
print("Pruned Model Testing R2:", r2_score(y_test,model_cv.predict(X_test)))
print("Best value of ccp_alpha:", model_cv.best_estimator_.ccp_alpha)
print("Number of non-zero feature importances in best model:", sum( model_cv.best_estimator_.feature_importances_!=0))

**Amazing job!!**  Much of what we went over today is normally taught in a sophomore level college course!

<img src="https://i.imgflip.com/5ez0wt.jpg"/>


To synthesize what we learned today, complete the following exercise.


In [None]:
#@title ### **Exercise: What other machine learning applications that we learned about in this course do you think could benefit from ensemble models or feature reduction techniques?**
_1_ = "" #@param {type:"string"}
_2_ = "" #@param {type:"string"}


# Wrapping Up

## Generalizability

Recall from the very first notebook what our metadata looked like. What type of plant farms did this data come from? From what country did the data come from?

<img src="https://media.istockphoto.com/vectors/cereal-plant-grain-and-seed-isolated-sketches-vector-id1215324634?b=1&k=6&m=1215324634&s=170667a&w=0&h=jJ2-la_V5A3tAjAOdxeoP4uOOnNQGTkn3qcuOkLLAAA=" width=400>

####**Exercise: Do you think your model will *generalize* well to new data? Why or why not? What types of datasets would it work best on?**




In [None]:
answer = "" #@param {type:"string"}


If you said that our model might not generalize well to new types of data, answer the following:

####**Exercise: What are the dangers of a model not generalizing well to future data?  How could we improve our model to mitigate these dangers?**

In [None]:
answer = "" #@param {type:"string"}


####**Final Exercise: Write an 3-sentence announcement to the National Farmers Association explaining how AI can be used for sustainable farming in the face of climate change.**

<img src="https://cdn.pixabay.com/photo/2017/04/25/19/36/farmer-2260636__340.jpg" width=400>

Keep in mind that farmers may dislike the idea of a black box choosing what land to allocate for farming, and that farmers might fear losing their jobs to automation.


In [None]:
Dear_Farmers = "" #@param {type:"string"}
