<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Exercise.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Exercise: The random forest
© ExploreAI Academy

In this exercise, we build, evaluate and compare random forest regression models.

## Learning objectives

By the end of this train, you should be able to:
* Build a random forest regression model in Python.
* Experiment with different number of trees.
* Evaluate feature importance using a random forest. 

## Exercises

In this exercise, we will be using the `Crop_yield` dataset that contains various factors that could influence the yield of a particular crop across different regions.

### Import libraries and dataset

In [8]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics

In [9]:
# Load dataset
df= pd.read_csv("https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/Python/Crop_yield.csv")
df.head(3)

URLError: <urlopen error [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions>

### Preparing the dataset

In the code below, we prepare our dataset for modeling by encoding categorical variables to convert them to a numeric format.

In [None]:
# Dummy Variable Encoding for categorical variables
df_encoded = pd.get_dummies(df, drop_first=True)

### Exercise 1

Create a function named `train_rf_model` to train and evaluate a random forest regression model on the encoded dataset. 

The function should take in 3 parameters:
- DataFrame containing the encoded features
- A string containing the name of the target variable
- The number of estimators for the random forest 

It then returns: 
- The trained model object 
- The RMSE and R<sup>2</sup> scores of the model's performance on the test set. 

In [11]:
def train_rf_model(data, target_variable, n_estimators):

    # Splitting the dataset into features and target variable
    X = data.drop(target_variable, axis=1)  # Features
    y = data[target_variable]  # Target variable

    # Splitting the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Initializing the RandomForestRegressor with n_estimators
    rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=42)

    # Training the model on the training set
    rf_model.fit(X_train, y_train)

    # Making predictions on the test set
    y_pred = rf_model.predict(X_test)

    # Evaluating the model
    mse = metrics.mean_squared_error(y_test, y_pred)  # Setting squared=False returns the RMSE
    r2 = metrics.r2_score(y_test, y_pred)
    
    # Return the trained model and its performance metrics
    return rf_model, {'MSE': mse, 'R2': r2}


### Exercise 2

Use the function you have defined in **Exercise 1** to train and evaluate three different random forest regression models with each having the following number of estimators respectively: `50`, `100`, and `200`. Store the results in a dictionary.

In [12]:
# Number of estimators to evaluate
estimators_list = [50, 100, 200]

# Dictionary to store results
results = {}

# Train and evaluate models with different numbers of estimators
for n in estimators_list:
    # Store the entire returned dictionary as the value for each key
    model, metric = train_rf_model(df_encoded, 'Yield', n)
    results[f"{n} trees"] = metric
    
results

{'50 trees': {'MSE': 0.739261264251345, 'R2': 0.9920180175887953},
 '100 trees': {'MSE': 0.7288864859605081, 'R2': 0.9921300365756436},
 '200 trees': {'MSE': 0.7200078994393476, 'R2': 0.9922259008186051}}

### Exercise 3

Say we wish to understand which features have the most impact on crop yield predictions.

Use the `feature_importances_` attribute from our lastly trained random forest model in **Exercise 2** to return a series containing the feature importance score for each of the features in our dataset, sorted in descending order. 

In [13]:
# Extract feature importances from the model
feature_importances = model.feature_importances_

# Get the names of the features, excluding the target variable 'Yield'
feature_names =df_encoded.drop('Yield', axis=1).columns

# Create a pandas Series 
importances = pd.Series(feature_importances, index=feature_names)

# Sort the feature importances in descending order
sorted_importances = importances.sort_values(ascending=False)
sorted_importances

Rainfall                  0.978910
Fertilizer_Usage          0.016670
Temperature               0.001971
Pesticide_Usage           0.001102
Irrigation                0.000251
Crop_Variety_Variety B    0.000202
Region_West               0.000194
Soil_Type_Loamy           0.000161
Soil_Type_Sandy           0.000158
Crop_Variety_Variety C    0.000143
Region_North              0.000120
Region_South              0.000118
dtype: float64

## Solutions

### Exercise 1

In [None]:
def train_rf_model(data, target_variable, n_estimators):

    # Splitting the dataset into features and target variable
    X = data.drop(target_variable, axis=1)  # Features
    y = data[target_variable]  # Target variable

    # Splitting the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Initializing the RandomForestRegressor with n_estimators
    rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=42)

    # Training the model on the training set
    rf_model.fit(X_train, y_train)

    # Making predictions on the test set
    y_pred = rf_model.predict(X_test)

    # Evaluating the model
    mse = metrics.mean_squared_error(y_test, y_pred)  # Setting squared=False returns the RMSE
    r2 = metrics.r2_score(y_test, y_pred)
    
    # Return the trained model and its performance metrics
    return rf_model, {'MSE': mse, 'R2': r2}


The function `train_rf_model` is designed to train and evaluate a random forest regression model. 

It takes 3 parameters, `data`, `target_variable`, `n_estimators`.

The function returns two items: the trained random forest model `rf_model` and a dictionary containing the evaluation metrics, `mse` and `r2`.

### Exercise 2

In [None]:
# Number of estimators to evaluate
estimators_list = [50, 100, 200]

# Dictionary to store results
results = {}

# Train and evaluate models with different numbers of estimators
for n in estimators_list:
    # Store the entire returned dictionary as the value for each key
    model, metric = train_rf_model(df_encoded, 'Yield', n)
    results[f"{n} trees"] = metric
    
results

{'50 trees': {'MSE': 0.739261264251345, 'R2': 0.9920180175887953},
 '100 trees': {'MSE': 0.7288864859605081, 'R2': 0.9921300365756436},
 '200 trees': {'MSE': 0.7200078994393476, 'R2': 0.9922259008186051}}

In the code above, we use the previously created function to train and evaluate multiple random forest models, each with a different number of trees (estimators). 

The for loop iterates over each value in `estimators_list`, where it calls the `train_rf_model()` function, passing the required parameters including the current number of estimators `n` as arguments.

The two items returned by the function are stored in separate variables, `model` and `metric`.

The `results` dictionary is then used to store the evaluation metrics for each model trained with a different number of trees. The keys are strings indicating the number of trees, and the values are the dictionary of metrics returned by the function.

### Exercise 3

In [None]:
# Extract feature importances from the model
feature_importances = model.feature_importances_

# Get the names of the features, excluding the target variable 'Yield'
feature_names =df_encoded.drop('Yield', axis=1).columns

# Create a pandas Series 
importances = pd.Series(feature_importances, index=feature_names)

# Sort the feature importances in descending order
sorted_importances = importances.sort_values(ascending=False)
sorted_importances

Rainfall                  0.978910
Fertilizer_Usage          0.016670
Temperature               0.001971
Pesticide_Usage           0.001102
Irrigation                0.000251
Crop_Variety_Variety B    0.000202
Region_West               0.000194
Soil_Type_Loamy           0.000161
Soil_Type_Sandy           0.000158
Crop_Variety_Variety C    0.000143
Region_North              0.000120
Region_South              0.000118
dtype: float64

In the code above, we use the `feature_importances_` attribute of the trained random forest model to extract the importance scores for each feature. 

The variable `feature_names` stores the list of feature names that were used to train the model. This will be used for mapping each importance score to its corresponding feature name.

`importances` is a pandas series object where each feature's importance score is associated with its name. 

In `sorted_importances`, we get the importances sorted in descending order to get a quick view of the features considered most important by the model.

> Which top 2 features contribute the most to the model's predictive ability?

Understanding feature importance and the contribution of each variable to the model's predictions offers us an opportunity to streamline our models. This understanding enables us to focus on the most influential features, thereby reducing model complexity without significantly sacrificing performance.

In refining your model, you should consider an experiment: retrain the model using only the subset of features that have demonstrated the highest importance scores. This encourages an exploration into how much we can reduce complexity while maintaining, or even potentially improving, model accuracy.

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>