                                Prepared by: Amon Melly || Email: amon.kmelly@gmail.com
## Decision Tree & Random Forest

### Import Required Libraries

In [79]:
# Data Processing
import pandas as pd

# Modelling
from sklearn.model_selection import train_test_split,RandomizedSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from scipy.stats import randint
from skopt import BayesSearchCV


# Tree Visualisation
from sklearn.tree import export_graphviz
from IPython.display import Image
import graphviz

### Preprocessing Data

In [99]:
data = pd.read_csv('Maji_Ndogo_agric_survey_data_small.csv')

# Convert categorical variable 'Soil_type' into dummy variables
dummies = pd.get_dummies(data['Soil_type'], prefix='Soil_type')

# Concatenate the dummy variables with the original DataFrame excluding 'Soil_type'
data_with_dummies = pd.concat([data.drop(['Soil_type'], axis =1), dummies], axis=1)

# Convert Boolean columns to integer
for col in data_with_dummies.columns:
    if data_with_dummies[col].dtype == 'bool':
        data_with_dummies[col] = data_with_dummies[col].astype(int)

In [100]:
data_with_dummies.head()

### Splitting the Data

In [101]:
# Split the data into features (X) and target (y)
X = data_with_dummies.drop('Standard_yield', axis=1)
y = data_with_dummies['Standard_yield']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### 1. Decision Tree

In [102]:
model = DecisionTreeRegressor(random_state=42, max_depth = 4)
model.fit(X_train, y_train)
predicted = model.predict(X_test)

#R Squared Score
print(f'R_Squared Score: {r2_score(y_test,predicted)}')

### Visualizing your Decision Tree

In [103]:
XX = data_with_dummies[['Pollution_level','Rainfall','pH']]
yy = data_with_dummies['Standard_yield']

model_vis = DecisionTreeRegressor(random_state=42, max_depth = 2)
model_vis.fit(XX,yy)

dot_data = export_graphviz(model_vis,
                           feature_names=XX.columns,
                           filled=True,
                           max_depth=2,
                           impurity=False,
                           proportion=True)
graph = graphviz.Source(dot_data)
display(graph)

### The concept of overfitting the model

In [105]:
#Adjusting max_depth between 2 & 15

print('\nR_Squared Score Analysis\n')

for max_depth in range(2,16):
    model_over = DecisionTreeRegressor(random_state=42, max_depth = max_depth)
    model_over.fit(X_train, y_train)
    test_prediction = model_over.predict(X_test)
    train_prediction = model_over.predict (X_train)
    print(f'New unseen data:={round(r2_score(y_test,test_prediction),4)}, Train set:={round(r2_score(y_train,train_prediction),4)}, Maximum depth={max_depth}')

### Pruning the tree
**Types of Pruning**
- **`Pre-Pruning (Early Stopping):`** Stop growing the tree before it reaches its maximum depth or minimum sample size criteria during training.
- **`Post-Pruning (Reduced Error Pruning):`** Grow the tree to its full extent during training, then selectively remove branches that lead to minimal improvements in model performance.
<br></br>
We will be using Cost Complexity Pruning (CCP) which is a post-pruning technique. This is a technique used to prune decision trees, optimizing their complexity for better generalization.
<br></br>
**What is Cost Complexity Pruning?**
Cost Complexity Pruning is a method for reducing the complexity of decision trees by systematically removing nodes based on a cost complexity measure.
It involves finding the optimal trade-off between tree complexity and predictive accuracy.
<br></br>
**Understanding Alphas**
- Alphas represent the complexity parameter used in Cost Complexity Pruning.
- Higher alphas result in more aggressive pruning, leading to simpler trees with fewer nodes.
- Lower alphas lead to less pruning, resulting in more complex trees with more nodes.
<br></br>
**How Cost Complexity Pruning Works**
- The Cost Complexity Pruning Path computes the effective alphas of decision tree nodes during pruning.
- It identifies which nodes to prune based on their cost complexity measures.
- The path provides insights into how the complexity of the tree changes as nodes are pruned.

In [106]:
# Perform post-pruning (reduced error pruning)
path = model.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
regressors = []

for ccp_alpha in ccp_alphas:
    regressor = DecisionTreeRegressor(random_state=42, ccp_alpha=ccp_alpha)
    regressor.fit(X_train, y_train)
    regressors.append(regressor)

# Find the optimal subtree by selecting the model with the highest r2
r2 = [r2_score(y_test, regressor.predict(X_test)) for regressor in regressors]
best_index = r2.index(max(r2))
best_regressor = regressors[best_index]

# Evaluate the pruned regressor on the testing set
y_pred_pruned = best_regressor.predict(X_test)
r2_ = r2_score(y_test, y_pred_pruned)
print("R_Squared after pruning:", r2)

### Feature Importances

In [107]:
importances = pd.DataFrame(best_regressor.feature_importances_, columns = ['Importance_Score'],
                           index=X_train.columns).sort_values(by='Importance_Score', ascending = False)
importances

In [108]:
importances['Importance_Score'].sum()

### 2. Random Forest

In [109]:
rf = RandomForestRegressor(random_state=42, max_depth = 5,n_estimators=100)
rf.fit(X_train, y_train)
predicted = rf.predict(X_test)

#R Squared Score
print(f'R_Squared Score: {r2_score(y_test,predicted)}')

### Tuning Hyperparameters
The process of selecting the optimal set of hyperparameters for a machine learning model to improve its performance.
### 1. Manual Search
Manual adjustment of hyperparameters based on domain knowledge and experimentation

In [110]:
#Adjusting max_depth between 3 & 10 while keeping n_estimators constant

print('\nR_Squared Score Analysis\n')

for max_depth in range(3,11):
    rf_ = RandomForestRegressor(random_state=42, max_depth = max_depth, n_estimators=100)
    rf_.fit(X_train, y_train)
    test_prediction = rf_.predict(X_test)
    train_prediction = rf_.predict (X_train)
    print(f'New unseen data:={round(r2_score(y_test,test_prediction),4)}, Train set:={round(r2_score(y_train,train_prediction),4)}, Maximum depth={max_depth}')

In [111]:
#Adjusting n_estimators between 50 & 200 while keeping max_depth constant

print('\nR_Squared Score Analysis\n')

for n in range(50,201,50):
    rf_ = RandomForestRegressor(random_state=42, max_depth = 8, n_estimators=n)
    rf_.fit(X_train, y_train)
    test_prediction = rf_.predict(X_test)
    train_prediction = rf_.predict (X_train)
    print(f'New unseen data:={round(r2_score(y_test,test_prediction),4)}, Train set:={round(r2_score(y_train,train_prediction),4)}, Number of estimators={n}')

### 2. RandomizedSearchCV
Randomly samples a fixed number of hyperparameter settings from a predefined distribution

In [112]:
param_dist = {'n_estimators': randint(50,200),
              'max_depth': randint(3,15)}

# Create a random forest regressor
rf_ = RandomForestRegressor()

# Use random search to find the best hyperparameters
rand_search = RandomizedSearchCV(rf_, 
                                 param_distributions = param_dist, 
                                 n_iter=5, 
                                 cv=5)

# Fit the random search object to the data
rand_search.fit(X_train, y_train)

# Create a variable for the best model
best_rf = rand_search.best_estimator_

# Print the best hyperparameters
print('Best hyperparameters:',  rand_search.best_params_)

In [85]:
#Checking score of the tuned paramaters
parameters = rand_search.best_params_

rf_ = RandomForestRegressor(random_state=42, max_depth = parameters['max_depth'], n_estimators=parameters['n_estimators'])
rf_.fit(X_train, y_train)
predicted = rf_.predict(X_test)

#Mean Squared Error
#print(f'R_Squared Score: {r2_score(y_test,predicted)}')

R_Squared Score: 0.5372439684958958


### 3. Bayesian Optimization
- Bayesian optimization is an iterative model-based optimization technique that uses probabilistic models to find the optimal set of hyperparameters.
- It efficiently explores the hyperparameter space by selecting the next set of hyperparameters based on the results of previous evaluations.

In [113]:
# Define the Random Forest Regressor model
rf_regressor = RandomForestRegressor()

# Define the search space for hyperparameters
search_space = {
    'n_estimators': (3, 200),          # Number of trees in the forest
    'max_depth': (1, 15),                # Maximum depth of each tree
    'min_samples_split': (2, 20),        # Minimum number of samples required to split an internal node
    'min_samples_leaf': (1, 10),         # Minimum number of samples required to be at a leaf node
    'max_features': (0.1, 1.0)           # Number of features to consider when looking for the best split
}

# Perform Bayesian Optimization
bayes_search = BayesSearchCV(
    estimator=rf_regressor,
    search_spaces=search_space,
    scoring='neg_mean_squared_error',
    cv=5,
    n_iter=5,
    random_state=42,
    n_jobs=-1
)

# Fit the Bayesian Optimization model
bayes_search.fit(X_train, y_train)

# Print the best hyperparameters
print("Best hyperparameters:", bayes_search.best_params_)

In [114]:
#Checking score of the tuned paramaters
parameters = bayes_search.best_params_

rf_ = RandomForestRegressor(random_state=42,
                            max_depth = parameters['max_depth'],
                            n_estimators = parameters['n_estimators'],
                            max_features = parameters['max_features'],
                            min_samples_leaf = parameters['min_samples_leaf'],
                            min_samples_split = parameters['min_samples_split']
                           )
rf_.fit(X_train, y_train)
predicted = rf_.predict(X_test)

#Mean Squared Error
print(f'R_Squared Score: {r2_score(y_test,predicted)}')

### Out-of-bag evaluation
The out-of-bag (OOB) error in random forests is an estimate of the model's performance on unseen data without the need for a separate validation set. It is a useful metric for assessing the generalization ability of a random forest model.

In [115]:
rf_regressor = RandomForestRegressor(random_state=42,
                                     max_depth = parameters['max_depth'],
                                     n_estimators = parameters['n_estimators'],
                                     max_features = parameters['max_features'],
                                     min_samples_leaf = parameters['min_samples_leaf'],
                                     min_samples_split = parameters['min_samples_split'],
                                     oob_score=True
                                    )
rf_regressor.fit(X_train, y_train)

# Compute the OOB error
oob_error = 1 - rf_regressor.oob_score_
print("Out-of-Bag Error:", oob_error)

### Question
#### In Decision tree we did visualize the tree structure, is it possible to do the same for random forest?