<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_8/Section_8_Python_Example__Implementing_Random_Forests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 8: Python example - implementing random forests

Random Forests are a popular ensemble learning method that builds on the concept of bagging. This technique involves constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random Forests aim to improve over decision trees by reducing overfitting without significantly increasing error due to bias. This section provides a step-by-step guide to implementing a Random Forest model in Python using scikit-learn.

1. Setting Up the Environment:

To implement a Random Forest model, ensure your Python setup includes scikit-learn, a comprehensive library that offers robust tools for machine learning, including the Random Forest algorithm. If it’s not installed, you can add it via pip:

In [None]:
pip install scikit-learn

2. Importing Required Libraries:

We will need scikit-learn for modeling, Pandas for data manipulation, and NumPy for numerical operations:

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

3. Preparing the Data:

For this example, let's use a dataset from scikit-learn's dataset module. We'll use the Iris dataset, which is a simple, widely-used dataset for classification tasks:

In [None]:
from sklearn.datasets import load_iris
# Load dataset
data = load_iris()
X = data.data
y = data.target
# Convert to DataFrame for easier handling
df = pd.DataFrame(data=X, columns=data.feature_names)
df['target'] = y

4. Splitting the Data:

Divide the data into training and test sets to ensure we have a way to validate the performance of our Random Forest model:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df[data.feature_names], df['target'], test_size=0.3, random_state=42)

5. Building the Random Forest Model:

Instantiate and train a Random Forest classifier. We'll use 100 trees in the forest:

In [None]:
# Initialize the Random Forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the model
rf.fit(X_train, y_train)

6. Evaluating the Model:

After training the model, evaluate its performance on the test data:

In [None]:
# Predict on the test set
y_pred = rf.predict(X_test)
# Evaluate the predictions
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy of the Random Forest model: {accuracy:.2f}')
print(classification_report(y_test, y_pred))

7. Feature Importance:

One of the advantages of Random Forests is their ability to determine the importance of features in the classification:

In [None]:
# Get feature importances
importances = rf.feature_importances_
# Print the name and importance of each feature
for feature, importance in zip(data.feature_names, importances):
  print(f'{feature}: {importance:.3f}')

8. Visualization of Feature Importance:

Visualize the feature importances to better understand which features are driving the model predictions:

In [None]:
import matplotlib.pyplot as plt
# Sorting the features by importance
indices = np.argsort(importances)[::-1]
plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices], align="center")
plt.xticks(range(X_train.shape[1]), np.array(data.feature_names)[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.show()

9. Conclusion:

Implementing a Random Forest in Python using scikit-learn is straightforward and provides a powerful tool capable of handling both classification and regression tasks effectively. This model is not only good at prediction but also provides insights into the importance of different features, making it a valuable tool for understanding complex datasets. The inherent ability of Random Forests to manage overfitting, while maintaining accuracy, makes them highly favourable for many practical applications in machine learning.