In [1]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

# Random Forest: What is it

Random forest is an ensemble model that is used both for classification and regression. Its built on top of decision trees, often reducing overfit and increasing accuracy. It uses a process called bagging do to so.

# Important Random Forest Concepts

### 1. Ensemble Learning:
- Random Forests uses ensembles. Ensemble is when we use multiplie models to make the same decision, not just one. 
- In the case on Random Forest, instead of using one decision tree, we use many and average teh result.

### 2. Bagging
- Each training is done on a different subset of the dataset, created randomlly by sampling with replacement. This method is known as bootstrap sample. By training in such a way, Random Forest is less overfitting. 

### 3. Random Subset of Features:
- Random Forest randomly selects a subset of features at each split point. This also helps with avoiding overfit as it prevents from a few dominate features to control the prediction

### 4. Voting (Classification) / Averaging (Regression):
- When performing a classfication task, each tree votes for some class. The final decision is based on majority votes across all trees.
- When performing regression, each tree predicts a numeric value and the final decision is an average of all predictions.

### 5. Feature Importance:
- Random Forest provides a measure of feature importance by calculating how much each feature reduces impurity.
- Features that usually reduce impurity across many trees, are reported as more important.


## Hyperparameters in Random Forest:

There are several important hyperparameters to tune when working with a Random Forest model:

1. **n_estimators**:  
   - The number of decision trees in the forest. More trees generally improve accuracy but increase computation time.

2. **max_depth**: 
   - The maximum depth of each tree. A deeper tree might capture more patterns but risks overfitting.

3. **min_samples_split**: 
   - The minimum number of samples required to split a node. Increasing this value can reduce overfitting by making trees less complex.

4. **min_samples_leaf**: 
   - The minimum number of samples required to be at a leaf node. A larger number helps to smooth predictions.

5. **max_features**: 
   - The maximum number of features to consider when looking for the best split. This can be set as a proportion of the total number of features or a fixed number.

6. **bootstrap**: 
   - Whether to use bootstrap samples when building trees (default is `True`).

7. **criterion**: 
   - The function to measure the quality of a split. It can be "gini" (default) for classification or "entropy" for classification, and "mse" (mean squared error) or "mae" (mean absolute error) for regression.


## Sklearn Random Forest

In [2]:
data = load_iris()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

rf = RandomForestClassifier(n_estimators=100, random_state=42)

rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 1.0000


In [3]:
importances = rf.feature_importances_

sorted_features = sorted(zip(data.feature_names, importances), key=lambda x: x[1], reverse=True)

for feature, importance in sorted_features:
    print(f"{feature}: {importance:.4f}")

petal width (cm): 0.4340
petal length (cm): 0.4173
sepal length (cm): 0.1041
sepal width (cm): 0.0446



## Random Forest Use Cases:

1. **Classification**: Medical diagnosis, fraud detection, document classification.
2. **Regression**: Predicting house prices, stock prices, or any continuous output.
3. **Feature Selection**: As Random Forest provides feature importance, it can be used to select the most important features before applying other models.

Random Forest is a solid option for both classification and regression tasks. It offers a nice balance between complexity and performance, its reasonably powerful in many scenarios. However, it is not always the best fit. Its complexity can make it hard to interpret, so if you need clear explanations or transparency, it might not be the ideal choice.

# Random Forest From Half Scratch

In [5]:
class RandomForestRegressorScratch:
    def __init__(self, n_trees=100, max_features=None, max_depth=None, min_samples_split=2):
        self.n_trees = n_trees
        self.max_features = max_features
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.trees = []
    
    def _bootstrap_sample(self, X, y):
        n_samples = X.shape[0]
        indices = np.random.choice(n_samples, size=n_samples, replace=True)
        return X[indices], y[indices]
    
    def _fit_tree(self, X, y):
        tree = DecisionTreeRegressor(max_depth=self.max_depth, min_samples_split=self.min_samples_split)
        tree.fit(X, y)
        return tree
    
    def fit(self, X, y):
        for _ in range(self.n_trees):
            X_sample, y_sample = self._bootstrap_sample(X, y)
            tree = self._fit_tree(X_sample, y_sample)
            self.trees.append(tree)
    
    def predict(self, X):
        tree_preds = np.array([tree.predict(X) for tree in self.trees])
        return np.mean(tree_preds, axis=0)

In [6]:
housing_data = fetch_california_housing()
X, y = housing_data.data, housing_data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

rf_scratch = RandomForestRegressorScratch(n_trees=10, max_depth=10)
rf_scratch.fit(X_train, y_train)
y_pred_rf_scratch = rf_scratch.predict(X_test)

mse_rf_scratch = mean_squared_error(y_test, y_pred_rf_scratch)
print(f'Mean Squared Error of the Random Forest from scratch: {mse_rf_scratch}')

Mean Squared Error of the Random Forest from scratch: 0.304552296345607


# Random Forest vs Decision Tree

In [4]:
housing_data = fetch_california_housing()
X, y = housing_data.data, housing_data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

decision_tree = DecisionTreeRegressor(random_state=42)
random_forest = RandomForestRegressor(random_state=42)

decision_tree.fit(X_train, y_train)
random_forest.fit(X_train, y_train)

y_pred_tree = decision_tree.predict(X_test)
y_pred_forest = random_forest.predict(X_test)

mse_tree = mean_squared_error(y_test, y_pred_tree)
mse_forest = mean_squared_error(y_test, y_pred_forest)

mse_tree, mse_forest

(0.526138102652923, 0.2571833084850122)

**Random Forest performs better than a single Decision Tree for several key reasons:**

1. Ensemble Averaging: Random Forest is an ensemble of multiple Decision Trees. Each tree is trained on a different subset of the data (using bootstrapping) and a random subset of features, which leads to a more diverse set of trees. The final prediction is the average (for regression) or majority vote (for classification) across all trees, reducing the variance in the predictions.

2. Overfitting Prevention: Decision Trees tend to overfit on noisy or complex datasets, capturing even minor details in the training data that don't generalize well to unseen data. Random Forest mitigates this by averaging the predictions from many different trees, which smooths out the noise and prevents overfitting.

3. Feature Randomization: In Random Forest, at each split, only a random subset of features is considered, preventing any single feature from dominating the model and leading to better generalization. Decision Trees, on the other hand, may focus too heavily on a few specific features, leading to overfitting.

4. Reduction of Model Variance: Single Decision Trees can have high variance—meaning that small changes in the training data can result in a significantly different tree. Random Forest reduces this variance by averaging the results of many trees, leading to more stable and reliable predictions.

## Why Do We Like Random Forest

- Less Likely to Overfit: Random Forests average multiple decision trees, which helps prevent the common issue of overfitting that often happens with single decision trees.

- Deals with Missing Data Well: They can handle missing values in both training and test data without losing much performance.

- Handles Big Data: They work well with large datasets, even when there are many features.

- Works Well with Imbalanced Classes: They tend to perform better than some other models on datasets with class imbalance by averaging predictions from all trees.

- Feature Importance Insight: Random Forests naturally rank the importance of features, helping you understand which variables matter most.

## Why Dont We Like Random Forest


- Harder to Interpret: Single decision trees are easy to understand, but with a Random Forest combining lots of trees, it becomes more complex to interpret the overall model.

- Training Can Be Slow: Building multiple trees takes more time compared to training just one, especially as the number of trees increases.

- Slower Predictions: Making predictions can take longer because every tree in the forest needs to be evaluated.

- Higher Memory Usage: Random Forests need more memory since they store many trees compared to just one in simpler models.