# Random Forest  

Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting. It is an extension of Bagging that introduces additional randomness by selecting a subset of features for each tree.  

## How It Works:  
- **Bootstrap Sampling**: Each tree is trained on a different random subset of the training data (with replacement).  
- **Feature Randomness**: Instead of considering all features at each split, only a random subset is used, making trees more diverse.  
- **Parallel Training**: Trees are trained independently, allowing efficient computation.  
- **Aggregation**: Predictions from all trees are combined using majority voting (for classification) or averaging (for regression).  

## Advantages:  
✅ **Reduces Overfitting**: Random selection of data and features prevents individual trees from overfitting.  
✅ **Improves Accuracy**: Typically achieves higher accuracy than individual decision trees.  
✅ **Handles High-Dimensional Data**: Works well with many features and avoids over-relying on any one feature.  
✅ **Works Well with Missing Data**: Can handle missing values better than a single decision tree.  

## Disadvantages:  
❌ **Increased Computational Cost**: Training multiple trees requires more computation and memory.  
❌ **Less Interpretability**: A single decision tree is easier to interpret than a forest of trees.  
❌ **Can Be Slow for Real-Time Predictions**: Large forests may slow down inference for large datasets.  


### Random Forest using Baseline Predictors     (refer /Data/Data_Formatting.ipynb)

In [1]:
# Function to make yearly predictions using Random Forest
def make_yearly_predictions_rf(Train, Test):
    best_alpha = find_optimal_alpha_base(Train)

    # Define static predictors
    static_predictors =  parameters_base(Train,Test)
    
    # Train a Random Forest Classifier
    rf_clf = RandomForestClassifier(n_estimators=50, max_depth=10, min_samples_split=10, random_state=1, n_jobs=-1)
    rf_clf.fit(Train[static_predictors], Train["Target"])

     # Access each tree and apply pruning
    for tree in rf_clf.estimators_:
        tree.set_params(ccp_alpha=best_alpha) 
                
    results = []
    for year in range(Test['Date'].dt.year.min(), Test['Date'].dt.year.max() + 1):
        test_year = Test[Test['Date'].dt.year == year]
        if not test_year.empty:  
            # After pruning, you can predict
            preds = rf_clf.predict(test_year[static_predictors])
            
            precision = precision_score(test_year["Target"], preds, average="weighted")
            accuracy = accuracy_score(test_year["Target"], preds)
            
                # Append results to list
            results.append({
                "Model": "Random Forest",
                "Year": year,
                "Precision": precision,
                "Accuracy": accuracy
            })

    # Convert results to DataFrame
    results_df = pd.DataFrame(results)

    return results_df

### Random Forest using Baseline Predictors + Rolling Predictors (refer /Data/Data_Formatting.ipynb) 

In [2]:
def make_yearly_predictions_rf_rolling(Train, Test):
    best_alpha = find_optimal_alpha_roll(Train)

    all_predictors = parameters_roll(Train,Test)
    Train = roll(Train)
    Test  = roll(Test)

    # Train a Random Forest Classifier
    rf_clf = RandomForestClassifier(n_estimators=50, max_depth=10, min_samples_split=10, ccp_alpha=best_alpha, random_state=1, n_jobs=-1)
    rf_clf.fit(Train[ all_predictors], Train["Target"])
    
    results = []
    for year in range(Test['Date'].dt.year.min(), Test['Date'].dt.year.max() + 1):
        test_year = Test[Test['Date'].dt.year == year]
        if not test_year.empty:
            # Predict on test data
           preds = rf_clf.predict(test_year[ all_predictors])
           # Calculate precision and accuracy 
           precision = precision_score(test_year["Target"], preds, average="weighted")
           accuracy = accuracy_score(test_year["Target"], preds)
            
            # Append results to list
           results.append({
                "Model": "Random Forest",
                "Year": year,
                "Precision": precision,
                "Accuracy": accuracy
           })

    # Convert results to DataFrame
    results_df = pd.DataFrame(results)

    return results_df

### Random Forest using Full Feature Set  (refer /Data/Data_Formatting.ipynb) 

In [3]:
def make_yearly_predictions_rf_full(Train, Test):
    best_alpha = find_optimal_alpha_full(Train)

    all_predictors = parameters_full(Train,Test)
    Train = roll(Train)
    Test  = roll(Test)
    
     # Train a Random Forest Classifier
    rf_clf = RandomForestClassifier(n_estimators=50, max_depth=10, min_samples_split=10, ccp_alpha=best_alpha, random_state=1, n_jobs=-1)
    rf_clf.fit(Train[ all_predictors], Train["Target"])
    
    results = []
    for year in range(Test['Date'].dt.year.min(), Test['Date'].dt.year.max() + 1):
        test_year = Test[Test['Date'].dt.year == year]
        if not test_year.empty:
            preds = rf_clf.predict(test_year[ all_predictors])
            
            precision = precision_score(test_year["Target"], preds, average="weighted")
            accuracy = accuracy_score(test_year["Target"], preds)
            
         # Append results to list
            results.append({
                "Model": "Random Forest",
                "Year": year,
                "Precision": precision,
                "Accuracy": accuracy
            })

    # Convert results to DataFrame
    results_df = pd.DataFrame(results)

    return results_df
 
