# Random Forest with Balanced Weighting - Scalar

The attached notebook represents an exploration of the implementation of a random forest cross-validation pipeline, using the weighted class argument to counter our imbalanced classes. This pipeline is very similar to that instantiated in the 3_svm.ipynb notebook, which contains more details on the grid search object if further information is needed. It is used to extract the most promising hyper-parameters given our scalar feature TRAINING dataset. The pertinent thing to come out of this notebook was the production of a set of parameters, which are then used upon instantiation of the weighted random forest model in 3_final_models.ipynb to test with both the training and test set as a whole. 

In [None]:
# IMPORTS 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, classification_report


# CROSS-VALIDATION PIPELINE FUNCTION

In [3]:
def rf_weighted_grid_search_cv(X, y, cv_folds=5):
    """
    Perform grid search cross-validation for Random Forest classifier on the given data.
    
    Parameters:
    - X: Features dataset.
    - y: Target variable dataset.
    - cv_folds: Number of folds for cross-validation.
    
    Returns:
    - grid_search: The fitted GridSearchCV object.
    """
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)
    
    # Define a pipeline with the classifier
    pipeline = Pipeline([
        ('rf', RandomForestClassifier(random_state=23, class_weight='balanced'))  # Random Forest classifier with adjusted class weighting
    ])
    
    param_grid = {
        'rf__n_estimators': [50, 150, 200],  # Number of trees in the forest
        'rf__max_features': ['sqrt', 'log2', None],  # The number of features to consider when looking for the best split
        'rf__max_depth': [None, 10, 20, 30],  # Maximum number of levels in each decision tree
        'rf__min_samples_split': [2, 5, 10],  # Minimum number of data points placed in a node before the node is split
        'rf__class_weight': ['balanced', 'balanced_subsample']  # Adjust weights inversely proportional to class frequencies
    }
    
    # Initialize GridSearchCV
    grid_search = GridSearchCV(pipeline, param_grid, cv=cv_folds, scoring='accuracy', verbose=3, n_jobs=-1)
    
    # Perform grid search cross-validation
    grid_search.fit(X_train, y_train)
    
    print("Best parameters found: ", grid_search.best_params_)
    print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
    
    # Optionally, evaluate on the test set
    test_score = grid_search.score(X_test, y_test)
    print("Test set score: {:.2f}".format(test_score))
    
    return grid_search

# Import training dataset

In [4]:
train_df_scalar = pd.read_pickle('/Users/erin/Documents/comp-viz/final-project/fabric/pkls/train_0406_scalar_non-aug.pkl')
train_df_scalar.head()

Unnamed: 0,label,category,0,1,2,3,4,5,6,7,...,22,23,24,25,26,27,28,29,30,31
0,0,Blended,3441.793288,20285310.0,1844.05677,0.150226,346.120911,0.00521,9.3e-05,6.3e-05,...,0.027344,0.054688,0.035156,0.039062,0.03125,0.027344,0.046875,0.0625,0.050781,0.050781
1,1,Denim,7211.992783,7747671.0,2271.840993,0.151475,348.99826,0.004833,-6.7e-05,3.8e-05,...,0.1,0.048649,0.021622,0.024324,0.013514,0.021622,0.024324,0.043243,0.016216,0.07027
2,2,Polyester,8856.756862,5854463.0,1967.259618,0.160454,369.684998,0.002032,-2e-06,8e-06,...,0.027778,0.00463,0.00463,0.25,0.12963,0.194444,0.018519,0.00463,0.027778,0.00463
3,0,Blended,7018.112788,7817569.0,1953.124972,0.1523,350.8992,0.004582,1.1e-05,3.4e-05,...,0.052239,0.063433,0.052239,0.007463,0.022388,0.044776,0.085821,0.074627,0.033582,0.029851
4,3,Cotton,7932.263905,6971318.0,2053.412469,0.157971,363.965454,0.002823,-3e-06,1.7e-05,...,0.030612,0.047619,0.054422,0.047619,0.05102,0.081633,0.013605,0.044218,0.047619,0.054422


# Run random forest model with equal balance on class weights using scalar features

In [5]:
X_sc = train_df_scalar.iloc[:,2:]
y_sc = train_df_scalar.iloc[:,0]

cv_params_scalar = rf_weighted_grid_search_cv(X_sc, y_sc)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
[CV 4/5] END rf__class_weight=balanced, rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=50;, score=0.691 total time=   2.8s
[CV 1/5] END rf__class_weight=balanced, rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=50;, score=0.675 total time=   2.9s
[CV 3/5] END rf__class_weight=balanced, rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=50;, score=0.689 total time=   2.9s
[CV 2/5] END rf__class_weight=balanced, rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=50;, score=0.701 total time=   3.0s
[CV 5/5] END rf__class_weight=balanced, rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=50;, score=0.677 total time=   3.2s
[CV 1/5] END rf__class_weight=balanced, rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=5, rf__n_estimators=50;, s

NOTES ON CV:
* This model, while not reaching the highs seen with the XGBoost model, appears to be much more stable within the folds than the other model. This leads to the conclusion that is could be the most generalizable out of the models.
* Again, the holdout set out performs the average, which is interesting given that RF models tend to overfit the data without a regularization technique being employed