# Unweighted Random Forest

Similar to the 3_RF-BalancedWeighting.ipynb notebook, the attached notebook represents an exploration of the implementation of a random forest cross-validation pipeline, however it does not use the weighted class argument to counter our imbalanced classes. This gives us a baseline for comparison to determine how much the weighting based on imbalances actualy improved the model performance.

This pipeline is very similar to that instantiated in the 3_svm.ipynb notebook, which contains more details on the grid search object if further information is needed. It is used to extract the most promising hyper-parameters given our scalar feature TRAINING dataset. The pertinent thing to come out of this notebook was the production of a set of parameters, which are then used upon instantiation of the standard random forest model in 3_final_models.ipynb to test with both the training and test set as a whole. 

We also tried to run this model on the reduced-dimensionality vectorized dataset which contained closed to 7000 features post-reduction. It was left running for 18 hours and only made it through ~20% of the cv job, at which point the job was halted as the folds continued to take longer and longer. The RF model tends to use a ton of memory so it is not surprising that it increased in processing time gradually as the memory was further and further depleted. The results were also absolutely dismal and so we chose not to proceed with this approach on the vectorized dataset.

In [1]:
# IMPORTS 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Define grid search function for the random forest model

In [13]:
def rf_grid_search_cv(X, y, cv_folds=5):
    """
    Perform grid search cross-validation for Random Forest classifier on the given data.
    
    Parameters:
    - X: Features dataset.
    - y: Target variable dataset.
    - cv_folds: Number of folds for cross-validation.
    
    Returns:
    - grid_search: The fitted GridSearchCV object.
    """
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)
    
    # Define a pipeline with the classifier
    pipeline = Pipeline([
        ('rf', RandomForestClassifier(random_state=23))  # Random Forest classifier
    ])
    
    param_grid = {
        'rf__n_estimators': [100, 200, 300],  # Number of trees in the forest
        'rf__max_features': ['sqrt', 'log2', None],  # The number of features to consider when looking for the best split
        'rf__max_depth': [None, 10, 20, 30],  # Maximum number of levels in each decision tree
        'rf__min_samples_split': [2, 5, 10]  # Minimum number of data points placed in a node before the node is split
    }
    
    # Initialize GridSearchCV
    grid_search = GridSearchCV(pipeline, param_grid, cv=cv_folds, scoring='accuracy', verbose=3, n_jobs=-1)
    
    # Perform grid search cross-validation
    grid_search.fit(X_train, y_train)
    
    print("Best parameters found: ", grid_search.best_params_)
    print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
    
    # Optionally, evaluate on the test set
    test_score = grid_search.score(X_test, y_test)
    print("Test set score: {:.2f}".format(test_score))
    
    return grid_search


# Import Scalar dataset

In [8]:
train_df_scalar = pd.read_pickle('/Users/erin/Documents/comp-viz/final-project/fabric/pkls/train_0406_scalar_non-aug.pkl')
train_df_scalar.head()

Unnamed: 0,label,category,0,1,2,3,4,5,6,7,...,22,23,24,25,26,27,28,29,30,31
0,0,Blended,3441.793288,20285310.0,1844.05677,0.150226,346.120911,0.00521,9.3e-05,6.3e-05,...,0.027344,0.054688,0.035156,0.039062,0.03125,0.027344,0.046875,0.0625,0.050781,0.050781
1,1,Denim,7211.992783,7747671.0,2271.840993,0.151475,348.99826,0.004833,-6.7e-05,3.8e-05,...,0.1,0.048649,0.021622,0.024324,0.013514,0.021622,0.024324,0.043243,0.016216,0.07027
2,2,Polyester,8856.756862,5854463.0,1967.259618,0.160454,369.684998,0.002032,-2e-06,8e-06,...,0.027778,0.00463,0.00463,0.25,0.12963,0.194444,0.018519,0.00463,0.027778,0.00463
3,0,Blended,7018.112788,7817569.0,1953.124972,0.1523,350.8992,0.004582,1.1e-05,3.4e-05,...,0.052239,0.063433,0.052239,0.007463,0.022388,0.044776,0.085821,0.074627,0.033582,0.029851
4,3,Cotton,7932.263905,6971318.0,2053.412469,0.157971,363.965454,0.002823,-3e-06,1.7e-05,...,0.030612,0.047619,0.054422,0.047619,0.05102,0.081633,0.013605,0.044218,0.047619,0.054422


# Run the grid-search

In [9]:
X_sc = train_df_scalar.iloc[:,2:]
y_sc = train_df_scalar.iloc[:,0]

cv_params_scalar = rf_grid_search_cv(X_sc, y_sc)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
[CV 4/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=100;, score=0.698 total time=   6.0s
[CV 5/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=100;, score=0.695 total time=   6.0s
[CV 3/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=100;, score=0.709 total time=   6.1s
[CV 1/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=100;, score=0.697 total time=   6.1s
[CV 2/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=100;, score=0.709 total time=   6.4s
[CV 2/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=200;, score=0.717 total time=  11.7s
[CV 5/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=200;, score=0.703 total time=

NOTES ON GRID SEARCH:
* there does not seem to be a significant difference across the folds of this model and the weighted model, which leads us to question the impact of the weighting
* The holdout set outperforms the average, which points to a model that is potentially generalizable
* Overall, it is interesting that both random forest models, which general bear a high likelihood of overfitting, experienced higher performance with the holdout set

# Vectorized Feature Dataset Cross Validation -- ABANDONED

In [10]:
train_df_vectorized = pd.read_pickle('/Users/erin/Documents/comp-viz/final-project/fabric/pkls/train_0410_vectorized_non-aug.pkl')
train_df_vectorized.head()

Unnamed: 0,label,category,0,1,2,3,4,5,6,7,...,6822,6823,6824,6825,6826,6827,6828,6829,6830,6831
0,0,Blended,216.875773,-21.220328,31.355705,113.803698,67.469517,51.739112,32.134906,-34.037195,...,-0.66434,-1.337023,-0.017352,-0.616874,-0.858821,-0.282316,1.121574,0.521872,0.946164,-0.098814
1,1,Denim,-6.478218,-44.845392,9.334029,0.559372,-5.535237,2.069803,1.66805,-2.671548,...,0.196028,0.417297,-0.643817,1.078271,0.301808,-1.118499,-0.406229,0.275195,-0.698774,0.043567
2,2,Polyester,-21.765612,15.511845,0.440996,7.523796,-23.242091,-0.555019,-0.387093,2.955491,...,0.123217,0.683356,0.218428,0.273425,-0.274362,0.532576,0.547632,-0.846333,0.320085,-1.747008
3,0,Blended,1.031767,-13.747633,-35.65789,10.977205,4.696071,1.991713,0.73496,4.531688,...,-0.123557,-0.207327,-0.15323,1.0455,0.740585,-0.987832,-1.463705,1.176062,0.128055,0.676635
4,3,Cotton,-9.98767,-17.112746,-15.962529,-15.739822,-1.963229,-4.441544,0.110439,-1.643463,...,0.713136,0.477264,-0.658661,-0.69228,-0.230699,1.451385,-1.054964,-0.210731,-0.705949,-0.167371


In [11]:
X_vec = train_df_vectorized.iloc[:,2:]
y_vec = train_df_vectorized.iloc[:,0]

cv_params_vector = rf_grid_search_cv(X_vec, y_vec)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
[CV 1/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=100;, score=0.380 total time= 2.5min
[CV 2/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=100;, score=0.384 total time= 2.6min
[CV 3/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=100;, score=0.387 total time= 2.6min
[CV 4/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=100;, score=0.386 total time= 2.6min
[CV 5/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=2, rf__n_estimators=100;, score=0.379 total time= 2.6min
[CV 1/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=5, rf__n_estimators=100;, score=0.385 total time= 2.4min
[CV 2/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=5, rf__n_estimators=100;, score=0.384 total time=



[CV 5/5] END rf__max_depth=None, rf__max_features=log2, rf__min_samples_split=2, rf__n_estimators=100;, score=0.358 total time=  25.7s
[CV 1/5] END rf__max_depth=None, rf__max_features=log2, rf__min_samples_split=2, rf__n_estimators=200;, score=0.352 total time=  52.9s
[CV 2/5] END rf__max_depth=None, rf__max_features=log2, rf__min_samples_split=2, rf__n_estimators=200;, score=0.357 total time=  53.0s
[CV 3/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=10, rf__n_estimators=200;, score=0.376 total time=21.1min
[CV 2/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=10, rf__n_estimators=200;, score=0.381 total time=21.1min
[CV 4/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=10, rf__n_estimators=200;, score=0.384 total time=21.2min
[CV 5/5] END rf__max_depth=None, rf__max_features=sqrt, rf__min_samples_split=10, rf__n_estimators=200;, score=0.384 total time=21.1min
[CV 5/5] END rf__max_depth=None, rf__max_features=s

KeyboardInterrupt: 

This would truly need to be run in a parallelized fashion, breaking the grid up into multiple kernels and likely done on the cloud paying for memory. Beyond that, the scores were not promising *at all* -- I think to make vectorized features worth it in this case, we may need to add more features and computational power.