# Random Forest Classifier

I decided to do a Random Forest model becuase I believe that this model with predict well for my particular data. The Random Forest model operates by using a multitude of decision trees where each tree uses a random sample of features to classify and then will cast a vote on what classification that the data will be in. Once each decision tree has voted, the majority vote is what the data will be grouped as. The reason I think this model will perform well on my particular data is becuase of how many features I have, in particular my genres. If i keep my 'max_dept' at None, it should be able to account for each individual genre. 

In [5]:
import pandas as pd
import numpy as np
import pickle
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
np.random.RandomState(42)
import matplotlib.pyplot as plt

from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

from sklearn.metrics import confusion_matrix

In [6]:
df = pd.read_csv('../00_Data/df_complete.csv')

In [7]:
df.fillna(value = 0 , inplace  = True)
df.drop_duplicates(inplace  = True )
df.reset_index(drop = True , inplace = True)

### Importing X & y

In [8]:
X = pd.read_csv('../00_Data/X.csv')
y = pd.read_csv('../00_Data/y.csv')

## Train Test Split

 - This is done so that we can later test to make sure that our model is able to predict accurately on unseen data. 
 - Training on more data is always better so in this case I am only saving 30% of my data for testing. 
 - Shuffle= True   : This Randomly grabs rows from our data frame and randomly puts them in either the train set of the test set. I am doing this to make sure that the data is random and so that my model won't be overfit to a specific string of data
 
##### Unbalanced Data:   
   - Negative Class = 96%
   - Positive Class = 4%'

##### Stratify = y :
   - Since my classes are unbalanced & I am randomly selecting if a row goes into the train or test set, it is possible that the majority of my positive class could end up in the test set & my model will do horrible becuase it will not have much data on my positive class to train on. Stratifying my data will make sure that there is an even number of postive class rows in both my training and testing set.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .3, random_state=42, 
                                                    shuffle = True , stratify = y)

In [10]:
y_ravel = y_train.values.ravel()

### Pipeline:
Pipelines are a way to run multiple processes in the order that they are listed. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. In the model below, we set Standard Scaler and Random Forest Classifier.

In [11]:
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier())
])

### Hyper Parameters:
 These are the parameters that we want our model to train on, feeding these parameters through a gridsearch will train a model on every combination of my parameters and output the best model. 

> #### Number of Estimators (n_estimators): 
 - This is the number of Decision Trees that my Random Forest model will use , the default if 10. Usually the More decision tree a Random Forest model uses the better it does, but there is a limit
> #### Class Weight: 
 - This gives my unrepresented class more weight when predicting the class

In [12]:
params= {
    'rf__n_estimators': [100,150,200],
    'rf__class_weight': ['balanced']
}

### Grid Search:
- Grid searching is a module that performs parameter tuning which is the process of selecting the values for a model’s parameters that maximize the accuracy of the model. Grid Search does this by fitting every combination of parameters and selecting the best parameters by which model had the best score.


#### Scoring:
- I am scoring with 'roc_auc' which stands for "Reciever operating Curve ,  Area Under the Curve". The reason I am scoring this instead of scoring with accuracy is because of how unbalanced my classes are. We could predict zero for every data point, and could have a 96% accuracy score because 96% of the data is in our negative class. Using roc_auc accounts for the true positives and true negatives that we predict.

In [13]:
RF_gs = GridSearchCV(pipe,param_grid=params ,scoring = 'roc_auc', cv = 3)
RF_gs.fit(X_train, y_ravel);

### Train & Test Score

In [14]:
RF_gs.score(X_train , y_ravel)

0.9999982975162349

In [15]:
RF_gs.score(X_test , y_test)

0.849282793792993

### Best parameters

In [16]:
RF_gs.best_params_

{'rf__class_weight': 'balanced', 'rf__n_estimators': 200}

### Predictions & Probabilty of Prediction

In [17]:
predictions = RF_gs.predict(X_test)

In [18]:
predict_prob = RF_gs.predict_proba(X_test)

### Confusion Matrix
   
   ##### True Positve:
   - A True Positive is the cases where my Model Predicts a positive class & is correct on that prediction
   
##### True Negative:
  - A True Negative is the cases where my Model Predicts a negative class & is correct on that prediction
​
​
​

*We want to optimize for these predictions because that means my model can really predict the differnece between classes



In [19]:
pd.DataFrame(confusion_matrix(y_test , predictions) , columns = ['Negative','Positive']  , index = ['False','True'])

Unnamed: 0,Negative,Positive
False,24597,40
True,892,52


### Top 25 features that were most important when predicting classes

In [20]:
top_25_feat = pd.DataFrame(RF_gs.best_estimator_.named_steps['rf'].feature_importances_ ,
             X.columns).sort_values(by = 0 , ascending = False).head(25)

In [21]:
probs_df = pd.DataFrame(predict_prob , columns = ['prob_no_award', 'prob_award'])
predict = pd.DataFrame(predictions ,columns = ['predicted'])
pred_df = pd.concat([probs_df, y ,predict, df.award, df.track, df.artist,df.release_date, df[top_25_feat.index] ,df.explicit] ,axis = 1)

### Export to CSV

In [28]:
pred_df.to_csv('../00_Data/pred_df.csv', index = False)

## Summary:
- My Random Forest Model did very average with predicting 52 Positive Class correcting. My best parameters were 200 decision tree estimators and 'balanced' class weight. My roc_auc score was .99 for my training and .84 for my test. My testing score was better than my Logistic Regression and had an excellent recall and precision rate. This is my best model and will be my production model.