# Classification Models
***

In [1]:
%load_ext autoreload
%autoreload 2
from functions import *

KeyboardInterrupt: 

In [None]:
# Import necessary libraries/packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

import itertools
from sklearn import metrics

from sklearn.model_selection import (train_test_split, GridSearchCV,
                                     RandomizedSearchCV, cross_val_score)

from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
                              
from sklearn.metrics import mean_squared_error
from statsmodels.tools.eval_measures import rmse

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (BaggingClassifier, RandomForestClassifier,
RandomForestRegressor)

from sklearn.metrics import (classification_report, confusion_matrix, 
                             plot_confusion_matrix, precision_score, 
                             accuracy_score, recall_score, f1_score, roc_curve, 
                             auc)

from scipy.special import logit

plt.style.use('seaborn')

import shap
shap.initjs()

from alibi.explainers import KernelShap
from scipy.special import logit

from sklearn.feature_extraction.text import TfidfVectorizer

***
# Preprocessing data
***

In [None]:
# Load dataset and ceate pd dataframes
new_df = pd.read_csv('../DATA/1950up_df.csv')

In [None]:
new_df.describe().round(2)

In [None]:
new_df.shape

## Popularity Distribution
Using classification models, I want to predict the popularity of a song given the features of tha data set. This data set includes a column for song popularity, which is ranges from 0-100, with 100 being the most popular. I will plot the popularity distribution of these scores. 

In [None]:
fig = plt.figure(figsize=(10,5))
sns.set(style="darkgrid") 
sns.distplot(new_df['popularity'], label="Popularity", bins='auto')
plt.xlabel("Popularity")
plt.ylabel("Density")
plt.title("Distribution of Popularity Scores")
plt.axvline(35)
plt.show()

## Create caterogrical (binary) target
In order to create a variable to be the target of this classification analysis, I decided to use a popularity of 35 as a threshold value. In this step, I will create a new binary column named "popular". This column will have a threshold of 35 popularity. If the song popularity is greater than or equal to 35, then it will be classified a popular song (1). Otherwise, the song is not popular (0). I will build other models that have different threshold values and compare model performance.

In [None]:
new_df['popular'] = (new_df['popularity'] >= 35).astype('int')
new_df['popular'].value_counts(1)

In [None]:
new_df.head()

In [None]:
# Save raw dataframe with 'popular column' as csv file and store in DATA folder
new_df.to_csv('../DATA/new1950_df.csv') 

## Make a new dataframe with necessary information

In [None]:
df = new_df[['valence', 'year', 'acousticness', 'danceability', 'duration_ms',
             'energy', 'instrumentalness', 'liveness', 'loudness', 
          'speechiness', 'tempo', 'key', 'popular']]

In [None]:
df.head()

In [None]:
df.info()

# Logistic Regression Models
***

## LR Model 1: Baseline model

### Define X and y

In [None]:
X = df[['valence', 'acousticness', 'danceability', 'duration_ms',
             'energy', 'instrumentalness', 'liveness', 'loudness', 
          'speechiness', 'tempo', 'key']]

y = df['popular']

### Train Test Split

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
                                                    random_state=42)

### Standardize train and test sets

In [None]:
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X.columns,
                       index=X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test),columns=X.columns,
                      index=X_test.index)

In [None]:
X_train.describe()

### Instantiate classifier and fit

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

### Predict

In [None]:
pred = logreg.predict(X_test)

### Summary Plot Mean absolute error of each feature

In [None]:
plot_shap(logreg, X_train)

X-axis: does it help the model more towards the positive outcome (popular) or negative outcome (not popular).

The newer songs are often more popular 

### Model coefficients
"Generally, positive coefficients make the event more likely and negative coefficients make the event less likely. An estimated coefficient near 0 implies that the effect of the predictor is small."

“For every one-unit increase in [X variable], the odds that the observation is in (y class) are [coefficient] times as large as the odds that the observation is not in (y class) when all other variables are held constant.”

In [None]:
find_coeffs(logreg, X_train, X).style.background_gradient(cmap='coolwarm')

### Model Performance

In [None]:
model_performance(logreg, X_train, X_test, y_train, y_test, pred)

Notes:
* Model has accuracy of 84% 
* Issues predicting popular songs. Too many false negatives and false positives

### ROC Curve and AUC
"ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0s as 0s and 1s as 1s."

In [None]:
roc_auc(logreg, X_train, X_test, y_train, y_test)

AUC is looking pretty good but could be better. Also the ROC curve could be more perpendicular

***
## LR Model 2: LogisticRegressionCV

### Instantiate classifier and fit model

In [None]:
logregcv = LogisticRegressionCV()
logregcv.fit(X_train, y_train)

### Predict

In [None]:
pred = logregcv.predict(X_test)

### Summary Plot and Mean absolute error 

In [None]:
plot_shap(logregcv, X_train)

### Model  coefficients

In [None]:
find_coeffs(logregcv, X_train, X).style.background_gradient(cmap='coolwarm')

### Model Performance

In [None]:
model_performance(logregcv, X_train, X_test, y_train, y_test, pred)

### ROC Curve and AUC

In [None]:
roc_auc(logregcv, X_train, X_test, y_train, y_test)

***
## LR Model 3: GridSearchCV

### Instantiate classifier

In [None]:
logreg = LogisticRegression()

### Create Parameter Grid 

In [None]:
log_param_grid = {
    'penalty' : ['l1', 'l2'],
    'C' : [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}

### Instantiate GridSearchCV and fit

In [None]:
gs_log = GridSearchCV(logreg, log_param_grid, cv=3, return_train_score=True,
                      n_jobs=-1)

In [None]:
gs_log.fit(X_train, y_train)

### Best parameters

In [None]:
print("Best Parameter Combination Found During Grid Search:")
gs_log.best_params_

### Predict

In [None]:
pred = gs_log.predict(X_test)

### Summary Plot Mean absolute error of each feature

In [None]:
#plot_shap(gs_log, X_train)

### Model coefficients

In [None]:
find_coeffs(gs_log, X_train, X).style.background_gradient(cmap='coolwarm')

### Model Performance

In [None]:
model_performance(gs_log, X_train, X_test, y_train, y_test, pred)

### ROC Curve and AUC

In [None]:
roc_auc(gs_log, X_train, X_test, y_train, y_test)

***
# Decision Trees Models
***

## DT Model 1: Baseline DecisionTree Model

### Instantiate classifier and fit model

In [None]:
dtree_clf = DecisionTreeClassifier() 
dtree_clf.fit(X_train, y_train)

### Predict

In [None]:
pred = dtree_clf.predict(X_test)

### Model coefficients

In [None]:
find_coeffs(dtree_clf, X_train, X).style.background_gradient(cmap='coolwarm')

### Model Performance

In [None]:
model_performance(dtree_clf, X_train, X_test, y_train, y_test, pred)

### Feature Importances

In [None]:
plot_feature_importances(dtree_clf, X_train, X)

### ROC Curve and AUC

In [None]:
label = 'Baseline DecisionTrees Model'

roc_dt_rf(y_test, pred, label=label)

***
## DT Model 2: Bagged DecisionTree

### Instantiate classifier and fit

In [None]:
bagged_tree =  BaggingClassifier(DecisionTreeClassifier())
bagged_tree.fit(X_train, y_train)

### Predict

In [None]:
pred = bagged_tree.predict(X_test)

### Model coefficients

In [None]:
find_coeffs(bagged_tree, X_train, X).style.background_gradient(cmap='coolwarm')

### Model Performance

In [None]:
model_performance(bagged_tree, X_train, X_test, y_train, y_test, pred)

### ROC Curve and AUC

In [None]:
label = 'Bagged DecisionTrees Model'

roc_dt_rf(y_test, pred, label=label)

***
## DT Model 3: DecisionTree GridSearch 

### Instantiate classifier

In [None]:
dtree_model = DecisionTreeClassifier() 

### Create Parameter Grid

In [None]:
dt_param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 2, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [2, 5, 6]
}

### Instantiate GridSearchCV and fit

In [None]:
dt_grid_search = GridSearchCV(dtree_model, dt_param_grid, cv=3,
                              return_train_score=True, n_jobs=-1)

In [None]:
dt_grid_search.fit(X_train, y_train)

### Best parameters

In [None]:
print("Best Parameter Combination Found During Grid Search:")
dt_grid_search.best_params_

### Predict

In [None]:
pred = dt_grid_search.predict(X_test)

### Model coefficients

In [None]:
find_coeffs(dt_grid_search, X_train, X).style.background_gradient(cmap='coolwarm')

### Model Performance

In [None]:
model_performance(dt_grid_search, X_train, X_test, y_train, y_test, pred)

### ROC Curve and AUC

In [None]:
label = 'DecisionTrees GridSearchCV Model'

roc_dt_rf(y_test, pred, label=label)

***
# Random Forests Models
***

## RF Model 1: Baseline Model

### Instantiate classifier and fit

In [None]:
forest = RandomForestClassifier()
forest.fit(X_train, y_train)

### Predict

In [None]:
pred = forest.predict(X_test)

### Summary Plot Mean absolute error of each feature

In [None]:
plot_shap(forest, X_train)

### Model coefficients

In [None]:
find_coeffs(forest, X_train, X).style.background_gradient(cmap='coolwarm')

### Model Performance

In [None]:
model_performance(forest, X_train, X_test, y_train, y_test, pred)

Notes:
* Model has accuracy of 85% 
* Issues predicting popular songs. Too many false negatives and false positives

### ROC Curve and AUC

In [None]:
label = 'Baseline RandomForests Model'
roc_dt_rf(y_test, pred, label=label)

***
## RF Model 2: GridSearchCV Model

### Instantiate classifier

In [None]:
rforest_model = RandomForestClassifier()

### Create Parameter Grid

In [None]:
rf_param_grid = {
    'n_estimators': [10, 100],
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 2, 5],
    'min_samples_split': [5, 10],
    'min_samples_leaf': [3, 6]
}

### Instantiate GridSearchCV and fit

In [None]:
rf_grid_search = GridSearchCV(rforest_model, rf_param_grid, cv=3,
                            return_train_score=True, n_jobs=-1)

In [None]:
rf_grid_search.fit(X_train, y_train)

### Best parameters

In [None]:
print("Best Parameter Combination Found During Grid Search:")
rf_grid_search.best_params_

### Predict

In [None]:
pred = rf_grid_search.predict(X_test)

### Model coefficients

In [None]:
find_coeffs(rf_grid_search, X_train, X).style.background_gradient(cmap='coolwarm')

### Model Performance

In [None]:
model_performance(rf_grid_search, X_train, X_test, y_train, y_test, pred)

### ROC Curve and AUC

In [None]:
label = 'RandomForests GridSearchCV Model'

roc_dt_rf(y_test, pred, label=label)