# Logistic Regression

Logistic Regression is the perfect predicting model that will take in Continuous variables and then will classify data into descrete groups. You can have 2 or more classification groups, but in this case I am grouping into two groups, Award Winning & Not Award Winning. To process my data, I will use a pipeline consisting of Standard Scaling & Logistic Regression. With this pipeline I will gridsearch through my hyper parameters to find the best model.

In [1]:
import pandas as pd
import numpy as np
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix

np.random.seed(42)

In [2]:
df = pd.read_csv('../00_Data/df_complete.csv')

In [3]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df_num = df.select_dtypes(include= numerics)

In [4]:
X = df_num.drop(columns = ['award_binary','award'])
y = df_num.award_binary

### Exporting Data

In [5]:
X.to_csv('../00_Data/X.csv', index = False )
y.to_csv('../00_Data/y.csv', index = False, header=True)

## Train Test Split

 - This is done so that we can later test to make sure that our model is able to predict accurately on unseen data. 
 - Training on more data is always better so in this case I am only saving 30% of my data for testing. 
 - Shuffle= True   : This Randomly grabs rows from our data frame and randomly puts them in either the train set of the test set. I am doing this to make sure that the data is random and so that my model won't be overfit to a specific string of data
 
##### Unbalanced Data:   
   - Negative Class = 96%
   - Positive Class = 4%'

##### Stratify = y :
   - Since my classes are unbalanced & I am randomly selecting if a row goes into the train or test set, it is possible that the majority of my positive class could end up in the test set & my model will do horrible becuase it will not have much data on my positive class to train on. Stratifying my data will make sure that there is an even number of postive class rows in both my training and testing set.

In [6]:
X_train, X_test , y_train , y_test = train_test_split(X,y , random_state = 42 , 
                                                      test_size = .3,shuffle = True , stratify = y)

### Pipeline:
Pipelines are a way to run multiple processes in the order that they are listed. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. In the model below, we set Standard Scaler and Logistic Regression.

In [7]:
pipe = Pipeline([
       ('ss',StandardScaler()),
       ('LR',LogisticRegression())
])

### Hyper Parameters:
 These are the parameters that we want our model to train on, feeding these parameters through a gridsearch will train a model on every combination of my parameters and output the best model. 

> #### Penalty: 
 - Determines which regularization to optimize on. If it is l1, this means that it is using lasso regularization, which minimizes the sum of squared errors plus the sum of the absolute value of the coefficients. l2 uses ridge regularization, which minimizes the sum of squared errors plus the sum of the square of the coefficients squared
> #### C: 
 - This parameter determines the strength of regularization. The smaller C is, the stronger regularization is.

In [8]:
params = {
   'LR__penalty':['l1','l2'],
   'LR__C':[.25,.5,.75,1]
}

### Grid Search:
- Grid searching is a module that performs parameter tuning which is the process of selecting the values for a model’s parameters that maximize the accuracy of the model. Grid Search does this by fitting every combination of parameters and selecting the best parameters by which model had the best score.


#### Scoring:
- I am scoring with 'roc_auc' which stands for "Reciever operating Curve ,  Area Under the Curve". The reason I am scoring this instead of scoring with accuracy is because of how unbalanced my classes are. We could predict zero for every data point, and could have a 96% accuracy score because 96% of the data is in our negative class. Using roc_auc accounts for the true positives and true negatives that we predict.

In [9]:
LogReg = GridSearchCV(pipe,param_grid=params,scoring='roc_auc')
LogReg.fit(X_train,y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('ss', StandardScaler(copy=True, with_mean=True, with_std=True)), ('LR', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'LR__penalty': ['l1', 'l2'], 'LR__C': [0.25, 0.5, 0.75, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

### Train & Test Score

In [10]:
LogReg.score(X_train , y_train)

0.8716995791934142

In [11]:
LogReg.score(X_test , y_test)

0.8414117477295757

### Best parameters

In [12]:
LogReg.best_params_

{'LR__C': 0.25, 'LR__penalty': 'l1'}

#### Predictions on every row in X

In [15]:
predictions = LogReg.predict(X_test)

#### The probabilty of predictions
- Cutoff is 50%

In [14]:
predict_prob = LogReg.predict_proba(X_test)

In [43]:
predictions.sum()

271.0

### Confusion Matrix
   
   ##### True Positve:
   - A True Positive is the cases where my Model Predicts a positive class & is correct on that prediction
   
##### True Negative:
  - A True Negative is the cases where my Model Predicts a negative class & is correct on that prediction



*We want to optimize for these predictions because that means my model can really predict the differnece between classes

In [17]:
pd.DataFrame(confusion_matrix(y_test , predictions) , columns = ['Negative','Positive']  , index = ['False','True'])

Unnamed: 0,Negative,Positive
False,24606,31
True,899,45


#### Result Visulization

In [72]:
pred_df_log = pd.concat([pd.DataFrame(predictions , columns = ['predictions']) , df[['track', 'artist' , 'award_binary' ]]] , axis = 1)

### Top 20 features that had the most weight in predicting classes

In [45]:
top_20_feat_logreg= pd.DataFrame(np.exp(LogReg.best_estimator_.named_steps['LR'].coef_.T  ),
             X.columns).sort_values( by = 0 , ascending = False).head(20)

## Summary:

- My best paramenters were a Lasso regression & a regularization strenth of .25. This model did not do very well for recall rate for my positive class. It only predicted 76 instintances of a Track being worthy of Winning an Award, & out of those only predicted 45 correctly. My roc_auc score was .871 on my training set & .841 on my testing set, which is slightly overfit but not too overfit. 