# Problem Statement
Now that you have two new regression methods at your fingertips, it's time to give them a spin. In fact, for this challenge, let's put them together! Pick a dataset of your choice with a binary outcome and the potential for at least 15 features. If you're drawing a blank, the crime rates in 2013 dataset has a lot of variables that could be made into a modelable binary outcome.

Engineer your features, then create three models. Each model will be run on a training set and a test-set (or multiple test-sets, if you take a folds approach). The models should be:

Vanilla logistic regression
Ridge logistic regression
Lasso logistic regression
If you're stuck on how to begin combining your two new modeling skills, here's a hint: the SKlearn LogisticRegression method has a "penalty" argument that takes either 'l1' or 'l2' as a value.

In your report, evaluate all three models and decide on your best. Be clear about the decisions you made that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

### Outline
1. [Data](#data)
2. [Vanilla](#vanilla)
3. [Ridge](#ridge)
4. [Lasso](#lasso)
5. [Write-up](#write_up)

<a id = 'data'></a>
# Data

In [1]:
import pandas as pd
import numpy as np
current_state = np.random.get_state()
np.random.set_state(current_state)
from sklearn import linear_model
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.model_selection import cross_val_score

In [2]:
data_path = '../../datasets/fma/echonest_cols.csv'
df = pd.read_csv(data_path)
df_feat = df.copy()
print ("Shape of df: {}".format(df_feat.shape))

Shape of df: (13129, 250)


In [3]:
# unhelpful columns
df_feat.drop(columns = ['track_id', 'artist_name', 'release'], inplace = True)
# collinearity
temporal_cols = df_feat.columns[-224:]
feat_corr = df_feat.corr()
feat_corr = feat_corr.dropna(axis = 0)
corr_col = []
for col in temporal_cols:
    corrs = sorted(abs(feat_corr[col]), reverse = True)
    corr_score = corrs[0]+corrs[1]
    if corr_score > 1.5:
        corr_col.append(col)
# drop correlated columns
df_feat.drop(columns = corr_col, inplace = True)

# nans
df_nans = df_feat.isnull().sum()
nans_cols = list(df_nans[df_nans > 100].index)
df_feat.drop(columns = nans_cols, inplace = True)
df_feat.fillna(value = df_feat.median(), inplace = True)
# zeros
df_zeros = (df_feat == 0).sum()
zeros_cols = list(df_zeros[df_zeros > 100].index)
df_feat.drop(columns = zeros_cols, inplace = True)

In [4]:
# choose target and input
X = df_feat[list(set(list(df_feat.columns)) - set(['artist_familiarity']))]
print ("Shape of X: {}".format(X.shape))
y = df['artist_familiarity']
# y >= 0.3 is familiar, y < 0.3 is unfamiliar
y[y >= 0.3] = 1
y[y < 0.3] = 0

Shape of X: (13129, 25)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


<a id = 'vanilla'></a>
# Vanilla

In [5]:
def evaluate_model(model, X, y):
    # Fit the model.
    fit = model.fit(X, y)
    return cross_val_score(model, X, y, cv = 5), fit.coef_

In [6]:
vanilla_model = linear_model.LogisticRegression(C=1e9)
vanilla_scores, vanilla_coefs = evaluate_model(vanilla_model, X.apply(stats.zscore, axis = 1), y)

<a id = 'ridge'></a>
# Ridge

In [7]:
ridge_model = linear_model.RidgeClassifier(alpha = 1e-4)
ridge_scores, ridge_coefs = evaluate_model(ridge_model, X, y)

<a id = 'lasso'></a>
# Lasso

In [8]:
lasso_model = linear_model.LogisticRegression(penalty = 'l1')
lasso_scores, lasso_coefs = evaluate_model(lasso_model, X, y)

<a id = 'write_up'></a>
# Write-up

In [9]:
# display accuracy
print ("\t\tMean Accuracy\nVanilla: \t%0.4f (+/- %0.4f)" % (vanilla_scores.mean(), vanilla_scores.std()))
print ("Ridge: \t\t%0.4f (+/- %0.4f)" % (ridge_scores.mean(), ridge_scores.std()))
print ("Lasso: \t\t%0.4f (+/- %0.4f)" % (lasso_scores.mean(), lasso_scores.std()))

		Mean Accuracy
Vanilla: 	0.8720 (+/- 0.0216)
Ridge: 		0.8715 (+/- 0.0206)
Lasso: 		0.8740 (+/- 0.0207)


In [10]:
# display coefficients
coefs = np.transpose([vanilla_coefs[0], np.transpose(ridge_coefs[0]), lasso_coefs[0]])
df_coefs = pd.DataFrame(data = coefs, index = X.columns, columns = ['vanilla', 'ridge', 'lasso'])
display(df_coefs)

Unnamed: 0,vanilla,ridge,lasso
170,0.48739,-0.000615,0.01029
211,2.966437,-0.001772,-0.026166
172,0.794032,0.010658,0.044994
179,-1.18586,-0.010792,-0.007379
valence,11.231501,0.097486,0.198311
acousticness,4.370769,0.099916,0.274178
174,-3.381389,-0.042412,-0.11286
liveness,-22.268186,-0.092163,-0.423843
169,0.698824,-0.00084,-0.0214
181,-0.306313,-0.0024,-0.012609
