# Challenge

Pick a dataset of your choice with a binary outcome and the potential for at least 15 features.

Engineer your features, then create 3 models. Each model will be run on a training set and a test-set (or multiple test-sets, if you take a folds approach):

- Vanilla logistic regression
- Ridge logistic regression
- Lasso logistic regression

If you're stuck on how to begin combining your 2 new modeling skills, here's a hint: the SKlearn LogisticRegression method has a "penalty" argument that takes either 'l1' or 'l2' as a value.

In [23]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn import preprocessing
%matplotlib inline

df = pd.read_csv('./data/abalone.csv', names=["sex", "length", "diameter", "height", "whole weight", "shucked weight", "viscera weight", "shell weight", "rings"])
# all_sexes = df["sex"]

# Standardize data
# names = list(df.columns)
# names.remove("sex")
# df = pd.DataFrame(preprocessing.scale(df[names]), columns=names)
# df["sex"] = all_sexes

df.head()

Unnamed: 0,sex,length,diameter,height,whole weight,shucked weight,viscera weight,shell weight,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [29]:
num_rows = df.shape[0]
trainsize = int(num_rows / 2)
df_test = df.iloc[trainsize:, :].copy()
df_train = df.iloc[:trainsize, :].copy()

# TRAINING
X_train = df_train.loc[:, ~(df_train.columns).isin(['rings', 'sex'])]
Y_train = df_train['rings'].values.reshape(-1, 1)

# TESTING
X_test = df_test.loc[:, ~(df_train.columns).isin(['rings', 'sex'])]
Y_test = df_test['rings'].values.reshape(-1, 1)

## Vanilla Logistic Regression

In [30]:
vanilla = linear_model.LinearRegression()
vanilla.fit(X_train, Y_train)
print('\nR²:', vanilla.score(X_train, Y_train))
print('\nCoefficients:', vanilla.coef_)
print('\nIntercept:', vanilla.intercept_)


R²: 0.5287212733800659

Coefficients: [[ -3.53448893  17.65380803   7.03000173  11.81187581 -22.71063649
  -11.89651271   5.03184816]]

Intercept: [2.92780683]


In [31]:
print('\nR²:', vanilla.score(X_test, Y_test))


R²: 0.5200804761044662


## Ridge Logistic Regression

In [None]:
# ridge = linear_model(penalty="l2")

In [32]:
# coefficients will not go to zero
# in the beginning, it is good to go with lasso because it will tell you which are the good variables to have

# coefficients will change based on scale of data
# if coefficient is 1-10, good alpha might be 5
# if coefficient is less than 1, smaller alphas will be sufficient
ridge = linear_model.Ridge(alpha=0.5, fit_intercept=False)
ridgefit = ridge.fit(X_train, Y_train)

print('R²:', ridge.score(X_train, Y_train))
print('\nCoefficients:', ridgefit.coef_)
print('\nIntercept:', ridgefit.intercept_)

# try also with different data? good for logistic regression

R²: 0.5112154373106311

Coefficients: [[  8.01999961  12.76564296   7.70335814   8.02944719 -20.10108544
   -9.01806838   6.69822391]]

Intercept: 0.0


In [33]:
print('\nR²:', ridge.score(X_test, Y_test))


R²: 0.5085877627775135


## Lasso Logistic Regression

In [34]:
# lasso = linear_model(penalty="l1")

In [35]:
# More weight (alpha) --> more features will be zero (more you are penalizing)
lasso = linear_model.Lasso(alpha=.001)
lassofit = lasso.fit(X_train, Y_train)

print('R²:', lasso.score(X_train, Y_train))
print('\nCoefficients:', lassofit.coef_)
print('\nIntercept:', lassofit.intercept_)

R²: 0.5278667911499331

Coefficients: [ -0.          13.20574399   6.01743935  10.97734904 -21.98380604
 -10.1062783    5.88822553]

Intercept: [2.93493665]


In [36]:
print('\nR²:', lasso.score(X_test, Y_test))


R²: 0.5213568146382644


In [37]:
# PROCESS / TIMELINE

# do lasso
# get coefficients to see which are zero
# make sure R-squared values for train and test are similar
# i do not have to take features out manually because coefficients will go to zero if they are unimportant

# remove features from ridge, bc ridge will not make unimportant features zero
# do ridge - will help prevent overfitting

# REPEAT (trial and error) – look at data

## Evaluate Models

Evaluate all 3 models and decide on your best. Be clear about the decisions you made that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

In [None]:
# keep features same
# adapt alphas to be different for ridge vs. lasso

# Evaluation criteria – R-squared for test and train

# Independent and dependent variables are normally distributed (big assumption of regression)
# If not normally distributed, regression will fail
# Random forest / KNN – will work better in those scenarios (not limited by distribution)

# RANDOM FOREST
# random forest is most common model to use
# add: gradient boosting machines to optimize ^
# nice because you don't have to make variables normally distributed
# scaling not 100% necessary for random forest
# outlier removal and dummies and cleaning is still necessary for random forest

# LOGISTIC REGRESSION
# more commonly used than linear regression
# easy to explain 

# kaggle xgboost (https://www.kaggle.com/dansbecker/learning-to-use-xgboost)