<a href="https://colab.research.google.com/github/hargettc2015/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/Unit_2_Sprint_1_Module_4_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.
- [ ] Watch Aaron's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes) to learn about the mathematics of Logistic Regression.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [54]:
df.head()

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [0]:
df = df.fillna(0)

In [0]:
df = df.drop(columns=['Burrito', 'Chips', 'Beef', 'Pico', 'Guac', 'Cheese', 'Fries', 'Pork'])

In [0]:
import datetime
from datetime import date
from datetime import time
from datetime import datetime

In [0]:
df['Date'] = pd.to_datetime(df['Date']).dt.year

In [74]:
df.head()

Unnamed: 0,Date,Yelp,Google,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Sour cream,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,2016,3.5,4.2,6.49,3.0,0.0,0.0,0.0,0.0,0.0,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0,0,0,0,0
1,2016,3.5,3.3,5.45,3.5,0.0,0.0,0.0,0.0,0.0,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0,0,0,0,0
2,2016,0.0,0.0,4.85,1.5,0.0,0.0,0.0,0.0,0.0,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0,0,0,0,0
3,2016,0.0,0.0,5.25,2.0,0.0,0.0,0.0,0.0,0.0,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0,0,0,0,0
4,2016,4.0,3.8,6.59,4.0,0.0,0.0,0.0,0.0,0.0,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0,0,0,0,1


In [0]:
df.Great = df.Great.astype(int)

In [77]:
df.head()

Unnamed: 0,Date,Yelp,Google,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Sour cream,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,2016,3.5,4.2,6.49,3.0,0.0,0.0,0.0,0.0,0.0,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0,0,0,0,0
1,2016,3.5,3.3,5.45,3.5,0.0,0.0,0.0,0.0,0.0,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0,0,0,0,0
2,2016,0.0,0.0,4.85,1.5,0.0,0.0,0.0,0.0,0.0,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0,0,0,0,0
3,2016,0.0,0.0,5.25,2.0,0.0,0.0,0.0,0.0,0.0,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0,0,0,0,0
4,2016,4.0,3.8,6.59,4.0,0.0,0.0,0.0,0.0,0.0,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0,0,0,0,1


In [0]:
train = (df[df['Date'] <= 2016])
validate = (df[df['Date'] == 2017])
test = (df[df['Date'] >= 2018])

In [80]:
train.shape, validate.shape, test.shape

((298, 51), (85, 51), (38, 51))

In [82]:
#Shows that 59 percent did not have a 'great' rating of over 4. 
target = 'Great'
y_train = train[target]
y_train.value_counts(normalize=True)

0    0.590604
1    0.409396
Name: Great, dtype: float64

In [0]:
majority_rating = y_train.mode()[0]
y_pred = [majority_rating] * len(y_train)

In [86]:
from sklearn.metrics import accuracy_score
#Base Rate
accuracy_score(y_train, y_pred)

0.5906040268456376

In [0]:
val = validate

In [91]:
#Validation Accuracy:
y_val = val[target]
y_pred = [majority_rating] * len(y_val)
accuracy_score(y_val, y_pred)

0.5529411764705883

In [92]:
#Logistic Regression
train.describe()

Unnamed: 0,Date,Yelp,Google,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Queso,Great
count,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0
mean,2015.979866,0.928523,0.986913,6.757919,3.433725,0.0,0.0,11.645067,12.870302,0.450134,3.472315,3.519799,3.432047,3.507215,3.457819,3.373154,3.10151,3.516443,3.928523,0.0,0.409396
std,0.295187,1.679213,1.776823,1.542546,0.873812,0.0,0.0,9.908151,10.958878,0.394904,0.797606,1.262157,1.068136,0.873048,1.143324,1.120343,1.254644,0.963831,1.207535,0.0,0.49255
min,2011.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2016.0,0.0,0.0,6.25,3.0,0.0,0.0,0.0,0.0,0.0,3.0,3.0,3.0,3.0,3.0,2.5,2.5,3.0,3.5,0.0,0.0
50%,2016.0,0.0,0.0,6.725,3.5,0.0,0.0,17.89,20.5,0.64,3.5,4.0,3.5,3.5,4.0,3.5,3.0,3.725,4.0,0.0,0.0
75%,2016.0,0.0,0.0,7.5,4.0,0.0,0.0,20.0,22.0,0.77,4.0,4.5,4.0,4.0,4.0,4.0,4.0,4.0,5.0,0.0,1.0
max,2016.0,4.5,4.9,11.95,5.0,0.0,0.0,26.0,27.0,1.24,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,0.0,1.0


In [93]:
#Step 1 Import estimator class
from sklearn.linear_model import LinearRegression

#Step 2- Instantiate this class
linear_reg = LinearRegression()

#Step 3- Arrange X feature matrices
features = ['Wrap', 'Meat', 'Tortilla']
X_train = train[features]
X_val = val[features]

#Impute missing values- Impute missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train)
X_val_imputed = imputer.transform(X_val)

#Step 4- Fit the model
linear_reg.fit(X_train_imputed, y_train)

#Step 5- Apply the model to the new data
linear_reg.predict(X_val_imputed)

array([ 0.62793003, -0.04180894,  0.45455856,  0.53398047,  0.14002791,
        0.33193944,  0.66258266,  0.59820814,  0.56462069,  0.70921772,
        0.71533062, -0.06201312,  0.31062812,  0.33953656,  0.92898537,
        0.35219843,  0.44466795,  0.43803112,  0.48481304,  0.98896174,
        0.51372148,  0.45374511,  0.46999169,  0.43749854,  0.63911186,
        0.71323014,  0.72589201,  0.60260629,  0.4121748 ,  0.60809875,
        0.70801854,  0.53904522,  0.2690578 ,  0.6405919 ,  0.26303276,
        0.33953656,  0.68121057,  0.67829665,  0.41962507,  0.37642787,
        0.6405919 ,  0.42842138,  0.38163947,  0.61526816,  0.85248157,
        0.7117501 ,  0.30513566,  0.28672961,  0.51372148,  0.04565064,
        0.43749854,  0.45374511,  0.47974829,  0.43749854,  0.35219843,
        0.17826524,  0.33953656,  0.06738967,  0.58994443, -0.16867936,
       -0.36053609,  0.31684589,  0.07097438,  0.5552918 ,  0.41934422,
        0.55995624,  0.52996806,  0.77320651,  0.4121748 ,  0.36

In [94]:
#getting coefficients:
pd.Series(linear_reg.coef_, features)

Wrap        0.025324
Meat        0.170600
Tortilla    0.203093
dtype: float64

In [95]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train_imputed, y_train)
print('Validation Accuracy', log_reg.score(X_val_imputed, y_val))

Validation Accuracy 0.7058823529411765


In [96]:
log_reg.predict(X_val_imputed)

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1])

In [0]:
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.e**(-x))

In [0]:
#Validate your information 
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

features = ['Wrap', 'Meat', 'Tortilla','Cost', 'Fillings', 'Meat:filling', 'Salsa',]
target = 'Great'
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]

In [103]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape

((298, 7), (298,), (85, 7), (85,))

In [0]:
#Category encoding
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)

#Simple Inputer
imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_val_imputed = imputer.transform(X_val_encoded)

#Standard Scaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)

In [0]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)

In [110]:
model = LogisticRegressionCV(cv=5)
model.fit(X_train_scaled,y_train)
print('Validation Accuracy', model.score(X_val_scaled, y_val))

Validation Accuracy 0.8470588235294118


In [112]:
coefficients = pd.Series(model.coef_[0], X_train_encoded.columns)
coefficients.sort_values()

Cost           -0.071904
Wrap            0.122496
Salsa           0.363992
Meat            0.497127
Tortilla        0.672148
Meat:filling    0.831926
Fillings        1.410434
dtype: float64