<a href="https://colab.research.google.com/github/dustinhodges/DS-Unit-2-Linear-Models/blob/master/214_Hodges_assignment_regression_classification_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.
- [ ] Watch Aaron's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes) to learn about the mathematics of Logistic Regression.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/burritos/burritos.csv')

In [5]:
df.shape

(423, 66)

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [7]:
df.shape

(421, 67)

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [16]:
print(df.shape)
df.head()

(421, 59)


Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.


In [0]:
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)

In [21]:
cutoff = pd.to_datetime('1/1/2018')
val_cut = pd.to_datetime('1/1/2017')

train = df[df.Date < val_cut]
val = df[(df.Date >= val_cut) & (df.Date < cutoff)]
test = df[df.Date >= cutoff]

train.shape, val.shape, test.shape

((298, 59), (85, 59), (38, 59))

 Begin with baselines for classification.
 

In [22]:
target = 'Great'
y_train = train[target]
y_train.value_counts(normalize=True)

False    0.590604
True     0.409396
Name: Great, dtype: float64

In [0]:
majority_guess = y_train.mode()
y_pred = [majority_guess] * len(y_train)

In [28]:
from sklearn.metrics import accuracy_score
print('Baseline Accuracy, guessing the mode every time', accuracy_score(y_train, y_pred))

Baseline Accuracy, guessing the mode every time 0.5906040268456376


In [30]:
y_val = val[target]
y_pred = [majority_guess] * len(y_val)
print('Baseline Accuracy for validation set:', accuracy_score(y_val, y_pred))

Baseline Accuracy for validation set: 0.5529411764705883


Use scikit-learn for logistic regression.
 

In [35]:
train.head()

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,2016-01-18,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,2016-01-24,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,2016-01-24,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,2016-01-24,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,2016-01-27,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [0]:
from sklearn.linear_model import LogisticRegression
import category_encoders as ce 
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

features = ['Burrito', 'Yelp', 'Google', 'Cost', 'Hunger', 'Fillings', 'Uniformity', 'Salsa', 'Synergy', 'Temp']
target = 'Great'
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]
y_test = test[target]

In [48]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape, y_test.shape

((298, 10), (298,), (85, 10), (85,), (38, 10), (38,))

In [49]:
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)
X_test_encoded = encoder.transform(X_test)

X_train_encoded.head()

Unnamed: 0,Burrito_California,Burrito_Carnitas,Burrito_Asada,Burrito_Other,Burrito_Surf & Turf,Yelp,Google,Cost,Hunger,Fillings,Uniformity,Salsa,Synergy,Temp
0,1,0,0,0,0,3.5,4.2,6.49,3.0,3.5,4.0,4.0,4.0,5.0
1,1,0,0,0,0,3.5,3.3,5.45,3.5,2.5,4.0,3.5,2.5,3.5
2,0,1,0,0,0,,,4.85,1.5,3.0,4.0,3.0,3.0,2.0
3,0,0,1,0,0,,,5.25,2.0,3.0,5.0,4.0,4.0,2.0
4,1,0,0,0,0,4.0,3.8,6.59,4.0,3.5,5.0,2.5,4.5,5.0


In [0]:
imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_val_imputed = imputer.transform(X_val_encoded)
X_test_imputed = imputer.transform(X_test_encoded)

In [51]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

X_train_scaled[:5]

array([[ 1.23508045e+00, -2.22026518e-01, -3.64801107e-01,
        -7.64922469e-01, -2.75340288e-01, -1.71200227e+00,
         3.20514492e-01, -3.39805235e-01, -5.24306577e-01,
        -2.24469694e-02,  5.57477567e-01,  7.21245234e-01,
         5.00993379e-01,  1.34069563e+00],
       [ 1.23508045e+00, -2.22026518e-01, -3.64801107e-01,
        -7.64922469e-01, -2.75340288e-01, -1.71200227e+00,
        -4.67482113e+00, -1.20857147e+00,  6.44233034e-02,
        -1.20240271e+00,  5.57477567e-01,  1.87274062e-01,
        -1.13340089e+00, -2.13866771e-01],
       [-8.09663853e-01,  4.50396651e+00, -3.64801107e-01,
        -7.64922469e-01, -2.75340288e-01,  1.91418451e-15,
        -4.92972144e-15, -1.70978276e+00, -2.29049622e+00,
        -6.12424838e-01,  5.57477567e-01, -3.46697110e-01,
        -5.88602801e-01, -1.76842918e+00],
       [-8.09663853e-01, -2.22026518e-01,  2.74121975e+00,
        -7.64922469e-01, -2.75340288e-01,  1.91418451e-15,
        -4.92972144e-15, -1.37564190e+00, -1.

Get your model's validation accuracy. (Multiple times if you try multiple iterations.)


In [52]:
model = LogisticRegressionCV()
model.fit(X_train_scaled, y_train)
print('Train Accuracy', model.score(X_train_scaled, y_train))
print('Validation Accuracy', model.score(X_val_scaled, y_val))

Train Accuracy 0.8657718120805369
Validation Accuracy 0.8235294117647058




82% up from 55% at baseline!

 Get your model's test accuracy. (One time, at the end.)
 

In [53]:
print('Test Accuracy:', model.score(X_test_scaled, y_test))

Test Accuracy: 0.7631578947368421


In [0]:
76% still better than the baseline!

Commit your notebook to your fork of the GitHub repo.
 Watch Aaron's video #1 (12 minutes) & video #2 (9 minutes) to learn about the mathematics of Logistic Regression.