Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [83]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error
from category_encoders import OneHotEncoder


  import pandas.util.testing as tm


In [7]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [8]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [9]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [10]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [11]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [12]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [13]:
# Replace NaN values with zeros
df.fillna(value = 0, inplace = True)

In [None]:
# Replace 'x' values with ones
df.replace('x', 1, inplace = True)

In [None]:
# Replace booleans in 'Great' column with ones or zeros
df['Great'] = df['Great'].apply(lambda x: 1 if x else 0)

In [23]:
# Drop useless columns
df.drop(columns = ['Unreliable', 'Yelp', 'Google', 'Mass (g)', 'Density (g/mL)'], inplace = True)

In [25]:
df['Great'].value_counts(normalize = True)

0    0.567696
1    0.432304
Name: Great, dtype: float64

## Establish Baseline

In [106]:
# The most common class is '0' for greatness of burrito. 
# So our Baseline prediction will be '0' for target class of all inputs.

baseline = 0.57

## Timeline Split


In [42]:
df['Date'] = pd.to_datetime(df['Date'])

In [45]:
train = df['Date'] <= pd.to_datetime('12/31/2016')
validate = (pd.to_datetime('12/31/2016') < df['Date']) & (df['Date'] <= pd.to_datetime('12/31/2018'))
test = df['Date'] >= pd.to_datetime('01/01/2019')

train_df = df[train]
validate_df = df[validate]
test_df = df[test]

## Building a Model

In [108]:
pipe = make_pipeline(
    OneHotEncoder(use_cat_names = True),
    StandardScaler(),
    LogisticRegression())

## Preparing Data to be fed into model

In [93]:
X_train = train_df.iloc[:,0:-1].set_index('Date')
y_train = train_df[['Great']]

X_validate = validate_df.iloc[:,0:-1].set_index('Date')
y_validate = validate_df[['Great']]

X_test = test_df.iloc[:,0:-1].set_index('Date')
y_test = test_df[['Great']]

## Fit model into data


In [110]:
pipe.fit(X_train,y_train);

  elif pd.api.types.is_categorical(cols):
  y = column_or_1d(y, warn=True)


## Evaluate performance

In [111]:
print(f'Baseline accuracy: {baseline}')
print(f'Training data accuracy: {round(pipe.score(X_train, y_train),2)}')
print(f'Validation data accuracy: {round(pipe.score(X_validate, y_validate),2)}')
print(f'Testing data accuracy: {round(pipe.score(X_test, y_test),2)}')


Baseline accuracy: 0.57
Training data accuracy: 0.91
Validation data accuracy: 0.76
Testing data accuracy: 0.73
