Lambda School Data Science

---

# Logistic Regression


## Assignment 🌯

Using a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/), build a model that predicts whether a burrito is rated `'Great'`?

## What We Want

*Associate Instructor* is a teaching position where you will work closely students — delivering curriculum, and leading question and answer sessions. The purpose of this assignment is to give us an idea of how you approach data science problems and whether you can explain that approach to someone who does not yet have your level of expertise. Given this, we are less interested you building a "perfect" model and more focused on how you use this assignment to teach important concepts to a data science student.

## What You Need to Do

- [ ] Make a copy of this notebook to work on - you can download if you have a local Jupyter setup, or click `File > Save a copy in Drive` to copy and work on with Google Colab
- [ ] Import the burrito `csv` file into a `DataFrame`. Your target will be the `'Great'` column.
- [ ] Conduct exploratory data analysis (EDA) to determine how you should clean the data for your pipeline.
- [ ] Clean your data. (Note: You are not required to use all columns in your model, but justify your decisions based on your EDA.)
- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Determine what the baseline accuracy is for a naïve classification model.
- [ ] Create a `scikit-learn` pipeline with the following components:
  - A one hot encoder for categorical features.
  - A scaler.
  - A logistic regressor.
- [ ] Train your model using the training data.
- [ ] Create a visualization showing your model's coefficients.
- [ ] Get your model's validation accuracy (multiple times if you try multiple iterations).
- [ ] Get your model's test accuracy (one time, at the end).

---

### Import the burrito `csv` file

In [None]:
# Import Data

import pandas as pd
import numpy as np
df = pd.read_csv('https://drive.google.com/uc?export=download&id=1cctPq1sYeD6Y6mGg5Lpl-GLDJBwtdihg')

### Conduct exploratory data analysis (EDA)

In this section I'll take a look at the data and begin formulating ideas about which columns should be kept, deleted, or modified in some way, which I'll complete in the next section: __Clean Data__.

In [None]:
# What does the data actually look like?

df.head(5)

In [None]:
# How much data do we have?

print(f'There are {df.shape[0]} rows')

In [None]:
# What do the columns look like?

df.columns

In [None]:
# What kind of data exists in each column?

df.info(verbose=True)

print(df.dtypes)

In [None]:
# How balanced is the target variable?

df["Great"].value_counts()

### Clean data

Now I'll take what I've learned from the last section and drop rows I don't think will be useful. I'll also interpolate data as needed.

The columns `Location`, `Burrito`, `Neighborhood`, and `Reviewer` represent discrete categories and can be dummy encoded. However, if there are too many unique values in each column, the number of resulting dummy columns can be huge. So first I'll check to see how many values each column has:

In [None]:
df["Location"].value_counts().shape

In [None]:
df["Burrito"].value_counts().shape

In [None]:
df["Neighborhood"].value_counts().shape

In [None]:
df["Reviewer"].value_counts().shape

In [None]:
df["NonSD"].value_counts()

Given the relatively small number of rows, it makes sense to drop these columns from consideration. Other features could be generated from these - such as combining close-by neighborhoods or clumping together burrito types (i.e. a binary feature that represents _California burrito_ vs _non-Cali_), but for time purposes and simplicity I'll skip this. In the future we can revisit these features and work with them if needed.

I'm also going to __assume__ that the column `Address` and `URL` are redundant given the name of the burrito shop, so I'll drop them as well.

The remaining columns are floats or objects - floats represent ratings of specific burrito parts (i.e. `Meat`) or things like `Cost` or `Weight`. The columns encoded as objects represent the absence or presence of certain features (i.e. `Avocado`) or whether or not the burrito is recommended `Rec`. `Notes` is text and can be munged, but I will skip doing that for time reasons.

Also drop the `Queso` column because it's empty.

In [None]:
# Drop columns I don't want

df = df.drop(['Location', 'Burrito', 'Neighborhood', 'Reviewer', 'Address', 'URL', 'Notes', 'NonSD', 'Queso', 'Unreliable'], axis=1)

In [None]:
# Of the float columns, check which ones have lots of NaN's

cols = ['Yelp', 'Google', 'Cost', 'Hunger', 'Mass (g)', 'Density (g/mL)',
       'Length', 'Circum', 'Volume', 'Tortilla', 'Temp', 'Meat', 'Fillings',
       'Meat:filling', 'Uniformity', 'Salsa', 'Synergy', 'Wrap']
        
df[cols].isna().sum()

In [None]:
# Drop the columns with over 100 NaN's

df = df.drop(['Yelp', 'Google', 'Mass (g)', 'Density (g/mL)', 'Length', 'Circum', 'Volume'], axis=1)

In [None]:
# Fill in the remaining NaN's with the column averages
# Alternatively: drop rows with missing values

cols = ['Cost', 'Hunger', 'Temp', 'Meat', 'Fillings', 'Meat:filling', 'Uniformity', 'Salsa', 'Synergy', 'Wrap']
for col in cols:
    df[col].fillna(value=df[col].mean(), inplace=True)

In [None]:
df

In [None]:
# Binary encode these columns instead of dummy encoding

cols = ['Beef', 'Pico', 'Guac', 'Cheese',
       'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish', 'Rice',
       'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots', 'Cabbage',
       'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito', 'Pineapple', 'Ham',
       'Chile relleno', 'Nopales', 'Lobster', 'Egg', 'Mushroom',
       'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini', 'Chips']

d = {'X': 1, 'x': 1, np.NaN: 0}

for col in cols:
    df[col] = df[col].map(d)

In [None]:
df

In [None]:
# Convert to datetime object

df['Date'] = pd.to_datetime(df['Date'])

df = df.drop("Chips", axis=1)

In [None]:
df.dtypes

### Do train/validate/test split

In [None]:
# Reindex based on date

df = df.set_index('Date')

In [None]:
# Define independent vs dependent variables.

X = [i for i in df.columns.to_list() if i not in "Great"]
y = "Great"

In [None]:
# Show what this looks like

df[X].loc['2017-01-01':'2017-12-31'][:5]

In [None]:
# Split the train/val/test sets by date

X_train = df[X].loc[:'2016-12-31']
y_train = df[y].loc[:'2016-12-31']

X_val = df[X].loc['2017-01-01':'2017-12-31']
y_val = df[y].loc['2017-01-01':'2017-12-31']

X_test = df[X].loc['2018-01-01':]
y_test = df[y].loc['2018-01-01':]

In [None]:
# Show how many rows are in each set
# These numbers aren't exactly balanced, but I will follow the requirements of this task and keep them as is

print(f'# rows in training set:   {X_train.shape[0]}')
print(f'# rows in validation set: {X_val.shape[0]}')
print(f'# rows in test set:       {X_test.shape[0]}')

### Determine baseline accuracy for a naïve classification model

How often will a model that guesses "great" every time be accurate? If the distribution of Great/Not Great is very skewed (lots more greats than not greats), then we can get a high accuracy by simply classifying every burrito as "Great"! Some people just really love Mexican food...

In [None]:
df["Great"].value_counts()

In [None]:
print(f'{df["Great"].value_counts()[1]/df.shape[0]:.2}')

In [None]:
# Subtract this from 1 because the model can also simply guess the opposite

print(f'{1 - df["Great"].value_counts()[1]/df.shape[0]:.2}')

If the logistic model I create is "useful", its classification accuracy should be above _0.57_

In [None]:
# Make a simple sklearn model and get the accuracy on the training data

from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression

# Scale data

scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

# Show accuracy on the _training_ data

clf = LogisticRegression(random_state=0).fit(X_train, y_train)

print(f"The model's accuracy on the training data is: {clf.score(X_train, y_train):.2}")

### Create a `scikit-learn` pipeline

In [None]:
from sklearn.pipeline import Pipeline

# Optional: add dummy encoding
pipe = Pipeline([('scaler', preprocessing.StandardScaler()), 
                 ('logistic_reg', LogisticRegression())])

### Train model using training data

In [None]:
pipe.fit(X_train, y_train)

print(f"{pipe.score(X_test, y_test):.2}")

### Create visualization of model coefficients

Later experiments: perform variable selection

In [None]:
df.columns

In [None]:
import matplotlib.pyplot as plt

coefs = pd.DataFrame(
    clf.coef_,
    columns=df.columns[:-1] # to remove "Great"
).T

coefs.plot(kind='barh', figsize=(15, 12))
plt.title('Model coefficient size')
plt.axvline(x=0, color='.5')
plt.subplots_adjust(left=.3)

### Get model's validation accuracy

In [None]:
print(f"{pipe.score(X_val, y_val):.2}")

### Get your model's test accuracy

In [None]:
print(f"{pipe.score(X_test, y_test):.2}")