BloomTech Data Science

*Unit 2, Sprint 1, Module 4*

---

In [2]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

# Libraries for today's project
from category_encoders import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd

# Part 1: Sprint Challenge Practice

- Can I import a CSV file into a DataFrame?
- Can I create a scatter plot?
- Can I split a DataFrame into a target vector and feature matrix?
- Can I split a dataset into a training set and a test set?
- Can I establish the baseline mean absolute error for a regression problem?
- Can I combine transformers with a predictor using a pipeline?
- Can I build a linear regresion model and a ridge regression model?
- Can I evaluate a model using common metrics like mean absolute error, root mean squared error, and R^2?
- If given a feature matrix, can I use my model to create a list of predictions?
- Can I create a horizontal bar chart with the coefficients from a linear model?

# Part 2: Pipelines and Plotting

Let's revise [our work from yesterday](https://tinyurl.com/yb9cz6n5).

# Part 3: Logistic Regression

# I. Wrangle Data

In [16]:
df = pd.read_csv(filepath)

In [6]:
df.head()

Unnamed: 0,Location,Burrito,Date,Neighborhood,Address,URL,Yelp,Google,Chips,Cost,...,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini
0,Donato's taco shop,California,1/18/2016,Miramar,6780 Miramar Rd,http://donatostacoshop.net/,3.5,4.2,,6.49,...,,,,,,,,,,
1,Oscar's Mexican food,California,1/24/2016,San Marcos,225 S Rancho Santa Fe Rd,http://www.yelp.com/biz/oscars-mexican-food-sa...,3.5,3.3,,5.45,...,,,,,,,,,,
2,Oscar's Mexican food,Carnitas,1/24/2016,,,,,,,4.85,...,,,,,,,,,,
3,Oscar's Mexican food,Carne asada,1/24/2016,,,,,,,5.25,...,,,,,,,,,,
4,Pollos Maria,California,1/27/2016,Carlsbad,3055 Harding St,http://pollosmaria.com/,4.0,3.8,x,6.59,...,,,,,,,,,,


In [43]:

def wrangle(filepath):
   #import w/ DateTimeIndex
   df = pd.read_csv(filepath, parse_dates=['Date'],
                   index_col='Date')
   #Drop unsated burritos
   df.dropna(subset=['overall'], inplace=True)

   #Derive binary classification target:
   #We define a 'Great' burrito as having an
   #overall rating of 4 or higher, on a 5 point scale
   df['Great'] = (df['overall'] >= 4).astype(int)

   #Drop high cardinality categories
   df = df.drop(columns=['Notes', 'Location', 'Address', 'URL'])

   #Drop columns to prevent "leakage"
   df = df.drop(columns=['Rec', 'overall'])

   return df

filepath = DATA_PATH + 'burritos/burritos.csv'


In [44]:
filepath = DATA_PATH + 'burritos/burritos.csv'
df = wrangle(filepath)

In [45]:
df.head()

Unnamed: 0_level_0,Burrito,Neighborhood,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,...,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-01-18,California,Miramar,3.5,4.2,,6.49,3.0,,,,...,,,,,,,,,,0
2016-01-24,California,San Marcos,3.5,3.3,,5.45,3.5,,,,...,,,,,,,,,,0
2016-01-24,Carnitas,,,,,4.85,1.5,,,,...,,,,,,,,,,0
2016-01-24,Carne asada,,,,,5.25,2.0,,,,...,,,,,,,,,,0
2016-01-27,California,Carlsbad,4.0,3.8,x,6.59,4.0,,,,...,,,,,,,,,,1


In [None]:
drop_cols = ['Unreliable', 'NonSD', 'Beef', 'Pico', 'Guac', 'Cheese', 'Fries',
       'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish', 'Rice', 'Beans',
       'Lettuce', 'Tomato', 'Bell peper', 'Carrots', 'Cabbage', 'Sauce',
       'Salsa.1', 'Cilantro', 'Onion', 'Taquito', 'Pineapple', 'Ham',
       'Chile relleno', 'Nopales', 'Lobster', 'Queso', 'Egg', 'Mushroom',
       'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini', 'Chips']

In [47]:
df.iloc[:, 22:-1].columns

Index(['Unreliable', 'NonSD', 'Beef', 'Pico', 'Guac', 'Cheese', 'Fries',
       'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish', 'Rice', 'Beans',
       'Lettuce', 'Tomato', 'Bell peper', 'Carrots', 'Cabbage', 'Sauce',
       'Salsa.1', 'Cilantro', 'Onion', 'Taquito', 'Pineapple', 'Ham',
       'Chile relleno', 'Nopales', 'Lobster', 'Queso', 'Egg', 'Mushroom',
       'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini'],
      dtype='object')

## II. Split Data

Since there's no time series component, we should do a randomized train-validation split.

## III. Establish our baseline

- This is a **classification problem**, so we look at the **majority class** to calculate baseline **accuracy score**.

## IV. Build Model

- Why would regular linear regression not work here?

What we need for our model pipeline:

- `OneHotEncoder`
- `SimpleImputer`
- `LogisticRegression`

## V. Check Metrics