<a href="https://colab.research.google.com/github/dondreojordan/DS-Unit-1-Build/blob/master/module4-logistic-regression/LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [33]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [105]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [106]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [107]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [37]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [108]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])


In [109]:
# Seperate data into train and test
train = df
test = df.drop('Great', axis=1)

In [110]:
#Print train and test shape
print('training:', train.shape)
print('test:', test.shape)

training: (421, 65)
test: (421, 64)


In [111]:
y = train.pop('Great') 

In [112]:
X=train
X.head()
X1 = X.dropna(thresh=370, axis=1)
#Remove colums with too many NaN values

In [113]:
y = y.loc[y.index.isin(X.index)]

**[ X ] Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.**

------------------------------------------------



In [114]:
# Date is an object. Convert to datetime. 

X1['Date'] = pd.to_datetime(X1['Date'], infer_datetime_format=True)


training = X1[X1['Date'].dt.year <= 2016]
testing = X1[X1['Date'].dt.year == 2017]
validation = X1[X1['Date'].dt.year >= 2018]

# OR a less clean filter

#training = X1[X1['Date'].dt.year.isin([2016])]
#testing = X1[X1['Date'].dt.year.isin([2018])]
#validation = X1[X1['Date'].dt.year.isin([2017])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


**[ X ] Begin with baselines for classification.**

-------------------------------------------------

https://datascience.stackexchange.com/questions/30912/what-does-baseline-mean-in-the-context-of-machine-learning

(more information on What does “baseline” mean in the context of machine learning?)

In [137]:
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit

X_train, X_val, y_train, y_val = train_test_split(X1, y, test_size=0.2, random_state=42, stratify=y)

In [130]:
y_train.value_counts(normalize=True)

False    0.568452
True     0.431548
Name: Great, dtype: float64

In [131]:
y_val.value_counts(normalize=True)

False    0.564706
True     0.435294
Name: Great, dtype: float64

In [132]:
StratifiedShuffleSplit()

StratifiedShuffleSplit(n_splits=10, random_state=None, test_size=None,
            train_size=None)

**[ ] Use scikit-learn for logistic regression.**

In [139]:
# Import statements
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [140]:
#Get the data
X1.sample(5)

Unnamed: 0,Location,Burrito,Date,Cost,Hunger,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Reviewer
25,Lola's 7 Up Market & Deli,Other,2016-02-29,6.0,3.5,2.5,2.5,3.0,4.0,4.0,4.0,3.0,3.5,1.5,Emily
108,La Perla Cocina,California,2016-05-16,6.5,5.0,4.0,,4.0,4.0,5.0,4.0,4.0,4.0,5.0,Laya
66,California Burritos,California,2016-04-15,6.25,4.0,4.5,4.5,2.5,3.5,3.5,3.5,3.0,4.0,5.0,Scott
206,Lucha Libre North Park,Other,2016-08-30,7.5,3.8,3.8,4.0,4.0,4.0,4.0,4.0,4.5,4.5,4.5,Luis
191,Los Cabos,California,2016-08-27,5.99,4.0,4.0,4.5,2.5,2.5,2.5,1.5,3.0,4.0,5.0,Scott


In [None]:
""" Since the all the classes are NOT numerical, you need to feature engineer (one hot encoding)

In [142]:
#Feature engineering (one hot encoding)
X1_ohe= pd.get_dummies(X1, drop_first=True)
X1_ohe.sample(5)

Unnamed: 0,Date,Cost,Hunger,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,"Location_Alberto's 623 N Escondido Blvd, Escondido, CA 92025",Location_Burrito Box,Location_Burrito Factory,Location_Burros and Fries,Location_Caliente Mexican Food,Location_California Burrito Company,Location_California Burritos,Location_California burritos,Location_Cancun Mexican & Seafood,Location_Carmen's Mexican Food,Location_Chili Peppers,Location_Chipotle,Location_Colima's,Location_Colima's Mexican Food,Location_Cortez Mexican Food,Location_Cotixan,Location_Don Carlos Taco Shop,Location_Donato's Taco Shop,Location_Donato's taco shop,Location_El Cuervo,Location_El Dorado Mexican Food,Location_El Indio,Location_El Nopalito,Location_El Patron,Location_El Pollo Loco,Location_El Portal Fresh Mexican Grill,Location_El Pueblo Mexican Food,Location_El Rey Moro,...,Reviewer_Marc,Reviewer_Matt,Reviewer_Matteo,Reviewer_Max,Reviewer_Meghan,Reviewer_Melissa G,Reviewer_Melissa N,Reviewer_Mike,Reviewer_Nick G.,Reviewer_Nicole,Reviewer_Nihal,Reviewer_Nuttida,Reviewer_Nuttida.1,Reviewer_Ricardo,Reviewer_Ricardo.1,Reviewer_Richard,Reviewer_Rob,Reviewer_Rob G,Reviewer_Rob L,Reviewer_Sage,Reviewer_Sai G,Reviewer_Sam A,Reviewer_Sam H,Reviewer_Sandra,Reviewer_Sankeerth,Reviewer_Sankha G,Reviewer_Sarah,Reviewer_Scott,Reviewer_Scott.1,Reviewer_Shijia,Reviewer_Shreejoy,Reviewer_Simon,Reviewer_Sisi,Reviewer_Spencer,Reviewer_TJ,Reviewer_Tammy,Reviewer_Tara,Reviewer_Tom,Reviewer_Torben,Reviewer_Xi
61,2016-04-07,7.45,3.5,3.0,5.0,3.5,2.5,3.0,2.5,3.75,3.0,4.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
148,2016-06-08,5.25,4.0,3.5,3.0,2.8,3.5,4.0,4.0,4.0,3.5,5.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
183,2016-08-16,4.99,3.5,3.0,3.0,3.0,3.0,2.5,2.0,2.5,3.0,4.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
309,2017-01-12,7.49,5.0,4.0,4.0,4.0,2.0,3.0,0.0,,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
129,2016-06-20,5.25,4.0,4.0,5.0,,5.0,4.0,4.0,5.0,5.0,5.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
X_train.select_dtypes(include='object').fillna('0').astype(int)
X_train['Guac'].str.replace('x','1', case =False) #Fasle so it isn't case sensitive

#Converted to int and removed the x and X  and NaN valuies. 

X_train['Guac'].apply(lambda n:1 if n else 0).head()