### Problem solving exercise

+ Because I love burritos we're analyzing a chipotle dataset! 

+ The goal is to string together the skills we've worked on over the last few months

#### RULES

+ I will not be giving out answers! (ok maybe some hints if you get *really* stuck) 
+ You will solve this together as a class
+ When someone figures something out, they can come to the board and present their solution to the class
    + Alternative solutions can also be presented
    
##### REWARDS:
+ There will be a happy hour after this class
+ Students with the most accurate models will be eligible to vote on the bar we go to!
    + Any models with accuracy within a 5% of the most accurate model  
    + So if the best model is 82.2%, we'll also select anyone with accuracy greater than 78.1%
+ Student who presents the most solutions presented to the class will get a free drink! (or alternative if you don't drink) 
    + No ties! Only one student can win this!


### Outline:

#### Cleaning Data

+ We only briefly covered cleaning data
+ You'll need to rely more on google and logic than class notes here
+ Cleaning data is something you just need to learn by doing
+ After cleaning, we'll run a machine learning algorithm to predict the price of an order

#### Preprocessing & ML 

+ We've covered this in class, but this time you're really driving the ship
+ Get your data into the right format, then start training your algorithm! 



### That's it! GO FOR IT! 
+ I believe in all of you!

### First import your dataset

+ hint - examine how the values are separated 
+ What's the difference between a tsv and csv?


In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import seaborn as sns
import statsmodels.api as sm
from sklearn import linear_model

In [61]:
### Code here
df = pd.read_csv('chipotle.tsv', sep='\t')
df

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98
6,3,1,Side of Chips,,$1.69
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",$9.25
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",$9.25


## Next, clean up "choice description"

+ What do the values look like? 
+ We're going to plug this into count vectorizer, later
+ How can we clean this up?

+ Check out a package called "re" for regular expressions
+ there are multiple ways to solve this problem

In [62]:
### code here
holder = []
for x in df['choice_description']:
    if type(x) == float:
        holder.append(str(x))
    else:
        holder.append(x)

df['choice_description'] = holder


In [65]:
df['choice_description']= df['choice_description'].apply(lambda x: x.replace("[", "").replace("]", ""))

### Next, clean up "item price" 

+ What can you do here? 
+ This will be our outcome variable
+ How can we make this easier to read? 

In [69]:
### code here

df['item_price'] = df['item_price'].apply(lambda x: float(x.replace("$", "")))

In [53]:
### code here
# df['item_price'] = df['item_price'].apply(remove_cash)

In [70]:
df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,2.39
1,1,1,Izze,Clementine,3.39
2,1,1,Nantucket Nectar,Apple,3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,2.39
4,2,2,Chicken Bowl,"Tomatillo-Red Chili Salsa (Hot), Black Beans, ...",16.98


### Now Preprocess your data! 

+ Use a vectorizer of your choice!

+ Consider a dimension reduction technique! 

    + PCA? SVD? LDA?

In [113]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

from sklearn.feature_extraction import DictVectorizer

In [137]:
descriptions = df['choice_description'].fillna('')
items = df['item_name'].fillna('')

y = df['item_price']

In [166]:
cv = CountVectorizer(ngram_range=(1, 1), 
                             stop_words='english',
                             binary=False)

In [159]:
choice_descriptions = cv.fit_transform(descriptions)

In [160]:
cv.get_feature_names()

[u'adobo',
 u'apple',
 u'banana',
 u'barbacoa',
 u'beans',
 u'black',
 u'blackberry',
 u'braised',
 u'brown',
 u'carnitas',
 u'cheese',
 u'cherry',
 u'chicken',
 u'chili',
 u'cilantro',
 u'clementine',
 u'coca',
 u'coke',
 u'cola',
 u'corn',
 u'cream',
 u'dew',
 u'diet',
 u'dr',
 u'fajita',
 u'fresh',
 u'grapefruit',
 u'green',
 u'grilled',
 u'guacamole',
 u'hot',
 u'lemonade',
 u'lettuce',
 u'lime',
 u'marinated',
 u'medium',
 u'mild',
 u'mountain',
 u'nan',
 u'nestea',
 u'orange',
 u'peach',
 u'pepper',
 u'pineapple',
 u'pinto',
 u'pomegranate',
 u'red',
 u'rice',
 u'roasted',
 u'salsa',
 u'sour',
 u'sprite',
 u'steak',
 u'tomatillo',
 u'tomato',
 u'vegetables',
 u'vegetarian',
 u'veggies',
 u'white']

In [146]:
choice_descriptions.A

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [164]:
desc_df = pd.DataFrame(choice_descriptions.A, columns= cv.get_feature_names() )

In [172]:
desc_df.head()

Unnamed: 0,adobo,apple,banana,barbacoa,beans,black,blackberry,braised,brown,carnitas,...,salsa,sour,sprite,steak,tomatillo,tomato,vegetables,vegetarian,veggies,white
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,1,0,0,0,0,...,1,1,0,0,1,0,0,0,0,0


In [167]:
item_descriptions = cv.fit_transform(items)


In [168]:
cv.get_feature_names()

[u'barbacoa',
 u'bottled',
 u'bowl',
 u'burrito',
 u'canned',
 u'carnitas',
 u'chicken',
 u'chili',
 u'chips',
 u'corn',
 u'crispy',
 u'drink',
 u'fresh',
 u'green',
 u'guacamole',
 u'izze',
 u'mild',
 u'nantucket',
 u'nectar',
 u'pack',
 u'red',
 u'roasted',
 u'salad',
 u'salsa',
 u'soda',
 u'soft',
 u'steak',
 u'tacos',
 u'tomatillo',
 u'tomato',
 u'veggie',
 u'water']

In [155]:
item_descriptions.A

array([[0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]])

In [169]:
items_desc = pd.DataFrame(item_descriptions.A, columns =cv.get_feature_names() )

In [171]:
items_desc.head()

Unnamed: 0,barbacoa,bottled,bowl,burrito,canned,carnitas,chicken,chili,chips,corn,...,salad,salsa,soda,soft,steak,tacos,tomatillo,tomato,veggie,water
0,0,0,0,0,0,0,0,0,1,0,...,0,1,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,1,0,...,0,1,0,0,0,0,1,0,0,0
4,0,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [124]:
# # Use `fit` to learn the vocabulary of the titles
# cv.fit(descriptions)

# # Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
# X = cv.transform(descriptions)

# cv.get_feature_names()
# # X.A
# # X.A.get_feature_names()

In [173]:
result = pd.concat([df, desc_df, items_desc], axis=1)

In [176]:
result.drop('cd_vectorized', axis=1, inplace=True)
result.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price,adobo,apple,banana,barbacoa,beans,...,salad,salsa,soda,soft,steak,tacos,tomatillo,tomato,veggie,water
0,1,1,Chips and Fresh Tomato Salsa,,2.39,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
1,1,1,Izze,Clementine,3.39,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,1,Nantucket Nectar,Apple,3.39,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,Chips and Tomatillo-Green Chili Salsa,,2.39,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
4,2,2,Chicken Bowl,"Tomatillo-Red Chili Salsa (Hot), Black Beans, ...",16.98,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [None]:
### code here

### Now train your model! 
+ What model you select is up to you
+ check out sklearn documentation!

In [None]:
### code here

### Now test your model!

In [None]:
### code here

In [None]:
### code here