### Problem solving exercise

+ Because I love burritos we're analyzing a chipotle dataset! 

+ The goal is to string together the skills we've worked on over the last few months

#### RULES

+ I will not be giving out answers! (ok maybe some hints if you get *really* stuck) 
+ You will solve this together as a class
+ When someone figures something out, they can come to the board and present their solution to the class
    + Alternative solutions can also be presented
    
##### REWARDS:
+ There will be a happy hour after this class
+ Students with the most accurate models will be eligible to vote on the bar we go to!
    + Any models with accuracy within a 5% of the most accurate model  
    + So if the best model is 82.2%, we'll also select anyone with accuracy greater than 78.1%
+ Student who presents the most solutions presented to the class will get a free drink! (or alternative if you don't drink) 
    + No ties! Only one student can win this!


### Outline:

#### Cleaning Data

+ We only briefly covered cleaning data
+ You'll need to rely more on google and logic than class notes here
+ Cleaning data is something you just need to learn by doing
+ After cleaning, we'll run a machine learning algorithm to predict the price of an order

#### Preprocessing & ML 

+ We've covered this in class, but this time you're really driving the ship
+ Get your data into the right format, then start training your algorithm! 



### That's it! GO FOR IT! 
+ I believe in all of you!

In [1]:
import pandas as pd
import re

### First import your dataset

+ hint - examine how the values are separated 
+ What's the difference between a tsv and csv?


In [2]:
### Code here
path = '../../DS-SF-32/lessons/lesson-18/chipotle.tsv'
df = pd.read_csv(path,sep='\t')

In [3]:
df.head(2)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39


## Next, clean up "choice description"

+ What do the values look like? 
+ We're going to plug this into count vectorizer, later
+ How can we clean this up?

+ Check out a package called "re" for regular expressions
+ there are multiple ways to solve this problem

In [4]:
df['choice_description'].value_counts(dropna=False)

NaN                                                                                                                    1246
[Diet Coke]                                                                                                             134
[Coke]                                                                                                                  123
[Sprite]                                                                                                                 77
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Lettuce]]                                                   42
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Guacamole, Lettuce]]                                        40
[Fresh Tomato Salsa (Mild), [Pinto Beans, Rice, Cheese, Sour Cream]]                                                     36
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream]]                                                            33
[Lemonad

In [5]:
df['choice_description'].unique()

array([nan, '[Clementine]', '[Apple]', ...,
       '[Roasted Chili Corn Salsa, [Pinto Beans, Sour Cream, Cheese, Lettuce, Guacamole]]',
       '[Tomatillo Green Chili Salsa, [Rice, Black Beans]]',
       '[Tomatillo Green Chili Salsa, [Rice, Fajita Vegetables, Black Beans, Guacamole]]'], dtype=object)

In [6]:
### code here
choice = '[Fresh Tomato Salsa, [Rice, Black Beans, Cheese]]'
def clean_choice(c):
    if type(c) == float:
        return c
    return get_choice_arr(c)

def get_choice_arr(c):
#     d = c[1:-1]
    d = re.sub(r'\[+', '', c)
    d = re.sub(r'\]+', '', d)
    a = d.split(', ')
    return d
    
cl = clean_choice(choice)
print type(cl)
print cl


<type 'str'>
Fresh Tomato Salsa, Rice, Black Beans, Cheese


In [7]:
d2 = df.head(5).copy()
d2.choice_description = d2.choice_description.map(lambda x: clean_choice(x))
d2

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,Clementine,$3.39
2,1,1,Nantucket Nectar,Apple,$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"Tomatillo-Red Chili Salsa (Hot), Black Beans, ...",$16.98


In [8]:
# create an array that includes all choices
import numpy as np
type(d2['choice_description'])
cd = 'choice_description'
u = d2[cd].unique()
# print u
l = [np.nan]
for s in u:
#     print type(s)
    if type(s) != str:
        pass
    else:
        a = s.split(', ')
#         print a
        l = l + a
l

[nan,
 'Clementine',
 'Apple',
 'Tomatillo-Red Chili Salsa (Hot)',
 'Black Beans',
 'Rice',
 'Cheese',
 'Sour Cream']

In [9]:
def get_strlist_col_unique(data, cd):
    # create an array that includes all choices from strings
#     cd = 'choice_description'
    uniq_strs = data[cd].unique()
    uniq = [np.nan]
    for s in uniq_strs:
        if type(s) != str:
            pass
        else:
            a = s.split(', ')
            uniq = uniq + a
    return uniq


# df.choice_description = df.choice_description.map(lambda x: clean_choice(x))
# get_strlist_col_unique(df, cd)

In [10]:
# d2['Apple'] = 0
# def set_choice(row):
#     if type(row['choice_description']) == float:
#         return 0
#     elif 'Apple' in row['choice_description']:
#         return 1
#     else:
#         return 0
# d2['Apple'] = d2.apply(set_choice, axis=1)
# d2['Apple']        
# d2['choice_description']

In [11]:
d2['choice_description']

0                                                  NaN
1                                           Clementine
2                                                Apple
3                                                  NaN
4    Tomatillo-Red Chili Salsa (Hot), Black Beans, ...
Name: choice_description, dtype: object

In [12]:
def set_row_for_choice(data, choice):
    def set_choice(row):
        if type(row['choice_description']) == float:
            return 0
        elif choice in row['choice_description']:
            return 1
        else:
            return 0
    col = "choice_" + choice
    data[col] = 0
    data[col] = data.apply(set_choice, axis=1)
    return data[col] 
set_row_for_choice(d2, 'Apple')
d2[['choice_description','choice_Apple']]

Unnamed: 0,choice_description,choice_Apple
0,,0
1,Clementine,0
2,Apple,1
3,,0
4,"Tomatillo-Red Chili Salsa (Hot), Black Beans, ...",0


In [13]:
set_row_for_choice(d2, 'Black Beans')
d2

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price,choice_Apple,choice_Black Beans
0,1,1,Chips and Fresh Tomato Salsa,,$2.39,0,0
1,1,1,Izze,Clementine,$3.39,0,0
2,1,1,Nantucket Nectar,Apple,$3.39,1,0
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39,0,0
4,2,2,Chicken Bowl,"Tomatillo-Red Chili Salsa (Hot), Black Beans, ...",$16.98,0,1


In [14]:
for ch in l:
    if type(ch) == str:
        print ch
        set_row_for_choice(d2, ch)
d2

Clementine
Apple
Tomatillo-Red Chili Salsa (Hot)
Black Beans
Rice
Cheese
Sour Cream


Unnamed: 0,order_id,quantity,item_name,choice_description,item_price,choice_Apple,choice_Black Beans,choice_Clementine,choice_Tomatillo-Red Chili Salsa (Hot),choice_Rice,choice_Cheese,choice_Sour Cream
0,1,1,Chips and Fresh Tomato Salsa,,$2.39,0,0,0,0,0,0,0
1,1,1,Izze,Clementine,$3.39,0,0,1,0,0,0,0
2,1,1,Nantucket Nectar,Apple,$3.39,1,0,0,0,0,0,0
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39,0,0,0,0,0,0,0
4,2,2,Chicken Bowl,"Tomatillo-Red Chili Salsa (Hot), Black Beans, ...",$16.98,0,1,0,1,1,1,1


In [15]:
df.choice_description = df.choice_description.map(lambda x: clean_choice(x))

In [16]:
unique_choices = get_strlist_col_unique(df, 'choice_description')
print len(unique_choices)

5547


In [17]:

# for ch in unique_choices:
#     if type(ch) == str:
# #         print ch
#         set_row_for_choice(df, ch)
# df.head(3)

### Next, clean up "item price" 

+ What can you do here? 
+ This will be our outcome variable
+ How can we make this easier to read? 

In [18]:
### code here
def dollar_to_float(string):
    return float(re.sub(r'\$','', str(string)))

# print dollar_to_float('$2.99')

df.item_price = df.item_price.map(lambda x: dollar_to_float(x))
df.item_price.head(2)

0    2.39
1    3.39
Name: item_price, dtype: float64

In [19]:
### code here

### Now Preprocess your data! 

+ Use a vectorizer of your choice!

+ Consider a dimension reduction technique! 

    + PCA? SVD? LDA?

In [20]:
## code here
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1,2))
d3 = df[df[cd].notnull()].copy()
choices = d3[cd].fillna('')
cv.fit(choices)
# titles = data['title']
X = cv.transform(choices)
print cv.vocabulary_
X

{u'cheese': 28, u'hot tomatillo': 100, u'mild fajita': 129, u'lettuce cheese': 104, u'mild': 126, u'roasted': 163, u'chicken': 38, u'tomatillo green': 185, u'cola': 53, u'cilantro': 47, u'beans black': 10, u'salsa hot': 170, u'medium sour': 124, u'cherry': 37, u'rice guacamole': 158, u'veggies cheese': 202, u'black': 18, u'vegetables sour': 197, u'rice tomatillo': 162, u'rice': 153, u'mild cheese': 128, u'beans lettuce': 14, u'salsa sour': 176, u'lettuce guacamole': 106, u'medium black': 115, u'mild pinto': 132, u'rice sour': 161, u'beans pinto': 15, u'carnitas pinto': 27, u'braised': 21, u'dr': 69, u'hot black': 90, u'mild lettuce': 131, u'beans sour': 17, u'fajita': 71, u'medium fresh': 118, u'and grilled': 3, u'blackberry': 20, u'hot lettuce': 95, u'fajita veggies': 73, u'sour cream': 178, u'cream black': 58, u'beans rice': 16, u'cream lettuce': 62, u'chili corn': 43, u'medium guacamole': 119, u'cheese cilantro': 30, u'grilled chicken': 80, u'peach orange': 143, u'adobo': 0, u'mild 

<3376x210 sparse matrix of type '<type 'numpy.int64'>'
	with 55213 stored elements in Compressed Sparse Row format>

In [21]:
### code here
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=2)
m = lda.fit_transform(X)



In [22]:
for topic_idx, topic in enumerate(lda.components_):
        print("Topic #%d:" % topic_idx)

Topic #0:
Topic #1:


In [23]:
lda.components_

array([[  7.83567215e+00,   7.83568093e+00,   7.83568289e+00,
          7.83567546e+00,   5.89653012e+00,   5.10702232e-01,
          2.08541667e+00,   1.29230476e+00,   1.29243451e+00,
          4.76460745e+02,   9.48969003e+00,   2.03993954e+02,
          2.14925491e+01,   2.35706045e+01,   1.51033146e+01,
          1.65285186e+01,   1.13821100e+02,   7.74287771e+01,
          3.38735762e+02,   3.38738160e+02,   7.71024159e+00,
          2.86276364e+00,   2.08541664e+00,   1.27642944e+00,
          2.59201370e+00,   2.59201607e+00,   1.27642983e+00,
          1.27642827e+00,   5.39168908e+02,   3.65242147e+00,
          6.05129524e+00,   8.33757644e-01,   5.58196099e+01,
          1.26386769e+02,   5.05423010e-01,   2.22618695e+00,
          2.81644503e+02,   5.16903750e-01,   4.63124214e+00,
          1.29258845e+00,   1.29254938e+00,   2.25200121e+00,
          8.37525335e+02,   7.18652043e+02,   5.24585491e-01,
          2.48928088e+00,   7.01684218e-01,   6.05130434e+00,
        

In [24]:
X

<3376x210 sparse matrix of type '<type 'numpy.int64'>'
	with 55213 stored elements in Compressed Sparse Row format>

In [31]:
cv.get_feature_names()
pd.DataFrame(X.A, columns=cv.get_feature_names())

Unnamed: 0,adobo,adobo marinated,and,and grilled,apple,banana,barbacoa,barbacoa pinto,barbacoa vegetarian,beans,...,veggies,veggies black,veggies cheese,veggies guacamole,veggies lettuce,veggies pinto,veggies rice,veggies sour,white,white rice
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Now train your model! 
+ What model you select is up to you
+ check out sklearn documentation!

model
- regression (continuous output)
- supervised (know correct results for some data)


In [25]:
X


<3376x210 sparse matrix of type '<type 'numpy.int64'>'
	with 55213 stored elements in Compressed Sparse Row format>

In [26]:
y = df['item_price']
y.head(3)

0    2.39
1    3.39
2    3.39
Name: item_price, dtype: float64

### Now test your model!

In [27]:
### code here

In [28]:
### code here