# Algorithms: Rules of Play

1. Name of the algorithm
2. What it's used for (classification, clustering, maybe other things?)
3. Why is it better/worse than other classification/clustering/etc algorithms
4. How to get our data into a format that is good for that algorithm
4. REALISTIC data sets
5. What the output means technically
6. What the output means in like real life language and practically speaking
7. What kind of datasets you use this algorithm for
8. Examples of when it was used in journalism OR maybe could have been used
9. Examples of when it was used period
10. Pitfalls
11. Maybe maybe maybe a little bit of math
12. How to ground them for a less technical audience and to help engage them in what the algorithm is doing

# Naive Bayes

Download and extract `recipes.csv.zip` from `#algorithms` and start a new Jupyter Notebook!!!!

**Classification algorithm** - spam filter

The more spammy words that are in an email, the more like it is to be spam

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv("recipes.csv")
df.head()

Unnamed: 0,cuisine,id,ingredient_list
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,..."
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr..."
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr..."
3,indian,22213,"water, vegetable oil, wheat, salt"
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep..."


# QUESTION ONE: What are we doing and why are we using Naive Bayes?

We have a bunch of recipes in categories. Maybe someone sends us new recipes, what category do the new recipes belong in?

We're going to train a classifier to recognize italian food, so that if someone sends us new recipes, we know if it's italian because we love italian food and we only want to eat italian food.

RULE IS: For classification algorithms, YOU MUST HAVE CATEGORIES ON YOUR ORIGINAL DATASET.

**For clustering**

1. You'll get a lot of documents
2. You feed it to an algorithm, tell it create `x` number of categories
3. The machine gives you back categories whether they make sense or not

**For classification (which we are doing now)**

1. You'll get a lot of documents
2. You'll classify some of them into categories that you know and love
3. You'll ask the algorithm what categories a new bunch of unlabeled documents end up in

All mean the same thing: CATEGORY = CLASS = LABEL

The reason why you use machine learning is to not do things manually. So if you can do things manually, do it. Otherwise just try different algorithms until one works well (but you might need to know some upsides or downsides of each to interpret that).

## How does Naive Bayes work?

NAIVE BAYES WORKS WITH TEXT (kind of)

**Bayes Theorem (kind of)**

* If you see a word that is normally in a spam email, there's a higher chance it's spam
* If you see a word that is normally in a non-spam email, there's a higher chance it's not spam

**Naive:** every word/ingredient/etc is independent of any other word

FOR US: If you see ingredients that are normally in italian food, it's probably italian

Secret trick: you can't just use text, you have to convert into numbers

## Types of Naive Bayes

Naive Bayes works on words, and SOMETIMES your text is long and SOMETIMES your text is short.

**Multinominal Naive Bayes - (multiple numbers)**: You count the words. You care about whether a word appears once or twice or three times or ten times. *This is better for long passages*

**Bernoulli Naive Bayes - True/False Bayes:** You only care if the word shows up (`True`) or it doesn't show up (`False`) - *this is better for short passages*


# STEP ONE: Let's convert our text data into numerical data

In [3]:
df.head()

Unnamed: 0,cuisine,id,ingredient_list
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,..."
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr..."
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr..."
3,indian,22213,"water, vegetable oil, wheat, salt"
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep..."


**Our problem:** Everything is text - cuisine is text, ingredient list is text, id is a number but it doesn't matter

**Two things to convert into numbers:**

* Our labels (a.k.a. the categories everything belongs in)
* Our features

## Converting our labels into numbers

We have two labels

* italian = `1`
* not italian = `0`

In [4]:
df.head()

Unnamed: 0,cuisine,id,ingredient_list
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,..."
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr..."
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr..."
3,indian,22213,"water, vegetable oil, wheat, salt"
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep..."


In [5]:
def make_label(cuisine):
    if cuisine == "italian":
        return 1
    else:
        return 0

In [6]:
df['label'] = df['cuisine'].apply(make_label)
df.head(10)

Unnamed: 0,cuisine,id,ingredient_list,label
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr...",0
3,indian,22213,"water, vegetable oil, wheat, salt",0
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep...",0
5,jamaican,6602,"plain flour, sugar, butter, eggs, fresh ginger...",0
6,spanish,42779,"olive oil, salt, medium shrimp, pepper, garlic...",0
7,italian,3735,"sugar, pistachio nuts, white almond bark, flou...",1
8,mexican,16903,"olive oil, purple onion, fresh pineapple, pork...",0
9,italian,12734,"chopped tomatoes, fresh basil, garlic, extra-v...",1


## Converting our features into numbers

**Feature selection:** The process of selecting the features that matter, in this case - what ingredients do we want to look at?

Our feature is going to be: whether it has spaghetti or not

In [7]:
df['has_spaghetti'] = df['ingredient_list'].str.contains("spaghetti")
df['has_curry_powder'] = df['ingredient_list'].str.contains("curry powder")
df.head(10)

Unnamed: 0,cuisine,id,ingredient_list,label,has_spaghetti,has_curry_powder
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0,False,False
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0,False,False
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr...",0,False,False
3,indian,22213,"water, vegetable oil, wheat, salt",0,False,False
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep...",0,False,False
5,jamaican,6602,"plain flour, sugar, butter, eggs, fresh ginger...",0,False,False
6,spanish,42779,"olive oil, salt, medium shrimp, pepper, garlic...",0,False,False
7,italian,3735,"sugar, pistachio nuts, white almond bark, flou...",1,False,False
8,mexican,16903,"olive oil, purple onion, fresh pineapple, pork...",0,False,False
9,italian,12734,"chopped tomatoes, fresh basil, garlic, extra-v...",1,False,False


## Let's run our tests

Let's feed our labels and our features to a machine that likes to learn and then see how well it learns!!!!

### Looking at our labels

We stored it in `label`, and if it's `0` it's not italian, if it's `1` it is Italian

In [8]:
df['label'].head()

0    0
1    0
2    0
3    0
4    0
Name: label, dtype: int64

### Look at our features

We have two features `has_spaghetti` and `has_curry_powder`.

In [9]:
df[['has_spaghetti', 'has_curry_powder']].head()

Unnamed: 0,has_spaghetti,has_curry_powder
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False


### Now let's finally do this

In [10]:
# We need to split into training and testing data
from sklearn.cross_validation import train_test_split

In [11]:
# Splitting into...
# X = are all our features
# y = are all our labels
# X_train are our features to train on (80%)
# y_train are our labels to train on (80%)
# X_test are our features to test on (20%)
# y_train are our labels to test on (20%)

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_spaghetti', 'has_curry_powder']], # the first is our FEATURES
    df['label'], # the second parameter is the LABEL (this is 0/1, not italian/italian)
    test_size=0.2) # 80% training, 20% testing

In [12]:
# Oh hey, it's just our features from the dataframe
X_train

Unnamed: 0,has_spaghetti,has_curry_powder
6785,False,False
14035,False,False
16477,False,False
2541,False,False
10704,False,False
18965,False,False
39451,False,False
6913,False,False
36372,False,False
4385,False,False


In [13]:
# X is always the features, whether it's for training or for testing
X_test

Unnamed: 0,has_spaghetti,has_curry_powder
38315,False,False
14214,False,False
31753,False,False
34806,False,False
7888,False,False
32614,True,False
10982,False,False
34086,False,False
4793,False,False
8442,False,False


In [14]:
len(X_train)

31819

In [15]:
len(X_test)

7955

In [16]:
# We're testing on ~8000 and training on ~32000

In [17]:
# y_train is our labels that we are training one
y_train

6785     0
14035    0
16477    0
2541     0
10704    0
18965    0
39451    1
6913     0
36372    0
4385     0
32874    0
18361    0
1725     1
123      0
36028    0
15923    0
19726    0
35670    0
17921    0
15414    0
26231    0
2008     0
25360    0
20617    0
29960    0
28484    1
15197    0
30300    1
706      0
13480    0
        ..
9593     0
20304    0
38450    0
16247    0
37042    1
30595    0
6986     1
27071    1
3585     1
20144    0
789      0
15513    0
3956     0
29346    0
5314     1
9751     0
33744    1
26038    0
13668    0
23003    1
16605    0
10788    0
3116     0
28326    0
17073    0
12889    0
32912    1
8860     0
7378     0
38400    0
Name: label, dtype: int64

In [18]:
# And y_test is the labels we're testing on
y_test

38315    0
14214    1
31753    0
34806    0
7888     0
32614    1
10982    0
34086    0
4793     0
8442     0
9780     0
37472    1
33246    1
38113    0
29007    0
21716    0
17892    0
22053    0
24356    1
23960    1
3214     1
22711    0
12024    1
26047    1
36170    1
33420    0
27584    1
35049    1
35618    0
37116    0
        ..
18871    1
28911    0
14716    0
33276    0
11299    0
19369    0
39447    0
4927     0
38551    1
19530    0
26596    0
3274     0
35779    1
13980    0
29929    0
4477     0
35053    0
18077    0
24650    1
18442    0
10688    0
26449    0
3418     0
31726    0
27098    0
4067     0
12581    0
19426    0
16433    0
2050     1
Name: label, dtype: int64

In [19]:
print("Length of training labels:", len(y_train))
print("Length of testing labels:", len(y_test))
print("Length of training features:", len(X_train))
print("Length of testing features:", len(X_test))

Length of training labels: 31819
Length of testing labels: 7955
Length of training features: 31819
Length of testing features: 7955


Basically all that happened was `train_test_split` took us from having a nice dataframe where everything was together and split it into two groups of two - separated our labels vs. our features, and our training data vs our testing data.

# Back to actually doing our fitting etc

In [20]:
# Splitting into...
# X = are all our features
# y = are all our labels
# X_train are our features to train on (80%)
# y_train are our labels to train on (80%)
# X_test are our features to test on (20%)
# y_train are our labels to test on (20%)

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_spaghetti', 'has_curry_powder']], # the first is our FEATURES
    df['label'], # the second parameter is the LABEL (this is 0/1, not italian/italian)
    test_size=0.2) # 80% training, 20% testing

In [21]:
# Import naive_bayes to get access to ALL kinds of naive bayes classifiers
# But REMEMBER we're using Bernoulli because it's for true/false which is fine
# for small passages
from sklearn import naive_bayes

# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()

# Feed the classifier two things:
#   * our training features (X_train)
#   * our training labels (y_train)
# To help it study for the exam later when we test it
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [22]:
# This looks ugly but in theory it's what every recipe is
# All those zeroes = not italian
# We know the first three aren't italian and the last three aren't italian
clf.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

In [23]:
# Naive Bayes can't overfit, really
# It can't "study too hard" it can't "memorize the questions"
# (a decision tree can)
# So if we give it the training data back it will get some wrong
clf.score(X_train, y_train)

0.80904491027373582

In [24]:
clf.score(X_test, y_test)

0.81621621621621621

In [25]:
df['cuisine'].value_counts()

italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: cuisine, dtype: int64

In [26]:
df.head()

Unnamed: 0,cuisine,id,ingredient_list,label,has_spaghetti,has_curry_powder
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0,False,False
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0,False,False
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr...",0,False,False
3,indian,22213,"water, vegetable oil, wheat, salt",0,False,False
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep...",0,False,False


# Wow, we did a really great job! Let's try another cuisine

## Step 1: Preparing our data

### Creating labels that scikit-learn can use

Our cuisine is , so we'll do `0` and `1` as to whether it's that cuisine or not 

In [27]:
def make_label(cuisine):
    if cuisine == "brazilian":
        return 1
    else:
        return 0

df['is_brazilian'] = df['cuisine'].apply(make_label)

In [28]:
df.head(2)

Unnamed: 0,cuisine,id,ingredient_list,label,has_spaghetti,has_curry_powder,is_brazilian
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0,False,False,0
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0,False,False,0


### Creating features that scikit-learn can use

It's Bernoulli Naive Bayes, so it's `True` and `False`

In [29]:
df['has_water'] = df['ingredient_list'].str.contains('water')
df['has_salt'] = df['ingredient_list'].str.contains('salt')

In [30]:
df.head(2)

Unnamed: 0,cuisine,id,ingredient_list,label,has_spaghetti,has_curry_powder,is_brazilian,has_water,has_salt
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0,False,False,0,False,False
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0,False,False,0,False,True


## Step 2: Create the test/train split

In [31]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_water', 'has_salt']], # the first is our FEATURES
    df['is_brazilian'], # the second parameter is the LABEL (this is 0/1, not italian/italian)
    test_size=0.2) # 80% training, 20% testing

## Step 3: Create classifier, train and test

In [32]:
from sklearn import naive_bayes

# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()

# Fit with our training data
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [33]:
clf.score(X_train, y_train)

0.9883402998208618

In [34]:
clf.score(X_test, y_test)

0.9879321181646763

# We just got destroyed by math: let's actually understand Naive Bayes

Naive Bayes gives you back a probability for each possible label - so, % chance that it's brazilian vs. the % chance that it is not brazilian. We'll use this to see what went wrong.

**Math stuff**

Naive Bayes is all about calculating the probability of "B given A", a.k.a., the chance of B being true if A is true.

* **Bayes` Theorem:** `P(B|A) = P(A and B)/P(A)`

* `P(A)` means "what is the probability of A being true?"
* `P(B|A)` means "if A is true, what is the probability of B being true?"
* `P(A and B)` means "what is the probability of both A and B being true?"

## Example: We have a recipe and it has water in it. Is it brazilian?

**Hypothesis one: the recipe is brazilian**

* `P(B|A)` would be "if it contains water, what is the chance that it is brazilian cuisine?"
* `P(A and B)` would be "what is the chance that it contains both water and is brazilian?"
* `P(A)` would be "what is the chance that this contains water?"

In [35]:
# P(B|A) = P(A and B)/P(A)

In [36]:
# P(A and B)
# Probability that a recipe has water and is brazilian

# How many recipes have water AND are brazilian?
len(df[(df['has_water']) & (df['cuisine'] == 'brazilian')])

109

In [37]:
# P(A)
len(df['has_water'])

39774

In [38]:
# P(B|A)
# The chance that a recipe is brazilian if it has water in it
109/39774

0.0027404837330919697

**Hypothesis two: the recipe is NOT brazilian**

* `P(B|A)` would be "if it contains water, what is the chance that it is NOT brazilian cuisine?"
* `P(A and B)` would be "what is the chance that it contains both water and is NOT brazilian?"
* `P(A)` would be "what is the chance that this contains water?"

In [39]:
# P(A and B)
# Probability that a recipe has water and is NOT brazilian

# How many recipes have water AND are NOT brazilian?
len(df[(df['has_water']) & (df['cuisine'] != 'brazilian')])

9385

In [40]:
# P(A)
# How many recipes have water?
len(df['has_water'])

39774

In [41]:
# P(B|A)
# The chance that a recipe is NOT brazilian if it has water in it
9385/39774

0.2359581636244783

## What this boils down to

No matter what, pretty much no recipe is ever brazilian. Does it have water in it? Does it not have water in it? Doesn't really matter, it's probably not brazilian.


In [42]:
len(df[df['cuisine'] == 'brazilian'])

467

In [43]:
len(df)

39774

In [44]:
# Only a little bit over 1% of our recipes are brazilian
# so even though it ALWAYS say it "not brazilian", it's usually right
467/39774

0.011741338562880274

In [45]:
1 - 467/39774

0.9882586614371197

# Let's fix up our labels

Before we had this:

    def make_label(cuisine):
        if cuisine == "brazilian":
            return 1
        else:
            return 0

which does not scale well. If we wanted to add in more different cuisines, we'd need to keep adding in else ifs again and again and again until our fingers fell off. And we'd probably misspell something. And if we're anything, it's LAZY.

## LabelEncoder to the rescue: Converts categories into numeric labels

In [46]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()

In [47]:
# LabelEncoder has two parts: FIT and TRANSFORM
# FIT learns all of the possible labels
# TRANSFORM takes a list of categories and converts them into numbers

In [48]:
# Teach the label encoder all of the possible labels
# It doesn't care about duplicates 
le.fit(['orange', 'red', 'red', 'red', 'yellow', 'blue'])

LabelEncoder()

In [49]:
# Get the labels out as numbers
le.transform(['orange', 'blue', 'yellow'])

array([1, 0, 3])

In [50]:
# Send the label encoder each and every cuisine
le.fit(df['cuisine'])

LabelEncoder()

In [51]:
le.transform(df['cuisine'])

array([ 6, 16,  4, ...,  8,  3, 13])

In [52]:
df['cuisine_label'] = le.transform(df['cuisine'])
df.head(3)

Unnamed: 0,cuisine,id,ingredient_list,label,has_spaghetti,has_curry_powder,is_brazilian,has_water,has_salt,cuisine_label
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0,False,False,0,False,False,6
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0,False,False,0,False,True,16
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr...",0,False,False,0,False,True,4


## Let's train and test with our new labels

In [53]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_water', 'has_salt']], # the first is our FEATURES
    df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)
    test_size=0.2) # 80% training, 20% testing

In [54]:
from sklearn import naive_bayes

# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()

# Learn how related every cuisine is to water and salt
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [55]:
clf.score(X_train, y_train)

0.19702064804047895

In [56]:
clf.score(X_test, y_test)

0.19723444374607166

# Let's add some more features to see if we can do a better job

Right now I'm only looking at water and salt which doesn't tell you much, maybe you're looking at tortillas or cumin or soy sauce which tells you a little bit more.

In [57]:
df['has_miso'] = df['ingredient_list'].str.contains("miso")
df['has_soy_sauce'] = df['ingredient_list'].str.contains("soy sauce")
df['has_cilantro'] = df['ingredient_list'].str.contains("cilantro")
df['has_black_olives'] = df['ingredient_list'].str.contains("black olives")
df['has_tortillas'] = df['ingredient_list'].str.contains("tortillas")
df['has_turmeric'] = df['ingredient_list'].str.contains("turmeric")
df['has_pistachios'] = df['ingredient_list'].str.contains("pistachios")
df['has_lemongrass'] = df['ingredient_list'].str.contains("lemongrass")

Our new feature set is!!! `df[['has_spaghetti', 'has_miso', 'has_soy_sauce', 'has_cilantro','has_black_olives','has_tortillas','has_turmeric', 'has_pistachios','has_lemongrass']]`

In [58]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[['has_spaghetti', 'has_miso', 'has_soy_sauce', 'has_cilantro','has_black_olives','has_tortillas','has_turmeric', 'has_pistachios','has_lemongrass']], # the first is our FEATURES
    df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)
    test_size=0.2) # 80% training, 20% testing

In [59]:
from sklearn import naive_bayes

# Create a Bernoulli Naive Bayes classifier
clf = naive_bayes.BernoulliNB()

# Learn how related every cuisine is to water and salt
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [60]:
clf.score(X_train, y_train)

0.36977906282409884

In [61]:
clf.score(X_test, y_test)

0.37372721558768068

# This is taking forever, please let there be an automatic way to pick out all of the words

In [62]:
from sklearn.feature_extraction.text import CountVectorizer

# STEP ONE: .fit to learn all of the words
# STEP TWO: .transform to turn a sentence into numbers

#vectorizer = CountVectorizer()
# So now 'olive' and 'oil' and 'olive oil' instead of just 'olive' and 'oil'
# Only pick the top 3000 most frequent ngrams
vectorizer = CountVectorizer(ngram_range=(1,2), max_features=3000)

In [63]:
# We have some sentences
# We're going to feed it to the vectorizer
# and it's going to learn all of the words
sentences = [
    "cats are cool",
    "dogs are cool"
]
vectorizer.fit(sentences)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=3000, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [64]:
# We're going to take some sentences and feed it to the vectorizer
# and its' going to convert it into numbers
vectorizer.transform(sentences)

<2x7 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [65]:
# But it looks bad to look at so I'll use .toarray()
vectorizer.transform(sentences).toarray()

array([[1, 1, 1, 1, 1, 0, 0],
       [1, 1, 0, 0, 1, 1, 1]])

In [66]:
# In our case, our text is the list of ingredients. We can get it through
df['ingredient_list'].head()

0    romaine lettuce, black olives, grape tomatoes,...
1    plain flour, ground pepper, salt, tomatoes, gr...
2    eggs, pepper, salt, mayonaise, cooking oil, gr...
3                    water, vegetable oil, wheat, salt
4    black pepper, shallots, cornflour, cayenne pep...
Name: ingredient_list, dtype: object

In [67]:
# Dear vectorizer, please learn all of these words
vectorizer.fit(df['ingredient_list'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=3000, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [68]:
# Dear vectorizer, please convert ingredient_list into features
# That we can do machine learning on

every_single_word_features = vectorizer.transform(df['ingredient_list'])
every_single_word_features

<39774x3000 sparse matrix of type '<class 'numpy.int64'>'
	with 1243216 stored elements in Compressed Sparse Row format>

# Now let's try with our new complete labels and our new complete features that includes every single word

In [69]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    every_single_word_features,
    df['cuisine_label'], # the second parameter is the LABEL (0-16, southern us, brazilian, anything really)
    test_size=0.2) # 80% training, 20% testing

In [70]:
print("This is Naive Bayes")

from sklearn import naive_bayes
clf = naive_bayes.BernoulliNB()
%time clf.fit(X_train, y_train)

# How does it do on the training data?
print("Training score: (stuff it already knows)", clf.score(X_train, y_train))

# How does it do on the testing data?
print("Testing score: (stuff it hasn't seen before):", clf.score(X_test, y_test))

This is Naive Bayes
CPU times: user 48.6 ms, sys: 14.5 ms, total: 63.1 ms
Wall time: 64.7 ms
Training score: (stuff it already knows) 0.715515886734
Testing score: (stuff it hasn't seen before): 0.689126335638


In [71]:
print("This is a Decision Tree")

from sklearn import tree
tree_clf = tree.DecisionTreeClassifier()

%time tree_clf.fit(X_train, y_train)

# How does it do on the training data?
print("Training score: (stuff it already knows)", tree_clf.score(X_train, y_train))

# How does it do on the testing data?
print("Testing score: (stuff it hasn't seen before):", tree_clf.score(X_test, y_test))

This is a Decision Tree
CPU times: user 12.5 s, sys: 88.6 ms, total: 12.6 s
Wall time: 12.9 s
Training score: (stuff it already knows) 0.99981143342
Testing score: (stuff it hasn't seen before): 0.640603394092


In [72]:
from sklearn.ensemble import RandomForestClassifier

print("This is a Random Forest")

tree_clf = RandomForestClassifier()

%time tree_clf.fit(X_train, y_train)

# How does it do on the training data?
print("Training score: (stuff it already knows)", tree_clf.score(X_train, y_train))

# How does it do on the testing data?
print("Testing score: (stuff it hasn't seen before):", tree_clf.score(X_test, y_test))

This is a Random Forest
CPU times: user 7.89 s, sys: 109 ms, total: 8 s
Wall time: 8.14 s
Training score: (stuff it already knows) 0.992677331154
Testing score: (stuff it hasn't seen before): 0.702199874293


# How do you do this in the real world with new data?

In [73]:
every_single_word_features = vectorizer.transform(df['ingredient_list'])


In [74]:
# Import the Naive bayes thing
from sklearn import naive_bayes
clf = naive_bayes.BernoulliNB()

# Give the classifier EVERYTHING we know, not holding back anything
clf.fit(every_single_word_features, df['cuisine_label'])

# We have some new stuff we have not categorized
incoming_recipes = [
    "spaghetti tomato sauce garlic onion water",
    "soy sauce ginger sugar butter",
    "green papaya thai chilies palm sugar",
    "butter oil salt black pepper water milk bubblegumpie"
]

features_for_new_recipes = vectorizer.transform(incoming_recipes)
features_for_new_recipes

<4x3000 sparse matrix of type '<class 'numpy.int64'>'
	with 35 stored elements in Compressed Sparse Row format>

In [75]:
predictions = clf.predict(features_for_new_recipes)
predictions

array([ 4, 11,  4, 16])

In [76]:
# The predictions are all categories that the labelencoder decided on
# Let's convert those numeric ones back into real fun cuisine words
le.inverse_transform(predictions)

array(['filipino', 'japanese', 'filipino', 'southern_us'], dtype=object)