In this notebook, we'll turn again to a problem that arose when I was working on the Kaggle What's Cooking challenge (see my previous posts on the subject [here](http://flothesof.github.io/kaggle-whatscooking-bokeh-plots.html) and [here](http://flothesof.github.io/kaggle-whats-cooking-machine-learning.html)). In this challenge, the data, which consists of recipes, contained quite a few quirks that made some of it useless.
For example, the data may contain names of brands that are not useful to a human working on this problem ("knorr garlic minicub"). Or mentions weights and quantities that should have been cleaned up beforehand ("(10 oz.) frozen chopped spinach, thawed and squeezed dry").

The goal of this notebook is therefore to develop a probabilistic model for finding better ingredient names in an automatic way.

We will first look at some overall statistics about the data and then develop a probabilistic model inspired by the principle behind spellcheckers and google auto-correct as explained by [Peter Norvig](http://norvig.com/spell-correct.html).

# Statistics about the data, at ingredient and word level

First, let's load the data.

In [1]:
import pandas as pd

In [2]:
df_train = pd.read_json('train.json')

We will build a single list with all the ingredients found in the dataset.

In [3]:
all_ingredients_text = []
for ingredient_list in df_train.ingredients:
    all_ingredients_text += [ing.lower() for ing in ingredient_list]

Let's have a look at the stats. We have the following number of ingredients in our recipes:

In [4]:
len(all_ingredients_text)

428275

Among those, there are unique ingredients:

In [5]:
len(set(all_ingredients_text))

6703

How about looking at the word level? We can split each ingredient at a word boundary using a regexp:

In [6]:
import re
re.split(re.compile('[,. ]+'), 'KRAFT Shredded Pepper Jack Cheese with a TOUCH OF PHILADELPHIA')

['KRAFT',
 'Shredded',
 'Pepper',
 'Jack',
 'Cheese',
 'with',
 'a',
 'TOUCH',
 'OF',
 'PHILADELPHIA']

Let's do that and compute the number of words and then the unique words:

In [7]:
splitter = re.compile('[,. ]+')
all_words = []
for ingredient in all_ingredients_text:
    all_words += re.split(splitter, ingredient)

In [8]:
len(all_words)

807802

How about unique ones?

In [9]:
len(set(all_words))

3152

So to conclude, our dataset consists of:

- 428275 ingredients
- among which 6714 are unique
- all these ingredients are made of 807802 words
- among which 3554 are unique


Let's now turn to the problem found within the ingredients.

# An example of why ingredients are not always the right ones 

The ten longest ingredient names found in the dataset are:

In [10]:
sorted(set(all_ingredients_text), key=len, reverse=True)[:50]

['pillsburyâ„¢ crescent recipe creationsâ® refrigerated seamless dough sheet',
 'kraft mexican style shredded four cheese with a touch of philadelphia',
 'bertolli vineyard premium collect marinara with burgundi wine sauc',
 'hidden valleyâ® original ranch saladâ® dressing & seasoning mix',
 'hidden valleyâ® farmhouse originals italian with herbs dressing',
 'hellmannã¢â‚¬â„¢ or best food canola cholesterol free mayonnais',
 'kraft shredded pepper jack cheese with a touch of philadelphia',
 'condensed reduced fat reduced sodium cream of mushroom soup',
 'condensed reduced fat reduced sodium cream of chicken soup',
 "i can't believ it' not butter! made with olive oil spread",
 '(10 oz.) frozen chopped spinach, thawed and squeezed dry',
 'wish-bone light asian sesame ginger vinaigrette dressing',
 'kraft shredded low-moisture part-skim mozzarella cheese',
 'kraft mexican style 2% milk finely shredded four cheese',
 'reduced fat reduced sodium tomato and herb pasta sauce',
 'hurst family 

What's going on here? The longest names are often not ingredient names a human would choose for describing properly the ingredients, because:

- they feature brand names (for instance KRAFT, Pillsbury, Hidden Valley)
- they are a sentence of english words instead of ingredients ("i can't believ it' not butter! made with olive oil spread")

Ideally, we want a function that returns the human version of these ingredients. For example:

```python 
>>> correct_ingredient('Pillsburyâ„¢ Crescent Recipe CreationsÂ® refrigerated seamless dough sheet')
>>> 'Refrigerated seamless dough sheet'
```

Or

```python 
>>> correct_ingredient('reduced fat reduced sodium tomato and herb pasta sauce')
>>> 'tomato and herb pasta sauce'
```

# Building a simple model 

## Theory 

Judging if a part of an ingredient is useful or not is difficult. We basically want to build a language model. We can be inspired by [Peter Norvig's approach on spell correcting](http://norvig.com/spell-correct.html). I also found these good slides at Cornell University explaining the principle behind this approach, called Naive Bayes: <https://courses.cit.cornell.edu/info4300_2011fa/slides/25.pdf>.

We want our correction to have the biggest probability given our input ingredient:

$$
\DeclareMathOperator*{\argmax}{arg\,max}
\argmax_{\text{c}}P\left({\text{c}\mid\text{ingredient}}\right)
$$

We transform it into the following expression, according to Bayes Rule (following Peter Norvig):

$$
\DeclareMathOperator*{\argmax}{arg\,max}
\argmax_{\text{c}}P\left({\text{ingredient}\mid\text{c}}\right) P(c)
$$

This leaves us with three things: 

- the probability of occurence of ingredient in our data $P(c)$ (this will favor commonly used ingredients)
- the error model, which says how likely it is that the ingredient is a modified version of the correct c
- the argmax, our control mechanism, which chooses the c that gives the best combined probability score


Our error model will be that we can only delete words from our ingredient. So what we need is to generate a list of possible ingredients based on our original ingredient by substracting words. Also, we will assume that word order doesn't matter, so we can represent an ingredient by a set. So what we need to do is:

## Modelling objects 

**Ingredients** contain a fixed number of words. We will therefore model them as **frozensets** of **words**. This will allow us to manipulate them more easily in the remaining document.

Let's define a function that creates an ingredient from a text string:

In [11]:
import re

def to_ingredient(text):
    "Transforms text into an ingredient."
    return frozenset(re.split(re.compile('[,. ]+'), text))

Let's build a list of all our ingredients using this function:

In [12]:
all_ingredients = [to_ingredient(text) for text in all_ingredients_text]

In [13]:
all_ingredients[:10]

[frozenset({'lettuce', 'romaine'}),
 frozenset({'black', 'olives'}),
 frozenset({'grape', 'tomatoes'}),
 frozenset({'garlic'}),
 frozenset({'pepper'}),
 frozenset({'onion', 'purple'}),
 frozenset({'seasoning'}),
 frozenset({'beans', 'garbanzo'}),
 frozenset({'cheese', 'crumbles', 'feta'}),
 frozenset({'flour', 'plain'})]

Let's now implement our algorithm given our model of an ingredient.

## Implementation 

We can now write a function that generates all possible candidate ingredients that can be built from a starting ingredient, using combinations.

In [14]:
import itertools

def candidates(ingredient):
    "Returns a list of candidate ingredients obtained from the original ingredient by keeping at least one of them."
    n = len(ingredient)
    possible = []
    for i in range(1, n + 1):
        possible += [frozenset(combi) for combi in itertools.combinations(ingredient, i)]
    return possible

Let's see how this works on examples:

In [15]:
candidates(to_ingredient("tomato and herb pasta sauce"))

[frozenset({'and'}),
 frozenset({'tomato'}),
 frozenset({'herb'}),
 frozenset({'sauce'}),
 frozenset({'pasta'}),
 frozenset({'and', 'tomato'}),
 frozenset({'and', 'herb'}),
 frozenset({'and', 'sauce'}),
 frozenset({'and', 'pasta'}),
 frozenset({'herb', 'tomato'}),
 frozenset({'sauce', 'tomato'}),
 frozenset({'pasta', 'tomato'}),
 frozenset({'herb', 'sauce'}),
 frozenset({'herb', 'pasta'}),
 frozenset({'pasta', 'sauce'}),
 frozenset({'and', 'herb', 'tomato'}),
 frozenset({'and', 'sauce', 'tomato'}),
 frozenset({'and', 'pasta', 'tomato'}),
 frozenset({'and', 'herb', 'sauce'}),
 frozenset({'and', 'herb', 'pasta'}),
 frozenset({'and', 'pasta', 'sauce'}),
 frozenset({'herb', 'sauce', 'tomato'}),
 frozenset({'herb', 'pasta', 'tomato'}),
 frozenset({'pasta', 'sauce', 'tomato'}),
 frozenset({'herb', 'pasta', 'sauce'}),
 frozenset({'and', 'herb', 'sauce', 'tomato'}),
 frozenset({'and', 'herb', 'pasta', 'tomato'}),
 frozenset({'and', 'pasta', 'sauce', 'tomato'}),
 frozenset({'and', 'herb', 'past

In [16]:
candidates(to_ingredient('knorr chicken flavor bouillon cube'))

[frozenset({'knorr'}),
 frozenset({'flavor'}),
 frozenset({'bouillon'}),
 frozenset({'chicken'}),
 frozenset({'cube'}),
 frozenset({'flavor', 'knorr'}),
 frozenset({'bouillon', 'knorr'}),
 frozenset({'chicken', 'knorr'}),
 frozenset({'cube', 'knorr'}),
 frozenset({'bouillon', 'flavor'}),
 frozenset({'chicken', 'flavor'}),
 frozenset({'cube', 'flavor'}),
 frozenset({'bouillon', 'chicken'}),
 frozenset({'bouillon', 'cube'}),
 frozenset({'chicken', 'cube'}),
 frozenset({'bouillon', 'flavor', 'knorr'}),
 frozenset({'chicken', 'flavor', 'knorr'}),
 frozenset({'cube', 'flavor', 'knorr'}),
 frozenset({'bouillon', 'chicken', 'knorr'}),
 frozenset({'bouillon', 'cube', 'knorr'}),
 frozenset({'chicken', 'cube', 'knorr'}),
 frozenset({'bouillon', 'chicken', 'flavor'}),
 frozenset({'bouillon', 'cube', 'flavor'}),
 frozenset({'chicken', 'cube', 'flavor'}),
 frozenset({'bouillon', 'chicken', 'cube'}),
 frozenset({'bouillon', 'chicken', 'flavor', 'knorr'}),
 frozenset({'bouillon', 'cube', 'flavor', 'k

The final step is to compute probabilities of candidate words. This is done using a counter of ingredients: the higher the count, the higher its probability.

In [17]:
from collections import Counter

c = Counter(all_ingredients)

c.most_common(10)

[(frozenset({'salt'}), 18049),
 (frozenset({'oil', 'olive'}), 7972),
 (frozenset({'onions'}), 7972),
 (frozenset({'water'}), 7457),
 (frozenset({'garlic'}), 7380),
 (frozenset({'sugar'}), 6434),
 (frozenset({'cloves', 'garlic'}), 6237),
 (frozenset({'butter'}), 4848),
 (frozenset({'black', 'ground', 'pepper'}), 4785),
 (frozenset({'all-purpose', 'flour'}), 4632)]

Now we're all set to compute the best candidate for a given input.

First, let's build the probability evaluation for a possible ingredient using a default dict (so that we don't end up asking for an ingredient that doesn't exist):

In [18]:
from collections import defaultdict
probability = defaultdict(lambda: 1, c.most_common())

Let's test the probability:

In [19]:
probability[to_ingredient('pasta and herb')]

1

In [20]:
probability[to_ingredient('tomato sauce')]

867

Seems like what we expect: tomato sauce has a higher probability than pasta and herb, which doesn't appear in our initial words.

Let's now write the function that yields the most probable replacement ingredient among all possible candidates.

In [21]:
def best_replacement(ingredient):
    "Computes best replacement ingredient for a given input."
    return max(candidates(ingredient), key=lambda c: probability[c])

Let's now look at some examples:

In [22]:
best_replacement(to_ingredient("tomato sauce"))

frozenset({'sauce', 'tomato'})

In [23]:
best_replacement(to_ingredient("pasta and herb"))

frozenset({'pasta'})

In [24]:
best_replacement(to_ingredient("kraft mexican style shredded four cheese with a touch of philadelphia"))

frozenset({'cheese'})

These examples all look good. What about the less frequent ingredients we had in our data?

In [25]:
def frozen_repr(fs):
    "Better represenation for a frozenset of strings."
    return " ".join(fs)

In [26]:
for text in sorted(set(all_ingredients_text), key=len, reverse=True)[:50]:
    print("original: {}, better: {}".format(text, " ".join(best_replacement(to_ingredient(text)))))

original: pillsburyâ„¢ crescent recipe creationsâ® refrigerated seamless dough sheet, better: dough
original: kraft mexican style shredded four cheese with a touch of philadelphia, better: cheese
original: bertolli vineyard premium collect marinara with burgundi wine sauc, better: wine
original: hidden valleyâ® original ranch saladâ® dressing & seasoning mix, better: seasoning
original: hidden valleyâ® farmhouse originals italian with herbs dressing, better: herbs
original: hellmannã¢â‚¬â„¢ or best food canola cholesterol free mayonnais, better: canola
original: kraft shredded pepper jack cheese with a touch of philadelphia, better: pepper
original: condensed reduced fat reduced sodium cream of mushroom soup, better: cream
original: condensed reduced fat reduced sodium cream of chicken soup, better: chicken
original: i can't believ it' not butter! made with olive oil spread, better: olive oil
original: (10 oz.) frozen chopped spinach, thawed and squeezed dry, better: spinach
original: 

This is interesting: a lot of the ingredients get well simplified, which is exactly what we want.

However, this model fails to take into account one thing: if we leave out too many ingredients in our replacement ingredient, the distance to the original ingredient increases. This is analogous to Peter Norvig's spell checker, where he first considers words that have only one modification, then two modifications, then three compared to the original word. Obviously this approach also has flaws, but let's try it on our ingredients.

# Building a slightly more elaborate model

The only thing we have to change is the way our candidates function works. Instead of generating all possibilites it should return only the ones that exist in our vocabulary of recipes with the least possible number of modifications. Here the modifications can be thought of leaving out a given number of words.

In [27]:
def candidates_sorted(ingredient, vocabulary):
    "Returns candidate ingredients obtained from the original ingredient by substraction, largest number of ingredients first."
    n = len(ingredient)
    for i in range(n - 1, 1, -1):
        possible = [frozenset(combi) for combi in itertools.combinations(ingredient, i) 
                    if frozenset(combi) in vocabulary]
        if len(possible) > 0:
            return possible
    return [ingredient]

We will define our vocabulary by the already existing counted ingredients:

In [28]:
vocabulary = dict(c.most_common())

In [29]:
list(vocabulary.keys())[:10]

[frozenset({'cheddar', 'mature'}),
 frozenset({'halibut'}),
 frozenset({'jeera', 'shahi'}),
 frozenset({'fat', 'low', 'natural', 'yoghurt'}),
 frozenset({'salt', 'seasoning'}),
 frozenset({'comfort', 'southern'}),
 frozenset({'grape', 'leaves'}),
 frozenset({'pasta', 'shells', 'wheat', 'whole'}),
 frozenset({'grain', 'mustard', 'whole'}),
 frozenset({'knorr', 'leek', 'mix', 'recip'})]

Let's test our function on a few examples:

In [30]:
candidates_sorted(to_ingredient("bottled clam juice"), vocabulary)

[frozenset({'clam', 'juice'})]

In [31]:
candidates_sorted(to_ingredient('knorr reduc sodium chicken flavor bouillon cube'), vocabulary)

[frozenset({'bouillon', 'chicken', 'flavor', 'knorr', 'reduc', 'sodium'})]

In [32]:
candidates_sorted(to_ingredient('hidden valley original ranch spicy ranch dressing'), vocabulary)

[frozenset({'dressing', 'ranch'})]

As we see, there are a couple of improvements in the sense that some words have been discarded, while keeping the biggest number of words.

Let's now write a new function for computing the best replacement:

In [33]:
def best_replacement_sorted(ingredient, vocabulary):
    "Computes best replacement ingredient for a given input."
    return max(candidates_sorted(ingredient, vocabulary), key=lambda w: vocabulary[w])

And let's apply this to our lesser used ingredients:

In [34]:
for text in sorted(set(all_ingredients_text), key=len, reverse=True)[:50]:
    print("original: {}, better: {}".format(text, " ".join(best_replacement_sorted(to_ingredient(text), vocabulary))))

original: pillsburyâ„¢ crescent recipe creationsâ® refrigerated seamless dough sheet, better: refrigerated seamless crescent dough
original: kraft mexican style shredded four cheese with a touch of philadelphia, better: cheese shredded
original: bertolli vineyard premium collect marinara with burgundi wine sauc, better: marinara premium vineyard burgundi collect bertolli wine sauc with
original: hidden valleyâ® original ranch saladâ® dressing & seasoning mix, better: ranch dressing
original: hidden valleyâ® farmhouse originals italian with herbs dressing, better: herbs italian
original: hellmannã¢â‚¬â„¢ or best food canola cholesterol free mayonnais, better: mayonnais canola free food hellmannã¢â‚¬â„¢ or cholesterol best
original: kraft shredded pepper jack cheese with a touch of philadelphia, better: kraft shredded pepper cheese jack
original: condensed reduced fat reduced sodium cream of mushroom soup, better: cream sodium soup mushroom of fat reduced
original: condensed reduced fat 

This result is interesting: the refined labels given here are much better than the originals. Hopefully, this leads us to a better ingredient names, leading to better classification. To assert this, we'll train again on the dataset and see if we can expect some improvement compared to our previous model.

# Training a machine learning model

First, let's train the [previous model](http://flothesof.github.io/kaggle-whats-cooking-machine-learning.html):

In [35]:
df_train['all_ingredients'] = df_train['ingredients'].map(";".join)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(df_train['all_ingredients'].values)
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y = enc.fit_transform(df_train.cuisine)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Let's evaluate the accuracy of our model by doing a K-fold cross validation (training the model several times and check the error each time).

In [36]:
from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem
import numpy as np
def evaluate_cross_validation(clf, X, y, K):
    # create a k-fold cross validation iterator
    cv = KFold(len(y), K, shuffle=True, random_state=0)
    # by default the score used is the one returned by score method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv=cv)
    print (scores)
    print ("Mean score: {0:.3f} (+/-{1:.3f})".format(
        np.mean(scores), sem(scores)))

In [37]:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
evaluate_cross_validation(logistic, X, y, 5)

[ 0.79145192  0.78868636  0.78252671  0.79044626  0.78124214]
Mean score: 0.787 (+/-0.002)


Now that we have an initial evaluation of our algorithm, let's feed it our better vectors. All we have to do is rebuild the X matrix using our new feature ingredients:

In [47]:
def improve_ingredients(ingredient_list):
    "Improves the list of ingredients given as input."
    better_ingredients = []
    for ingredient in ingredient_list:
        better_ingredients.append(" ".join(best_replacement_sorted(to_ingredient(ingredient.lower()), vocabulary)))
    return ";".join(better_ingredients)

In [48]:
df_train['better_ingredients'] = df_train['ingredients'].map(improve_ingredients)

In [53]:
df_train.head(10)

Unnamed: 0,cuisine,id,ingredients,all_ingredients,better_ingredients
0,greek,10259,"[romaine lettuce, black olives, grape tomatoes...",romaine lettuce;black olives;grape tomatoes;ga...,romaine lettuce;olives black;grape tomatoes;ga...
1,southern_us,25693,"[plain flour, ground pepper, salt, tomatoes, g...",plain flour;ground pepper;salt;tomatoes;ground...,flour plain;pepper ground;salt;tomatoes;pepper...
2,filipino,20130,"[eggs, pepper, salt, mayonaise, cooking oil, g...",eggs;pepper;salt;mayonaise;cooking oil;green c...,eggs;pepper;salt;mayonaise;oil cooking;green c...
3,indian,22213,"[water, vegetable oil, wheat, salt]",water;vegetable oil;wheat;salt,water;oil vegetable;wheat;salt
4,indian,13162,"[black pepper, shallots, cornflour, cayenne pe...",black pepper;shallots;cornflour;cayenne pepper...,pepper black;shallots;cornflour;cayenne pepper...
5,jamaican,6602,"[plain flour, sugar, butter, eggs, fresh ginge...",plain flour;sugar;butter;eggs;fresh ginger roo...,flour plain;sugar;butter;eggs;fresh ginger;sal...
6,spanish,42779,"[olive oil, salt, medium shrimp, pepper, garli...",olive oil;salt;medium shrimp;pepper;garlic;cho...,olive oil;salt;shrimp medium;pepper;garlic;cho...
7,italian,3735,"[sugar, pistachio nuts, white almond bark, flo...",sugar;pistachio nuts;white almond bark;flour;v...,sugar;nuts pistachio;bark almond white;flour;e...
8,mexican,16903,"[olive oil, purple onion, fresh pineapple, por...",olive oil;purple onion;fresh pineapple;pork;po...,olive oil;purple onion;fresh pineapple;pork;pe...
9,italian,12734,"[chopped tomatoes, fresh basil, garlic, extra-...",chopped tomatoes;fresh basil;garlic;extra-virg...,chopped tomatoes;fresh basil;garlic;olive oil;...


Let's now see the performance of this new version of the ingredients.

In [63]:
X = cv.fit_transform(df_train['better_ingredients'].values)

In [52]:
evaluate_cross_validation(logistic, X, y, 5)

[ 0.78680075  0.7859208   0.7840352   0.78378378  0.7760875 ]
Mean score: 0.783 (+/-0.002)


Well that's too bad... Our "clever" ingredients don't improve our score on the test set. Still let's try to submit a solution given that the test data might have some other issues (unknown ingredients that our method can simplify...).

In [54]:
df_test = pd.read_json('test.json')

In [55]:
df_test['better_ingredients'] = df_test['ingredients'].map(improve_ingredients)

KeyError: frozenset({'japanese', 'greens'})

We get a KeyError, due to the absence of an item in our vocabulary. No problem, let's build a better vocabulary by using the test dataset also.

In [57]:
all_ingredients_text = []
for df in [df_train, df_test]:
    for ingredient_list in df.ingredients:
        all_ingredients_text += [ing.lower() for ing in ingredient_list]
all_ingredients = [to_ingredient(text) for text in all_ingredients_text]
c = Counter(all_ingredients)
vocabulary = dict(c.most_common())

Let's run the above line again to build our test set:

In [58]:
df_test['better_ingredients'] = df_test['ingredients'].map(improve_ingredients)

And now, let's build our feature matrix:

In [64]:
X_submit = cv.transform(df_test['better_ingredients'].values)

In [65]:
logistic.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [66]:
y_submit = logistic.predict(X_submit)

In [67]:
cuisines = [u'greek', u'southern_us', u'filipino', u'indian', u'jamaican',
       u'spanish', u'italian', u'mexican', u'chinese', u'british', u'thai',
       u'vietnamese', u'cajun_creole', u'brazilian', u'french',
       u'japanese', u'irish', u'korean', u'moroccan', u'russian']

In [72]:
with open("submission_better_words.csv", 'w') as f:
    f.write("id,cuisine\n")
    for i, cuisine in zip(df_test.id, enc.inverse_transform(y_submit)):
        f.write("{},{}\n".format(i, cuisine))

# Conclusion 

My submission (to the already closed contest) achieves a ranking of 618th, nothing to frill about. That's too bad. I would have hoped it would somehow improve my score, even though the K-fold had shown worse performance. Nevertheless, I think what is valuable here is that we have devised a way to improve a given text feature vector by using simple probability. This technique might be useful in other machine learning settings, when one can't trust the absolute reliability of the data. As demonstrated on the most exotic ingredients found in the dataset, this method can perform very well and transform an ingredient that is quite exotic into something that still makes sense.