**Name:** WANG XI

**EID:** xwang258

**Name:** Yu Mingjie

**EID:** mingjieyu2

**Kaggle Competition:** What's cooking?

**Kaggle Team Name:** Group2

# CS4487 - Course Project
Due date: Nov 27, 2015 11:59pm HKT

## Goal
You can select your course project as _one_ of the following Kaggle competitions:
  1. [What's cooking?](https://www.kaggle.com/c/whats-cooking) - cooking ingredients to classify the type of cuisine.
  2. [Trip type classification](https://www.kaggle.com/c/walmart-recruiting-trip-type-classification) - from a shopping list, predict the what type of shopping trip the customer is on (e.g., weekly groceries, clothes shopping).

## Groups
Group projects with at most 2 students are allowed.  To sign up for a group, go to Canvas and under "People", join one of the existing "Project Groups".  _For group projects, the project report must state the percentage contribution from each project member._

## Methodology
You are free to choose the methodology to solve the task.  In machine learning, it is important to use domain knowledge to help solve the problem.  Hence, instead of blindly applying the algorithms to the data you need to think about how to represent the data in a way that makes sense for the algorithm to solve the task. 


## Evaluation on Kaggle

Besides evaluating on the validation set, you need to submit your test results to Kaggle for evaluation.

## Project Presentation

Each project group needs to give a presentation at the end of the semester.  The presentation time is 10 minutes.  You _must_ give a presentation.

## What to hand in
You need to turn in the following things:

1. This ipynb file with your source code and documentation.
2. Your final submission file to Kaggle.
3. Presentation slides.

Files should be uploaded to "Course Project" on Canvas.


## Grading
The marks of the assignment are distributed as follows:
- 50% - Results using various feature representations, dimensionality reduction methods, classifiers, etc.
- 20% - Trying out feature representations (e.g. adding additional features, combining features from different sources) or methods not used in the tutorials.
- 15% - Quality of the written report.  More points for insightful observations and analysis.
- 15% - Project presentation.
<hr>

Contribution from each project member: WANG XI 50% and Yu Mingjie 50%.

In [36]:
%matplotlib inline
import IPython.core.display         
# setup output image format (Chrome works best)
IPython.core.display.set_matplotlib_formats("svg")
import matplotlib.pyplot as plt
import matplotlib
from numpy import *
from sklearn import *
from scipy import stats
import IPython.utils.warn as warn
random.seed(100)
import json
import csv
import time
import nltk
from nltk import word_tokenize   
from nltk.stem import WordNetLemmatizer

In [2]:
def load_cooking(fname):
    # load the cooking dataset
    # returns dictionary with data, target (cuisine type), item ids
    
    fp = open(fname, "r")
    dataj = json.load(fp)
    fp.close()
    
    data = []
    ids  = []
    target = []
    for o in dataj:
        if o.has_key('cuisine'):
            target.append(o['cuisine'])
        ids.append(o['id'])
        data.append(o['ingredients'])

    out = {"data": data, "id":ids}
    if len(target) > 0:
        out["target"] =  target
    return out


# write a kaggle submission file for "whats cooking"
def write_csv_kaggle_cooking(fname, ids, target):
    # header
    tmp = [['id', 'cuisine']]
    
    for i in range(len(ids)):
        # add a row (id and class prediction)
        tmp.append([ids[i], target[i]])
        
    # write CSV file
    f = open(fname, 'wb')
    writer = csv.writer(f)
    writer.writerows(tmp)
    f.close()

In [3]:
# load the data
traindata = load_cooking("train.json")
testdata  = load_cooking("test.json")
print len(traindata['data'])
print len(testdata['data'])

39774
9944


In [4]:
# first training example
print traindata['data'][0]
# data is a list of ingredients
print traindata['target'][0]
# target is the cuisine

[u'romaine lettuce', u'black olives', u'grape tomatoes', u'garlic', u'pepper', u'purple onion', u'seasoning', u'garbanzo beans', u'feta cheese crumbles']
greek


In [5]:
# list of cuisines
classes = unique(traindata['target'])
print len(classes)
print classes

20
[u'brazilian' u'british' u'cajun_creole' u'chinese' u'filipino' u'french'
 u'greek' u'indian' u'irish' u'italian' u'jamaican' u'japanese' u'korean'
 u'mexican' u'moroccan' u'russian' u'southern_us' u'spanish' u'thai'
 u'vietnamese']


In [6]:
trainY = traindata['target']

We first transform the traindata['data'] and testdata['data'] to a list of recipes instead of a list  of ingredients for futher vectorization.

In [7]:
trainRecipe = []
for li in traindata['data']:
    s=""
    for x in li:
        s += " "+x
    trainRecipe.append(s)

In [8]:
testRecipe = []
for li in testdata['data']:
    s=""
    for x in li:
        s += " "+x
    testRecipe.append(s)

First we try CountVectorizer for feature extraction. If only using the default tokenizer for the feature extraction, there are 3010 features.

In [9]:
tmp = feature_extraction.text.CountVectorizer()
trainXtmp = tmp.fit_transform(trainRecipe)
print len(tmp.get_feature_names())

3010


After we remove the stop_words, the features decrease to 2970.

In [10]:
tmp = feature_extraction.text.CountVectorizer(stop_words='english')
trainXtmp = tmp.fit_transform(trainRecipe)
testXtmp = tmp.transform(testRecipe)
print len(tmp.get_feature_names())

2970


We use the Bernoulli Naive Bayes model to train data and do prediction. 

In [12]:
alphasb = logspace(-4,0,10)
avgscoresb = empty(len(alphasb))

for i,al in enumerate(alphasb):       
        bmodel = naive_bayes.BernoulliNB(alpha=al)
        myscoreb = cross_validation.cross_val_score(bmodel, trainXtmp, trainY, cv=5)
        avgscoresb[i] = mean(myscoreb)

In [13]:

tStart = time.time()
time.sleep(2)

bestib = argmax(avgscoresb)
bestab = alphasb[bestib]

print "max acc of cross-validation =", avgscoresb[bestib]

bmodel = naive_bayes.BernoulliNB(alpha=bestab)
bmodel.fit(trainXtmp, trainY)
tEnd = time.time()

print "It takes %f sec" % (tEnd - tStart)

predTrainYb = bmodel.predict(trainXtmp)
predTestYb = bmodel.predict(testXtmp)

print "acc of bernoulli = ", mean(predTrainYb == trainY)

max acc of cross-validation = 0.718635892965
It takes 2.203000 sec
acc of bernoulli =  0.740760295671


Bernoulli model runs for 2.23 seconds. The accuracy of bernoulli model on training data is 0.74076 and the testing accuray on Kaggle is 0.68142.

Then, we try to use TF-IDF to extract features and other models to classify.

In [11]:
# TF-IDF representation
# (For TF, pass use_idf=False)
from sklearn.feature_extraction.text import TfidfVectorizer
tf_trans = feature_extraction.text.TfidfVectorizer(use_idf=True, stop_words='english')
# setup the TF-IDF representation, and transform the training set
trainX = tf_trans.fit_transform(trainRecipe) # transform the test set
testX = tf_trans.transform(testRecipe)

In [13]:
print trainX.shape

(39774, 2970)


In [14]:
trainXrf, testXrf, trainYrf, testYrf = \
            cross_validation.train_test_split(trainX, trainY,
            train_size=0.5, test_size=0.5, random_state=4487)
    
print trainXrf.shape 
print testXrf.shape

(19887, 2970)
(19887, 2970)


Some ingredients are more important than others, and we try the tfidf vectorizer and use Multinomial Naive Bayes Model.

In [15]:
alphas = logspace(-1,0,30)
vocas = range(1000,3000,200)
avgscores = empty((len(alphas), len(vocas)))

for i,al in enumerate(alphas):
    for j,voca in enumerate(vocas):     
        tfvect = feature_extraction.text.TfidfVectorizer(use_idf=False, stop_words = 'english', max_features=voca)
        trainXtf = tfvect.fit_transform(trainRecipe)
        mmodel_tf = naive_bayes.MultinomialNB(alpha=al)      
        myscore = cross_validation.cross_val_score(mmodel_tf, trainXtf, trainY, cv=5)
        avgscores[i,j] = mean(myscore)

In [17]:
import time
tStart = time.time()

time.sleep(2)

besti = argmax(avgscores)

(bestia, bestiv) = unravel_index(besti, avgscores.shape)
besta = alphas[bestia]
bestv = vocas[bestiv]
print "vocabulary size = ", bestv
print "max acc of cross-validation = ", avgscores[bestia,bestiv]

tfvect = feature_extraction.text.TfidfVectorizer(use_idf=False, max_features=bestv)
trainXtf = tfvect.fit_transform(trainRecipe)
    
mmodel_tf = naive_bayes.MultinomialNB(alpha=besta)
mmodel_tf.fit(trainXrf, trainYrf)

tEnd = time.time()

print "It cost %f sec" % (tEnd - tStart)
print tEnd - tStart

predtrainYtf = mmodel_tf.predict(testXrf)


print "acc of tf-idf = ", mean(predtrainYtf==testYrf)

vocabulary size =  1800
max acc of cross-validation =  0.710112816369
It cost 3.131339 sec
3.13133907318
acc of tf-idf =  0.727962990899


The Multinomial NB model costs only about 3.1 seconds, which is much better than the random forest classifier. The training accuracy of Multinomial NB is not high (72.52%), situation in Kaggle is similiar, the accuracy is 70.746%.
Go on and try other classifiers:

In order to deduct the dimensionality, nmf is selected to see if we can get higher accuracy.

In [19]:
nmf = decomposition.NMF(n_components=50)
W = nmf.fit_transform(trainXrf)
Wt = nmf.transform(testXrf)

In [20]:
mmodel_tf2 = naive_bayes.MultinomialNB(alpha=besta)
mmodel_tf2.fit(W, trainYrf)

predtestYtf = mmodel_tf2.predict(Wt)
print "acc of tf-idf = ", mean(predtestYtf==testYrf)

acc of tf-idf =  0.279780761301


Using NMF, compared to the MultinomiaNB Classfier without dimensionality deduction in the above (accuracy = 78.9%), this version has only an accuracy of 28%. Important features might have been deducted and we gave up on using dimensionality deductions.

In [18]:
import time
tStart = time.time()

time.sleep(2)

from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 1000) 
result = forest.fit(trainXrf, trainYrf)

tEnd = time.time()
print "It cost %f sec" % (tEnd - tStart)
print tEnd - tStart

predTrainRF = forest.predict(testXrf)
print "acc of Random Forest = ", mean(predTrainRF == testYrf)

It cost 347.799201 sec
347.799201012
acc of Random Forest =  0.738522652989


The Random Forest Classifier costs 348 seconds, which is time-consuming. When we used it to train the whole training dataset, it needed 100 more seconds. Its accuracy is about 76.78% in Kaggle. Though the result is not bad, the efficiency of Random Forest Classifier is barely satisfactory.

In [28]:
import time
tStart = time.time()

time.sleep(2)

clf = svm.LinearSVC(C=0.8)
clf.fit(trainXrf, trainYrf)

tEnd = time.time()

print "It cost %f sec" % (tEnd - tStart)
print tEnd - tStart

predtrainYtf = clf.predict(testXrf)

print "acc of tf-idf = ", mean(predtrainYtf==testYrf)

It cost 3.388000 sec
3.38800001144
acc of tf-idf =  0.777140845779


The speed of Lineaer SVC is also good (about 3.4 seconds) and it is even slight faster than the Multinomial NB. After using LinearSVC to train the trainXtf data, the accuracy is 78.268% in Kaggle. The accuracy is relately higher than other classifiers needing only 3.089 seconds.

In [29]:
import time
tStart = time.time()

time.sleep(2)

clf = tree.DecisionTreeClassifier()
clf.fit(trainXrf, trainYrf)

tEnd = time.time()

print "It cost %f sec" % (tEnd - tStart)
print tEnd - tStart

predtestYt=clf.predict(testXrf)
print "acc of decision tree = ", mean(predtestYt == testYrf)

It cost 13.354000 sec
13.3539998531
acc of decision tree =  0.594458691608


Using Decision Tree Classifier, we got an accuracy of 59% in Kaggle, similar to the training accuracy here, which is much lower than other classifiers and implemented slower (about 13.35 seconds)than Multinomial NB classifier or Linear SVC.

In [26]:
import time
tStart = time.time()

time.sleep(2)

logreg = linear_model.LogisticRegression(C=100) 
logreg.fit(trainXrf, trainYrf)

tEnd = time.time()

print "It cost %f sec" % (tEnd - tStart)
print tEnd - tStart

# predict from the model
predY = logreg.predict(testXrf)
# calculate accuracy
Ncorrect = sum(testYrf==predY) 
acc = mean(testYrf==predY) 
print "accuracy=", acc

It cost 12.666446 sec
12.6664459705
accuracy= 0.764620103585


To sum up, we choose to focus on logistic regression and go on with further exploration.

Noticed that there are some french characters, numbers and punctuations, we re-write the generation of list of recipe to remove or replace them.

In [28]:
import re
trainRecipe = []
for li in traindata['data']:
    s=""
    for x in li:
        x=re.sub('-','_',x)
        s += ' '+re.sub(r'[^a-zA-Z_ ]','',x) # keep the underscore
    trainRecipe.append(s)

In [29]:
testRecipe = []
for li in testdata['data']:
    s=""
    for x in li:
        x=re.sub('-','_',x)
        s += ' '+re.sub(r'[^a-zA-Z_ ]','',x) # keep the underscore
    testRecipe.append(s)

We use TfidfVectorizer to vectorize the clean trainRecipe data and logistic regression to train it.

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfvect = TfidfVectorizer(stop_words='english')
trainXtf = tfvect.fit_transform(trainRecipe)
testXtf = tfvect.transform(testRecipe)
logreg = linear_model.LogisticRegressionCV(Cs=logspace(1,10,10), cv=5) 

In [32]:
logreg.fit(trainXtf, trainY)

predTrainYlr = logreg.predict(trainXtf)
predTestYlr = logreg.predict(testXtf)

print "acc of logistic regression = ", mean(predTrainYlr == trainY)

acc of logistic regression =  0.858475385931


The training accuracy . The final accuracy in the Kaggle is 78.872%. 

In [33]:
print tfvect.get_feature_names()


[u'10', u'11', u'12', u'13', u'14', u'15', u'16', u'17', u'18', u'19', u'20', u'21', u'22', u'23', u'24', u'25', u'26', u'27', u'28', u'29', u'30', u'31', u'32', u'33', u'34', u'35', u'36', u'38', u'40', u'43', u'49', u'52', u'59', u'65', u'aai', u'abalone', u'abbamele', u'absinthe', u'abura', u'acai', u'accent', u'accompaniment', u'achiote', u'acid', u'acini', u'ackee', u'acorn', u'active', u'added', u'adobo', u'adzuki', u'agar', u'agave', u'age', u'aged', u'ahi', u'aioli', u'ajinomoto', u'ajwain', u'aka', u'alaskan', u'albacore', u'alcohol', u'ale', u'aleppo', u'alexia', u'alfalfa', u'alfredo', u'all_purpose', u'allspice', u'almond', u'almondmilk', u'almonds', u'aloe', u'alphabet', u'alum', u'amaranth', u'amarena', u'amaretti', u'amaretto', u'amba', u'amber', u'amberjack', u'amchur', u'america', u'american', u'aminos', u'ammonium', u'amontillado', u'ampalaya', u'anaheim', u'anasazi', u'ancho', u'anchovies', u'anchovy', u'andouille', u'anejo', u'angel', u'anglaise', u'angled', u'angos

We can see the words like "banana" and "bananas" show together, so we can use lemmatization to remove the redundant features.

In [34]:
def tokenize_and_lemmatizer(text):
    tokens = word_tokenize(text)        
    lemmas = [WordNetLemmatizer().lemmatize(t) for t in tokens]
    return lemmas

In [37]:
tfvect = TfidfVectorizer(stop_words='english',tokenizer=tokenize_and_lemmatizer)
trainXtf = tfvect.fit_transform(trainRecipe)
testXtf = tfvect.transform(testRecipe)
logreg = linear_model.LogisticRegressionCV(Cs=logspace(1,10,10), cv=5) 

In [38]:
logreg.fit(trainXtf, trainY)

predTrainYlr = logreg.predict(trainXtf)
predTestYlr = logreg.predict(testXtf)

print "acc of logistic regression = ", mean(predTrainYlr == trainY)

acc of logistic regression =  0.853396691306


The training accuracy is lower, however, the testing accuracy on Kaggle slightly increases to 78.982%.

We only use single words as features, however, each ingredient contains at least one word. So we modify the preprocessing method to see the result.

In [40]:
import re
trainRecipe = []
for li in traindata['data']:
    s=""
    for x in li:
        x=re.sub(' ','',x)
        s += ' '+re.sub(r'[^a-zA-Z_ ]','',x) # keep the underscore
    trainRecipe.append(s)

In [46]:
import re
testRecipe = []
for li in testdata['data']:
    s=""
    for x in li:
        x=re.sub(' ','',x)
        s += ' '+re.sub(r'[^a-zA-Z_ ]','',x) # keep the underscore
    testRecipe.append(s)

In [41]:
print trainRecipe[0]

 romainelettuce blackolives grapetomatoes garlic pepper purpleonion seasoning garbanzobeans fetacheesecrumbles


It shows that we catenate the words in a single ingredient together so that the tokenizer will not split them.

In [42]:
tfvect = TfidfVectorizer(stop_words='english',tokenizer=tokenize_and_lemmatizer)
trainXtf = tfvect.fit_transform(trainRecipe)
testXtf = tfvect.transform(testRecipe)
logreg = linear_model.LogisticRegressionCV(Cs=logspace(1,10,10), cv=5) 

In [43]:
logreg.fit(trainXtf, trainY)

predTrainYlr = logreg.predict(trainXtf)
predTestYlr = logreg.predict(testXtf)

print "acc of logistic regression = ", mean(predTrainYlr == trainY)

acc of logistic regression =  0.900060340926


The accuracy of it in Kaggle is 78.831%

We have also tried out difference feature extractions by setting different min_df and ngram_range. Min_df can restrict the less frequent features. Since ingredients have different lengths, the combined word can be meaningful. Therefore we can manipulate different ngram_range.

(1). The tesing accuracy on Kaggle is 0.76016.

tfvect = TfidfVectorizer(stop_words='english',ngram_range=(1,6), min_df = 0.00067, tokenizer=tokenize_and_lemmatizer)

(2). Only set the min_df=0.001 for the features and the result on Kaggle is 0.78459.

tfvect = TfidfVectorizer(stop_words='english',min_df = 0.001, tokenizer=tokenize_and_lemmatizer)

(3). Only set the ngram_range(1,2) and the result on Kaggle is 0.78600.

tfvect = TfidfVectorizer(stop_words='english',ngram_range=(1,2), tokenizer=tokenize_and_lemmatizer)

The best result is still from the logistic regression using tf-idf with tokenize_and_lemmatizer.


In [48]:
write_csv_kaggle_cooking("kaggle_cooking_test.csv", testdata["id"], predTestYlr)

FURTHER THINKING

● Insight: some special ingredient can directly determine the cuisine

           E.g Shanghai -> China, Kimchi -> Korea
           
● For each cuisine, specific set of ingredients.

● How to predict cuisine separately and merge at the end? ● Feature union and averaging model predictions.