# What's Cooking - 1


aka first attempt at the Kaggle 'What's Cooking?' competition. 

## JSON
First I need to learn to read in the data, which is in JSON format. What does that mean? 

JSON means JavaScript Object Notation

JSON is an open-standard format that uses human-readable text to transmit data objects consisting of attribute-value pairs

In [176]:
import json as json

In [177]:
data = json.loads({'train.json'})

TypeError: expected string or buffer

In [178]:
help(json)

Help on package json:

NAME
    json

FILE
    /Users/Elizabeth/anaconda/lib/python2.7/json/__init__.py

MODULE DOCS
    http://docs.python.org/library/json

DESCRIPTION
    JSON (JavaScript Object Notation) <http://json.org> is a subset of
    JavaScript syntax (ECMA-262 3rd edition) used as a lightweight data
    interchange format.
    
    :mod:`json` exposes an API familiar to users of the standard library
    :mod:`marshal` and :mod:`pickle` modules. It is the externally maintained
    version of the :mod:`json` library contained in Python 2.6, but maintains
    compatibility with Python 2.4 and Python 2.5 and (currently) has
    significant performance advantages, even without using the optional C
    extension for speedups.
    
    Encoding basic Python object hierarchies::
    
        >>> import json
        >>> json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])
        '["foo", {"bar": ["baz", null, 1.0, 2]}]'
        >>> print json.dumps("\"foo\bar")
        "\"foo\bar"
 

In [179]:
class ComplexEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, complex):
            return [obj.real, obj.imag]
        # Let the base class default method raise the TypeError
        return json.JSONEncoder.default(self, obj)

json.dumps(2 + 1j, cls=ComplexEncoder)
ComplexEncoder().encode(2 + 1j)
list(ComplexEncoder().iterencode(2 + 1j))

# the following line I tried, but didn't
# data = json.loads({'train.json'})

['[2.0', ', 1.0', ']']

In [180]:
data = json.load(open('train.json'))

In [181]:
print data[0]

{u'cuisine': u'greek', u'id': 10259, u'ingredients': [u'romaine lettuce', u'black olives', u'grape tomatoes', u'garlic', u'pepper', u'purple onion', u'seasoning', u'garbanzo beans', u'feta cheese crumbles']}


In [182]:
print data[0:2]

[{u'cuisine': u'greek', u'id': 10259, u'ingredients': [u'romaine lettuce', u'black olives', u'grape tomatoes', u'garlic', u'pepper', u'purple onion', u'seasoning', u'garbanzo beans', u'feta cheese crumbles']}, {u'cuisine': u'southern_us', u'id': 25693, u'ingredients': [u'plain flour', u'ground pepper', u'salt', u'tomatoes', u'ground black pepper', u'thyme', u'eggs', u'green tomatoes', u'yellow corn meal', u'milk', u'vegetable oil']}]


In [183]:
print data[0]['cuisine']

greek


In [184]:
print data[0]['id']

10259


In [185]:
print data[0]['ingredients']

[u'romaine lettuce', u'black olives', u'grape tomatoes', u'garlic', u'pepper', u'purple onion', u'seasoning', u'garbanzo beans', u'feta cheese crumbles']


In [186]:
print data[0]['ingredients'][0]

romaine lettuce


In [187]:
print data['cuisine']

TypeError: list indices must be integers, not str

In [188]:
print data[0::]['cuisine']

TypeError: list indices must be integers, not str

In [189]:
print data[0:2]['cuisine']

TypeError: list indices must be integers, not str

In [190]:
type(data)

list

In [191]:
type(data[0])

dict

In [192]:
print type(data[0]['ingredients'])

len(data)

<type 'list'>


39774

## Ok so we have a list of dicts of lists ...

I'm going to create a for loop, iterate through each item of the list, and gather a list of all of the ingredients. 

In [193]:
# my research reveals that sets (like set()) filter out duplicates, and lists don't. I'll make a set
full_ingredient_set = set()

count = 0
for x in xrange (0,len(data)):
    for y in xrange (0,len(data[x]['ingredients'])):
        full_ingredient_set.add(data[x]['ingredients'][y])
        count += 1

In [194]:
print len(full_ingredient_set)
print count

6714
428275


# Great! Now we have a preliminary ingredient set

In [195]:
print full_ingredient_set[0:20]

TypeError: 'set' object has no attribute '__getitem__'

In [196]:
full_ingredient_list = list(full_ingredient_set)

In [197]:
print full_ingredient_list[0:20]

[u'low-sodium fat-free chicken broth', u'sweetened coconut', u'baking chocolate', u'egg roll wrappers', u'bottled low sodium salsa', u'vegan parmesan cheese', u'clam sauce', u'mahlab', u'(10 oz.) frozen chopped spinach, thawed and squeezed dry', u'figs', u'caramels', u'broiler', u'jalapeno chilies', u'(15 oz.) refried beans', u'brioche buns', u'broccoli romanesco', u'flaked oats', u'anise extract', u'whole wheat pastry flour', u'ravva']


## Next let's get a list of the cuisine types

In [198]:
cuisine_set = set()

In [199]:
for x in xrange (0,len(data)):
    cuisine_set.add(data[x]['cuisine'])
    
print len(cuisine_set)

20


In [200]:
print cuisine_set

set([u'irish', u'mexican', u'chinese', u'filipino', u'vietnamese', u'moroccan', u'brazilian', u'japanese', u'british', u'greek', u'indian', u'jamaican', u'french', u'spanish', u'russian', u'cajun_creole', u'thai', u'southern_us', u'korean', u'italian'])


In [201]:
cuisine_list = list(cuisine_set)

In [202]:
print cuisine_list

[u'irish', u'mexican', u'chinese', u'filipino', u'vietnamese', u'moroccan', u'brazilian', u'japanese', u'british', u'greek', u'indian', u'jamaican', u'french', u'spanish', u'russian', u'cajun_creole', u'thai', u'southern_us', u'korean', u'italian']


### What are the 'u's for?



Not sure

## General data exploration: unique words

I want to see first if there are any words that don't overlap with words in any other list for another cuisine type, and then look at those words to see if they make sense

In [203]:
sets_of_ingredients = list(range(len(cuisine_list)))
count = 0
for x in xrange (0,len(data)):
    for y in xrange (0, len(cuisine_list)):
        sets_of_ingredients[y] = set()
        for z in xrange (0,len(data[x]['ingredients'])):
            sets_of_ingredients[y].add(data[x]['ingredients'][z])

In [204]:
unique_sets_ingredients = sets_of_ingredients.copy()

AttributeError: 'list' object has no attribute 'copy'

Well, that didn't work. Let's try this copy.deepcopy() function. Do we need to import copy first? 

In [205]:
unique_sets_ingredients = copy.deepcopy(sets_of_ingredients)

In [206]:
import copy as copy #seems like yes

In [207]:
unique_sets_ingredients = copy.deepcopy(sets_of_ingredients)

In [208]:
# So yeah, my plan is to go through each list, then each other list, 
# and if an item from one is in another, then delete it from both
#
# I'll use s.difference(t), which gives the elements of s that aren't in t. 

# also I'm tired of all of these calling len deals, so I'll come up with a new 
# variable for the number of cuisines

n_cuisine = len(cuisine_set)

for x in range (0,n_cuisine):
    for y in range (0, n_cuisine):
        if y != x:
            unique_sets_ingredients[x] = sets_of_ingredients[x].difference(sets_of_ingredients[y])

In [209]:
# Check if it worked (do we have smaller ingredient sets for each cuisine?)
for x in range (0, n_cuisine):
    print len(unique_sets_ingredients[x])
    print cuisine_list[x]

0
irish
0
mexican
0
chinese
0
filipino
0
vietnamese
0
moroccan
0
brazilian
0
japanese
0
british
0
greek
0
indian
0
jamaican
0
french
0
spanish
0
russian
0
cajun_creole
0
thai
0
southern_us
0
korean
0
italian


## Ok, it's hard to say if it worked or not. 
Maybe every ingredient was on there at least once? Or I easily could have messed up

In [210]:
# Test by adding in a nonsensical string as an ingredient to test
unique_sets_ingredients_test = copy.deepcopy(sets_of_ingredients)
unique_sets_ingredients_test[0].add('thisisatest')

for x in range (0,n_cuisine):
    for y in range (0, n_cuisine):
        if y != x:
            unique_sets_ingredients_test[x] = \
                unique_sets_ingredients_test[x].difference(unique_sets_ingredients_test[y])

In [211]:
# Check if it worked (is my test string in there?)
unique_sets_ingredients_test[0]

{'thisisatest'}

ok, seems to have worked

## Unique ingredients sets will not work, then. Bummer. New strategy!
Let's look at the lists individually and see what we're working with

In [212]:
print sets_of_ingredients[0]

set([u'ground black pepper', u'salt', u'dried oregano', u'celery', u'garlic', u'chopped cilantro fresh', u'white sugar', u'jalapeno chilies', u'green chile', u'green bell pepper', u'onions', u'roma tomatoes'])


## Wait - I think I messed up

The above looks like the ingredient list of just one recipe. I think I just looked if the first twenty recipes had any ingredients that the other ones didn't have. I don't know if I ever consolidated the lists for each cuisine type. In fact, I know I didn't. 

In [213]:
print sets_of_ingredients[1]

set([u'ground black pepper', u'salt', u'dried oregano', u'celery', u'garlic', u'chopped cilantro fresh', u'white sugar', u'jalapeno chilies', u'green chile', u'green bell pepper', u'onions', u'roma tomatoes'])


In [214]:
# What is this list even? 
print sets_of_ingredients[2]

set([u'ground black pepper', u'salt', u'dried oregano', u'celery', u'garlic', u'chopped cilantro fresh', u'white sugar', u'jalapeno chilies', u'green chile', u'green bell pepper', u'onions', u'roma tomatoes'])


## Just a repeat of the same list
I obviously messed up somewhere. Just need to find where. 

Ha, just. As if. Step 1 is find where, then I'll be back at square 1 essentially. 

In [215]:
print sets_of_ingredients[20]

IndexError: list index out of range

Yeah, this list is messed up. 

When did I create it? - started with sets_of_ingredients

In [216]:
# The following code is what I originally ran to create sets_of_lists

#
# sets_of_ingredients = list(range(len(cuisine_list)))
# count = 0
# for x in xrange (0,len(data)):
#     for y in xrange (0, len(cuisine_list)):
#         sets_of_ingredients[y] = set()
#         for z in xrange (0,len(data[x]['ingredients'])):
#             sets_of_ingredients[y].add(data[x]['ingredients'][z])
#            
len(data)

39774

### What is the error above?
Let's find it together!

In [217]:
# first line - looks good, one set for each cuisine type
sets_of_ingredients = list(range(len(cuisine_list)))

# Set up outer loop - looks good, will go through each recipe
for x in xrange (0,len(data)):
    # Set up a second loop to go through cuisine types, I think...
    for y in xrange (0, len(cuisine_list)):
        # New stuff. Check if cuisine matches recipe
        if cuisine_list[y] == data[x]['cuisine']:
            sets_of_ingredients[y] = sets_of_ingredients[y].union(set(data[x]['ingredients']))

AttributeError: 'int' object has no attribute 'union'

In [218]:
# Nevermind, that won't work because I couldn't figure out how to initialize 
# the parts of list as sets.

In [219]:
n_cuisine

20

In [220]:
sets_of_ingredients = list(range(0, n_cuisine))

# Initialize each item in list as a set
for x in xrange (0, n_cuisine):
    sets_of_ingredients[x] = set()
    

In [221]:
# Set up outer loop to go through each recipe
for x in xrange (0,len(data)):
    # Get the cuisine index
    index = cuisine_list.index(data[x]['cuisine'])
    # Add new ingredients to existing set
    sets_of_ingredients[index] = sets_of_ingredients[index].union(set(data[x]['ingredients']))

Ok so the above lines of code took a few seconds to run, so that's a good sign

In [222]:
# Now run those same lines from before to see if there is any cuisine
# ingredient overlap

unique_sets_ingredients = copy.deepcopy(sets_of_ingredients)

for x in range (0,n_cuisine):
    for y in range (0, n_cuisine):
        if y != x:
            unique_sets_ingredients[x] = sets_of_ingredients[x].difference(sets_of_ingredients[y])

In [223]:
print unique_sets_ingredients[0]

set([u'banger', u'candied peel', u'oat bran', u'sweetened coconut', u'ale', u'marshmallow creme', u'chopped potatoes', u'pico de gallo', u'fat-free buttermilk', u'caster', u'dried lavender', u'kiwi fruits', u'maraschino', u'self raising flour', u'gala apples', u'cranberry juice cocktail', u'raw honey', u'treacle', u'raspberry preserves', u'tartar sauce', u'split peas', u'Jameson Irish Whiskey', u'steak sauce', u'sparkling sugar', u'hot tea', u'whole cloves', u'maple sugar', u'wholemeal flour', u'caul fat', u'ginger ale', u'irish oats', u'pork butt', u'Baileys Irish Cream Liqueur', u'suet', u'cardamom seeds', u'chutney', u'coriander', u'biscuit mix', u'tenderloin roast', u'instant pudding mix', u'xanthan gum', u'teas', u'cider', u'scones', u'dried plum', u'hot mustard', u"pig's trotters", u'condensed cream', u'vanilla wafer crumbs', u'lager', u'lamb loin', u'pickle relish', u'sultana', u'mint chocolate chip ice cream', u'liquor', u'virginia ham', u'meringue powder', u'dried pear', u'pas

In [224]:
print len(sets_of_ingredients[0])
print len(unique_sets_ingredients[0])

999
227


In [225]:
unique_ingredients_total = 0

# Check if it worked (do we have smaller ingredient sets for each cuisine?)
for x in range (0, n_cuisine):
    print cuisine_list[x]
    print len(unique_sets_ingredients[x]), len(sets_of_ingredients[x])
    unique_ingredients_total += len(unique_sets_ingredients[x])
print unique_ingredients_total

irish
227 999
mexican
1055 2684
chinese
748 1792
filipino
311 947
vietnamese
404 1108
moroccan
187 974
brazilian
203 853
japanese
568 1439
british
320 1166
greek
207 1198
indian
607 1664
jamaican
215 877
french
564 2102
spanish
223 1263
russian
180 872
cajun_creole
396 1576
thai
498 1376
southern_us
911 2462
korean
313 898
italian
2344 2929
10481


The total of the unique ingredients is more than the actual total of unique ingredients that I calculated earlier, so something is defiitely wrong, but i'm not sure what. 

In [226]:
type(unique_sets_ingredients[0])

set

In [227]:
print unique_sets_ingredients[0].intersection(unique_sets_ingredients[1])

set([u'lager', u'oat bran', u'marshmallow creme', u'stew meat', u'pickle relish', u'muscovado sugar', u'tenderloin roast', u'malt vinegar', u'slaw', u'ale', u'decorating sugars', u'milk chocolate chips', u'french fries', u'biscuit mix', u'Country Crock\xae Spread', u'pico de gallo', u'fat-free buttermilk', u'rye whiskey', u'beef sirloin', u'organic chicken broth', u'bacon drippings', u'pickles', u'kiwi fruits', u'ginger beer', u'lean steak', u'brisket', u'whole allspice', u'mashed potatoes', u'fudge brownie mix', u'shortcrust pastry', u'steak sauce', u'whole cloves', u'gravy', u'papaya', u'pumpernickel bread', u'cream sauce', u'extra sharp white cheddar cheese', u'single crust pie', u'pot roast', u'black tea', u'gluten-free flour', u'pork butt', u'fine salt', u'pickled jalapenos', u'frozen orange juice concentrate', u'mixed spice', u'stew', u'flax seed meal', u'potato flakes', u'dark molasses', u'chutney', u'rolled oats', u'coriander', u'fresh cranberries', u'fresh ginger root', u'xant

# One more try


In [228]:
# Now run those same lines from before to see if there is any cuisine
# ingredient overlap

unique_sets_ingredients = copy.deepcopy(sets_of_ingredients)

for x in range (0,n_cuisine):
    for y in range (0, n_cuisine):
        if y != x:
            unique_sets_ingredients[x] = unique_sets_ingredients[x].difference(unique_sets_ingredients[y])

In [229]:
print len(sets_of_ingredients[0])
print len(unique_sets_ingredients[0])

999
36


In [230]:
# Check to see if each list is much smaller

# Total up the num of ingredients just in one list
unique_ingredients_total_2 = 0

for x in range (0, n_cuisine):
    print cuisine_list[x]
    print len(unique_sets_ingredients[x]), len(sets_of_ingredients[x])
    unique_ingredients_total_2 += len(unique_sets_ingredients[x])
print unique_ingredients_total_2

irish
36 999
mexican
457 2684
chinese
209 1792
filipino
67 947
vietnamese
78 1108
moroccan
39 974
brazilian
71 853
japanese
188 1439
british
97 1166
greek
90 1198
indian
256 1664
jamaican
72 877
french
288 2102
spanish
103 1263
russian
95 872
cajun_creole
199 1576
thai
290 1376
southern_us
837 2462
korean
313 898
italian
2929 2929
6714


# Well, something is still wrong, since we have 6714 as the total, which was the original total (and italian says none of its ingredients overlap with others)

### Third time's a charm
Last time I was checking to see how many of the ingredients of one unique set list overlapped with other unique set lists, instead of how many overlapped with other set lists generally (because over time, the unique set lists were growing smaller and smaller). 

In [231]:
# Now run those same lines from before to see if there is any cuisine
# ingredient overlap

unique_sets_ingredients = copy.deepcopy(sets_of_ingredients)

for x in range (0,n_cuisine):
    for y in range (0, n_cuisine):
        if y != x:
            unique_sets_ingredients[x] = unique_sets_ingredients[x].difference(sets_of_ingredients[y])

In [232]:
# Check to see if each list is much smaller

# Total up the num of ingredients just in one list
unique_ingredients_total_3 = 0

for x in range (0, n_cuisine):
    print cuisine_list[x]
    print len(unique_sets_ingredients[x]), len(sets_of_ingredients[x])
    unique_ingredients_total_3 += len(unique_sets_ingredients[x])
print unique_ingredients_total_3

irish
36 999
mexican
456 2684
chinese
197 1792
filipino
51 947
vietnamese
46 1108
moroccan
34 974
brazilian
50 853
japanese
139 1439
british
76 1166
greek
65 1198
indian
143 1664
jamaican
36 877
french
193 2102
spanish
54 1263
russian
49 872
cajun_creole
105 1576
thai
74 1376
southern_us
272 2462
korean
41 898
italian
480 2929
2597


# (Presumed) Success!! 

### 2597 of the original 6714 ingredients are unique to one particular type of cuisine. Can I use that to characterize the new ones? It seems like a really dumb way to do it, but why not try? 

In [233]:
# first look at examples of these unique food lists
# Keep in mind could be misspellings of words, brands, duplicates because of capitalization

print unique_sets_ingredients[9]

set([u'Hidden Valley\xae Greek Yogurt Original Ranch\xae Dip Mix', u'honey-flavored greek style yogurt', u'grated kefalotiri', u'Cavenders Greek Seasoning', u'pork tenderloin medallions', u'sunflower kernels', u'ammonium bicarbonate', u'tarama', u'Greek dressing', u'mahimahi fillet', u'goat milk feta', u'low-fat caesar dressing', u'pocket bread', u'lamb steaks', u'red wine vinaigrette', u'boneless skinless chicken thigh fillets', u'kasseri', u'wish bone red wine vinaigrett dress', u'grape vine leaves', u'dillweed', u'yogurt low fat', u'Greek feta', u'Stonefire Tandoori Garlic Naan', u'aleppo', u'lowfat plain greekstyl yogurt', u'myzithra', u'Mezzetta Sliced Greek Kalamata Olives', u'Greek black olives', u'lean minced lamb', u'Yoplait\xae Greek 2% caramel yogurt', u'Homemade Yogurt', u'pointed peppers', u'raki', u'kefalotyri', u'whole wheat pita pockets', u'low-fat feta', u'fresh spinach leaves, rins and pat dry', u'mahlab', u'pita wedges', u'pita wraps', u'curly leaf spinach', u'manour

Without even looking at the name, it definitely looks like greek food. 

# Load test data!


Let's see if I remember how to load this data. 

In [234]:
# Load in test data
test_data = json.load(open('test.json'))

No error message, so far so good. 

In [235]:
# Look at first entry in test data and data type
print type(test_data)
print test_data[0]

<type 'list'>
{u'id': 18009, u'ingredients': [u'baking powder', u'eggs', u'all-purpose flour', u'raisins', u'milk', u'white sugar']}


Great, probably the same format as the train data, a list of dictionaries of lists

In [236]:
print type(test_data[0])
print type(test_data[0]['ingredients'])

<type 'dict'>
<type 'list'>


Yep, it's the same, just no cuisine type label. 

Not entirely sure how we're going to go about labeling this data, but let's give it a go. 

Remember, if we label everything as entirely italian, we know we'll get something like 20% accuracy, so we want to beat that. And if we fail to label anything, we can call it 'italian' and be done with it. 

# Labeling the test data

First of all, we will want to make a csv document that has the id number and the label, I presume. let's load the example submission file to make sure. 

In [237]:
import csv as csv

In [238]:
sample_submission = csv.load('sample_submission.csv')

AttributeError: 'module' object has no attribute 'load'

Ok, that's not a thing. I'll go back and look at my Titanic solution and see how we load csv files into Python. BRB. 


In [239]:
import numpy as np

csv_reader = csv.reader(open('sample_submission.csv', 'rb'))
header = csv_reader.next()

# Create variable 'data'
# Go through each row of csv file, add to data
sample_submission = []
for row in csv_reader: 
    sample_submission.append(row)
    
# convert from list to array
sample_submission = np.array(sample_submission)

# each item is currently a string format

In [240]:
sample_submission

array([['35203', 'italian'],
       ['17600', 'italian'],
       ['35200', 'italian'],
       ..., 
       ['15430', 'italian'],
       ['46530', 'italian'],
       ['30849', 'italian']], 
      dtype='|S7')

Yep, it is as I expected

In [241]:
print type(sample_submission)

<type 'numpy.ndarray'>


In [242]:
print type(sample_submission[0])

<type 'numpy.ndarray'>


It's a numpy array of numpy arrays. I could make one of those for the test data pretty easily. just have to iterate through it

# Plan of attack
So my plan of attack is to get the length of the test data, iterate through it, check to see if any of the words in the unique words lists are in there, and if they are, pick the cuisine type of the most common unique word. And if there aren't any, choose italian. Then build a numpy array out of it. 

In [243]:
# Get length of test data set
n_test = len(test_data)
print n_test

9944


Great, step 1 complete. Next, figure out best way to keep track of the word counts. I can think of 2 ways this might work - 1 is to count the most common of the unique words and pick that word's type. Or we could pick the cuisine type with the most of the unique words. Or the cuisine type with the most instances of the unique words (like total number of times for any unique words). I wonder how different those outcomes would be? 

## First try - cuisine type of most common unique word

Wait, I just realized I'm being an idiot. None of the words in a single recipe are going to show up more than once. 

Ok so new plan - for each recipe, make a count of the number of unique words from each cuisine that are being considered, and pick the one with the highest count. If none are included, pick italian. 

In [244]:
submission_data = []

test_count = 0
test_count_2 = 0
# Iterate through test data
for x in xrange(0, n_test):
    # Make list of unique word counts
    counts = [0] * n_test
    
    # iterate through each ingredient in the recipe - not sure if this will work
    for ingredient in test_data[x]['ingredients']:
        
        # iterate through each cuisine type
        for y in xrange(0, n_cuisine):
        
            # if the ingredient is in one of the unique words lists, add one to counts[y]
            if ingredient in unique_sets_ingredients[y]:
                counts[y] += 1 
                test_count += 1
                
    # Find the max of counts
    max_count = max(counts)
    
    if max_count:
        label = cuisine_list[counts.index(max_count)]
        test_count_2 += 1
    else:
        label = 'italian'
        
    # Add results to the array
    temp_array = [test_data[x]['id'], label]
    submission_data.append(temp_array)
    
print submission_data[0]
print test_count
print test_count_2

[18009, 'italian']
1475
1281


In [245]:
print submission_data[1]

[28583, 'italian']


In [246]:
print submission_data[0:40]

[[18009, 'italian'], [28583, 'italian'], [41580, 'italian'], [29752, 'italian'], [35687, 'italian'], [38527, 'italian'], [19666, u'italian'], [41217, 'italian'], [28753, u'mexican'], [22659, 'italian'], [21749, 'italian'], [44967, u'southern_us'], [42969, 'italian'], [44883, 'italian'], [20827, 'italian'], [23196, 'italian'], [35387, 'italian'], [33780, 'italian'], [19001, 'italian'], [16526, 'italian'], [42455, 'italian'], [47453, 'italian'], [42478, 'italian'], [11885, u'vietnamese'], [16585, 'italian'], [29639, 'italian'], [26245, 'italian'], [38516, 'italian'], [47520, 'italian'], [26212, u'cajun_creole'], [23696, 'italian'], [14926, 'italian'], [13292, 'italian'], [27346, 'italian'], [1384, 'italian'], [15959, 'italian'], [42297, 'italian'], [46235, 'italian'], [21181, 'italian'], [9809, 'italian']]


Well obviously that failed, but at least I have a set where they are all italian and I can at least see what the result is and submit it. 

In [247]:
# are any of them not italian? My counts above claim that 1281 
# weren't arbitrarily assigned

print submission_data[n_test-2][1]

for x in range(0, n_test):
    if submission_data[x][1] != 'italian':
        print submission_data[x][1]

italian
mexican
southern_us
vietnamese
cajun_creole
japanese
chinese
filipino
french
indian
mexican
french
mexican
thai
vietnamese
vietnamese
southern_us
mexican
mexican
indian
brazilian
southern_us
chinese
mexican
japanese
moroccan
mexican
mexican
southern_us
korean
mexican
mexican
irish
mexican
french
korean
british
indian
mexican
indian
southern_us
mexican
thai
japanese
vietnamese
southern_us
mexican
mexican
mexican
mexican
moroccan
mexican
russian
filipino
british
mexican
chinese
mexican
korean
brazilian
korean
mexican
chinese
mexican
southern_us
indian
mexican
mexican
indian
mexican
mexican
mexican
indian
chinese
indian
japanese
mexican
french
mexican
mexican
southern_us
mexican
mexican
mexican
mexican
chinese
mexican
mexican
mexican
japanese
chinese
british
indian
japanese
mexican
vietnamese
mexican
brazilian
mexican
southern_us
mexican
mexican
french
korean
southern_us
mexican
chinese
mexican
southern_us
chinese
russian
chinese
chinese
british
greek
mexican
mexican
mexican
mexic

Great, should be better than nothing...

# CSV file creation

In [252]:

submission_1 = open('erh_submission_1_WC.csv', 'wb')
submission_object = csv.writer(submission_1)
submission_object.writerow(['id', 'cuisine'])

for row in submission_data:
    submission_object.writerow(submission_data[x])


submission_1.close()


# good news, bad news

The good news is there seems to have been something wrong with my algorithm (surprise, surprise!) because when I looked at the CSV file, every single entry was the same entry. which might be because the code above says "submission_data[x]" when row is it...


In [253]:
# Redo the above code
submission_1 = open('erh_submission_1_WC.csv', 'wb')
submission_object = csv.writer(submission_1)
submission_object.writerow(['id', 'cuisine'])

for row in submission_data:
    submission_object.writerow(row)


submission_1.close()


Now I look at the csv file again and cross my fingers...

# Moderate success! 
Not every entry is Italian, so I am going to submit the sample where every one is italian, and then mine and see how much better I got. 

# All Italian model - baseline

Score would have been 0.19268, place # 1367 (at the All Italian Benchmark)


# My baseline - submission # 1

Score would have been 0.26740, place # 1331. I guess that's my new baseline.