# Reading json

Get the What's Cooking? json files and organize them in a big spreadsheet.

In [None]:
# starting up a console attached to this kernel
%qtconsole
import os

# importing base code
os.chdir('/your-path/whats-cooking/code')
from base import *

# changing to competition dir
os.chdir('/your-path/whats-cooking/')

In [None]:
# reading files
# also using stemmed data
path = './raw-data'
train = pd.read_json(path + '/train.json')
test = pd.read_json(path + '/test.json')

In [None]:
print train.iloc[0:10,:]

Files are organized as {label, id, [list of ingredients]}.
We want to transform this to {id, label, ing1, ing2, ing3,...} like a one-hot encoding.
First, let us build a dictionary of ingredients (so we can use tf or tf-idf later). Also let us record the frequency of the features. Finally, this dict must use training data only, to avoid 0-variance features.

In [None]:
# transform to big spreadsheet
count = 0
lengths = []
ing_dict = {}
freq_dict = {}
# writing one column per ingredient
# iterating over rows is bad practice, but this is a small dataset
for row, data in train.iterrows():
    lengths.append(len(ing_dict))
    for ingredient in data['ingredients']:
        try:
            ing_dict[ingredient]
            freq_dict[ingredient] += 1
        except KeyError:
            ing_dict[ingredient] = count
            freq_dict[ingredient] = 1
            count += 1

# ordering the dict 
import operator
sorted_freqs = sorted(freq_dict.items(), key=operator.itemgetter(1), reverse=True)

In [None]:
# visualize output:
print ing_dict.keys()[0:10]
print 'number of ingredients:', len(ing_dict)
print 'Top 25 ingredients:'
print sorted_freqs[0:25]

Note: 6714 ingredients (raw data) is a lot. There must be some overlap (modifiers, typos, etc).

Now, let us save our dictionaries:

In [None]:
# saving data
with open(path + '/ing_dict.txt', 'w') as f:
    f.write(str(ing_dict))
    
with open(path + '/freq_dict.txt', 'w') as f:
    f.write(str(freq_dict))
    
with open(path + '/freq_sorted.txt', 'w') as f:
    f.write(str(sorted_freqs))

Saved as literal (not good practice?). To read it, just use ast module.

In [None]:
# read dict literal
with open(path + '/ing_dict.txt', 'r') as f:
    ing_dict = ast.literal_eval(f.read())

Now, we can use feature extraction tools from sklearn to build sparse features from our rows.

In [None]:
# build a spreadsheet where columns are frequency counts
# dummy function, as we want to override the sklearn analyser
# and use what is inside the existing lists as tokens
do_nothing = lambda x: x 

# this instance will count the word's frequencies
cvect = CountVectorizer(analyzer=do_nothing,
                        vocabulary=ing_dict)
# getting corpus
combi = pd.concat([train, test])
corpus = combi['ingredients']

# build count matrix
counts = cvect.transform(corpus)

# turn sparse numpy into pd.DataFrame
counts_df = pd.DataFrame(counts.todense())

Moving on: feature extraction, exploratory analysis and feature engineering