# Create Embedding Dictionary

Before defining & training our model, we need suitable embeddings. We base all our work on Google's word2vec. It's a massive embedding dictionary containing the most common 300k words (based on Google news), with a 300 dimensional embedding for each word. Given this vocabulary, we will find almost any food item. 

You can find the pre-trained embeddings [here](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit).

However, we need a little bit of pre-processing.

1. We need a function that maps labels to an appropriate list of words. Each element in this list should appear in our embedding dictionary. word2vec usually contains both lower and upper case versions of a word. We generally want to use the lower case version (in the case of food 'apple' instead of 'Apple' as the later can also refer to the company), except for proper nouns (for example 'Babybel').
2. We don't need most of the 300k words in word2vec, and using all words takes up too much memory (around 4-5 GB). Instead, we want to create a subset based on the food items we are interested in.
3. We create the embedding dictionaries for both our full set of labels and the reduced set of labels

## Import data 

In [40]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.vision import *
import gensim

from ImageEmbedModel import utils
from tqdm import tqdm
from gensim.models.phrases import Phrases, Phraser

In [41]:
path = Path('/home/jupyter/data')

labels = pd.read_csv(path/'food_label_concat_new.csv')
labels_red = pd.read_csv(path/'food_label_concat_new_red.csv')
labels['label'] = labels['label'].apply(str)
labels_red['label'] = labels_red['label'].apply(str)

extra_labels = ['salty', 'sweet', 'sour', 'hot', 'bitter', 'large', 'small', 'liquid', 'roasted', 'sauteed', 'boiled',
                'pepper', 'salt', 'fruity']

## Load word2vec (takes a while)

In [42]:
model = gensim.models.KeyedVectors.load_word2vec_format(path/'GoogleNews-vectors-negative300.bin', binary=True)

## Get classes for each dataset

In [43]:
all_classes = utils.get_labels(labels['label'])
all_classes_red = utils.get_labels(labels_red['label'])
all_classes = all_classes+extra_labels
all_classes_red = all_classes_red+extra_labels
print("total classes: {}\n".format(len(all_classes)))
print("total classes reduced: {}".format(len(all_classes_red)))

total classes: 304

total classes reduced: 201


## Create label dictionaries

In [44]:
def create_label_dict(all_classes) -> dict:
    res = {}
    errors = []
    for label in all_classes:
        for word in utils.get_lab_list(label):
            try: 
                res[word] = model.get_vector(word)
            except:
                errors.append(label)
                print("error: {}".format(label))
    return res, errors        

def dump_dict(class_dict,path):
    pickle.dump(class_dict, open(path, 'wb'))

def remove_errors(df,errors):
    df2 = df[df['label'].apply(lambda x: x not in errors)]
    removed_count = len(df) - len(df2)
    return df2, removed_count



In [45]:
total_dict, err_total = create_label_dict(all_classes)
total_dict_red, err_total_red = create_label_dict(all_classes_red)

error: slad


In [46]:
out_fn_red = path/'food_dict_red.pkl'
out_fn_new = path/'food_dict_new.pkl'
dump_dict(total_dict,out_fn_new)
dump_dict(total_dict_red,out_fn_red)

## Subset labels so that they don't contain errors

In [47]:
labels_clean, removed_count = remove_errors(labels,err_total)
labels_red_clean, removed_count_red = remove_errors(labels_red,err_total_red)
print("labels removed: {}\n".format(removed_count))
print("labels reduced removed: {}".format(removed_count_red))

labels removed: 0

labels reduced removed: 1534


## Write to CSV

In [48]:
labels_clean.to_csv( path/'all_food_processed_new.csv', index=False)
labels_red_clean.to_csv( path/'all_food_processed_red.csv', index=False)