# Knowledge Based Recommendation System of Ingredients

## Notebook 2: Word Embeddings using Word2Vec, FastText
### Project Breakdown
    1  Exploratory Data Analysis and Preprocessing
    2: Build Word Embeddings using Word2Vec, FastText
    3: Recommend Recipes based on ingredients
    4: Build and Visualize Interactive Knowledge Graph of Ingredients


## Word2Vec with Gensim
Word2Vec original papers can be found [here](https://arxiv.org/pdf/1301.3781.pdf) and [here](https://arxiv.org/pdf/1310.4546.pdf), while the documentation for the Gensim model can be found [here](https://radimrehurek.com/gensim/models/word2vec.html).

![Word2Vec architecture](https://www.researchgate.net/profile/Giuseppe-Futia/publication/328373466/figure/fig3/AS:701226521997316@1544196839385/Architecture-of-Word2Vec-models-CBOW-and-Skip-Gram.ppm)

In [1]:
from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors
from tqdm import tqdm
import pandas as pd
import numpy as np
import pickle

In [2]:
# read train_data.pkl file from data folder

!mkdir -p data
!gdown --id 1-IzMRZLYH4OZR8_Za0ZM-6vt2jGIb0dz  -O data/train_data.pkl

with open('data/train_data.pkl', 'rb') as f:
    train_data = pickle.load(f)

Downloading...
From: https://drive.google.com/uc?id=1-IzMRZLYH4OZR8_Za0ZM-6vt2jGIb0dz
To: /content/data/train_data.pkl
100% 246M/246M [00:01<00:00, 197MB/s]


In [3]:
train_data[:1]

[['place',
  'chicken',
  'butter',
  'soup',
  'onion',
  'slow',
  'cooker',
  'water',
  'covercover',
  'cook',
  'hour',
  'high',
  'minute',
  'serving',
  'place',
  'torn',
  'biscuit',
  'dough',
  'slow',
  'cooker',
  'cook',
  'dough',
  'longer',
  'raw',
  'center']]

In [4]:
wv_model = Word2Vec(size=300)
wv_model.build_vocab(train_data)

In [6]:
%%time
wv_model.train(
    train_data, 
    total_examples=wv_model.corpus_count,
    epochs=50,
    compute_loss=True
)

CPU times: user 36min 20s, sys: 7.23 s, total: 36min 28s
Wall time: 12min 27s


(647573309, 784110650)

In [7]:
wv_model.wv.most_similar(['orange'], topn=10)

[('lemon', 0.772682785987854),
 ('tangerine', 0.7122073173522949),
 ('lime', 0.6782360672950745),
 ('citrus', 0.6447024941444397),
 ('grapefruit', 0.633366584777832),
 ('clementine', 0.5394362807273865),
 ('pineapple', 0.4848083257675171),
 ('pomegranate', 0.4140113592147827),
 ('satsuma', 0.3966277241706848),
 ('cranberry', 0.38645362854003906)]

In [8]:
!mkdir -p models
wv_model.save('models/word2vec.model')

## Facebook AI's FastText Model

“fastText is a library for efficient learning of word representations and sentence classification,​​ purposed by Facebook​​ AI Research Center. It’s a​​ new approach based on the Miklov’s CBOW and skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated with each character n-gram; words being represented as the sum of these representations. It is a faster method, allowing to train models on large corpora quickly and allows researchers to compute word representations for words that did not appear in the training data.”​​ 

Paper title:​​ Enriching Word Vectors with Sub-word Information

In [11]:
# source: https://github.com/facebookresearch/fastText#building-fasttext-for-python

!git clone https://github.com/facebookresearch/fastText.git 

Cloning into 'fastText'...
remote: Enumerating objects: 3854, done.[K
remote: Total 3854 (delta 0), reused 0 (delta 0), pack-reused 3854[K
Receiving objects: 100% (3854/3854), 8.22 MiB | 14.23 MiB/s, done.
Resolving deltas: 100% (2417/2417), done.


In [12]:
# compile libraries and install required python files
%cd fastText
!make
!pip install .

/content/fastText
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/args.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/autotune.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/matrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/dictionary.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/loss.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/productquantizer.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/densematrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/quantmatrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/vector.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/model.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/utils.cc
c++ -pthread -std=c++11 -march=native -O3 -funrol

In [13]:
# change dir back to parent folder
%cd ..

# building a text file with train_data
with open('data/train_data.txt', 'a') as f:
    text = '\n'.join([' '.join(data) for data in train_data])
    f.write(text)

!head -5 data/train_data.txt

/content
place chicken butter soup onion slow cooker water covercover cook hour high minute serving place torn biscuit dough slow cooker cook dough longer raw center
slow cooker mix cream mushroom soup dry onion soup mix water place pot roast slow cooker coat soup mixturecook high setting hour low setting hour
preheat oven degree degree lightly grease inch loaf panpress brown sugar prepared loaf pan spread ketchup sugarin mixing bowl mix thoroughly remaining ingredient shape loaf place ketchupbake preheated oven hour juice clear
preheat oven degree degree ccream butter white sugar brown sugar smooth beat egg time stir vanilla dissolve baking soda hot water add batter salt stir flour chocolate chip nut drop large spoonful ungreased pansbake minute preheated oven edge nicely browned
preheat oven degree line quart casserole dish reynolds wrapr pan lining paper parchment need grease dishcook pasta large saucepan according package direction adding broccoli minute cooking drain return saucep

In [14]:
# train FastText model with train_data.txt
!mkdir -p model
!fastText/fasttext skipgram -dim 300  -ws 5 -epoch 100 -input data/train_data.txt -output models/ft_model

Read 15M words
Number of words:  21795
Number of labels: 0
tcmalloc: large alloc 2426159104 bytes == 0x55ccd6bd6000 @  0x7feb7d66e887 0x55cccd23dfed 0x55cccd24c71e 0x55cccd2544fc 0x55cccd25bffc 0x55cccd211887 0x7feb7c70bbf7 0x55cccd211b4a
Progress: 100.0% words/sec/thread:   18165 lr:  0.000000 avg.loss:  0.439929 ETA:   0h 0m 0s


In [15]:
# Load word embeddings from FastText
ft_model = KeyedVectors.load_word2vec_format('models/ft_model.vec', binary=False)
print(ft_model)

<gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x7f9fb0c4b250>


In [16]:
ft_model.most_similar(['burger', 'cheese'])

[('cheddar', 0.5891498327255249),
 ('patty', 0.5787400007247925),
 ('hamburger', 0.5557500720024109),
 ('bun', 0.5498818755149841),
 ('mozzarella', 0.535476565361023),
 ('burgersplace', 0.5248774290084839),
 ('pattiesoil', 0.5143989324569702),
 ('pattiescook', 0.5099775791168213),
 ('pattieslightly', 0.5054371356964111),
 ('pattiesgrill', 0.5029280781745911)]

In [20]:
ft_model.save('models/fasttext.vec')

In [22]:
from google.colab import drive
drive.mount('/content/gdrive')
!cp -r models/* /content/gdrive/MyDrive/colab/xlabs/

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
