# HyperFoods: Machine intelligent mapping of cancer-beating molecules in foods

## Recipe Retrieval w/ Higher Number Anti-Cancer Molecules

Each recipe had all the ingredients concatenated in single string. It was used the ingredients vocabulary of the dataset
to filter what were and what weren't ingredient names in each string. Finally, it was calculated the sum of the number
of anti-cancer molecules present in each recipe using the table food_compound.csv. A DataFrame object was created so that
it not ony shows us the ID of each recipe, but also the number of anti-cancer molecules, along with an URL to the recipe's
location online.

## Importing Modules

Importing libraries installed using PyPI and functions present in scripts created in for this project.

In [3]:
# ---------------------------- Data Management ----------------------------
# pandas is an open source library providing high-performance, easy-to-use data structures and data 
# analysis tools for the Python programming language.

import pandas

# ---------------------------- Scientific Operations ----------------------------
# NumPy is the fundamental package for scientific computing with Python. It contains among other things: a powerful 
# N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code, 
# useful linear algebra, Fourier transform, and random number capabilities.

import numpy

# ---------------------------- Write & Read JSON Files ----------------------------
# Python has a built-in package which can be used to work with JSON data.

import json

# ---------------------------- Pickling ----------------------------
# The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling”
# is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse 
# operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.

import pickle

# ------------------------------------- Word2Vec -------------------------------------
# Word2Vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural
# networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of
# text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being
# assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that
# share common contexts in the corpus are located close to one another in the space.
# Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target
# audience is the natural language processing (NLP) and information retrieval (IR) community.

import gensim
from gensim.models import Word2Vec

# -------------------------- Dimensionality Reduction Tools --------------------------
# Scikit-learn (also known as sklearn) is a free software machine learning library for the
# Python programming language.It features various classification, regression and clustering algorithms including 
# support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with
# the Python numerical and scientific libraries NumPy and SciPy.
# Principal component analysis (PCA) - Linear dimensionality reduction using Singular Value Decomposition of the data to
# project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying 
# the SVD.
# t-distributed Stochastic Neighbor Embedding (t-SNE) - It is a tool to visualize high-dimensional data. It converts 
# similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between
# the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that
# is not convex, i.e. with different initializations we can get different results.

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# ------------------------------ Check File Existance -------------------------------
# The main purpose of the OS module is to interact with the operating system. Its primary use consists in 
# creating folders, removing folders, moving folders, and sometimes changing the working directory.

import os
from os import path

# ------------------------ Designed Visualization Functions -------------------------
# Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats
# and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython
# shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
# Plotly's Python graphing library makes interactive, publication-quality graphs. You can use it to make line plots, 
# scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar 
# charts, and bubble charts.
# Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing
# attractive and informative statistical graphics.

from algorithms.view.matplotlib_designed import matplotlib_function
from algorithms.view.plotly_designed import plotly_function
from algorithms.view.seaborn_designed import seaborn_function

# ------------------------ Retrieving Ingredients, Units and Quantities -------------------------

from algorithms.parsing.ingredient_quantities import ingredient_quantities

# ------------------------ Create Distance Matrix -------------------------
# SciPy is a free and open-source Python library used for scientific and technical computing. SciPy contains modules for
# optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE
# solvers and other tasks common in science and engineering.
# distance_matrix returns the matrix of all pair-wise distances.

from scipy.spatial import distance_matrix

# ------------------------ Unsupervised Learning -------------------------
#

from clustering.infomapAlgorithm import infomap_function # Infomap algorithm detects communities in large networks with the map equation framework.
from sklearn.cluster import DBSCAN # DBSCAN
from sklearn.cluster import MeanShift # Meanshift
import community # Louvain

# ------------------------ Supervised Learning -------------------------

from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import LeaveOneOut

# ------------------------ Jupyter Notebook Widgets -------------------------
# Interactive HTML widgets for Jupyter notebooks and the IPython kernel.

import ipywidgets as w
from IPython.core.display import display
from IPython.display import Image

# ------------------------ IoU Score -------------------------
# The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient (originally given the
# French name coefficient de communauté by Paul Jaccard), is a statistic used for gauging the similarity and diversity 
# of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of 
# the intersection divided by the size of the union of the sample sets.
# Function implemented during this project.

from benchmark.iou_designed import iou_function

# ------------------------ F1 Score -------------------------
# The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best
# value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The 
# formula for the F1 score is: F1 = 2 * (precision * recall) / (precision + recall)

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score

# ------------------------ API Requests -------------------------
# The requests library is the de facto standard for making HTTP requests in Python. It abstracts the complexities of
# making requests behind a beautiful, simple API so that you can focus on interacting with services and consuming data 
# in your application.

import requests

# ------------------------ RegEx -------------------------
# A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.
# RegEx can be used to check if a string contains the specified search pattern.
# Python has a built-in package called re, which can be used to work with Regular Expressions.

import re

# ------------------------ Inflect -------------------------
# Correctly generate plurals, singular nouns, ordinals, indefinite articles; convert numbers to words.

import inflect

# ------------------------ Parse URLs -------------------------
# This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing
# scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL”
# to an absolute URL given a “base URL.”

from urllib.parse import urlparse

# ------------------------ Embedding HTML -------------------------
# Public API for display tools in IPython.

from IPython.display import HTML

# ------------------------ Creating Graph -------------------------
# NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of 
# complex networks.

import networkx

# ------------------------ Language Detectors -------------------------
# TextBlob requires API connnection to Google translating tool (low limit on the number of requests). langdetect is an offline detector.

from textblob import TextBlob
from langdetect import detect

# ------------------------ Language Detectors -------------------------
# In Python, string.punctuation will give the all sets of punctuation: !"#$%&'()*+, -./:;<=>?@[\]^_`{|}~

import string

# ------------------------ CSV Reader -------------------------
# CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases.

import csv

# ------------------------ Natural Language Processing -------------------------
# 

import nltk
#nltk.download()
from nltk.corpus import stopwords, wordnet
import webcolors
from nltk.corpus import wordnet



## Recipe1M+ Dataset

In [34]:
# ---------------------------- Importing Recipe1M+ Dataset ----------------------------

f = open('./data/recipe1M+/layer1.json')
recipes_data = (json.load(f))[0:10000] # Regular computer able to read Recipe1M+ full dataset.
f.close()

id_ingredients = {}
#id_url = {}
   
for recipe in recipes_data:

    id_ingredients[recipe["id"]] = []
    #id_url[recipe["id"]] = recipe["url"]
    
    for index, ingredient in enumerate(recipe["ingredients"]):
        id_ingredients[recipe["id"]].append({"id": index, "ingredient": (ingredient["text"]).lower()})
        

In [163]:
# ---------------------------- Details Recipe1M+ ----------------------------

# Online websites parsed to retrieve recipes.

recipe_databases = []

for key, value in id_url.items():
    
    parsed_uri = urlparse(value)
    result = '{uri.scheme}://{uri.netloc}'.format(uri=parsed_uri)
    
    recipe_databases.append(result)

list(set(recipe_databases)) # The common approach to get a unique collection of items is to use a set. Sets are 
# unordered collections of distinct objects. To create a set from any iterable, you can simply pass it to the built-in
# set() function. If you later need a real list again, you can similarly pass the set to the list() function.

with open('./data/allRecipeDatabases.txt', 'w') as f:
    for item in list(set(recipe_databases)):
        f.write("%s\n" % item)

### Recipe1M+ Dataset Errors Corrected

In [38]:
# ---------------------------- Deleting Non-English/Empty Recipes ----------------------------

true_recipes_positions = []

for key, recipe in enumerate(recipes_data):
    
    joint_ingredients = ""
    
    for key2, ingredient in enumerate(recipe["ingredients"]):
                                                                
        #b = TextBlob(modified_recipes_data[key]["instructions"][0]["text"])
        #print(detect(ingredient["text"] + "a"))
        
        #joint_ingredients = joint_ingredients + " " + ingredient["text"]
        
        #print(joint_ingredients)
            
        if len(ingredient["text"].split(" ")) > 1 and detect(ingredient["text"] + "a") == "en":
        #if b.detect_language() == "en":
            #print("en")
        
            true_recipes_positions.append(key)
            break
            
        if key2 == len(recipe["ingredients"]) - 1 and TextBlob(ingredient["text"]).detect_language() == "en":
            
            true_recipes_positions.append(key)
            print(str(key) + "normal")
            break
            
        elif key2 == len(recipe["ingredients"]) - 1:
            
            print(str(key) + "fuck")
            
            
#print(recipes_data[399])
#print(true_recipes_positions)

82normal
221normal
257normal
399normal
406normal
419normal
473normal
506normal
635normal
654normal
681normal
727normal
825normal
930normal
934normal
955normal
979normal
992normal
1021normal
1136normal
1200normal
1325normal
1351fuck
1367normal
1377normal
1401normal
1421normal
1424fuck
1429normal
1432normal
1591normal
1632normal
1647normal
1664normal
2037normal
2105normal
2157normal
2180fuck
2283normal
2345normal
2404normal
2459fuck
2636normal
2671normal
2750normal
2808normal
2900normal
3053normal
3097normal
3117normal
3315normal
3482normal
3502normal
4117normal
4214normal
4220normal
4497normal
4622normal
4668normal
4749normal
4827normal
4862normal
4999normal
5107normal
5125normal
5172normal
5326normal
5351normal
5355normal
5580normal
5584normal
5655normal
5806normal
5831normal
6202normal
6235normal
6462normal
6494normal
6738normal
6745normal
6750normal
6919normal
7164normal
7205normal
7274normal
7544normal
7649normal
7692normal
7858normal
7862normal
7873normal
7890normal
8156normal
8305

In [42]:
for key, recipe in enumerate(recipes_data):
    
    if key == 1351 or key == 1424 or key == 2180 or key == 2459:
        print(recipe)
print(true_recipes_positions)

{'ingredients': [{'text': '1 can chipotles en adobo'}], 'url': 'http://www.cookstr.com/recipes/chipotle-adobo-puree', 'partition': 'train', 'title': 'Chipotle Adobo Puree', 'id': '0058d9df49', 'instructions': [{'text': 'Scrape the contents of a can of chipotles en adobo into a blender and blend at low speed until smooth.'}, {'text': 'Store in a small glass jar in the refrigerator for up to 2 months.'}]}
{'ingredients': [{'text': '2 fluid ounces tequila'}, {'text': '1 fluid ounce Cointreau liqueur'}, {'text': '1 -2 fluid ounce Sprite'}, {'text': '1 fluid ounce orange juice'}, {'text': '12 lime, juice of'}], 'url': 'http://www.food.com/recipe/trudys-mexican-martini-95987', 'partition': 'val', 'title': "Trudy's Mexican Martini", 'id': '005d42774d', 'instructions': [{'text': 'Shake all ingredients and strain into glass rimmed with salt; add stuffed olives.'}]}
{'ingredients': [{'text': '2 oz Frangelico'}, {'text': '2 oz vodka'}], 'url': 'https://cookpad.com/us/recipes/356173-almondtini', '

In [None]:
# ---------------------------- Correcting Fractions in Food.com ----------------------------

relative_units = {"cup": 240, "cups": 240, "c.": 240, "tablespoon": 15, "tablespoons": 15, "bar": 150, "bars": 150, "lump": 5, "lumps": 5, "piece": 25, "pieces": 25, "portion": 100, "portions": 100, "slice": 10, "slices": 10, "teaspoon": 5, "teaspoons": 5, "tbls": 15, "tsp": 5, "jar": 250, "jars": 250, "pinch": 1, "pinches": 1, "dash": 1, "can": 330, "box": 250, "boxes": 250, "small": 250, "medium": 500, "large": 750, "big": 750, "sprig": 0.1, "sprigs": 0.1, "bunch": 100, "bunches": 100, "leaves": 0.1, "packs": 100, "packages": 100, "pck": 100, "pcks": 100, "stalk": 0.1}

modified_recipes_data = original_recipes_data

#print(original_recipes_data)

for key, recipe in enumerate(original_recipes_data):
    
    if (".food.com" or "/food.com") in recipe["url"]:
        
        for key2, ingredient in enumerate(recipe["ingredients"]):
            
            if re.search(r"[1-5][1-9]", ingredient["text"]):
                
                number = re.search(r"[1-5][1-9]", ingredient["text"]).group()
                
                split_ingredient_list = (ingredient["text"].split(" "))
                
                for index in range(len(split_ingredient_list) - 1):
                                        
                    if split_ingredient_list[index] == number and split_ingredient_list[index + 1] in list(relative_units.keys()):
                        
                        split_ingredient = split_ingredient_list[index][0] + "/" + split_ingredient_list[index][1]
                        split_ingredient = "".join(split_ingredient)
                        
                        split_ingredient_list[index] = split_ingredient
                        split_ingredient_list = " ".join(split_ingredient_list)
                        
                        modified_recipes_data[key]["ingredients"][key2]["text"] = split_ingredient_list
                        

In [None]:
# ---------------------------- Exporting Corrected Recipe Dataset ----------------------------

with open('./data/recipe1M+/modified_modified_recipes_data_mod.json', 'w') as json_file:
    
    json.dump(modified_modified_recipes_data, json_file)
    

In [None]:
try:
    print(detect("m 1, 5 . . ( )"))
    
except ValueError:
    print("wrong")

## Natural Language Processing

### Creating Units Vocabulary

In [None]:
p = inflect.engine()

with open('./vocabulary/ingr_vocab.pkl', 'rb') as f: # Includes every ingredient present in the dataset.
    ingredients_list = pickle.load(f)

f = open('./data/recipe1M+/layer11.json')
original_recipes_data = (json.load(f))#[0:100000]
f.close()

units_list_temp = set()

def get_units(ingredient_text_input, number_input):
    
    split_ingredient_list2 = ingredient_text_input.replace("/", " ").replace("-", " ").translate({ord(ii): None for ii in string.punctuation.replace(".", "")}).lower().split(" ")
    print(split_ingredient_list2)
    
    for number_input_it in number_input:
        
        for iji in range(len(split_ingredient_list2) - 1):
                                            
            if split_ingredient_list2[iji] == number_input_it and re.search(r"[0-9]", split_ingredient_list2[iji + 1]) is None and re.search(r".\b", split_ingredient_list2[iji + 1]) is None:
                            
                units_list_temp.add(split_ingredient_list2[iji + 1])
                break

for original_recipes_data_it in original_recipes_data:
        
        for ingredient_it in original_recipes_data_it["ingredients"]:
                        
            # search_number = re.search(r"\d", ingredient_text)
            
            number_array = re.findall(r"\d", ingredient_it["text"])
            
            if number_array:
                
                # search_number.group() # [0-9]|[0-9][0-9]|[0-9][0-9][0-9]|[0-9][0-9][0-9][0-9]                
                get_units(ingredient_it["text"], number_array)
                
units_list = list(units_list_temp)
units_list.sort()

print(units_list)

# Save a dictionary into a txt file.
with open('./vocabulary/units_list.txt', 'w') as f:
    for item in units_list:
        if item != "<end>" and item != "<pad>":
            f.write("%s\n" % item)
            
#for jj, ingredients_list_it in enumerate(ingredients_list):
                    
                    #if predicted_unit in ingredients_list_it or predicted_unit in p.plural(ingredients_list_it):
                
                        #break
                
                    #elif jj == len(ingredients_list) - 1:

In [None]:
hey = [0, 4, 1, 4, 9]

print(set(hey))

print(0 in set(hey))

for e in set(hey):
    print(e)

In [None]:
p = inflect.engine()

with open('./vocabulary/ingr_vocab.pkl', 'rb') as f: # Includes every ingredient present in the dataset.
    ingredients_list = pickle.load(f)  

lineList = [line.rstrip('\n') for line in open('./vocabulary/units_list.txt')]
  
print(lineList)

In [None]:
final_units = []

for unit in lineList:

    for index, ingredients_list_it in enumerate(ingredients_list):
                        
        if unit == ingredients_list_it or unit == p.plural(ingredients_list_it):
                    
            break
                    
        elif index == len(ingredients_list) - 1:
            
            final_units.append(unit)

print(len(final_units))

In [None]:
# Save a dictionary into a txt file.
with open('./vocabulary/units_list_final.txt', 'w') as f:
    for item in final_units:
        if item != "<end>" and item != "<pad>":
            f.write("%s\n" % item)
            

In [None]:
food = wordnet.synset('food.n.02')

print("red" in webcolors.CSS3_NAMES_TO_HEX)

with open("./vocabulary/units_list_final - cópia.txt") as f:
    content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
lines = [x.strip() for x in content] 

filtered_stopwords = [word for word in lines if word not in stopwords.words('english')]
filtered_verbs_adjectives_adverbs = []

for w in filtered_stopwords:
    if wordnet.synsets(w) and wordnet.synsets(w)[0].pos() != "v" and wordnet.synsets(w)[0].pos() != "a" and wordnet.synsets(w)[0].pos() != "r" and w not in webcolors.CSS3_NAMES_TO_HEX and w not in list(set([w for s in food.closure(lambda s:s.hyponyms()) for w in s.lemma_names()])):
        filtered_verbs_adjectives_adverbs.append(w)
    elif wordnet.synsets(w) == []:
        filtered_verbs_adjectives_adverbs.append(w)

print(filtered_stopwords)
print(len(lines))
print(len(filtered_stopwords))
print(len(filtered_verbs_adjectives_adverbs))

# Save a dictionary into a txt file.
with open('./vocabulary/units_list_final_filtered.txt', 'w') as f:
    for item in filtered_verbs_adjectives_adverbs:
        if item != "<end>" and item != "<pad>":
            f.write("%s\n" % item)
            

In [None]:
food = wordnet.synset('food.n.02')
len(list(set([w for s in food.closure(lambda s:s.hyponyms()) for w in s.lemma_names()])))
list(set([w for s in food.closure(lambda s:s.hyponyms()) for w in s.lemma_names()]))

In [None]:
b = TextBlob("En una procesadora o batidora mescla el queso crema, pistachos, y 1 de los ajos")
b.detect_language()

### Retrieving Ingredients, Units and Quantities from Recipe1M+

In [164]:
# ---------------------------- Creating Vocabulary to Import Units ----------------------------

absolute_units = {"litre": 1000, "litres": 1000, "ounce": 28, "ounces": 28, "gram": 1, "grams": 1, "grm": 1, "kg": 1000, "kilograms": 1000, "ml": 1, "millilitres": 1, "oz": 28, "l": 1000, "g": 1, "lbs": 454, "pint": 568, "pints": 568, "lb": 454, "gallon": 4546, "gal": 4546, "quart": 1137, "quarts": 1137}
relative_units = {"cup": 240, "cups": 240, "c.": 240, "tablespoon": 15, "tablespoons": 15, "bar": 150, "bars": 150, "lump": 5, "lumps": 5, "piece": 25, "pieces": 25, "portion": 100, "portions": 100, "slice": 10, "slices": 10, "teaspoon": 5, "teaspoons": 5, "tbls": 15, "tsp": 5, "jar": 250, "jars": 250, "pinch": 1, "pinches": 1, "dash": 1, "can": 330, "box": 250, "boxes": 250, "small": 250, "medium": 500, "large": 750, "big": 750, "sprig": 0.1, "sprigs": 0.1, "bunch": 100, "bunches": 100, "leaves": 0.1, "packs": 100, "packages": 100, "pck": 100, "pcks": 100, "stalk": 0.1}

# ---------------------------- Save a dictionary into a txt file ----------------------------

with open('./vocabulary/absolute_units.json', 'w') as json_file:
    json.dump(absolute_units, json_file)
    
with open('./vocabulary/relative_units.json', 'w') as json_file:
    json.dump(relative_units, json_file)
    
# ---------------------------- Importing and Exporting as Text File Ingredient's Vocabulary ----------------------------

# Reading ingredients vocabulary.
# with open('./vocabulary/instr_vocab.pkl', 'rb') as f: # Includes every ingredient, cooking vocabulary and punctuation signals necessary to describe a recipe in the dataset.
with open('./vocabulary/ingr_vocab.pkl', 'rb') as f: # Includes every ingredient present in the dataset.
    ingredients_list = pickle.load(f) # Using vocabulary ingredients to retrieve the ones present in the recipes.
    
# Save a dictionary into a txt file.
with open('./vocabulary/ingr_vocab.txt', 'w') as f:
    for item in ingredients_list:
        if item != "<end>" and item != "<pad>":
            f.write("%s\n" % item)
    
# ---------------------------- Importing Ingredients, Units and Quantities ----------------------------
    
relative_units.update(absolute_units)
units_list_dict = relative_units

ingrs_quants_units_final = {}

for recipe in recipes_data:
    
    ingrs_quants_units_final[recipe["id"]] = ingredient_quantities(recipe, ingredients_list, units_list_dict)
        
# Exporting data for testing
#with open('./data/test/new_id_ingredients_tokenized_position.json', 'w') as json_file:
    #json.dump(new_id_ingredients_tokenized_position, json_file)
    
#with open('./data/test/id_ingredients.json', 'w') as json_file:
    #json.dump(id_ingredients, json_file)

In [165]:
new_id_ingredients_tokenized = {}

for key, value in ingrs_quants_units_final.items():
    
    new_id_ingredients_tokenized[key] = []
    
    for value2 in value:
        
        new_id_ingredients_tokenized[key].append(value2["ingredient"])
        
print(new_id_ingredients_tokenized)

{'000018c8a5': ['penne', 'cheese', 'cheese', 'gruyere', 'chili', 'butter', 'stick', 'flour', 'milk', 'cheese', 'cheese', 'salt', 'chili', 'garlic'], '000033e39b': ['macaroni', 'cheese', 'celery', 'pepper', 'greens', 'pimentos', 'mayonnaise', 'salad dressing', 'vinegar', 'salt', 'dill'], '000035f7ed': ['tomato', 'salt', 'onion', 'pepper', 'greens', 'pepper', 'pepper', 'cucumber', 'oil', 'olive', 'basil'], '00003a70b1': ['milk', 'water', 'butter', 'potato', 'corn', 'cheese', 'onion'], '00004320bb': ['gelatin', 'watermelon', 'water', 'cool whip', 'watermelon', 'cracker'], '0000631d90': ['coconut', 'beef', 'garlic', 'salt', 'pepper', 'juice', 'lemon', 'soy sauce', 'cornstarch', 'pineapple', 'liquid', 'orange', 'liquid', 'nuts', 'cashews'], '000075604a': ['chicken', 'tea', 'kombu', 'pepper'], '00007bfd16': ['rhubarb', 'rhubarb', 'sugar', 'gelatin', 'strawberry', 'strawberries', 'cake', 'water', 'butter', 'margarine'], '000095fc1d': ['vanilla', 'fat', 'yogurt', 'strawberry', 'strawberries', 

### Retrieving Cooking Processes from Recipe1M+

### Ingredients -> Vector (Word2Vec)

Converting ingredients into 50 dimensional vectors to facilitate 

In [None]:
# Ingredients are converted into vectors and, by averaging the ones belonging to the same recipe, a vector for the
# recipe is obtained.

if path.exists("./trained_models/model.bin"):
    
    corpus = new_id_ingredients_tokenized.values()
    
    model = Word2Vec(corpus, min_count=1,size= 50,workers=3, window =10, sg = 0)

    words = list(model.wv.vocab)

# By default, the model is saved in a binary format to save space.
    model.wv.save_word2vec_format('./trained_models/model.bin')

# Save the learned model in ASCII format and review the contents
    model.wv.save_word2vec_format('./trained_models/model.txt', binary=False)

else:
    
    model = gensim.models.KeyedVectors.load_word2vec_format('./trained_models/model.bin', binary=True) # Saved model can then be loaded again by calling the Word2Vec.load() function.
    

### Ingredients -> Vector (Every vector component corresponds to a word)

### Recipes -> Vector (Word2Vec)

Representing recipes in their vectorized way by taking the average of the vectors of the ingredients present.

In [None]:
new_id_ingredients_tokenized_keys = new_id_ingredients_tokenized.keys()

id_ingreVectorized = {}
id_recipe = {}

for recipe_id in new_id_ingredients_tokenized_keys:
    
    id_ingreVectorized[recipe_id] = []
    
    for recipe_ingr in new_id_ingredients_tokenized[recipe_id]:
        
        id_ingreVectorized[recipe_id].append(model[recipe_ingr])

    id_recipe[recipe_id] = sum(id_ingreVectorized[recipe_id])/len(new_id_ingredients_tokenized[recipe_id])
    

### Recipes -> Vector (Every vector component corresponds to a word)

### Dimensionality Reduction (Ingredients)

PCA and T-SNE intended to decrease the dimensionality (50) of the vectors representing ingredients, so that they can be 
plotted in visualizable way.

In [None]:
X_ingredients = model[model.wv.vocab]

print(X_ingredients)

# ---------------------------- PCA ----------------------------
X_ingredients_embedded1 = PCA(n_components=2).fit_transform(X_ingredients)

# ---------------------------- T-SNE ----------------------------
X_ingredients_embedded2 = TSNE(n_components=2).fit_transform(X_ingredients)

### Clustering Ingredients

Finding groups of ingredients that most often co-occur in the same recipes.

In [None]:
# ---------------------------- Build Distance Dataframe & Networkx Graph ----------------------------

data = list(X_ingredients_embedded1) # list(X_ingredients_embedded1) / model[model.wv.vocab]
ctys = list(model.wv.vocab)
df = pandas.DataFrame(data, index=ctys)

distances = (pandas.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)).rdiv(1) # Creating dataframe from distance matrix between ingredient vectors.
# G = networkx.from_pandas_adjacency(distances) # Creating networkx graph from pandas dataframe.
X = numpy.array(df.values) # Creating numpy array from pandas dataframe.

# ---------------------------- Clustering ----------------------------

# Mean Shift

#  ingredientModule = MeanShift().fit(X).labels_

# Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

# ingredientModule = DBSCAN(eps=0.3, min_samples=2).fit(X).labels_ # Noisy samples are given the label -1.

# Louvain

# ingredientModule = list((community.best_partition(G)).values())

# Infomap

ingredientModule = infomap_function(distances, ctys)

### Number of Times Ingredients are used in Recipes

Retrieving how often different ingredients are used across the recipe dataset.

In [None]:
ingredients_count = {}

for ingredient in ingredients_list:

    if "_" in ingredient:
        ingredients_count[ingredient.replace("_", " ")] = 0
        continue

    ingredients_count[ingredient] = 0 # In case there is no _

for recipe in recipes_data:
        
    for recipe_standardized in ingrs_quants_units_final[recipe["id"]]:
        
        ingredients_count[recipe_standardized["ingredient"]] = ingredients_count[recipe_standardized["ingredient"]] + recipe_standardized["quantity"]

# -------------------------------

ingredientSize = {}
markerSizeConstant = 1

for ingredient_vocabulary in list(model.wv.vocab):
    
    ingredientSize[ingredient_vocabulary] = markerSizeConstant*ingredients_count[ingredient_vocabulary]
    
ingredientSize = list(ingredientSize.values())

print(ingredientSize)

###  PCA & T-SNE Visualization (Ingredients)

Although some informamation was inevitably lost, a pair of the most variable components was used. <br>
Size of each marker is proportional to the number of times the ingredient is used in the recipe dataset. <br>
Markers with a similar color group ingredients that are usually used together in the recipe dataset.

In [None]:
# ---------------------------- Matplotlib ----------------------------
matplotlib_function(X_ingredients_embedded1, X_ingredients_embedded2, list(model.wv.vocab), ingredientModule, ingredientSize, "Ingredients")

In [None]:
# ---------------------------- Plotly ----------------------------
plotly_function(X_ingredients_embedded1, X_ingredients_embedded2, list(model.wv.vocab), ingredientModule, ingredientSize, "true", "Ingredients")

# Toggle Button for Labels
toggle = w.ToggleButton(description='No Labels')
out = w.Output(layout=w.Layout(border = '1px solid black'))

def fun(obj):
    with out:
        if obj['new']:  
            plotly_function(X_ingredients_embedded1, X_ingredients_embedded2, list(model.wv.vocab), ingredientModule, ingredientSize, "false")
        else:
            plotly_function(X_ingredients_embedded1, X_ingredients_embedded2, list(model.wv.vocab), ingredientModule, ingredientSize, "true")

toggle.observe(fun, 'value')
display(toggle)
display(out)

# (Run in localhost to visualize it)

In [None]:
# ---------------------------- Seaborn ----------------------------
seaborn_function(X_ingredients_embedded1, X_ingredients_embedded2, list(model.wv.vocab), ingredientModule, ingredientSize)

### Dimensionality Reduction (Recipes)

PCA and T-SNE intended to decrease the dimensionality (50) of the vectors representing recipes, so that they can be 
plotted in visualizable way. Although some informamation was inevitably lost, a pair of the most variale components was used.

In [None]:
# ---------------------------- PCA ----------------------------
X_recipes_embedded1 = PCA(n_components=2).fit_transform(list(id_recipe.values()))

# ---------------------------- T-SNE ----------------------------
X_recipes_embedded2 = TSNE(n_components=2).fit_transform(list(id_recipe.values()))

### Clustering Recipes

Finding groups of recipes that most correspond to different types of cuisine.

In [None]:
# ---------------------------- Build Distance Dataframe & Networkx Graph ----------------------------

data = list(X_recipes_embedded1) # list(X_recipes_embedded1) / id_recipe.values()
ctys = id_recipe.keys()
df = pandas.DataFrame(data, index=ctys)

distances = (pandas.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)).rdiv(1)
# G = networkx.from_pandas_adjacency(distances) # Creating networkx graph from pandas dataframe.
X = numpy.array(df.values) # Creating numpy array from pandas dataframe.

# ---------------------------- Clustering ----------------------------

# Mean Shift

recipeModules = MeanShift().fit(X).labels_

# Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

# recipeModules = DBSCAN(eps=0.3, min_samples=2).fit(X).labels_ # Noisy samples are given the label -1.

# Louvain

# recipeModules = list((community.best_partition(G)).values())

# Infomap

# recipeModules = infomap_function(1./distances, ctys)

### Number of Ingredients in each Recipe

Calculated so that the size of each recipe marker could be proportional to the number of ingredients present.

In [None]:
numberIngredients = []
markerSizeConstant = 1

for key, value in new_id_ingredients_tokenized.items():
    
    numberIngredients.append(markerSizeConstant*len(value))

print(numberIngredients)

### PCA & T-SNE Visualization

Size of each marker is proportional to the number of ingredients a given recipe contains. <br>
Markers with a similar color group recipes that contain the higher number of common ingredients.

In [None]:
# ---------------------------- Matplotlib ----------------------------
matplotlib_function(X_recipes_embedded1, X_recipes_embedded2, list(id_recipe.keys()), recipeModules, numberIngredients, "Recipes")

In [None]:
# ---------------------------- Plotly ----------------------------
plotly_function(X_recipes_embedded1, X_recipes_embedded2, list(id_recipe.keys()), recipeModules, numberIngredients, "true", "Recipes")

toggle = w.ToggleButton(description='No Labels')

out = w.Output(layout=w.Layout(border = '1px solid black'))

def fun(obj):
    with out:
        if obj['new']:  
            plotly_function(X_recipes_embedded1, X_recipes_embedded2, list(id_recipe.keys()), recipeModules, numberIngredients, "false")
        else:
            plotly_function(X_recipes_embedded1, X_recipes_embedded2, list(id_recipe.keys()), recipeModules, numberIngredients, "true")

toggle.observe(fun, 'value')
display(toggle)
display(out)

# (Run in localhost to be able to visualize it)

In [None]:
# ---------------------------- Seaborn ----------------------------
seaborn_function(X_recipes_embedded1, X_recipes_embedded2, list(id_recipe.keys()), recipeModules, numberIngredients)

### Importing Anticancer Ingredients

Getting the anticancer ingredients and the number of anticancer molecules each one contain. Further data processing to 
facilitate analysis.

In [None]:
ac_data = pandas.read_csv("./data/food_compound.csv", delimiter = ",")
ac_data.head()

# Selecting Useful Anti-Cancer Ingredients Columns

ac_data_mod = ac_data[['Common Name', 'Number of CBMs']]
ac_data_mod

#  Dropping Nan Rows from Anti-Cancer Ingredients Table

ac_data_mod.replace("", numpy.nan)
ac_data_mod = ac_data_mod.dropna()
ac_data_mod

# Converting DataFrame to Dictionary

ingredient_anticancer = {}

for index, row in ac_data_mod.iterrows():
    
    ingredient_anticancer[row['Common Name'].lower()] = row['Number of CBMs']

### Recipes -> Score

Calculating the score of each recipe taking into account the number of cancer-beating molecules. <br>
Data Source: Veselkov, K., Gonzalez, G., Aljifri, S. et al. HyperFoods: Machine intelligent mapping of cancer-beating molecules in foods. Sci Rep 9, 9237 (2019) doi:10.1038/s41598-019-45349-y

In [None]:
recipe_cancerscore = {}
recipe_weight = {}

for key, value in ingrs_quants_units_final.items():
    
    recipe_weight[key] = 0
    
    for recipe_standardized in value:
        
        recipe_weight[key] = recipe_weight[key] + recipe_standardized["quantity (ml)"]
        
recipe_weight

# ----------------------

recipe_cancerscore = {}
ingredient_anticancer_keys = list(ingredient_anticancer.keys())

for key, value in ingrs_quants_units_final.items():
    
    recipe_cancerscore[key] = 0
    
    for recipe_standardized in value:
        
        for ingredient_anticancer_iterable in ingredient_anticancer_keys:
            
            if recipe_standardized["ingredient"] in ingredient_anticancer_iterable:
        
                recipe_cancerscore[key] = recipe_cancerscore[key] + ingredient_anticancer[ingredient_anticancer_iterable]*(recipe_standardized["quantity (ml)"])/(recipe_weight[key])
                
                break

### Best Recipes Decreasing Order

Printing, in a decreasing order, the recipes with a bigger number of cancer-beating molecules. 

In [None]:
res1 = pandas.DataFrame.from_dict(recipe_cancerscore, orient='index', columns=['Anticancer Molecules/Number Ingredients'])
res2 = pandas.DataFrame.from_dict(id_url, orient='index', columns=['Recipe URL'])

pandas.set_option('display.max_colwidth', 1000)

pandas.concat([res1, res2], axis=1).reindex(res1.index).sort_values(by=['Anticancer Molecules/Number Ingredients'], ascending=False).head()

# Creating a dataframe object from listoftuples
# pandas.DataFrame(recipe_cancerscore_dataframe)

## Recipes -> Nutritional Information

Retrieving nutritional information for each ingredient present in the recipe dataset. <br>
Overall recipe score will be calculated taking into account not only the number of cancer-beating molecules, but also
nutrtional content. <br>
Data Source: U.S. Department of Agriculture, Agricultural Research Service. FoodData Central, 2019. fdc.nal.usda.gov.

In [None]:
with open('./vocabulary/ingr_vocab.pkl', 'rb') as f: # Includes every ingredient present in the dataset.
    ingredients_list = pickle.load(f)[1:-1]
    
print(len(ingredients_list))

In [None]:
# -------------------------------- Extracting Ingredients

new_ingredients_list = [] # List of ingredients from the vocabulary with spaces instead of underscores.

for i in range(0, len(ingredients_list)):

    if "_" in ingredients_list[i]:
        new_ingredients_list.append(ingredients_list[i].replace("_", " "))
        continue

    new_ingredients_list.append(ingredients_list[i]) # In case there is no _
print(len(new_ingredients_list))

In [None]:
# ---------------------------- Get FoodData Central IDs for Each Ingredient from Vocab ----------------------------

if os.path.exists('./vocabulary/ingredient_fdcIds.json'):
    
    f = open('./vocabulary/ingredient_fdcIds.json')
    ingredient_fdcIds = (json.load(f))# [0:100]
    f.close()
                
else:
    
    API_Key = "BslmyYzNnRTysPWT3DDQfNv5lrmfgbmYby3SVsHw"
    URL = "https://api.nal.usda.gov/fdc/v1/search?api_key=" + API_Key
    
    ingredient_fdcIds = {}
    
    for value in new_ingredients_list:
        
        ingredient_fdcIds[value] = {}
        ingredient_fdcIds[value]["fdcIds"] = []
        ingredient_fdcIds[value]["descriptions"] = []
        
        # ------------------------------------------ ADDING RAW
        PARAMS2 = {'generalSearchInput': value + " raw"}
        r2 = requests.get(url = URL, params = PARAMS2)
        data2 = r2.json()
        
        raw = False
        
        if "foods" in data2 and value + " raw" in (data2["foods"][0]["description"]).lower().replace(",", ""):
            
            raw_id = data2["foods"][0]["fdcId"]
            raw_description = data2["foods"][0]["description"]
            
            ingredient_fdcIds[value]["fdcIds"].append(raw_id)
            ingredient_fdcIds[value]["descriptions"].append(raw_description)
            
            raw = True
        
        # id_nutritionalInfo[value] = []
        
        # for i in range(len(value)):
        # Defining a params dict for the parameters to be sent to the API 
        PARAMS = {'generalSearchInput': value} 
        
        # Sending get request and saving the response as response object 
        r = requests.get(url = URL, params = PARAMS)
        
        # Extracting data in json format 
        data = r.json() 
        
        if "foods" in data:
            
            numberMatches = len(data["foods"])
            
            if numberMatches > 10 and raw == True:
                numberMatches = 9
            elif numberMatches > 10 and raw == False:
                numberMatches = 10
            
            for i in range(numberMatches):
                
                ingredient_fdcIds[value]["fdcIds"].append(data["foods"][i]["fdcId"])
                ingredient_fdcIds[value]["descriptions"].append(data["foods"][i]["description"])
                
#print(ingredient_fdcIds)

In [None]:
# ---------------------------- Get All Nutritional Info from Vocab ----------------------------

if os.path.exists('./vocabulary/ingredient_nutritionalInfo.json'):
    
    f = open('./vocabulary/ingredient_nutritionalInfo.json')
    ingredient_nutritionalInfo = (json.load(f))# [0:100]
    f.close()
                    
else:

    API_Key = "BslmyYzNnRTysPWT3DDQfNv5lrmfgbmYby3SVsHw"
    
    ingredient_nutritionalInfo = {}
    
    for key, value in ingredient_fdcIds.items():
        
        if value["fdcIds"]:
        
            URL = "https://api.nal.usda.gov/fdc/v1/" + str(value["fdcIds"][0]) + "?api_key=" + API_Key

            # Sending get request and saving the response as response object 
            r = requests.get(url = URL)

            ingredient_nutritionalInfo[key] = {}
            ingredient_nutritionalInfo[key]["fdcId"] = value["fdcIds"][0]
            ingredient_nutritionalInfo[key]["description"] = value["descriptions"][0]
            ingredient_nutritionalInfo[key]["nutrients"] = {}

            for foodNutrient in r.json()["foodNutrients"]:

                if "amount" in foodNutrient.keys():

                    ingredient_nutritionalInfo[key]["nutrients"][foodNutrient["nutrient"]["name"]] = [foodNutrient["amount"], foodNutrient["nutrient"]["unitName"]]

                else:

                    ingredient_nutritionalInfo[key]["nutrients"][foodNutrient["nutrient"]["name"]] = "NA"
                    
        else:
            
            ingredient_nutritionalInfo[key] = {}
            

In [None]:
# ---------------------------- Correcting Units in JSON with Nutritional Info ----------------------------

if os.path.exists('./vocabulary/ingredient_nutritionalInfo_corrected.json'):
    
    f = open('./vocabulary/ingredient_nutritionalInfo_corrected.json')
    ingredient_nutritionalInfo_modified = (json.load(f))# [0:100]
    f.close()
                    
else:

    ingredient_nutritionalInfo_modified = ingredient_nutritionalInfo

    for nutrient, dictionary in ingredient_nutritionalInfo.items():
        
        if "nutrients" in dictionary:

            for molecule, quantity in dictionary["nutrients"].items():

                if quantity != "NA":

                    if quantity[1] == "mg":

                        ingredient_nutritionalInfo_modified[nutrient]["nutrients"][molecule][0] = quantity[0]/1000
                        ingredient_nutritionalInfo_modified[nutrient]["nutrients"][molecule][1] = 'g'

                    elif quantity[1] == "\u00b5g":

                        ingredient_nutritionalInfo_modified[nutrient]["nutrients"][molecule][0] = quantity[0]/1000000
                        ingredient_nutritionalInfo_modified[nutrient]["nutrients"][molecule][1] = 'g'

                    elif quantity[1] == "kJ":

                        ingredient_nutritionalInfo_modified[nutrient]["nutrients"][molecule][0] = quantity[0]/4.182
                        ingredient_nutritionalInfo_modified[nutrient]["nutrients"][molecule][1] = 'kcal'

                    elif quantity[1] == "IU":

                        if "Vitamin A" in molecule:

                            ingredient_nutritionalInfo_modified[nutrient]["nutrients"][molecule][0] = quantity[0]*0.45/1000000
                            ingredient_nutritionalInfo_modified[nutrient]["nutrients"][molecule][1] = 'g'

                        elif "Vitamin C" in molecule:

                            ingredient_nutritionalInfo_modified[nutrient]["nutrients"][molecule][0] = quantity[0]*50/1000000
                            ingredient_nutritionalInfo_modified[nutrient]["nutrients"][molecule][1] = 'g'

                        elif "Vitamin D" in molecule:

                            ingredient_nutritionalInfo_modified[nutrient]["nutrients"][molecule][0] = quantity[0]*40/1000000
                            ingredient_nutritionalInfo_modified[nutrient]["nutrients"][molecule][1] = 'g'

                        elif "Vitamin E" in molecule:

                            ingredient_nutritionalInfo_modified[nutrient]["nutrients"][molecule][0] = quantity[0]*0.8/1000
                            ingredient_nutritionalInfo_modified[nutrient]["nutrients"][molecule][1] = 'g'
                        

In [None]:
# ---------------------------- Get Medium Sizes for each Ingredient in Vocab ----------------------------

f = open('./vocabulary/ingredient_fdcIds.json')
ingredient_fdcIds = (json.load(f))#[0:10]
f.close()

API_Key = "BslmyYzNnRTysPWT3DDQfNv5lrmfgbmYby3SVsHw"
    
ingredient_mediumSize = {}
    
for key, value in ingredient_fdcIds.items():
    
    aux = True
    
    for id_key, fdcId in enumerate(value["fdcIds"][0:5]):
        
        if not aux:
            break
        
        URL = "https://api.nal.usda.gov/fdc/v1/" + str(fdcId) + "?api_key=" + API_Key
        
        # Sending get request and saving the response as response object 
        r = requests.get(url = URL)
        
        foodPortions = r.json()["foodPortions"]
        i = 0
        first_cycle = True
        second_cycle = False
        third_cycle = False
                
        while i < len(foodPortions):
            
            if "portionDescription" in foodPortions[i]:
                
                if "medium" in foodPortions[i]["portionDescription"] and first_cycle:
                            
                    ingredient_mediumSize[key] = {"fdcId": fdcId, "description": value["descriptions"][id_key], "weight": foodPortions[i]["gramWeight"]}
                    aux = False
                    break
                            
                elif i == len(foodPortions) - 1 and first_cycle:
                    i = -1
                    first_cycle = False
                    second_cycle = True
                    third_cycle = False
                            
                elif "Quantity not specified" in foodPortions[i]["portionDescription"] and second_cycle:
                            
                    ingredient_mediumSize[key] = {"fdcId": fdcId, "description": value["descriptions"][id_key], "weight": foodPortions[i]["gramWeight"]}
                    aux = False
                    #print("Quantity not specified" + key)
                    break
                            
                elif i == len(foodPortions) - 1 and second_cycle:
                    i = -1
                    first_cycle = False
                    second_cycle = False
                    third_cycle = True
                            
                elif key in foodPortions[i]["portionDescription"] and third_cycle:
               
                    ingredient_mediumSize[key] = {"fdcId": fdcId, "description": value["descriptions"][id_key], "weight": foodPortions[i]["gramWeight"]}
                    aux = False
                    #print(key)
                    break
                            
                elif i == len(foodPortions) - 1 and third_cycle:
                    i = -1 
                    ingredient_mediumSize[key] = {"fdcId": "NA", "description": "NA", "weight": "NA"}
                    first_cycle = False
                    second_cycle = False
                    third_cycle = False
                    break
            else:
                
                break
                
            i = i + 1    
                
#print(ingredient_mediumSize)

In [None]:
# ---------------------------- Save JSON File with Nutritional Info ----------------------------

with open('./vocabulary/id_ingredients_cuisine.json', 'w') as json_file:
    json.dump(id_ingredients_cuisine, json_file)

## Recipes -> Cuisines

### Importing Kaggle and Nature Dataset

In [81]:
#data = pandas.read_csv("./data/jaan/kaggle_and_nature.csv", skiprows=5)
#pandas.read_table('./data/jaan/kaggle_and_nature.csv')
#data.head()

id_ingredients_cuisine = []
cuisines = []
        
with open('./data/jaan/kaggle_and_nature.csv', newline = '') as games:      
    
    game_reader = csv.reader(games, delimiter='\t')
    
    i = 0
    
    for game in game_reader:
                
        id_ingredients_cuisine.append({"id": i, "ingredients": [ingredient.replace("_", " ") for ingredient in game[0].split(",")[1:]], "cuisine": game[0].split(",")[0]})
        cuisines.append(game[0].split(",")[0])
        
        i = i + 1
        
print(len(cuisines))

96250


### Creating Synonymous Vocabulary

In [92]:
# ---------------------------- Importing Recipe1M+ Vocabulary ----------------------------

with open('./vocabulary/ingr_vocab.pkl', 'rb') as f: # Includes every ingredient present in the dataset.
    ingredients_list = pickle.load(f)
    
#print(len(ingredients_list))

# ---------------------------- Creating Vocabulary to Kaggle and Nature Dataset----------------------------

vocabulary = set()

for recipe in id_ingredients_cuisine:
    
    for ingredient in recipe["ingredients"]:
        
        vocabulary.add(ingredient.replace(" ", "_"))

#print(vocabulary)
print(len(vocabulary))
print(len(ingredients_list))

synonymous = {}

for ingredient2 in list(vocabulary):
    
    synonymous[ingredient2] = "new"

aux = 0

for ingredient2 in list(vocabulary):

    for ingredient1 in ingredients_list:
       
        if ingredient1 == ingredient2:
            #print(ingredient2 + " " + ingredient1)
            synonymous[ingredient2] = ingredient1
            break
            
        elif ingredient1 in ingredient2:
            
            synonymous[ingredient2] = ingredient1
            
    if synonymous[ingredient2] == "new":
        aux = aux + 1
        
print(len(synonymous))

new_id_ingredients_cuisine = id_ingredients_cuisine
            
for key1, recipe in enumerate(id_ingredients_cuisine):
    
    for key2, ingredient in enumerate(recipe["ingredients"]):
        
        if synonymous[id_ingredients_cuisine[key1]["ingredients"][key2].replace(" ", "_")] == "new":
            
            new_id_ingredients_cuisine[key1]["ingredients"].remove(id_ingredients_cuisine[key1]["ingredients"][key2])
            continue
        
        new_id_ingredients_cuisine[key1]["ingredients"][key2] = synonymous[id_ingredients_cuisine[key1]["ingredients"][key2].replace(" ", "_")]   
        
    if len(id_ingredients_cuisine[key1]["ingredients"]) < 2:
        new_id_ingredients_cuisine.remove(id_ingredients_cuisine[key1])
#print(len(synonymous))

881
1488
881


### Ingredients and Recipes to Vectors

In [108]:
# ---------------------------- Converting Ingredients to Vectors ----------------------------

#ingredients = set()

#for key, recipe in enumerate(new_id_ingredients_cuisine):
    
    #for key2, ingredient in enumerate(recipe["ingredients"]):
    
        #ingredients.add(recipe["ingredients"][key2])
    
#ingredient_list = ingredients

ingredient_list = ingredients_list

print(len(ingredient_list))

ingredient_vector = {}

for key, value in enumerate(ingredient_list):
    
    ingredient_vector[value] = [0] * len(ingredient_list)
    ingredient_vector[value][key] = 1
    
#print(ingredient_vector["cinnamon"])

# ---------------------------- Converting Recipes to Vectors ----------------------------

id_ingredients_cuisine_vectorized = {}

# print(len(id_ingredients_cuisine))

for key1, recipe in enumerate(new_id_ingredients_cuisine[0:20000]):
    
    id_ingredients_cuisine_vectorized[key1] = []
    
    for ingredient in recipe["ingredients"]:
        
        id_ingredients_cuisine_vectorized[key1].append(ingredient_vector[ingredient])
        
    id_ingredients_cuisine_vectorized[key1] = numpy.sum(numpy.array(id_ingredients_cuisine_vectorized[key1]), 0)
    
#print(id_ingredients_cuisine_vectorized)

1488


### Support Vector Classifier (Linear)

In [109]:
# ---------------------------- Importing Data ----------------------------

X = list(id_ingredients_cuisine_vectorized.values())
y = cuisines[0:20000]

#for vector in list(id_ingredients_cuisine_vectorized.values()):
    #print(len(vector))

# ---------------------------- Creating Training & Testing Sets ----------------------------

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#print(X_train[0:10])
#print(y_train[0:10])

clf = svm.LinearSVC(max_iter = 5000)
clf.fit(X_train, y_train)

# ---------------------------- Save Model ----------------------------

#filename = './trained_models/finalized_model2.sav'
#pickle.dump(clf, open(filename, 'wb'))

# ---------------------------- Load Model ----------------------------

#loaded_model = pickle.load(open(filename, 'rb'))
# result = loaded_model.score(X_test, Y_test)

#print(id_ingredients_cuisine_vectorized["10"])

#print(clf.predict([id_ingredients_cuisine_vectorized[430]]))

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=5000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

### Random Forest Classifier

In [None]:
# ---------------------------- Importing Data ----------------------------

X = list(id_ingredients_cuisine_vectorized.values())
y = cuisines[0:20000]

# ---------------------------- Creating Training & Testing Sets ----------------------------

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# ---------------------------- Save Model ----------------------------

filename = './trained_models/randomForestClassifier.sav'
pickle.dump(clf, open(filename, 'wb'))

# ---------------------------- Load Model ----------------------------

#loaded_model = pickle.load(open(filename, 'rb'))
# result = loaded_model.score(X_test, Y_test)

#print(id_ingredients_cuisine_vectorized["10"])

#print(clf.predict([id_ingredients_cuisine_vectorized[430]]))

#loaded_model = pickle.load(open(filename, 'rb'))
print(clf.predict([id_ingredients_cuisine_vectorized[430]]))

### Validating Model

In [None]:
# Upsides: intuitive and easy to perform.
# Downsides: drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

print(clf.score(X_test, y_test))

#### Stratified K-Fold Cross Validation

In [110]:
cv = StratifiedKFold(n_splits=5)

scores = cross_val_score(clf, X_test, y_test, cv=cv)

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.80 (+/- 0.01)


#### Leave One Out Cross Validation (LOOCV)

In [101]:
# LOO is more computationally expensive than k-fold cross validation.

cv = LeaveOneOut()

scores = cross_val_score(clf, X_test, y_test, cv=cv)

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

KeyboardInterrupt: 

### Adding Cuisine to Recipe1M+ Database

In [166]:
# ---------------------------- Importing Dataset ----------------------------

f = open('./data/recipe1M+/layer11.json')
recipes_data = (json.load(f))#[0:100000]
f.close()

# ---------------------------- Converting Ingredients to Vectors ----------------------------
modified_recipes_data = {}

#print(new_id_ingredients_tokenized)
   
for key1, list_ingredients in new_id_ingredients_tokenized.items():
    
    modified_recipes_data[key1] = []
    
    for key2, ingredient in enumerate(list_ingredients):

        modified_recipes_data[key1].append(ingredient_vector[ingredient.replace(" ", "_")])
        
# ---------------------------- Converting Recipes to Vectors ----------------------------

id_ingredients_cuisine_vectorized = {}
cuisines_recipe1m = []

for key1, recipe in modified_recipes_data.items():
            
    id_ingredients_cuisine_vectorized[key1] = numpy.sum(numpy.array(modified_recipes_data[key1]), 0)
    
    cuisines_recipe1m.append((clf.predict([id_ingredients_cuisine_vectorized[key1]]))[0])
    
# ---------------------------- Adding Cuisines to Recipe1M+ Dataset ----------------------------
        
modified_modified_recipes_data = recipes_data
    
for key, recipe in enumerate(recipes_data):
    
    modified_modified_recipes_data[key]["cuisine"] = cuisines_recipe1m[key]
    
# ---------------------------- Generating New Recipe1M+ w/ Cuisines File ----------------------------
    
file = open('./data/layer11_modified_cuisines.txt','w') 
file.write(str(modified_modified_recipes_data))
    

69721

### Dimensionality Reduction

In [None]:
X_ingredients = list(id_ingredients_cuisine_vectorized.values())

#print(X_ingredients)

# ---------------------------- PCA ----------------------------
X_ingredients_embedded1 = PCA(n_components=2).fit_transform(X_ingredients)

# ---------------------------- T-SNE ----------------------------
# X_ingredients_embedded2 = TSNE(n_components=2).fit_transform(X_ingredients)

### Calculating Amount of Ingredients & Identifying Recipes' Cuisines

In [None]:
#recipeModules = [0] * len(list(id_ingredients_cuisine_vectorized.keys()))

cuisine_number = {}
cuisine_numberized = []
index = 0

cuisine_number["African"] = 0

for key, cuisine in enumerate(cuisines):
    
    if cuisine not in list(cuisine_number.keys()):
    
        index = index + 1
    
        cuisine_number[cuisine] = index

for key, cuisine in enumerate(cuisines):
    
    cuisine_numberized.append(cuisine_number[cuisine])

recipeModules = cuisine_numberized

print(recipeModules)

numberIngredients = [5] * len(list(id_ingredients_cuisine_vectorized.keys()))

###  PCA & T-SNE Visualization

In [None]:
# ---------------------------- Matplotlib ----------------------------
matplotlib_function(X_ingredients_embedded1, X_ingredients_embedded1, list(id_ingredients_cuisine_vectorized.keys()), recipeModules, numberIngredients, "Recipes")

## Benchmark Facebook Recipe Retrieval Algorithm

It was created a dictionary object (id_url.json) that matches recipes IDs (layer1.json) with the URLs of images available in layer2.json. While
some recipes do not contain images, others contain more than 1. This matching between different files was possible once layer2.json
also contain the recipe ID present in layer1.json. 

Then, by manipulating Facebook's algorithm and its repository, the recipe retrieval algorithm is able to convert the JSON file id_url.json into
an array of strings of URLs. Along with this, it creates a parallel array of strings of the IDs of the recipes, so that in each position there is
correspondence between ID in this object with an image URL in the previous.

Finally, Facebook's algorithm was run and the ingredients list for each image URL was obtained. The number of correct elements over the total
number of elements in the ground-truth recipe gives us the accuracy of the algorithm. The ingredients present in each ground-truth recipe 
were retrieved using the method above - "Recipe Retrieval w/ Higher Number Anti-Cancer Molecules".

### Writing Input File w/ Images to Facebook's Algorithm

A JSON file (id_url.json) was created to be input in the Facebook's recipe retrieval algorithm, so that it could generate a prediction of the ingredients 
present in every recipe from the dataset (with, at least, 1 image available). <br>
Ground-truth ingredients for each recipe can be found in layer1.json. The respective images are present in the layer2.json.
Both files are in the data directory.

In [None]:
ids = []

for recipe in recipes_data:
                        
    ids.append(recipe["id"])

f = open('./data/recipe1M+/layer2.json')
recipes_images_data = (json.load(f))# [0:100]
f.close()

id_images = {}

for recipe in recipes_data:

    id_images[recipe["id"]] = []
    
    for recipe_image in recipes_images_data:
        
        for image in recipe_image["images"]:
        
            if recipe["id"] == recipe_image["id"]:
            
                id_images[recipe["id"]].append(image["url"])
                    
# Writing text file with IDs of each recipe and respective URLs for 1 or more online images.
with open('./data/id_url.json', 'w') as json_file:
    json.dump(id_images, json_file)
    

### Executing Inverse Cooking Algorithm

Recipe Generation from Food Images. </br>
https://github.com/facebookresearch/inversecooking

In [None]:
'''
from demo import demo_func

f = open('./data/recipe1M+/id_url.json')
id_url = (json.load(f))# [0:100]
f.close()

urls_output = []
ids_output = []

for id, urls in id_url.items():

    for url in urls:

        urls_output.append(url)

        if url:

            ids_output.append(id)

print(id_url)
print(urls_output)
print(ids_output)

demo_func(urls_output, ids_output)

'''

### Comparing Ingredient Prediction w/ Ground Truth

IoU and F1 scores are used to compare the prediction of the ingredients made by the Facebook's algorithm with the ones present
in the dataset. <br>
First, a JSON file with the prediction for each recipe is read. Then, the 2 scores are calculated. Finally, a comparison between 
the benchmark performed by the algorithm's team and ours is made.

In [None]:
f = open('./data/id_predictedIngredients.json')
id_predictedIngredients = (json.load(f))# [0:100]
f.close()

# ---------------------------- Intersection over Union (IoU) Score / Jaccard Index ----------------------------

iou_list = []

recipe_ids = id_predictedIngredients.keys

for key, value in id_predictedIngredients.items():
    
    iou_list.append(iou_function(new_id_ingredients_tokenized[key], value))

iou = sum(iou_list)/len(iou_list)

# ---------------------------- F1 Score ----------------------------

f1_list = []

for key, value in id_predictedIngredients.items():
    
    y_true = [new_id_ingredients_tokenized[key]]
    y_pred = [value]
    
    binarizer = MultiLabelBinarizer()
    
    # In this case, I am considering only the given labels.
    binarizer.fit(y_true)
    
    f1_list.append(f1_score(binarizer.transform(y_true), binarizer.transform(y_pred), average='macro'))
    
f1 = sum(f1_list)/len(f1_list)

# Benchmark Tests Comparison

benchmark = {'Method': ["Ours", "Facebook Group"],
        'IoU': [iou, 0.3252],
        'F1': [f1, 0.4908]
        }

df = pandas.DataFrame(benchmark, columns = ['Method', 'IoU', 'F1'])
print(df)

# Data obtained by the Facebook Research group comparing how their algorithm, a retrieval system and a human perform when 
# predicting the ingredients present in the food. 

Image("img/iou&f1.png")

### Annotations

List Jupyter running sessions: 
```console
jupyter notebook list
```

Exit Jupyter notebooks:
```
jupyter notebook stop (8889)
```

Plot using Matplotlib:
https://medium.com/incedge/data-visualization-using-matplotlib-50ffc12f6af2

Add large files to github repo:
https://git-lfs.github.com/

Removing large file from commit:
https://help.github.com/en/github/authenticating-to-github/removing-sensitive-data-from-a-repository
https://rtyley.github.io/bfg-repo-cleaner/
https://towardsdatascience.com/uploading-large-files-to-github-dbef518fa1a
$ bfg --delete-files YOUR-FILE-WITH-SENSITIVE-DATA
bfg is an alias for:
java -jar bfg.jar

Initialize github repo:
git init
git remote add origin https://gitlab.com/Harmelodic/MyNewProject.git



In [None]:
HTML('<iframe src=http://fperez.org/papers/ipython07_pe-gr_cise.pdf width=700 height=350></iframe>')

# embbeddidng projector