# Prepare the NLP exploration of the "7tomorrow" database (executed on Google Colab)

Short preliminary notebook to translate the 7tomorrow database in french (since the translation is quite long). We don't keep all informations in french but only the recipe names and for each of them the list of ingredients, each ingredient being characterized only by the first word of the column 'ingredient_clean_comma' before the first comma. 


In [None]:
import pandas as pd
import numpy as np
import re
from textblob import TextBlob
from time import time

Show the gpu infos and change if needed in "Execution" panel.   
Using GPUs is not compulsory here. I moved to Google Colab because my computer is slow.

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Mon Jan 30 16:02:05 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P0    28W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Load complete recipes from "projet7tomorrow"  
The recipe database of "projet7tomorrow"  contains almost 100000 recipes but some of them contain ingredients which cannot be matched with the Agribalyse ingredient database (maybe more recipes could be matched after deeper investigation, not yet done). Hereafter, we only consider complete recipes i.e. fully matched with Agribalyse, which are listed in the file "recettes_completes_7tomorrow.xlsx".

Import from Google Drive.  
Here the folder 'carbondiet4GD' only contains the folder data/ with subfolders Recipes/ and Tools/ (others are not used).

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

import zipfile
with zipfile.ZipFile('/content/drive/MyDrive/carbondiet4GD.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/')     # create a copy directly in colab for efficiency reasons

Mounted at /content/drive


**Indicate the path where the folder data/ is located.**

In [None]:
#data_path = '../'
data_path = 'carbondiet4GD/'

Extract the recipes

In [None]:
seven_path = data_path + 'data/Recipes/recettes_completes_7tomorrow.xlsx'
seven_data = pd.read_excel(seven_path, header = [0])
seven_data.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,title_raw,id,url,unit,quantity_raw,ingredient_raw,weight_per_ingr,ingredient_clean_comma,matched_ingredient,food_clean_comma,est_impact,title_clean,source_website
0,0,36,natural peanut butter chocolate bon bons,0006ca31f4,http://www.food.com/recipe/natural-peanut-butt...,cup,12,"cocoa, dry powder, unsweetened",1032.0,"cocoa , dry powder , unsweetened","cocoa powder , sugar , powder , instant","cocoa powder , sugar , powder , instant",933.0,natural peanut butter chocolate bon bons,http://www.food.com
1,1,37,natural peanut butter chocolate bon bons,0006ca31f4,http://www.food.com/recipe/natural-peanut-butt...,cup,12,honey,4068.0,honey,honey,honey,38.0,natural peanut butter chocolate bon bons,http://www.food.com
2,2,38,natural peanut butter chocolate bon bons,0006ca31f4,http://www.food.com/recipe/natural-peanut-butt...,cup,1,"peanut butter, smooth style, without salt",256.0,"peanut butter , style , salt",peanut butter peanut paste,peanut butter peanut paste,316.7,natural peanut butter chocolate bon bons,http://www.food.com
3,3,170,cilantro-mustard mayo (dip or sauce),00185a60a8,http://www.food.com/recipe/cilantro-mustard-ma...,cup,12,"mustard, prepared, yellow",2880.0,"mustard , , yellow",mustard,mustard,120.0,cilantro mustard mayo,http://www.food.com
4,4,171,cilantro-mustard mayo (dip or sauce),00185a60a8,http://www.food.com/recipe/cilantro-mustard-ma...,cup,12,"salad dressing, mayonnaise, regular",2649.6,"salad dressing , mayonnaise ,","salad dressing ,","salad dressing ,",120.0,cilantro mustard mayo,http://www.food.com


In [None]:
seven_data['title_raw'].unique().shape

(7418,)

-> There are 7418 recipes in this database.

In [None]:
seven_data['matched_ingredient'].equals(seven_data['food_clean_comma']) 

True

-> the columns 'matched_ingredient' and 'food_clean_comma' are equal.



In [None]:
seven_data[seven_data['title_raw']!=seven_data['title_clean']][['title_raw', 'title_clean']]

Unnamed: 0,title_raw,title_clean
3,cilantro-mustard mayo (dip or sauce),cilantro mustard mayo
4,cilantro-mustard mayo (dip or sauce),cilantro mustard mayo
5,cilantro-mustard mayo (dip or sauce),cilantro mustard mayo
6,cilantro-mustard mayo (dip or sauce),cilantro mustard mayo
7,cilantro-mustard mayo (dip or sauce),cilantro mustard mayo
...,...,...
35283,maple-cream-cinnamon smoothie,maple cream cinnamon smoothie
35284,maple-cream-cinnamon smoothie,maple cream cinnamon smoothie
35285,maple-cream-cinnamon smoothie,maple cream cinnamon smoothie
35286,maple-cream-cinnamon smoothie,maple cream cinnamon smoothie


After a quick look, it seems better to use the colum 'title_raw' than 'title_clean'. We don't know how the 7tomorrow team went from one column to the other but let's use rather the raw data.

Let's add a new column built from 'ingredient_clean_comma' by keeping only the first word before the first comma.

In [None]:
def keep_first_ing(s):
    if s.startswith(','):         # if the first character is a comma ...
        res = s.split(',')[1]     # ... keep the ingredient after the first comma only
    else:          
        res = s.split(',')[0]    # keep only the first ingredient
    res = res.strip()     # remove leading and trailing whitespaces
    return res

# Check
#print(keep_first_ing(', vegetable , household , '))
#print(keep_first_ing('cocoa , dry powder , unsweetened'))

seven_data['ingredient_basics'] = seven_data['ingredient_clean_comma'].apply(keep_first_ing)
seven_data.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,title_raw,id,url,unit,quantity_raw,ingredient_raw,weight_per_ingr,ingredient_clean_comma,matched_ingredient,food_clean_comma,est_impact,title_clean,source_website,ingredient_basics
0,0,36,natural peanut butter chocolate bon bons,0006ca31f4,http://www.food.com/recipe/natural-peanut-butt...,cup,12,"cocoa, dry powder, unsweetened",1032.0,"cocoa , dry powder , unsweetened","cocoa powder , sugar , powder , instant","cocoa powder , sugar , powder , instant",933.0,natural peanut butter chocolate bon bons,http://www.food.com,cocoa
1,1,37,natural peanut butter chocolate bon bons,0006ca31f4,http://www.food.com/recipe/natural-peanut-butt...,cup,12,honey,4068.0,honey,honey,honey,38.0,natural peanut butter chocolate bon bons,http://www.food.com,honey
2,2,38,natural peanut butter chocolate bon bons,0006ca31f4,http://www.food.com/recipe/natural-peanut-butt...,cup,1,"peanut butter, smooth style, without salt",256.0,"peanut butter , style , salt",peanut butter peanut paste,peanut butter peanut paste,316.7,natural peanut butter chocolate bon bons,http://www.food.com,peanut butter
3,3,170,cilantro-mustard mayo (dip or sauce),00185a60a8,http://www.food.com/recipe/cilantro-mustard-ma...,cup,12,"mustard, prepared, yellow",2880.0,"mustard , , yellow",mustard,mustard,120.0,cilantro mustard mayo,http://www.food.com,mustard
4,4,171,cilantro-mustard mayo (dip or sauce),00185a60a8,http://www.food.com/recipe/cilantro-mustard-ma...,cup,12,"salad dressing, mayonnaise, regular",2649.6,"salad dressing , mayonnaise ,","salad dressing ,","salad dressing ,",120.0,cilantro mustard mayo,http://www.food.com,salad dressing


Let's translate the columns 'ingredients_basics' (that will be used later for NLP) in french

In [None]:
def translate(s):
    try:
        res = str(TextBlob(s).translate(from_lang = 'en', to = 'fr'))   # Translation is done using Google Translate API
    except:    # when the translation API returns the input string unchanged
        res = s
    return res

# Commented to avoid unwanted (long) runs
'''
# Avoid pd.Series.apply because it is slow 
ingredient_basics_fr = [translate(s) for s in list(seven_data['ingredient_basics'].values)]
seven_data['ingredient_basics_fr'] = pd.Series(ingredient_basics_fr)
'''

Parenthesis : is pd.Series.apply really slower?

In [None]:
t0 = time()
test = [translate(s) for s in list(seven_data['ingredient_basics'].values)[:2000]]
df_test = pd.Series(test)
t1 = time()
print("Time with list comprehension = ", t1 - t0)

t0 = time()
df_test = seven_data['ingredient_basics'].loc[:2000].apply(translate)
t1 = time()
print("Time with apply = ", t1 - t0)

Time with list comprehension =  68.09938907623291
Time with apply =  65.37564992904663


This is very long to translate with textblob from english to french (18 minutes on Google Colab) and the translation is far from good as can be seen below (e.g. 'honey' is translated as 'chéri'):

In [None]:
seven_data[['ingredient_basics', 'ingredient_basics_fr']]

Unnamed: 0,ingredient_basics,ingredient_basics_fr
0,cocoa,cacao
1,honey,chéri
2,peanut butter,beurre d'arachide
3,mustard,moutarde
4,salad dressing,vinaigrette
...,...,...
35283,yogurt,yaourt
35284,cream,crème
35285,milk,lait
35286,syrup,sirop


Let's do some corrections (but more are needed)

In [None]:
seven_data['ingredient_basics_fr'] = seven_data['ingredient_basics_fr'].replace(\
                                                to_replace = ['chéri', 'pimenter', 'pétrole', 'le beurre', 'planter'], \
                                                value = ['miel', 'épices', 'huile', 'beurre', 'graines'])

Such basic translation does not work well. What other techniques could we use?
- Use the column 'matched ingredient' since each entry is supposed to correspond to the english translation of an Agribalyse ingredient and then look in the Agribalyse database for the translation in french. This should work if Agribalyse has performed the fr<->en translation correctly. **Problem:** when I choose randomly some ingredients in the column 'matched ingredient', I can't find easily the same ingredients (with the same names) in the file 'AGRIBALYSE3.1_produits alimentaires_2.xlsm'. For instance 'peanut butter peanut paste' matches with 'Peanut butter or peanut paste' which is similar but not exactly the same and so, this would require an additional fuzzywuzzy step or whatever
- Use other translation tools from Hugging Face
- Give some context! E.g. aggregate the recipe name and the list of ingredients to make it clear we're talking about food. 

-> I've no time to do it now. So let's continue with this Textblob translation and let's see how bad/good are our results for the final NLP similarity task.

Let's now group by recipe and concatenate the list of ingredients

In [None]:
agg_dict = {'ingredient_basics': lambda x : ', '.join(x), \
        'ingredient_basics_fr': lambda x : ', '.join(x)}

seven_data2 = seven_data[['title_raw', 'ingredient_basics', 'ingredient_basics_fr']]\
                    .groupby('title_raw').agg(agg_dict).reset_index()

In [None]:
seven_data2.head()

Unnamed: 0,title_raw,ingredient_basics,ingredient_basics_fr
0,""" child's play"" sour pops","lime juice, lemon juice","jus de citron vert, jus de citron"
1,""" fried egg sundaes""","ice cream, cream, spice","glace, crème, épices"
2,""" world's best ""( and easiest ) teriyaki chick...","chicken, soy sauce soy, sugar","poulet, sauce de soja, du sucre"
3,"""apple crisp"" peanut butter snack bites","peanut butter, honey, oat, nut, apple, spice","beurre d'arachide, miel, avoine, noix, Pomme, ..."
4,"""berry good"" smoothie","apple juice, strawberry, raspberry, blackberry...","jus de pomme, fraise, framboise, la mûre, myrt..."


Let's translate in french the column 'title_raw' (it takes 6 minutes)

In [None]:
# Commented to avoid unwanted (long) runs
'''
title_raw_fr = [translate(s) for s in list(seven_data2['title_raw'].values)]
seven_data2['title_raw_fr'] = pd.Series(title_raw_fr)
'''

In [None]:
seven_data2[['title_raw', 'title_raw_fr']][:100]

Unnamed: 0,title_raw,title_raw_fr
0,""" child's play"" sour pops","""Play de l'enfant"" Sour Pops"
1,""" fried egg sundaes""","""Sundaes aux œufs au plat"""
2,""" world's best ""( and easiest ) teriyaki chick...","""Best du monde"" (et les plus faciles) Ailes de..."
3,"""apple crisp"" peanut butter snack bites","Bites de collations au beurre d'arachide ""pomm..."
4,"""berry good"" smoothie","Smoothie ""Berry Good"""
...,...,...
95,all-natural no-bake cookies,biscuits entièrement naturels sans cuisson
96,all-natural wood polish,Polon de bois entièrement naturel
97,all-purpose curry powder,Poudre de currie tout usage
98,alla's cranberry scones (raw foods),Scones de canneberge d'Alla (aliments crus)


Let's save it to an excel file to avoid running this long first notebook part next time.

In [None]:
seven_data2.to_excel("recettes_completes_7tomorrow_preprocessed.xlsx")