## Imports

In [44]:
import pandas as pd

## RAW_recipes.csv

This is the raw data of all the recipes (rows) and the following columns:
 - **name**: name of the recipe;
 - **id**: ID of the recipe;
 - **minutes**: the time it takes to cook that recipe;
 - **contributor_id**: who submitted the recipe;
 - **submitted**: date when the recipe was submitted; 
 - **tags**: some key words of the recipe taste and steps;
 - **nutrition**: list of dietary metrics in order ['calories', 'total fat', 'sugar', 'sodium', 'protein', 'saturated fat', 'carbohydrates'] (calories measured in kcal, and all the other metrics in PDV i.e. Percentage Daily Value that is the percentage of the suggested daily value of the nutrient that is contained in the recipe);
 - **n_steps**: number of steps to complete the recipe;
 - **steps**: list of actual steps to perform;
 - **description**: subjective description of the user who sent the recipe;
 - **ingredients**: list of ingredients for this recipe;
 - **n_ingredients**: number of ingredients.

In [45]:
rec = pd.read_csv('RAW_recipes.csv')
rec.head(10)

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,['place potatoes in a large pot of lightly sal...,"this is a super easy, great tasting, make ahea...","['spreadable cheese with garlic and herbs', 'n...",11
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,['mix all ingredients& boil for 2 1 / 2 hours ...,my dh's amish mother raised him on this recipe...,"['tomato juice', 'apple cider vinegar', 'sugar...",8
5,apple a day milk shake,5289,0,1533,1999-12-06,"['15-minutes-or-less', 'time-to-make', 'course...","[160.2, 10.0, 55.0, 3.0, 9.0, 20.0, 7.0]",4,"['combine ingredients in blender', 'cover and ...",,"['milk', 'vanilla ice cream', 'frozen apple ju...",4
6,aww marinated olives,25274,15,21730,2002-04-14,"['15-minutes-or-less', 'time-to-make', 'course...","[380.7, 53.0, 7.0, 24.0, 6.0, 24.0, 6.0]",4,['toast the fennel seeds and lightly crush the...,my italian mil was thoroughly impressed by my ...,"['fennel seeds', 'green olives', 'ripe olives'...",9
7,backyard style barbecued ribs,67888,120,10404,2003-07-30,"['weeknight', 'time-to-make', 'course', 'main-...","[1109.5, 83.0, 378.0, 275.0, 96.0, 86.0, 36.0]",10,['in a medium saucepan combine all the ingredi...,this recipe is posted by request and was origi...,"['pork spareribs', 'soy sauce', 'fresh garlic'...",22
8,bananas 4 ice cream pie,70971,180,102353,2003-09-10,"['weeknight', 'time-to-make', 'course', 'main-...","[4270.8, 254.0, 1306.0, 111.0, 127.0, 431.0, 2...",8,"['crumble cookies into a 9-inch pie plate , or...",,"['chocolate sandwich style cookies', 'chocolat...",6
9,beat this banana bread,75452,70,15892,2003-11-04,"['weeknight', 'time-to-make', 'course', 'main-...","[2669.3, 160.0, 976.0, 107.0, 62.0, 310.0, 138.0]",12,"['preheat oven to 350 degrees', 'butter two 9x...",from ann hodgman's,"['sugar', 'unsalted butter', 'bananas', 'eggs'...",9


In [46]:
rec.shape

(231637, 12)

## ingr_map.pkl

In order to simplify the process, the data authors mapped ingredients listed under different names into a single unified name. For instance, all variations of lettuce were grouped under the single ingredient "lettuce" as shown in the following lines. 

The key columns we focus on are the following:
 - **raw_ingr**: the raw ingredient name;
 - **replaced**: mapped ingredient name;
 - **id**: ID for each term in the "replaced" column.

In [47]:
ingredients_dataset = pd.read_pickle('ingr_map.pkl')       
ingredients_dataset.head(10)

Unnamed: 0,raw_ingr,raw_words,processed,len_proc,replaced,count,id
0,"medium heads bibb or red leaf lettuce, washed,...",13,"medium heads bibb or red leaf lettuce, washed,...",73,lettuce,4507,4308
1,mixed baby lettuces and spring greens,6,mixed baby lettuces and spring green,36,lettuce,4507,4308
2,romaine lettuce leaf,3,romaine lettuce leaf,20,lettuce,4507,4308
3,iceberg lettuce leaf,3,iceberg lettuce leaf,20,lettuce,4507,4308
4,red romaine lettuce,3,red romaine lettuce,19,lettuce,4507,4308
5,head romaine lettuce,3,head romaine lettuce,20,lettuce,4507,4308
6,curly endive lettuce,3,curly endive lettuce,20,lettuce,4507,4308
7,romaine lettuce hearts,3,romaine lettuce heart,21,lettuce,4507,4308
8,baby leaf lettuce,3,baby leaf lettuce,17,lettuce,4507,4308
9,head of lettuce,3,head of lettuce,15,lettuce,4507,4308


In [48]:
ingredients_dataset.shape

(11659, 7)

In [49]:
# 8023 ingredients in total
len(ingredients_dataset['replaced'].unique())

8023

## PP_recipes.csv

RAW_recipes.csv dataset has been processed and tokenized via GPT subword tokenizer. All the elements of the dataset have been translated into IDs, and the resulting columns are:
 - **id**: ID of the recipe;:
 - **name_tokens**: tokenized recipe name;
 - **ingredient_tokens**: list of vectors assigned to each ingredient;
 - **steps_tokens**: list of ID assigned to each step; 
 - **techinques**: sparse vector telling us which technique has been used;
 - **calory_level**: either 0, 1, 2 (the higher the number, the higher the calories of that recipe);
 - **ingredient_ids**: IDs assigned to each ingredient, as in the file ingr_map.plk.

In [50]:
PP = pd.read_csv('PP_recipes.csv')
PP.head(10)

Unnamed: 0,id,i,name_tokens,ingredient_tokens,steps_tokens,techniques,calorie_level,ingredient_ids
0,424415,23,"[40480, 37229, 2911, 1019, 249, 6878, 6878, 28...","[[2911, 1019, 249, 6878], [1353], [6953], [153...","[40480, 40482, 21662, 481, 6878, 500, 246, 161...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...",0,"[389, 7655, 6270, 1527, 3406]"
1,146223,96900,"[40480, 18376, 7056, 246, 1531, 2032, 40481]","[[17918], [25916], [2507, 6444], [8467, 1179],...","[40480, 40482, 729, 2525, 10906, 485, 43, 8393...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...",0,"[2683, 4969, 800, 5298, 840, 2499, 6632, 7022,..."
2,312329,120056,"[40480, 21044, 16954, 8294, 556, 10837, 40481]","[[5867, 24176], [1353], [6953], [1301, 11332],...","[40480, 40482, 8240, 481, 24176, 296, 1353, 66...","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...",1,"[1257, 7655, 6270, 590, 5024, 1119, 4883, 6696..."
3,74301,168258,"[40480, 10025, 31156, 40481]","[[1270, 1645, 28447], [21601], [27952, 29471, ...","[40480, 40482, 5539, 21601, 1073, 903, 2324, 4...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0,"[7940, 3609, 7060, 6265, 1170, 6654, 5003, 3561]"
4,76272,109030,"[40480, 17841, 252, 782, 2373, 1641, 2373, 252...","[[1430, 11434], [1430, 17027], [1615, 23, 695,...","[40480, 40482, 14046, 1430, 11434, 488, 17027,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...",0,"[3484, 6324, 7594, 243]"
5,465171,111231,"[40480, 3390, 829, 35873, 7047, 13731, 2640, 1...","[[13731, 30684, 260, 245, 17843, 25592, 10601]...","[40480, 40482, 7087, 13731, 30684, 260, 245, 5...","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2,"[6861, 7655, 6846, 6906, 1789, 131, 6863, 1833..."
6,163861,85356,"[40480, 1966, 488, 5218, 252, 5867, 10994, 118...","[[31801, 12395, 25808], [17918], [6953], [1133...","[40480, 40482, 604, 704, 15110, 244, 15684, 24...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2,"[5574, 2683, 6270, 5319, 2499, 869, 1278, 4987..."
7,186383,105140,"[40480, 5317, 7, 491, 11274, 5639, 40481]","[[17918], [25916], [15473, 8361], [15473, 1016...","[40480, 40482, 729, 2525, 10906, 485, 43, 8393...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0,"[2683, 4969, 332, 335, 6270, 800, 4987, 7470, ..."
8,116395,8671,"[40480, 16190, 13249, 4914, 5639, 40481]","[[17918], [36374, 3388, 650, 256, 6444], [2361...","[40480, 40482, 19093, 271, 40478, 40482, 23667...","[1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...",1,"[2683, 1689, 5687, 1098, 840, 7782, 7011, 1910..."
9,303460,160334,"[40480, 1287, 7912, 504, 22118, 19276, 831, 47...","[[559, 1164, 6020], [511, 532, 543, 241], [664...","[40480, 40482, 14259, 1055, 11, 4364, 488, 827...","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0,"[6413, 7997, 3148, 3710, 1799, 2007, 3203, 265..."


In [51]:
PP.shape

(178265, 8)

Note that we have less recipes than the RAW version, and we will use this to filter unwanted recipes. 

## Recipe_final.csv

We transform and merge RAW_recipe.csv and PP_recipes.csv in a file, which will be used by our bot.

In [52]:
PP = pd.read_csv('PP_recipes.csv')
PP.set_index('id',inplace=True)
rec = pd.read_csv('RAW_recipes.csv')
rec.set_index('id',inplace=True)
rec_final = rec.copy()

################
#  DATA CONV
################

pp_indices = PP.index
rec_indices = rec.index

# identify indices in rec that are not in PP
remaining_indices = rec_indices.difference(pp_indices)

# drop them from rec as they will never be chosen
rec_final.drop(remaining_indices, axis = 0, inplace=True)

# re-index rec so it has the same order of PP
rec_final = rec_final.reindex(pp_indices)

# unpack the metrics and find out if recipe is difficlut or not by the tags
new_col = ['calories','total fat', 'sugar',  'sodium','protein','saturated fat','carbohydrates']
rec_final[new_col] = rec_final['nutrition'].apply(eval).apply(pd.Series)
rec_final['difficulty'] = rec_final['tags'].apply(lambda x: 0 if "easy" in x else 1)

# drop unwanted columns
rec_final.drop(['contributor_id', 'submitted', 'nutrition', 'n_steps', 'n_ingredients'], inplace = True, axis = 1)

# reorder the columns such that the last eight are dietary and time metric
rec_final = rec_final[['name', 'tags', 'steps', 'description', 'ingredients', 'difficulty', 'calories', 'total fat', 'sugar', 'sodium', 'protein', 'saturated fat', 'carbohydrates', 'minutes']]

# export ot csv
rec_final.to_csv('Recipe_final.csv')

In [53]:
rec_final = pd.read_csv('Recipe_final.csv')
rec_final.head(10)

Unnamed: 0,id,name,tags,steps,description,ingredients,difficulty,calories,total fat,sugar,sodium,protein,saturated fat,carbohydrates,minutes
0,424415,aromatic basmati rice rice cooker,"['weeknight', 'time-to-make', 'course', 'main-...","['rinse the rice in a fine strainer , then dra...",from the ultimate rice cooker cookbook. the a...,"['basmati rice', 'water', 'salt', 'cinnamon st...",0,228.2,2.0,2.0,8.0,9.0,1.0,15.0,61
1,146223,pumpkin pie a la easy,"['60-minutes-or-less', 'time-to-make', 'course...","['preheat oven to 350', 'combine flour , oats ...",this is a pampered chef recipe for their stone...,"['flour', 'oats', 'brown sugar', 'pecans', 'bu...",0,249.4,16.0,92.0,8.0,11.0,27.0,11.0,55
2,312329,cheesy tomato soup with potatoes,"['30-minutes-or-less', 'time-to-make', 'course...","['pour the broth & water into a large pot', 'a...",after modifying another recipe i came up with ...,"['chicken broth', 'water', 'salt', 'black pepp...",0,351.3,34.0,15.0,50.0,25.0,70.0,8.0,25
3,74301,mini tacos,"['15-minutes-or-less', 'time-to-make', 'course...","['cook hamburger until browned', 'drain the fa...",these can be a easy appetizer or a light dinne...,"['wonton wrappers', 'hamburger', 'taco seasoni...",1,79.7,5.0,2.0,11.0,11.0,7.0,2.0,15
4,76272,rosemary s hanky panky s,"['30-minutes-or-less', 'time-to-make', 'course...","['fry ground beef and sausage until browned', ...",my girlfriend rosemary gave me this wonderfull...,"['ground beef', 'ground sausage', 'velveeta ch...",0,240.7,29.0,9.0,28.0,27.0,42.0,0.0,20
5,465171,pink bavarian crown strawberry dream supreme,"['course', 'gelatin', 'desserts']",['mix strawberry jell-o with boiling water the...,"made with dream whip, strawberry jell-o and an...","['strawberry jell-o gelatin dessert', 'water',...",1,548.9,20.0,282.0,25.0,16.0,52.0,34.0,250
6,163861,tom and kelly s chicken fried steak,"['30-minutes-or-less', 'time-to-make', 'course...","['have your butcher ""cube"" 2 lean boneless por...","you'll need your butcher's help with this, but...","['boneless pork chops', 'flour', 'salt', 'pepp...",1,781.6,82.0,6.0,12.0,94.0,95.0,8.0,18
7,186383,chocolate oat cookie bars,"['60-minutes-or-less', 'time-to-make', 'course...","['preheat oven to 350 degrees', 'whisk togethe...",these are made with oil instead of butter/marg...,"['flour', 'oats', 'baking powder', 'baking sod...",1,198.7,13.0,62.0,2.0,5.0,11.0,9.0,40
8,116395,tropical lemon cream bars,"['60-minutes-or-less', 'time-to-make', 'course...","['crust:', 'combine flour , confectioners suga...",a unique blend of ingredients makes this a won...,"['flour', ""confectioners' sugar"", 'powdered mi...",1,413.9,32.0,138.0,8.0,16.0,53.0,16.0,50
9,303460,jeera on cubes barbecue marinade,"['30-minutes-or-less', 'time-to-make', 'course...",['roast coriander and cumin and ground afterwa...,if you love jeera (cumin) you will love this o...,"['sesame oil', 'yoghurt', 'fruit vinegar', 'ho...",0,142.3,16.0,19.0,27.0,4.0,9.0,3.0,30
