## INTRO and whats going on

We need to:
* Consolidate/simpify ingredients (`ingr_map.pkl`)
* Filter out recipes that don't contain at least one of our predeterimined ingredients
* Simplify user recipe ratings


### Steps

Download the Food.com Recipes and Interactions dataset from [Kaggle](https://www.kaggle.com/shuyangli94/food-com-recipes-and-user-interactions) and save the zip file as `raw_interactions.zip` in this directory.

In [9]:
# Extract the zip file
!unzip raw_interactions.zip -d kaggle_food/
!mkdir dataset

Archive:  raw_interactions.zip
  inflating: dataset/PP_recipes.csv  
  inflating: dataset/PP_users.csv    
  inflating: dataset/RAW_interactions.csv  
  inflating: dataset/RAW_recipes.csv  
  inflating: dataset/ingr_map.pkl    
  inflating: dataset/interactions_test.csv  
  inflating: dataset/interactions_train.csv  
  inflating: dataset/interactions_validation.csv  


In [31]:
import pandas as pd
import ast
import matplotlib.pyplot as plt

In [2]:
ingredient_set = {
    "beef",
    "salmon",
    "chicken",
    "broccoli",
    "cabbage",
    "carrot",
    "celery",
    "corn",
    "cucumber",
    "eggplant",
    "green bean",
    "bell pepper",
    "olive",
    "onion",
    "potato",
    "spinach",
    "tomato",
    "lettuce",
    "apple",
    "avocado",
    "banana",
    "lemon",
    "bread",
    "cheese",
    "mushroom",
    "egg",
    "pasta",
    "rice",
}

## Finding ingredient mapping


In [3]:
ingr_map = pd.read_pickle("kaggle_food/ingr_map.pkl")
ingr_map.head(10)

Unnamed: 0,raw_ingr,raw_words,processed,len_proc,replaced,count,id
0,"medium heads bibb or red leaf lettuce, washed,...",13,"medium heads bibb or red leaf lettuce, washed,...",73,lettuce,4507,4308
1,mixed baby lettuces and spring greens,6,mixed baby lettuces and spring green,36,lettuce,4507,4308
2,romaine lettuce leaf,3,romaine lettuce leaf,20,lettuce,4507,4308
3,iceberg lettuce leaf,3,iceberg lettuce leaf,20,lettuce,4507,4308
4,red romaine lettuce,3,red romaine lettuce,19,lettuce,4507,4308
5,head romaine lettuce,3,head romaine lettuce,20,lettuce,4507,4308
6,curly endive lettuce,3,curly endive lettuce,20,lettuce,4507,4308
7,romaine lettuce hearts,3,romaine lettuce heart,21,lettuce,4507,4308
8,baby leaf lettuce,3,baby leaf lettuce,17,lettuce,4507,4308
9,head of lettuce,3,head of lettuce,15,lettuce,4507,4308


We need to consolodate some of the `ingr_map` ingredients to fit within our `ingredient_set`. Here's what we need:
* Beef matches only to beef
* Bell peppers should match all colors
* The many types of pasta should all just be "pasta"


**Bell Peppers**

In [4]:
ingr_map.loc[ingr_map["replaced"].str.match(".*bell pepper")].head(10)

Unnamed: 0,raw_ingr,raw_words,processed,len_proc,replaced,count,id
933,frozen bell peppers onions and celery,6,frozen bell peppers onions and celery,37,frozen bell peppers onions and celery,2,2958
2172,campbell southwest-style pepper jack soup,5,campbell pepper jack soup,25,campbell pepper jack soup,9,933
2367,raw red bell pepper,4,raw red bell pepper,19,raw red bell pepper,2,5888
2418,green bell pepper flakes,4,green bell pepper flake,23,green bell pepper flake,2,3400
2894,orange sweet bell pepper,4,orange sweet bell pepper,24,orange sweet bell pepper,9,5084
3269,red sweet bell peppers,4,red sweet bell pepper,21,red sweet bell pepper,83,6002
3270,red sweet bell pepper,4,red sweet bell pepper,21,red sweet bell pepper,83,6002
3320,roasted bell pepper hummus,4,roasted bell pepper hummu,25,roasted bell pepper hummu,2,6109
3434,yellow sweet bell pepper,4,yellow sweet bell pepper,24,yellow sweet bell pepper,7,7988
4118,sweet bell peppers,3,sweet bell pepper,17,sweet bell pepper,37,6959


In [5]:
ingr_map.loc[ingr_map["replaced"].str.match(".*bell pepper") & ~ingr_map["replaced"].str.contains("celery|soup|flake|hummu"), "replaced"] = "bell pepper"

**Pasta**

In [6]:
types_of_pasta = ["pasta", "spaghetti", "rotini", "fettuccine", "angel hair", 
                  "linguine", "fusilli", "elbows", "farfalle", "penne",
                  "rotelli", "rigatoni", "ziti", "conchiglie"]
ingr_map.loc[ingr_map["replaced"].str.contains("|".join(types_of_pasta))].head(10)

Unnamed: 0,raw_ingr,raw_words,processed,len_proc,replaced,count,id
172,prego extra chunky mushroom & diced tomato spa...,9,prego chunky mushroom & diced tomato spaghetti...,52,pasta sauce,441,5205
173,prego pasta sauce with tomato basil and garlic,8,prego pasta sauce,17,pasta sauce,441,5205
174,low fat prepared pasta sauce,5,pasta sauce,11,pasta sauce,441,5205
175,prego ricotta and parmesan sauce,5,prego ricotta and parmesan sauce,32,pasta sauce,441,5205
176,pasta sauce with meat,4,pasta sauce,11,pasta sauce,441,5205
177,pasta sauce with vegetables,4,pasta sauce,11,pasta sauce,441,5205
178,fat free pasta sauce,4,pasta sauce,11,pasta sauce,441,5205
179,pasta sauce with mushrooms,4,pasta sauce,11,pasta sauce,441,5205
180,low-carb pasta sauce,3,pasta sauce,11,pasta sauce,441,5205
181,prego spaghetti sauce,3,prego spaghetti sauce,21,pasta sauce,441,5205


In [7]:
ingr_map.loc[ingr_map["replaced"].str.contains("|".join(types_of_pasta)) & ~ingr_map["replaced"].str.contains("sauce|salad|soup|dough|squash|spinach"), "replaced"] = "pasta"

In [8]:
ingr_map.loc[ingr_map["replaced"] == "pasta", "count"].sum()

6360

Now lets see our final ingredients.

In [9]:
ingr_map.loc[ingr_map["replaced"].map(lambda x: x in ingredient_set)]

Unnamed: 0,raw_ingr,raw_words,processed,len_proc,replaced,count,id
0,"medium heads bibb or red leaf lettuce, washed,...",13,"medium heads bibb or red leaf lettuce, washed,...",73,lettuce,4507,4308
1,mixed baby lettuces and spring greens,6,mixed baby lettuces and spring green,36,lettuce,4507,4308
2,romaine lettuce leaf,3,romaine lettuce leaf,20,lettuce,4507,4308
3,iceberg lettuce leaf,3,iceberg lettuce leaf,20,lettuce,4507,4308
4,red romaine lettuce,3,red romaine lettuce,19,lettuce,4507,4308
...,...,...,...,...,...,...,...
11170,cabbage,1,cabbage,7,cabbage,1593,893
11220,olive,1,olive,5,olive,301,5003
11368,spaghettini,1,spaghettini,11,pasta,35,6714
11474,fusilli,1,fusilli,7,pasta,121,3167


In [10]:
required_ingredient_ids_set = set(ingr_map.loc[ingr_map["replaced"].map(lambda x: x in ingredient_set), "id"].unique())
len(required_ingredient_ids_set)

89

## Filter the recipes

In [88]:
recipes = pd.read_csv("kaggle_food/RAW_recipes.csv")
recipes.rename(columns={"id": "recipe_id"}, inplace=True)
recipes["ingredients"] = recipes["ingredients"].apply(ast.literal_eval)

In [89]:
ingr_dict = ingr_map.set_index("raw_ingr")["id"].to_dict()

In [90]:
recipes["ingredient_ids"] = (
    recipes["ingredients"].apply(
        lambda l: [ingr_dict[i] for i in l
        if i in ingr_dict]
    )
)

In [91]:
recipes.head(10)

Unnamed: 0,name,recipe_id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients,ingredient_ids
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"[winter squash, mexican seasoning, mixed spice...",7,"[7933, 4694, 4795, 3723, 840, 5006, 6270]"
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"[prepared pizza crust, sausage patty, eggs, mi...",6,"[5481, 6324, 2499, 4717, 6276, 1170]"
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"[ground beef, yellow onions, diced tomatoes, t...",13,"[3484, 7979, 2131, 7229, 7235, 6189, 4062, 765..."
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,['place potatoes in a large pot of lightly sal...,"this is a super easy, great tasting, make ahea...","[spreadable cheese with garlic and herbs, new ...",11,"[1170, 4918, 6426, 5185, 7099, 5006, 6009, 627..."
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,['mix all ingredients& boil for 2 1 / 2 hours ...,my dh's amish mother raised him on this recipe...,"[tomato juice, apple cider vinegar, sugar, sal...",8,"[7227, 155, 6906, 6270, 5319, 1564, 1521, 2430]"
5,apple a day milk shake,5289,0,1533,1999-12-06,"['15-minutes-or-less', 'time-to-make', 'course...","[160.2, 10.0, 55.0, 3.0, 9.0, 20.0, 7.0]",4,"['combine ingredients in blender', 'cover and ...",,"[milk, vanilla ice cream, frozen apple juice c...",4,"[4717, 7474, 2946, 150]"
6,aww marinated olives,25274,15,21730,2002-04-14,"['15-minutes-or-less', 'time-to-make', 'course...","[380.7, 53.0, 7.0, 24.0, 6.0, 24.0, 6.0]",4,['toast the fennel seeds and lightly crush the...,my italian mil was thoroughly impressed by my ...,"[fennel seeds, green olives, ripe olives, garl...",9,"[2587, 3437, 5002, 3184, 5324, 5068, 5058, 131..."
7,backyard style barbecued ribs,67888,120,10404,2003-07-30,"['weeknight', 'time-to-make', 'course', 'main-...","[1109.5, 83.0, 378.0, 275.0, 96.0, 86.0, 36.0]",10,['in a medium saucepan combine all the ingredi...,this recipe is posted by request and was origi...,"[pork spareribs, soy sauce, fresh garlic, fres...",22,"[5622, 6696, 2807, 2809, 1329, 2780, 6270, 277..."
8,bananas 4 ice cream pie,70971,180,102353,2003-09-10,"['weeknight', 'time-to-make', 'course', 'main-...","[4270.8, 254.0, 1306.0, 111.0, 127.0, 431.0, 2...",8,"['crumble cookies into a 9-inch pie plate , or...",,"[chocolate sandwich style cookies, chocolate s...",6,"[1397, 1447, 7474, 342, 6858, 7702]"
9,beat this banana bread,75452,70,15892,2003-11-04,"['weeknight', 'time-to-make', 'course', 'main-...","[2669.3, 160.0, 976.0, 107.0, 62.0, 310.0, 138.0]",12,"['preheat oven to 350 degrees', 'butter two 9x...",from ann hodgman's,"[sugar, unsalted butter, bananas, eggs, fresh ...",9,"[6906, 7367, 342, 2499, 2832, 5068, 911, 335, ..."


In [92]:
len(recipes)

231637

In [93]:
recipes = recipes[recipes["ingredient_ids"].apply(lambda l: bool(set(l) & required_ingredient_ids_set))]
len(recipes)

133044

In [94]:
for col in ["tags", "steps", "nutrition"]:
    recipes[col] = recipes[col].apply(ast.literal_eval)

## Interactions

In [95]:
interactions_train = pd.read_csv("kaggle_food/interactions_train.csv")
interactions_test = pd.read_csv("kaggle_food/interactions_test.csv")
interactions_validation = pd.read_csv("kaggle_food/interactions_validation.csv")
interactions_train.head(10)

Unnamed: 0,user_id,recipe_id,date,rating,u,i
0,2046,4684,2000-02-25,5.0,22095,44367
1,2046,517,2000-02-25,5.0,22095,87844
2,1773,7435,2000-03-13,5.0,24732,138181
3,1773,278,2000-03-13,4.0,24732,93054
4,2046,3431,2000-04-07,5.0,22095,101723
5,2046,13307,2000-05-21,5.0,22095,134551
6,2312,780,2000-09-12,5.0,1674,127175
7,2312,51964,2000-09-26,5.0,1674,151793
8,2312,1232,2000-10-17,4.0,1674,15498
9,2312,4397,2000-10-17,5.0,1674,14380


In [98]:
# inner merge will help to filter out invalid recipes
interactions = pd.concat([
    interactions_train.merge(recipes[["recipe_id"]], how='inner', on="recipe_id"),
    interactions_test.merge(recipes[["recipe_id"]], how='inner', on="recipe_id"),
    interactions_validation.merge(recipes[["recipe_id"]], how='inner', on="recipe_id")
]).drop(columns = "i")

In [99]:
len(interactions)

420360

## Serialize the data

In [100]:
ingr_map.to_pickle("dataset/our_ingr_map.pkl")
recipes.to_pickle("dataset/our_recipes.pkl")
interactions.to_pickle("dataset/our_interactions.pkl")