# Data Preprocessing
## Overview
1. Read recipe data that was scraped from some recipe website
2. Combine different scraped datasets into a single file to be read in by the database
3. Clean ingredient data by extracting out only the ingredients to be passed to TF-IDF, etc.

In [1]:
import pandas as pd
import numpy as np

import os
import re
import glob

## 1. Aggregate scraped data

### Load in data

Point to the directory where the recipe CSVs are stored

In [2]:
DATADIR = '../data/'
recipe_filenames = glob.glob(DATADIR + 'recipe_info_dataset*')

In [3]:
recipe_filenames

['../data/recipe_info_dataset2.csv',
 '../data/recipe_info_dataset3.csv',
 '../data/recipe_info_dataset4.csv',
 '../data/recipe_info_dataset5.csv',
 '../data/recipe_info_dataset6.csv']

In [4]:
for idx, filename in enumerate(recipe_filenames):
    
    temp_df = pd.read_csv(filename, header=0)
    
    if idx == 0:
        df = temp_df
        
    else:
        df = df.append(temp_df, ignore_index=True)

In [5]:
df.shape

(17777, 8)

### Sample the data
- We don't need this much data to demonstrate, so let's take a sample

In [6]:
df = df.sample(n=1000, random_state=41234)

Take a look at what came out:

In [7]:
df.head()

Unnamed: 0,recipe_id,recipe_name,recipe_url,ingredients,cook_time,calorie_count,review_count,overall_rating
14932,237052,homemade-coffee-creamer-chocolate-caramel,https://www.allrecipes.com/recipe/237052/homem...,"['1 (14 ounce) can sweetened condensed milk', ...",10 m,243 cals,4 reviews,3.25
15237,237363,double-chocolate-granola,https://www.allrecipes.com/recipe/237363/doubl...,"['3 cups rolled oats', '1 cup chopped cashews'...",55 m,190 cals,1 reviews,5.0
10018,231960,swedish-nutballs,https://www.allrecipes.com/recipe/231960/swedi...,"['2 cups all-purpose flour', '1 cup butter, so...",55 m,297 cals,1 reviews,5.0
14611,236729,mediterranean-bulgur-wheat,https://www.allrecipes.com/recipe/236729/medit...,"['2 cups chicken broth', '1 cup bulgur', '1 ta...",30 m,213 cals,0 reviews,0.0
5157,222411,sea-breeze-cocktail,https://www.allrecipes.com/recipe/222411/sea-b...,"['1 1/2 fluid ounces vodka', '4 fluid ounces c...",5 m,196 cals,10 reviews,4.588235


### Save this merged table to be read into the database

In [8]:
df.to_csv('../FoodFlix/static/data/recipes_all_data.csv')

## 2. Prepare data to be used in recommendation system
- Now we only need the ingredients and the recipe id to use as a key

In [9]:
df = df[['recipe_id', 'ingredients', 'recipe_name', 'calorie_count', 'cook_time', 'recipe_url', 'review_count', 'overall_rating']]
df.set_index('recipe_id', inplace=True)

### Remove all non-alphabetical characters

In [10]:
def clean_ingredients(ingredient):
    return re.sub("[^a-zA-Z]", " ", ingredient).strip().lower()

In [11]:
df['ingredients'] = df['ingredients'].apply(clean_ingredients)

### Remove stop words

In [12]:
with open('../data/stopwords.txt', 'r') as infile:
    stop_words = infile.read().split('\n')

In [13]:
def remove_stop_words(ingredients):
    clean = [k for k in ingredients.split() if k not in stop_words]
    clean = ' '.join(clean).strip()
    return clean

In [14]:
df['ingredients'] = df['ingredients'].apply(remove_stop_words)

### Convert calories and review counts into numbers

In [15]:
df['calorie_count'] = pd.to_numeric( df['calorie_count'].str.replace("cals","") )

In [16]:
df['review_count'] = pd.to_numeric( df['review_count'].str.replace("reviews","") )

### Remove recipes with no calorie count

In [17]:
df.dropna(subset=['calorie_count'], inplace=True)
df.loc[df['recipe_name']=='date-almond-and-yogurt-bread']

Unnamed: 0_level_0,ingredients,recipe_name,calorie_count,cook_time,recipe_url,review_count,overall_rating
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
257432,slivered almonds white sugar butter milk yogur...,date-almond-and-yogurt-bread,470.0,1 h 10 m,https://www.allrecipes.com/recipe/257432/date-...,1,4.0


### Check if the recipe names are unique:

In [18]:
len(df)

992

In [19]:
len(df['recipe_name'].unique())

992

Let's drop the duplicates (if any):

In [20]:
df = df[~df['recipe_name'].duplicated()]

In [21]:
df.shape

(992, 7)

### Remove empty ingredient lists

In [22]:
df = df[df['ingredients'].apply(lambda x: bool(x))]

In [23]:
df.shape

(992, 7)

In [24]:
df.head()

Unnamed: 0_level_0,ingredients,recipe_name,calorie_count,cook_time,recipe_url,review_count,overall_rating
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
237052,sweetened condensed milk heavy whipping cream ...,homemade-coffee-creamer-chocolate-caramel,243.0,10 m,https://www.allrecipes.com/recipe/237052/homem...,4,3.25
237363,rolled oats cashews shredded coconut dried mix...,double-chocolate-granola,190.0,55 m,https://www.allrecipes.com/recipe/237363/doubl...,1,5.0
231960,flour butter confectioners sugar walnuts confe...,swedish-nutballs,297.0,55 m,https://www.allrecipes.com/recipe/231960/swedi...,1,5.0
236729,chicken broth bulgur olive oil onion stalks ce...,mediterranean-bulgur-wheat,213.0,30 m,https://www.allrecipes.com/recipe/236729/medit...,0,0.0
222411,vodka cranberry juice grapefruit juice ice cub...,sea-breeze-cocktail,196.0,5 m,https://www.allrecipes.com/recipe/222411/sea-b...,10,4.588235


### Save the cleaned ingredients to file

In [25]:
df.to_csv('../FoodFlix/static/data/clean_ingredients.csv')