# Cooking Recipes dataset

The cooking recipes dataset is a collection of html webpage. On the webpage that host the dataset, we found two perl scripts: one that extract the indredient from the recipes and another that extract the nutrients from a recipe.
In order to use those scripts, we clean the log file that came with the data set to transform it into a list of http link. We then exectute those script with this cleaned file as input and redirect the output into another file that we format in order to use it.

## Cleaning msg.txt

In [1]:
import numpy as np
import pandas as pd
import re

In [2]:
log = open("data/msg.txt", mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
clean_log = open("data/msg_clean.txt", mode='w', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

In [3]:
content = log.readlines()

In [4]:
for i in range(len(content)):
    line = content[i]
    breaked_line = line.split(' ')
    if len(breaked_line) == 1:
        breaked_line = line.split('\t')
        url = breaked_line[1]
        cleaned_url = url[4:]
        clean_log.write(cleaned_url)
        clean_log.write('\n')
    else:
        url = breaked_line[2]
        cleaned_url = url[4:]
        clean_log.write(cleaned_url)
        clean_log.write('\n')

## Cleaning nutrients.txt

The nutrients.txt file is the result of the extractNutrientsFromRecipes perl script when executed on the msg.txt file. It contains the nutrients of a recipe (in mg). Recipes are identified by a md5 encryptyon of their html. Therefore, their ids are the same than the one we can find in ingredients.txt

In [5]:
# Open the file
nutrients_txt = open("data/recipeClean/nutrients.txt", mode='r', buffering=-1, encoding="ISO-8859-1", errors=None, newline=None, closefd=True, opener=None)

In [6]:
# Get all the lines
content = nutrients_txt.readlines()

In [7]:
# Retrieve each nutrient and the id of the recipes
nutrients = []
ids = np.zeros(len(content), dtype=object)

def check_nan(nutrient):
    if(nutrient == '?' or nutrient == '-'):
        return np.nan
    else:
        return float(nutrient)

for i in range(len(content)):
    line = content[i].split('\t')
    nutrients.append((check_nan(line[0]),
                     check_nan(line[1]),
                     check_nan(line[2]),
                     check_nan(line[3]),
                     check_nan(line[4]),
                     check_nan(line[5])))
    ids[i] = line[6]

In [8]:
# Construct a DataFrame where nutrients are indexed by the ids of the recipes
nutrients = np.array(nutrients)
nutrients_df = pd.DataFrame(nutrients, columns= ('kcal', 'carb', 'fat', 'prot', 'sodium', 'chol'), index=ids)
#nutrients_df.to_json("nutrients_cookies_recipes.json")

### Light Analysis of the nutrients dataframe

In [9]:
nutrients_df.describe()

Unnamed: 0,kcal,carb,fat,prot,sodium,chol
count,34675.0,34634.0,34623.0,34630.0,33824.0,33631.0
mean,347.626311,140.007216,154.379675,56.039255,614.618764,63.253001
std,438.068277,233.083793,246.40588,73.188498,1937.704896,105.702694
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,158.0,48.5,45.45,12.0,147.0,4.0
50%,270.0,102.4,102.6,28.8,343.0,35.0
75%,422.05,172.0,193.5,84.8,714.0,84.6
max,14082.0,11342.8,11222.1,2770.68,113226.6,4149.4


In [10]:
# If values of all the different nutrients are set to 0, it should be NaN values instead
nutrients_df[(nutrients_df['kcal'] == 0) & 
             (nutrients_df['carb'] == 0) & 
             (nutrients_df['fat'] == 0) & 
             (nutrients_df['prot'] == 0) &
             (nutrients_df['sodium'] == 0) &
             (nutrients_df['chol'] == 0)] = np.nan

In [11]:
nutrients_df.describe()

Unnamed: 0,kcal,carb,fat,prot,sodium,chol
count,34637.0,34596.0,34585.0,34592.0,33786.0,33593.0
mean,348.007689,140.160999,154.549299,56.100815,615.310041,63.324552
std,438.157086,233.165552,246.488042,73.205102,1938.684615,105.741041
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,158.3,48.8,45.9,12.0,148.0,4.1
50%,270.0,102.4,102.6,29.2,344.0,35.0
75%,422.5,172.0,193.5,84.8,715.0,85.0
max,14082.0,11342.8,11222.1,2770.68,113226.6,4149.4


In [12]:
nutrients_df.isnull().sum()

kcal      29612
carb      29653
fat       29664
prot      29657
sodium    30463
chol      30656
dtype: int64

If we take the minumum NaN values among the different kind of nutrients and we divide it by the total number of row we get: 29612/64249 = 0.46. We are close to the 47% of nil values announced [here](http://infolab.stanford.edu/~west1/from-cookies-to-cooks/)