# Meet the meat

## Abstract

With increasingly dire climate change forecasts, concerned individuals are asking how they can minimize their carbon footprint. Recent research suggests that reducing one's consumption of meat, in particular beef, is one of the highest impact actions an individual can take. To examine this topic, we will explore the popularity and prevalence of meat in recipes. Specifically, we plan to extract the ingredients from a recipe database and calculate the carbon footprint of recipes

Finally, we hope to directly relate this data to the issue of climate change by estimating a rating reflecting the carbon footprint of meat in recipes and the environmental impact of consumers' diets.

### Imports and libraries

In [1]:
# Import libraries
import re
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import os, os.path as osp

In [2]:
DATA_FOLDER='data'
SAMPLE_DATA_FOLDER = DATA_FOLDER + '/sample_400/'

## Data extraction and cleaning

Our recipe dataset contains recipes from the [From Cookies to Cooks](http://infolab.stanford.edu/~west1/from-cookies-to-cooks/), combining recipes from 14 high-traffic websites. We start by extracting all the information we want from the HTML files, that is: title, ingredients and meat or animal protein ingredients, tags, ratings in order to explore the recipes in more detail.


#### Recipe webpage scraping
The websites' HTML sources are rich in information. However, the information we wantfrom these pages is rather limited. We extract the information we need from the websites, clean and pre-process the data and save it as a CSV file for easy retrieval in further processing.

In [59]:
def find_website(soup):
    """
    Finds if the page is a recipe and which website it comes from
    """
    is_recipe = True
    
    if 'allrecipes' in soup.title.string.casefold():
        website = 'allrecipes'               
              
    elif 'epicurious' in soup.title.string.casefold():
        website = 'epicurious'
    
    elif 'food network' in soup.title.string.casefold():
        website = 'food_network'
        
    elif 'food.com' in soup.title.string.casefold():
        website == 'food_com'
    
    elif 'betty crocker' in soup.title.string.casefold() or 'bettycrocker' in soup.title.string.casefold():
        website = 'betty_crocker'
               
    elif 'myrecipes' in soup.title.string.casefold():
        website = 'my_recipes'
    
    elif 'taste.com' in soup.title.string.casefold():
        website = 'taste'

    else:
        website = 'not found'
        is_recipe = False
        
    return is_recipe, website

In [352]:
def analyse_page(soup, page):
    """
    Input: 
        soup
        page: 'allrecipes', 'epicurious', 'food_network', 'food_com', 'betty_crocker', 'my_recipes' , others not implemented yet
    
    Output:
        tags = list of tags assigned to the recipe
        ings = list of ingredients
    """
    ings = []
    tags = []
    rate = 0
    serv = 0
    if page == 'allrecipes':
        # Extract tags
        tag_wrappers = soup.find_all(itemprop="recipeCategory")
        for tag in tag_wrappers:
            tags.append(tag['content'])           
        # Extract ingredients
        ing_wrap=soup.find_all('li', class_="plaincharacterwrap ingredient")
        if ing_wrap:
            for ing in ing_wrap:
                ings.append(ing.getText())
        else:##?
            ing_wrap=soup.find_all(itemprop="recipeIngredient")

            for ing in ing_wrap:
                ings.append(ing.getText())
        # Extract Ratings
        #rate = soup.find('p', class_="reviewP")
        
        # Extract Number of servings
        serv=soup.find('span', class_="yield yieldform").text
        
        
    elif page == 'epicurious':       
        # Extract tags

        tag_wrappers = soup.find_all(itemprop="recipeCuisine")
        for tag in tag_wrappers:
            tags.append(tag.getText())    
        tag_wrappers = soup.find_all(itemprop="recipeCategory")
        for tag in tag_wrappers:
            tags.append(tag.getText())        
        # Extract ingredients
        ing_wrap=soup.find('div', id="ingredients")
        if ing_wrap:
            for ing in ing_wrap:
                ings.append(ing.string)
        if None in ings:
            ings=[]
            ing_wrap=soup.find_all('li', class_="ingredient")
            for ing in ing_wrap:
                ings.append(ing.string)
        #extract serving size        
        serv=soup.find('span',class_='yield').text
    
    elif page == 'food_network':  
        # Extract tags
        tag_wrappers = soup.find_all(class_="btn grey-tags")        
        for tag in tag_wrappers:
            tags.append(tag.getText())      
        # Extract ingredients
        ing_wrap=soup.find_all('li',class_='ingredient')
        for ing in ing_wrap:
            ings.append(ing.text)
        #extract serving size
        serv0=soup.find('div', id='recipe-meta')
        serv1=serv0.find_all('dd', class_="clrfix")
        for ser in serv1:
            if any(char.isdigit() for char in ser.get_text()):
            #'serving' in ser.get_text() or 'cup' in ser.get_text():  ##NEED A BETTER WAY TO DO THIS (00a1cb2c972a31e50718971edd070a50)
                serv=ser.get_text()
    elif page == 'food_com':      
        # Extract tags
            #not found          
        # Extract ingredients
        ing_wrap=soup.find_all('li', class_="ingredient")
        if ing_wrap:
            for ing in ing_wrap:
                ings.append((ing.find('span',class_='value').text+ ' '+ing.find('span',class_='type').text + ' ' + ing.find('span', class_='name').text))
        else:
            ing_wrap=soup.find_all(class_="name")
            for ing in ing_wrap:
                ings.append(ing.getText())
    
    elif page == 'betty_crocker':   
        # Extract tags
            #not found    
        # Extract ingredients
        ing_wrap=soup.find_all('dl', class_='ingredient')
        for ing in ing_wrap:
            ings.append(ing.getText())
            
        #get serving size
        serv=(soup.find('meta', itemprop='recipeYield'))['content'] #tag attribute 'content'
    #########################################################################3
    
    elif page == 'my_recipes':
        # Extract tags
        tag_wrappers = soup.find_all(itemprop="recipeType")
        for tag in tag_wrappers:
            tags.append(tag.getText())  
        # Extract ingredients
        ing_wrap=soup.find_all(itemprop="ingredient")
        for ing in ing_wrap:
            ings.append(ing.text)
            
        #extract serving size
        serv=soup.find('span', itemprop='yield').text
            
    elif page == 'taste':
        print('taste')
        ing_wrap = soup.find('div', class_="module recipe-ingredients")
        ing_wrap=ing_wrap.find_all('li')
        #print(ing_wrap)
        for ing in ing_wrap:
            ings.append(ing.text)
        serv=soup.find('h2', class_="ingredients").text
    #other websites    
        # Extract tags   
        # Extract ingredients 
        
    if not ing_wrap:  #return warning if website is recognized but format/data extraction is not successful
        print('Not a recipe format')  
        print('*******')
    
    #if not tags:
        #print('no tags found :( ')
        
    return tags, ings, rate, serv


In [353]:
#code to test individual files
with open(SAMPLE_DATA_FOLDER+'00c91ad77e6a57b1c83c37a6798ebe61.html') as f:
    page = f.read()
    soup = BeautifulSoup(page, 'html.parser')
            
            #check webpage and extract ingredients if recognised as recipe
    is_recipe, website = find_website(soup)
    print('Recipe Analysed: '+soup.title.string)
    print('filename: '+filename)
    #print(soup.prettify())
    
    #tags, ingredients = analyse_page(soup, website)
    tags, ingredients, rating, servings= analyse_page(soup, website)
    has_meat, meat_ingredients, meat_ingredients_str = contains_meat_ingredients(ingredients, meat_products)
    #Extract meat ingredients and quantities in kg
    #ingredient_quant_kg = extract_meat(meat_ingredients_str)
    print(servings)
    print('contains meat:'+str(has_meat))
    print(meat_ingredients_str)                  
    print('{0} Ingredients: '.format(len(ingredients)))
    print(ingredients)
    print('{0} tags:'.format(len(tags)))
    print(tags)
    print('does this recipe contain meat? ', has_meat)
    print('meat ingredients=', meat_ingredients_str)
    print('ingredient_quantity (kg)= ',ingredient_quant_kg)


Recipe Analysed: 









Grilled Rib-Eye Steaks with Parsley-Garlic Butter Recipe
 at Epicurious.com
filename: 00e0cea8855936712d9d52c21b705128.html
 Makes 6 servings
contains meat:True
['3 1 1/2-inch-thick rib-eye steaks (about 1 pound each)']
9 Ingredients: 
['For the Parsley-Garlic Butter, mix together in small bowl, then cover and chill:', '1/2 cup (1 stick) butter, softened', '1 tablespoon finely chopped fresh parsley', '1 tablespoon chopped fresh chives', '1 garlic clove, pressed', '2 teaspoons Cognac', 'Salt and pepper', 'Prepare barbecue (medium-high heat). Rub with generous amounts of salt and pepper:', '3 1 1/2-inch-thick rib-eye steaks (about 1 pound each)']
0 tags:
[]
does this recipe contain meat?  True
meat ingredients= ['3 1 1/2-inch-thick rib-eye steaks (about 1 pound each)']
ingredient_quantity (kg)=  [0.012]


#### Quantity extraction and conversion
The amounts of each ingredients are expressed in many different units (imperial or metric) depending on the websites, and even on the recipes. Once we have extracted the ingredients and amounts, we need to convert all different quantities to one single weight unit (fixed to kilograms) in order to process the carbon footprint of selected ingredients.

These functions are currently not fully implemented in the data structure but have been successfully tested on a subset of files.

In [363]:
check_quantity('5 dozen skinless')



('5 x12 skinless', ['5', '12'], 60.0)

In [370]:
def check_quantity(quant_str):
    """
    Cleans input string and extracts numerical values
    Outputs cleaned string, array of numerical values and sum of numerical values
    """
    quant_str=quant_str.replace("½",".5")
    quant_str=quant_str.replace("1/2",".5")
    quant_str=quant_str.replace("1/3", '.33')
    quant_str=quant_str.replace('1/4','.25')
    quant_str=quant_str.replace('3/4','.75')
    

    if 'dozen' in quant_str:
        quant_str=quant_str.replace('dozen', 'x12')
        quant_vals=re.findall(r"[+]?\d*\.\d+|\d+", quant_str)
        total_quant=np.prod([float(i) for i in quant_vals])
    else:
        quant_vals=re.findall(r"[+]?\d*\.\d+|\d+", quant_str)
        total_quant=np.sum([float(i) for i in quant_vals])    

    #matches positive decimals or whole numbers
    return quant_str, quant_vals, total_quant


def convert_to_kg(quant, unit):
    """
    Converts any input unit (kg, lb, grams, ounces) to kilograms
    """
    
    if (unit=='kilogram') or (unit=='kg'):
        amnt_kg=quant
        #print(quant,'kg')
    elif (unit=='pound') or (unit=='lb') or (unit=='lbs') or (unit=='pounds'):
        amnt_kg=quant/2.205
        #print(amnt_kg,'kg')
    elif(unit=='g') or (unit=='gram') or (unit =='grams'):
        amnt_kg=quant/1000
        #print(amnt_kg,'kg')     
    elif(unit=='oz') or (unit=='ounce'):
        amnt_kg=quant/35.274
        #print(amnt_kg, 'kg')
    elif(unit=='egg'):
        amnt_kg=quant*0.006 #1 egg weighs aproximately 60g
    else:
        print('unit not recognized')
        
    return(amnt_kg)

def contains_meat_ingredients(ings_in, meat_products_in):
    contains_meat=False
    meat_ingredients=[]
    meat_ingredients_str=[]
    for i in ings_in:
        for meat_product in meat_products_in:
            if i != None:
                if meat_product in i.casefold(): 
                    contains_meat=True
                    meat_ingredients.append(meat_product) 
                    meat_ingredients_str.append(i)
                    
    return contains_meat, meat_ingredients, meat_ingredients_str

def extract_meat(meat_ings_str):
    """
    Inputs: 
    ings_in= list of ingredients (and quantities)
     
    Outputs:
    ing_amnt_out = list of corresponding quanities of meat ingredients in kg (=0 if unit not recognized)
    
    """

    ing_amnt_out=[]
    #extract amount from string and convert to kg
    for meat_i in meat_ings_str:
        meat_i_quant_kg=0
        meat_i, quantity_vals, total_quantity=check_quantity(meat_i) #pass string, return cleaned string and total quantity
        #find appropriate units and convert to kg
        for u in units:
            if u in meat_i.casefold():
                meat_i_quant_kg = convert_to_kg(total_quantity,u)
        ing_amnt_out.append(meat_i_quant_kg)
        #if meat_i_quant_kg==0:
        #    print('Units not recognized for: '+meat_i)
                    
    return ing_amnt_out

def normalize_servings(ing_amount, servs):
    #input is ingredient quantity in kg
    #interpret servings\
    ing_norm=[]
    str_servs, val_servs, tot_servs = check_quantity(servs)
    for ing in ing_amount:
        ing_norm.append(ing/tot_servs)
    return ing_norm    
        
normalize_servings([5,12,72],'6 dozen servings')
        
    
    
    
    
    

72.0


[0.06944444444444445, 0.16666666666666666, 1.0]

#### Define carbon footprint of meat ingredients
Animal agriculture is one of the leading sources of the carbon-impact of a recipe. We start by assigning a carbon footprint to each meat ingredient and could later on extend it to other animal products. 
The functions below assign a carbon footprint to each meat ingredient of the recipes.

Source of data: [GreenEatz](https://www.greeneatz.com/foods-carbon-footprint.html)

In [194]:
#Load data from xls file
carbon_footprint = pd.read_excel('data/carbon_footprint_protein.xls', sheet_name='meat_dairy_eggs', index_col=0)
carbon_footprint

Unnamed: 0_level_0,Food,CO2 Kilos Equivalent,Car Miles Equivalent
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Lamb,39.2,91
2,Beef,27.0,63
3,Cheese,13.5,31
4,Pork,12.1,28
5,Turkey,10.9,25
6,Chicken,6.9,16
7,Tuna,6.1,14
8,Eggs,4.8,11


In [195]:
#List of meat ingredients
meat_products = carbon_footprint['Food'].tolist()
#same list copied without caps
meat_products = ['steak','lamb', 'beef', 'cheese', 'pork', 'turkey', 'chicken', 'tuna', 'egg']
#['lamb', 'beef', 'cheese', 'pork', 'turkey', 'chicken', 'tuna', 'egg']


In [196]:
units = ['pound','gram','oz','ounce','kg','kilogram','lb', 'egg' ]

In [197]:
# calculate carbon footprint
#input ingredients and amounts
#output carbon footprint
def carbon_fp (l):
    """
    takes a list of ingredients contributing to co2 and returns carbon footprint
    dictionary: grouping certain ingredients (steak-->beef)
    """
    c=len(l)
    return c

### Data extraction and cleaning loop
Below we extract the data from the recipes of our html dataset and save it in dataframes. Our goal here is to extract the ingredients and assign a carbon-impact rating to the highest impact ingredients (meat or animal protein) in the recipes.

To extract protein-rich ingredients from animal source in order to calculate the main carbon footprint of the recipe, we use an extra database listing the main protein sources and carbon impact. Source of data: [GreenEatz](https://www.greeneatz.com/foods-carbon-footprint.html)

In [354]:
#Loop for all recipes in folder
# data has following row structure
# RecipeName as Identifier - bool contains_meat - list of co2 ingredients - carbonFootprint - ingredients
data=[]
step=0

verbose = 1 #verbose outputs

for filename in os.listdir(SAMPLE_DATA_FOLDER):
    with open(SAMPLE_DATA_FOLDER+filename) as f:
        isTrue=False
        count_exceptions=0

        try:
            page = f.read()
            soup = BeautifulSoup(page, 'html.parser')
            
            #check webpage and extract ingredients if recognised as recipe
            is_recipe, website = find_website(soup)
            print('Recipe Analysed: '+soup.title.string)
            print('filename: '+filename)

            if is_recipe:

                #tags, ingredients = analyse_page(soup, website)
                tags, ingredients, rating, servings= analyse_page(soup, website)
                if ingredients:
                    
                    
                    has_meat, meat_ingredients, meat_ingredients_str = contains_meat_ingredients(ingredients, meat_products)

                    if has_meat:

                        #Extract meat ingredients and quantities in kg
                        ingredient_quant_kg = extract_meat(meat_ingredients_str)
                    else:
                        ingredient_quant_kg = 0

                    if verbose: 
                        #print('Recipe Analysed: '+soup.title.string)
                        
                        print('contains meat:'+str(has_meat))
                        print(meat_ingredients_str)
                        
                        print('{0} Ingredients: '.format(len(ingredients)))
                        #print(ingredients)

                        print('{0} tags:'.format(len(tags)))
                        print(tags)
                        print('number of servings= ', servings)
                        print('does this recipe contain meat? ', has_meat)
                        #print('ingredients = ',ingredients)
                        print('meat ingredients=', meat_ingredients_str)
                        print('ingredient_quantity (kg)= ',ingredient_quant_kg)
                              
                            
                    data.append([soup.title.string, has_meat, meat_ingredients, ingredient_quant_kg, tags])

            else:
                print('website not recognized')
                #print('not a recipe')
        except:
            count_exceptions=count_exceptions+1
            print('Exception')
    step=step+1
    
    if verbose: 
        print('-------------------------------------')
        
    if step>=100:
        break

column_labels=['Recipe Title', 'Has meat', 'Meat types', 'Meat quantity (kg)','Tags']#missing: 'Carbon footprint', 'Rating', 'Tags'
recipes_df = pd.DataFrame(data, columns = column_labels)

#save the data as csv for in depth analysis
#recipes_df.to_csv(DATA_FOLDER+'/recipes_data')

recipes_df

Recipe Analysed: 
	Chicken Breast Cutlets with Artichokes and Capers Recipe - Allrecipes.com

filename: 000a3333ad24828769b6be5a5e1bdb4a.html
contains meat:True
['\n                    2 pounds chicken breast tenderloins or strips', '\n                    2 cups chicken broth']
13 Ingredients: 
0 tags:
[]
number of servings=  6 servings
does this recipe contain meat?  True
meat ingredients= ['\n                    2 pounds chicken breast tenderloins or strips', '\n                    2 cups chicken broth']
ingredient_quantity (kg)=  [0.9070294784580498, 0]
-------------------------------------
Recipe Analysed: 
	Best Ever Popcorn Balls Recipe - Allrecipes.com

filename: 000b861ad15679c578d81884a87689ea.html
contains meat:False
[]
6 Ingredients: 
0 tags:
[]
number of servings=  20 popcorn balls
does this recipe contain meat?  False
meat ingredients= []
ingredient_quantity (kg)=  0
-------------------------------------
Recipe Analysed: Pumpkin Oatmeal Recipe : Aarti Sequeira : Recipes : 

Recipe Analysed: Skewer Recipes for the Grill - Recipes for Grilling Skewers and Kebobs - Delish.com
filename: 00b619d3e8be01dcb29693ece6b10045.html
website not recognized
-------------------------------------
Recipe Analysed: Oven-Fried Parmesan Chicken Strips Recipe | MyRecipes.com
filename: 00b61bfa2d072a7a213fbc7a7ecf65de.html
contains meat:True
['\n1/3 cup\n grated Parmesan cheese\n \n', '\n2 pounds\n chicken breast strips\n \n']
6 Ingredients: 
11 tags:
['Main Dishes', 'Snacks', 'Freezable', 'Kid-Friendly', 'Make-Ahead', 'Quick/Easy', '5 Ingredients or Less', 'Poultry', 'Low Calorie', 'Low Carbohydrate', 'Southern Living']
number of servings=  Makes 5 servings (serving size: 3 strips)
does this recipe contain meat?  True
meat ingredients= ['\n1/3 cup\n grated Parmesan cheese\n \n', '\n2 pounds\n chicken breast strips\n \n']
ingredient_quantity (kg)=  [0, 0.9070294784580498]
-------------------------------------
Recipe Analysed: 
	Authentic Korean Bulgogi Recipe - Allrecipes.com



Recipe Analysed: 
	Zucchini and Blue Cheese Side Recipe - Allrecipes.com

filename: 00c9dfdb9d7a1d56b66f59afad17f3ac.html
contains meat:True
['\n                    1/4 cup crumbled blue cheese']
5 Ingredients: 
0 tags:
[]
number of servings=  4 servings
does this recipe contain meat?  True
meat ingredients= ['\n                    1/4 cup crumbled blue cheese']
ingredient_quantity (kg)=  [0]
-------------------------------------
Recipe Analysed: Whole Wheat Molasses Bread Recipe
filename: 00cadb838dbd0d560d4701844ca2c8ad.html
website not recognized
-------------------------------------
Recipe Analysed: 
	Fresh Pineapple Upside Down Cake Recipe - Allrecipes.com

filename: 00cb9d45bab98391af2b7722e0d4980a.html
contains meat:True
['\n                    3 eggs']
10 Ingredients: 
0 tags:
[]
number of servings=  1 - 9 inch round cake
does this recipe contain meat?  True
meat ingredients= ['\n                    3 eggs']
ingredient_quantity (kg)=  [0.018000000000000002]
--------------------

Unnamed: 0,Recipe Title,Has meat,Meat types,Meat quantity (kg),Tags
0,Chicken Breast Cutlets with Artichokes and C...,True,"[chicken, chicken]","[0.9070294784580498, 0]",[]
1,Best Ever Popcorn Balls Recipe - Allrecipes....,False,[],0,[]
2,Pumpkin Oatmeal Recipe : Aarti Sequeira : Reci...,False,[],0,[]
3,Green Bean Casserole Recipe from Betty Croc...,False,[],0,[]
4,Orange Cream Cheese Frosting Recipe - Allrec...,True,[cheese],[0.08504847763225037],[]
5,Orange Curd Recipe : Ina Garten : Recipes : Fo...,True,[egg],[0.024],[]
6,Chocolate Chunk Cookies Recipe : Ina Garten : ...,True,[egg],[0.012],[]
7,Perfect Baked Potato Recipe - Allrecipes.com,True,[cheese],[0],[]
8,Pumpkin Oatmeal Recipe - Allrecipes.com,False,[],0,[]
9,Baked Asparagus with Balsamic Butter Sauce R...,False,[],0,[]
