# Who are the carnivores?

### Abstract

With increasingly dire climate change forecasts, concerned individuals are asking how they can minimize their carbon footprint. Recent research suggests that reducing one's consumption of meat, in particular beef, is one of the highest impact actions an individual can take. To examine this topic, we will explore the trends in meat consumption in the U.S. by analyzing the prevalence of meat in recipes frequented online. Specifically, we plan to extract the ingredients, time and location of clicks from a recipe database. Using this information, we will explore the link between meat consumption and various key factors such as time of year, rural and urban locations, average regional income or historic events (i.e. the Paris Climate Agreement, mad cow disease outbreak). Finally, we hope to directly relate this data to the issue of climate change by estimating a rating reflecting the carbon footprint of meat in recipes and the environmental impact of consumers' diets.


### Imports and libraries

In [1]:
# Import libraries
import requests
from bs4 import BeautifulSoup

In [5]:
DATA_FOLDER = 'data'
SAMPLE_DATA_FOLDER = DATA_FOLDER + '/htmlSample'

## First step: data loading and cleaning

Goal: end up with a dataframe containing ingredients, clicks, and other features for each recipe

Start with one HTML file then scale up to 10-100 then the whole folder

In [None]:
Sample ingredient HTML item: 
    <li itemprop="ingredient" itemscope itemtype="http://data-vocabulary.org/RecipeIngredient">
                <span itemprop="amount">2 tablespoons</span>
                <span itemprop="name"> unsweetened cocoa</span>
                <span itemprop="preparation"> </span>
    </li>

Below: code from [StackOverflow](https://stackoverflow.com/questions/10123929/python-requests-fetch-a-file-from-a-local-url) that's suppposed to help dealing with local files but I couldn't get it to work yet.

In [6]:
import requests
import os, sys

if sys.version_info.major < 3:
    from urllib import url2pathname
else:
    from urllib.request import url2pathname

class LocalFileAdapter(requests.adapters.BaseAdapter):
    """Protocol Adapter to allow Requests to GET file:// URLs

    @todo: Properly handle non-empty hostname portions.
    """

    @staticmethod
    def _chkpath(method, path):
        """Return an HTTP status for the given filesystem path."""
        if method.lower() in ('put', 'delete'):
            return 501, "Not Implemented"  # TODO
        elif method.lower() not in ('get', 'head'):
            return 405, "Method Not Allowed"
        elif os.path.isdir(path):
            return 400, "Path Not A File"
        elif not os.path.isfile(path):
            return 404, "File Not Found"
        elif not os.access(path, os.R_OK):
            return 403, "Access Denied"
        else:
            return 200, "OK"

    def send(self, req, **kwargs):  # pylint: disable=unused-argument
        """Return the file specified by the given request

        @type req: C{PreparedRequest}
        @todo: Should I bother filling `response.headers` and processing
               If-Modified-Since and friends using `os.stat`?
        """
        path = os.path.normcase(os.path.normpath(url2pathname(req.path_url)))
        response = requests.Response()

        response.status_code, response.reason = self._chkpath(req.method, path)
        if response.status_code == 200 and req.method.lower() != 'head':
            try:
                response.raw = open(path, 'rb')
            except (OSError, IOError) as err:
                response.status_code = 500
                response.reason = str(err)

        if isinstance(req.url, bytes):
            response.url = req.url.decode('utf-8')
        else:
            response.url = req.url

        response.request = req
        response.connection = self

        return response

    def close(self):
        pass

In [14]:
requests_session = requests.session()
requests_session.mount('file://', LocalFileAdapter())
r = requests_session.get('file:///home/chloe/ADA/Lovelace-Project/htmlSample/ff6da2b8d426c56ae77beda595bdcfea.html')

Another method to get the file from online

In [16]:
r = requests.get('https://www.allrecipes.com/recipe/234502/vegan-waffles/')

In [18]:
print('Response status code: {0}\n'.format(r.status_code))
print('Response headers: {0}\n'.format(r.headers))
print('Response body: {0}'.format(r.text))

Response status code: 200

Response headers: {'Cache-Control': 'private', 'Transfer-Encoding': 'chunked', 'Content-Type': 'text/html; charset=utf-8', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'Set-Cookie': 'FirstImpression=False; domain=.allrecipes.com; expires=Sat, 17-Nov-2018 15:35:19 GMT; path=/; HttpOnly, ARSiteUser=1-a38cfde1-c133-46a0-9019-dafbbbd17d59; domain=.allrecipes.com; expires=Sat, 16-Nov-2019 14:11:12 GMT; path=/, ARCompressedSession=CAcAAB+LCAAAAAAABAB9VdtyozgQ/Re/MilDMMZO1T6AMQYMmHAxhjcBAsvmZolrpubflziZXGantkovOt0656i7BT9nDspK0LQYzp5mo95vw7QuEAKrktCoB21piauL6zAHumJ0SQRbml9v5OPL7XAqzry+pAJsxZqyYB8PdHtRlEztjdE3AQ4Bv3+kmN5d5tWQh6eQlYKjOqDU9xrKO+NmVNN5wLZHam1whu+Xy06D48Bcz/siDFakbRje0oM50x5d7BySZ5nsF7WiiG442Lpk2gbr2f4QXLxezthIFXj34gN+UBi7YFZOtfM4j8YXzdbU89bjFMvNII4qNlc3bu+A4GSa1GleuoTKrEju1klr7Dyxv+Hn7UYMlsVxJbN5Jz2SWyJoqeN7wmNvJPXxRRpV0gbNjRXjWtfynXSuqTChkH471NcVUhep0krhy/VC8qkYTTd/Hod+sDJ+HsAlum4ic8RsxorZFfJUcuLHR8NQpOMqz/bBjtYulq0fH4nMXR/d6JCEhcAvmMNheyz1FO

Below, we use beautifulSoup to extract features from the HTML file. I took a sample page from the web as I couldn't easily access local files.

In [25]:
URL = 'https://www.allrecipes.com/recipe/234502/vegan-waffles/'

In [26]:
r = requests.get(URL)
page_body = r.text

In [28]:
#This is how we get a beatiful soup of HTML for our recipe web page!
soup = BeautifulSoup(page_body, 'html.parser')

In [30]:
#And here is how we read the title!
soup.title.string

'Vegan Waffles Recipe - Allrecipes.com'

In [44]:
#Now lets try to extract the ingredients. This vegan recipe won't contain meat!

all_ingredients = soup.find_all('ingredient')
print('The webpage contains {0} ingredients...'.format(len(all_ingredients)))



The webpage contains 0 ingredients...


In [43]:
#okay that didn't work, lets try something else...

#example from tuto
#publications_wrappers = soup.find_all('li', class_='entry')

# we have this: <li itemprop="ingredient" itemscope itemtype="http://data-vocabulary.org/RecipeIngredient">
ingredient_wrappers = soup.find_all(itemprop_="ingredient")
#ingredient_wrappers = soup.find_all('li', itemtype_="ingredient")

print('Total number of ingredients: {0}'.format(len(ingredient_wrappers)))

Total number of ingredients: 0
