# Web Scraping Example in Python

In this notebook we'll scrape food pantry data (e.g. Food Pantry name, Address, hours, contact info, etc.) from the website [www.foodpantries.org](https://www.foodpantries.org). This example is based on a scrape done for [Hack for LA's](https://www.hackforla.org) Food Oasis project. 

**Quick disclaimer**: I'm not an expert in python or webscrapping, so there are likely cleaner, more efficient ways of doing this example. Nonetheless, I hope that this may be useful or helpful for some folks.

## Things You'll Need to Run This Example
We'll use the following packages:
* [Requests](https://requests.readthedocs.io/en/master/)
* [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Selenium](https://selenium-python.readthedocs.io/)

You can install these packages using [pip](https://pypi.org/project/pip/). If you're running the code from a Jupyter Notebook, make sure you that you have these packages installed in the environment used by the notebook.

In addition to the packages noted above, this example uses Google Chrome and the Google Chrome Webdriver. The webdriver can be downloaded here: [https://chromedriver.chromium.org/](https://chromedriver.chromium.org/). 

In [44]:
import requests
from bs4 import BeautifulSoup 
from selenium import webdriver

import json
import collections
from datetime import date
import csv

## Browse Website and Develop a Plan

Before we start coding, we have to get an idea of what we want to get from the website. Some questions are:
* What data do we want?
* How do we navigate through the website to get the data we want?
* How is data structured on the website?
* How is data structured in the website's source code?

## Saving and Parsing Webpage Data

We'll use the `requests.get()` function to get the data located at [https://www.foodpantries.org/st/california](https://www.foodpantries.org/st/california). The function takes in the url of a page (as a string) and returns a **Response** object. This **Response** object contains the data at that url.

In [5]:
url = "https://www.foodpantries.org/st/california"

# requests.get :: str -> requests.models.Response
page = requests.get(url)

You can get the webpage's text by accessing the text attribute.

In [27]:
print(page.text)

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <title>California Food Pantries | Food Banks, Food Pantries, Food Assistance in California</title>
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta name="keywords" content="California food pantries" />
    <meta name="description" content="California Food Pantries. Find the list of the latest Food Pantries added to our California page. Click on the city pages to see the list of food pantries by city." />
	<meta name="google-site-verification" content="D18XaR9vxr0_KGV2URyZ-jB_UWxRUTGVze2Gi_ySMfE" />
    
    <!--[if lt IE 9]>
    <script src="https://www.foodpantries.org/templates/inspira/js/html5shiv.js"></script>
    <link rel="stylesheet" href="https://www.foodpantries.org/templates/inspira/css/ie.css" type="text/css">
    <![endif]-->

    <!-- CSS Files
    <link rel="stylesheet" href="https://www.foodpantries.org/templates/inspira/css/main.css" type="text/css"

We'll now use `BeautifulSoup` to parse the package. We will pass the document `page.text` to the **BeautifulSoup** constructor to a create a **BeautifulSoup** object. As its [documention](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-objects) notes, "BeautifulSoup transforms a complex HTML document into a complex tree of Python objects."

In [6]:
soup = BeautifulSoup(page.text, "lxml")

In [7]:
print(soup)

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>California Food Pantries | Food Banks, Food Pantries, Food Assistance in California</title>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="California food pantries" name="keywords"/>
<meta content="California Food Pantries. Find the list of the latest Food Pantries added to our California page. Click on the city pages to see the list of food pantries by city." name="description"/>
<meta content="D18XaR9vxr0_KGV2URyZ-jB_UWxRUTGVze2Gi_ySMfE" name="google-site-verification"/>
<!--[if lt IE 9]>
    <script src="https://www.foodpantries.org/templates/inspira/js/html5shiv.js"></script>
    <link rel="stylesheet" href="https://www.foodpantries.org/templates/inspira/css/ie.css" type="text/css">
    <![endif]-->
<!-- CSS Files
<link href="https://www.foodpantries.org/templates/inspira/css/main.css" id="main-css" rel="stylesheet" type="text/css"/>
<link href="https://www.foodpantrie

In [8]:
links = soup.find("table", {"class", "table table-striped"})

In [9]:
cities = []

for link in links.find_all("a"):
    cities.append([link.text, link.get("href")])

We can also write this as a list comprehension.

In [10]:
cities = [[link.text, link.get("href")] for link in links.find_all("a")]

## Automating Scraping of a List of URLs

So now we have links to all the city pages for California food pantries. The next step is to visit all these city pages and grab the information on the food pantries included in each of these pages. 

Instead of manually going through each of these pages using requests and beautifulsoup, we can have selenium automate this process for us. 

We will create an WebDriver object. Before we do this, we create an Options object so that we can add the options we want for our browser. Here we add an argument for the window-size of the browser and the term *headless*. This means that we can run the Chrome browser without the actual UI--i.e. we won't see the browser. If you want to see the browser go through each of the links, you can remove the line `options.add_argument('headless')`. 

In [11]:
options = webdriver.ChromeOptions()
options.add_argument('window-size=800x841')
options.add_argument('headless')
driver = webdriver.Chrome(options=options)

In [13]:
pantries = []

for city in cities:
    driver.implicitly_wait(10)
    driver.get(city[1])
    innerHTML = driver.execute_script("return document.body.innerHTML")
    soup = BeautifulSoup(innerHTML, "lxml")
    entries = soup.find_all("script",{"type":"application/ld+json"})
    for entry in entries:
        pantries.append(entry.text.replace("\n", " ").replace("\t", " "))

In [73]:
pantry_data = []
pantry_err = []

for pantry in pantries:
    try:
        pantry_data.append(json.loads(pantry))
    except json.decoder.JSONDecodeError:
        pantry_err.append(pantry)

In [74]:
print("{} entries parsed. \n{} entries not parsed.".format(len(pantry_data), len(pantry_err)))

1841 entries parsed. 
4 entries not parsed.


In [26]:
def flatten(d, parent_key='', sep='_'):
    items = []
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v, collections.MutableMapping):
            items.extend(flatten(v, new_key, sep=sep).items())
        elif isinstance(v, list):
            for sub_v in v:
                items.extend(flatten(sub_v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

In [27]:
flat = []

for pantry in pantry_data:
    flat.append(flatten(pantry))

In [40]:
date = date.today()
filename = "foodpantries_scrape_" + str(date) + ".csv"

In [41]:
print(filename)

foodpantries_scrape_2020-02-07.csv


In [57]:
corrected = []

for entry in flat:
    if "name" not in entry.keys():
        continue
    else:
        if entry["name"] == "https://www.foodpantries.org/" or entry["name"] == "FoodPantries.org":
            continue
    corrected.append(entry)

In [58]:
corrected

[{'@context': 'https://schema.org',
  '@type': 'LocalBusiness',
  'address_@type': 'PostalAddress',
  'address_streetAddress': '1900 Thau Way',
  'address_addressLocality': 'Alameda',
  'address_addressRegion': 'CA',
  'address_postalCode': '94501',
  'name': 'Alameda Food Bank',
  'image': 'https://www.foodpantries.org/gallery/925_alameda-food-bank_fca.png',
  'description': 'This location offers a Monthly Food Program (if eligible you may attend once per month). To be eligible you must be a resident of the city of Alameda and the household income must be at or below the current requirements. You will be required to bring a picture ID and current proof of residency such as a utility bill , lease or other such documents, each time you visit the program. Hours:Monday , Wednesday and Friday 1:00pm - 4:00pm (except the Third Wednesday of the month)Tuesday 4:30pm - 6:30pmThe second ,',
  'telephone': '(510) 523-5850'},
 {'@context': 'https://schema.org',
  '@type': 'LocalBusiness',
  'addr

In [61]:
keys = ["@context", "@type", "url", "sameAs", "contactPoint", "alternateName", 
         "name", "itemListElement", "address", "image", "description", "telephone",
        "contactPoint_@type", "contactPoint_contactType", "contactPoint_email",
        "contactPoint_url", "itemListElement_item_name", "itemListElement_position",
        "itemListElement_@type", "itemListElement_item_@id", "address_addressLocality",
        "address_streetAddress", "address_addressRegion", "address_postalCode", 
        "address_@type"]

date = date.today()
filename = "foodpantries_scrape_" + str(date) + ".csv"

with open(filename, 'w', encoding="utf-8", newline="") as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(corrected)