# Web Scraping Example in Python

In this notebook we'll scrape food pantry data (e.g. Food Pantry name, Address, hours, contact info, etc.) from the website [www.foodpantries.org](https://www.foodpantries.org). This example is based on a scrape done for [Hack for LA's](https://www.hackforla.org) Food Oasis project. 

**Quick disclaimer**: I'm not an expert in python or webscrapping, so there are likely cleaner, more efficient ways of doing this example. Nonetheless, I hope that this may be useful and/or helpful for some.

## Things You'll Need to Run This Example
We'll use the following Python libraries:
* [Requests](https://requests.readthedocs.io/en/master/)
* [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Selenium](https://selenium-python.readthedocs.io/)

You can install these packages using [pip](https://pypi.org/project/pip/). If you're running the code from a Jupyter Notebook, make sure you that you have these packages installed in the environment used by the notebook.

In addition to the packages noted above, this example uses Google Chrome and the Google Chrome Webdriver. The webdriver can be downloaded here: [https://chromedriver.chromium.org/](https://chromedriver.chromium.org/). 

In [1]:
import requests
from bs4 import BeautifulSoup 
from selenium import webdriver

import json
import collections
from datetime import date
import csv

## Browse Website and Develop a Plan

Before we start coding, we have to get an idea of what we want to get from the website. Some questions are:
* What data do we want?
* How do we navigate through the website to get the data we want?
* How is data structured on the website?
* How is data structured in the website's source code?

## Saving and Parsing Webpage Data

We'll use the `requests.get()` function to get the data located at [https://www.foodpantries.org/st/california](https://www.foodpantries.org/st/california). The function takes in the url of a page (as a string) and returns a **Response** object. This **Response** object contains the data at that url.

In [2]:
url = "https://www.foodpantries.org/st/california"

# requests.get :: str -> requests.models.Response
page = requests.get(url)

You can get the webpage's text by accessing the text attribute.

In [3]:
# print(page.text)

We'll now use `BeautifulSoup` to parse the package. We will pass the document `page.text` to the **BeautifulSoup** constructor to a create a **BeautifulSoup** object. As its [documention](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-objects) notes, "BeautifulSoup transforms a complex HTML document into a complex tree of Python objects."

In [4]:
soup = BeautifulSoup(page.text, "lxml")

# print(soup)

In [5]:
links = soup.find("table", {"class", "table table-striped"})

In [6]:
cities = []

for link in links.find_all("a"):
    cities.append([link.text, link.get("href")])

We can also write this as a list comprehension.

In [7]:
cities = [[link.text, link.get("href")] for link in links.find_all("a")]

We only want cities in Los Angeles County, so let's scrape a list of cities in Los Angeles County from Wikipedia. We're assuming that this list is complete, however this assumption, as with any assumption, should be checked. 

In [8]:
url_la_cities = "https://en.wikipedia.org/wiki/List_of_cities_in_Los_Angeles_County,_California"
r = requests.get(url_la_cities)
la_soup = BeautifulSoup(r.text, "lxml")
la_tbl = la_soup.find("table", {"class", "wikitable sortable"})
la_city_links = la_tbl.find_all("a")

la_city_list = []

for city in la_city_links[1:]:
    la_city_list.append(city.text)

After we scrape the list, we should compare the cities in `cities` list with the `la_city_list` to find cities in the `cities` list that are not included in `la_city_list`. In this case, we do have a couple cases:

* The entry for Inglewood is mispelled. 
* Furthermore, several neighborhoods in LA County are listed as cities when in fact they are not cities. 

Knowing this requires some domain knowledge (in this case, knowledge of local geography), which is extremely important in data analysis. We will append these cases to `la_city_list`. Then create a new list called `la_cities`, which contains only the LA County links from `cities`. 

In [9]:
la_city_list = la_city_list + ["Ingelwood", "North Hollywood", "Studio City", "Sun Valley", "Van Nuys", "Woodland Hills"]

In [10]:
la_cities = [city for city in cities if city[0] in la_city_list]

## Automating Scraping of a List of URLs

So now we have links to all the LA County city pages for California food pantries. The next step is to visit all these city pages and grab the information on the food pantries included in each of these pages. 

Instead of manually going through each of these pages using requests and beautifulsoup, we can have selenium automate this process for us. 

We will create an WebDriver object. Before we do this, we create an Options object so that we can add the options we want for our browser. Here we add an argument for the window-size of the browser and the term *headless*. This means that we can run the Chrome browser without the actual UI--i.e. we won't see the browser. If you want to see the browser go through each of the links, you can remove the line `options.add_argument('headless')`. 

In [11]:
options = webdriver.ChromeOptions()
options.add_argument('window-size=800x841')
options.add_argument('headless')
driver = webdriver.Chrome(options=options)

In [12]:
pantries = []

for city in la_cities:
    driver.implicitly_wait(10)
    driver.get(city[1])
    innerHTML = driver.execute_script("return document.body.innerHTML")
    soup = BeautifulSoup(innerHTML, "lxml")
    entries = soup.find_all("script",{"type":"application/ld+json"})
    for entry in entries:
        pantries.append(entry.text.replace("\n", " ").replace("\t", " "))

In [13]:
pantries[81]

'             {             "@context": "https://schema.org",             "@type": "LocalBusiness",             "address": {             "@type":"PostalAddress",             "streetAddress":"4390 Colfax Avenue",             "addressLocality":"Studio City",             "addressRegion":"CA",             "postalCode": "91604"             },             "name": "North Hollywood Interfaith Food Pantry"             ,"image": "https://www.foodpantries.org/gallery/no_photo_1.jpg"             ,"description": "Hours:Monday - Friday9:00am - 1:00pmPLEASE NOTE THAT THE ENTRANCE IS ON THE TROOST SIDE OF THE PARKING LOT ONLYFor more information, please call...."             ,"telephone":"(818) 980-1657"             }         '

In [14]:
pantry_data = []
pantry_err = []

for pantry in pantries:
    try:
        pantry_data.append(json.loads(pantry))
    except json.decoder.JSONDecodeError:
        pantry_err.append(pantry)

In [15]:
print("{} entries parsed. \n{} entries not parsed.".format(len(pantry_data), len(pantry_err)))

282 entries parsed. 
1 entries not parsed.


In [16]:
def flatten(d, parent_key='', sep='_'):
    items = []
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v, collections.MutableMapping):
            items.extend(flatten(v, new_key, sep=sep).items())
        elif isinstance(v, list):
            for sub_v in v:
                items.extend(flatten(sub_v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

In [17]:
flat = []

for pantry in pantry_data:
    flat.append(flatten(pantry))

If we take a look at flat, we can notice a couple things. There are some entries that do not have a "name" key. These are not food pantry entries as they contain no information on a food pantry. Also, there are some entries in which the value associated with the "name" key are either "https://www.foodpantries.org" or "FoodPantries.org". These also are not food pantry entries. 

We create a new corrected list of dictionaries that does not include these entries.

In [18]:
corrected = []

for entry in flat:
    if "name" not in entry.keys():
        continue
    else:
        if entry["name"] == "https://www.foodpantries.org/" or entry["name"] == "FoodPantries.org":
            continue
    corrected.append(entry)

## Saving the List of Dictionaries as a CSV File

We use `DictWriter` from the `csv` library to save our scraped data. 

In [19]:
keys = set().union(*(d.keys() for d in corrected))

date = date.today()
filename = "load_food_pantries_" + str(date) + ".csv"

with open(filename, 'w', encoding="utf-8", newline="") as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(corrected)