# Week 2. Web scraping

## This week's objectives

1. Understand how to scrape web pages and other data where an API doesn't exist
2. Introduce the `BeautifulSoup` library
3. Learn how to parse unstructured text data
4. Learn how to handle errors ("exceptions") gracefully
5. More pratice with `pandas`, `geopandas`, and plotting

APIs make it relatively simple to get data from the web. But sometimes, an API doesn't exist—they take effort on the part of the agency to set up and maintain.

In these cases, we can still obtain data from the web. But rather than dropping it directly into a (geo)pandas `DataFrame`, we'll need to do more work to understand the structure of the webpage, and to clean and process the results. 

## Example: Land use permit data
Often, cities make their building and land use permit data available for download, and/or accessible through an API. But these are typically incomplete—they provide a subset of fields that are most relevant to most users (e.g., permit approval date and number of units), but perhaps exclude more esoteric fields. And parking, sadly, is one of the fields that is often excluded.

For a recent project, I looked at the impacts of TOD plans in Seattle and San Francisco on development outcomes, including parking ratios. Let's walk through the Seattle analysis.

The [Seattle land use permit dataset is here](https://data.seattle.gov/Permitting/Land-Use-Permits/ht3q-kdvx). Let's get this into a `pandas` dataframe, in the same way that we did with the LA data.

In [None]:
%config Completer.use_jedi = False
import json
import requests
import pandas as pd
url = 'https://data.seattle.gov/resource/ht3q-kdvx.json' # copied and pasted from the webpage
r = requests.get(url)
df = pd.DataFrame(json.loads(r.text))
print(df.head())

Notice a couple of things. First, the website seems to have more fields than are in the API version of the document. Second, parking is nowhere to be seen.

But there is a `link` field. Let's take a look at the first one. 

In [None]:
# The .loc operator gives us an extract from the dataframe. 0 is the row index, 'link' is the column

print(df.loc[0,'link'])   

Notice that this column of the pandas dataframe is a dictionary. That's perhaps a surprise, but we know how to deal with dictionaries. 

For now, [let's take a look at what this link looks like](https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=3003094-LU). Clearly, there is a lot more information here about the specific permit!

How do we bring the information in that webpage into Python? Remember, the `requests` library is our friend in this circumstance. While we've used it to get data from an API, `requests` can retrieve pretty much anything from the web.

In [None]:
urldict = df.loc[0,'link']
permiturl = urldict['url']
# or we could do this in one step: df.loc[0,'link']['url']
r = requests.get(permiturl)

In [None]:
# Let's look at what the r object has given us. 
# Remember, the .text attribute gives us the text of what's retrieved.
print(r.text)

It looks like we've got the whole .html webpage. The relevant information is buried in there, but how can we get it in the sea of html code?

This is where the `BeautifulSoup` library comes in ([documentation here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)). Let's convert our text to a "soup" object.

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text)
print(type(soup))

This soup object has a lot of attributes and functions (type `soup.` and press tab to autocomplete). We'll use the `.find` function to find the relevant text.

We can also use the `.prettify()` function to give us a better sense of what we are looking at.

In [None]:
# Not very pretty IMHO, but we can look at see where the data we want are buried
# and cross-refernce that to the webpage in our browser
print(soup.prettify())

Let's suppose we want to get information the project description (where the parking information might be included, since there isn't a separate parking field). 

It looks like this field are contained within a `<td>` tag. 

In [None]:
links = soup.find_all('td') # returns a "list-like" object, i.e. we can loop through it or slice it like a list

# Look at an example
print(links[5])

# More systematically, let's look at the links that we are interested
for link in links:
    if 'Project Description' in link.text: 
        print (link) 
        break # so that we keep this link, and abort the loop

Now we are getting closer! It looks like the Project Description is contained in another `<td>` tag, nested one level down. So let's do the same thing again at this second-level link.

In [None]:
sublinks = link.find_all('td')
print(sublinks)

We've obtained a list! And the information we need is in the second element of that list.

In [None]:
description = sublinks[1]
print(description.text)

Now, let's take everything we've done so far, and put it in a function.
 
The function takes a single argument: the dictionary in the `url` column of the pandas DataFrame
 
It returns the Description text, unless that's not found, in which case it returns an empty string `''`.  

In [None]:
def getDescription(urldict):
    permiturl = urldict['url']
    # or we could do this in one step: df.loc[0,'link']['url']
    r = requests.get(permiturl)
    soup = BeautifulSoup(r.text)
    links = soup.find_all('td')
    for link in links:
        if 'Project Description' in link.text: 
            sublinks = link.find_all('td')
            description = sublinks[1].text
            # once we find a description, we return it and exit the function
            return description 
    
    return '' # if we don't find it, return an empty string

urldict = df.loc[0,'link']
getDescription(urldict)

The advantage of a function is that we can now apply this procedure to every row of our pandas DataFrame.

Let's do this for 5 rows (so we are nice and don't disrupt the City's website).

The `apply` function in `pandas` applies a function to each row of a DataFrame.

In [None]:
smalldf = df.iloc[:5].copy()  # create a copy, rather than a view to that object
descriptions = smalldf['link'].apply(getDescription)  # for each row in smallDf, we pass the link column to getDescription

In [None]:
# what's the description object? It's a pandas Series (basically, a one-column DataFrame)
print(type(descriptions))
print(descriptions)

In [None]:
# So we can insert that into the dataframe as a new column
smalldf['description'] = descriptions
# we could have done this in one step: 
# smalldf['description'] = smalldf['link'].apply(getDescription) 

Now we have scraped the description for each project!

How do we get the number of parking spaces? Well, that depends on whether the city uses consistent terminology. 

For starters, let's just get a column for whether there is "no parking" in the description.

In [None]:
import numpy as np

def noparking(description):
    if isinstance(description, str):
        text = description.lower()
        if 'no parking' in text:
            return True
        elif 'parking' in text:
            return False
    return np.nan
smalldf['noparking'] = smalldf.description.apply(noparking)

In [None]:
smalldf.noparking


<div class="alert alert-block alert-info">
<strong>Exercise:</strong> If you want to get the number of parking spaces for each project, what would be your next step? In principle, how might you do that?
</div>

<div class="alert alert-block alert-info">
<strong>Let's generalize.</strong> What did we do here?
    
1. We obtained the URL for each page to scrape. (Here, it was given to us in the city's data file, but sometimes we'll have to reverse-engineer the composition of the URL.)
2. We examined a sample page, and identified the html tags that enclose the data we wanted to extract.
3. We wrote a function that pulled out the data for a specific page.
4. We applied that function to each URL / page. Since our URLs were in a pandas DataFrame, we could use the pandas <strong>apply</strong> method.
    
Every scraping project will pose different challenges, but normally it will involve each of these four steps.
</div>