# Web scraping part 1: permits

## Lecture objectives

1. Demonstrate how to scrape web pages and other data where an API doesn't exist
2. Introduce the `BeautifulSoup` library
3. Demonstrate how to parse unstructured text data
4. More practice with `pandas`

APIs make it relatively simple to get data from the web. But sometimes, an API doesn't exist—they take effort on the part of the agency to set up and maintain.

In these cases, we can still obtain data from the web. But rather than dropping it directly into a (geo)pandas `DataFrame`, we'll need to do more work to understand the structure of the webpage, and to clean and process the results. 

## Example: Land use permit data
Often, cities make their building and land use permit data available for download, and/or accessible through an API. But these are typically incomplete—they provide a subset of fields that are most relevant to most users (e.g., permit approval date and number of units), but perhaps exclude more esoteric fields. And parking, sadly, is one of the fields that is often excluded.

For a [recent project](https://www.tandfonline.com/doi/full/10.1080/01944363.2021.1873824), I looked at the impacts of TOD plans in Seattle and San Francisco on development outcomes, including parking ratios. Let's walk through how I obtained the data for the Seattle analysis.

The basic Seattle land use permit dataset [is available through the city's Socrata API](https://data.seattle.gov/Permitting/Land-Use-Permits/ht3q-kdvx). That's a good starting point for our work. Let's get this into a `pandas` dataframe, in the same way that we did with the Los Angeles data.

In [None]:
import json
import requests
import pandas as pd
url = 'https://data.seattle.gov/resource/ht3q-kdvx.json' # copied and pasted from the webpage
r = requests.get(url)
df = pd.DataFrame(json.loads(r.text))
print(df.head())

There are lots of columns, so the output is truncated.

But we can explore the contents of the dataframe in other ways. For example `.info()` gives us the column names and variable types. (Object is normally a string, or a mixed type.)

In [None]:
df.info()

Notice that there is a `link` field. Let's take a look at the first one. 

In [None]:
# The .loc operator gives us an extract from the dataframe. 0 is the index, 'link' is the column
# So this gives us the contents of the 'link' column for the row with index 0 (the first one).

print(df.loc[0, 'link'])   

Notice that this column of the pandas dataframe is a dictionary. That's perhaps a surprise, but we know how to deal with dictionaries. 

For now, [let's take a look at what this link looks like](https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=3001212-LU). Clearly, there is a lot more information here about the specific permit, than is provided via the API!

How do we bring the information in that webpage into Python? Remember, the `requests` library is our friend in this circumstance. While we've used it to get data from an API, `requests` can retrieve pretty much anything from the web.

First, let's extract the text string that gives the URL for this row.

In [None]:
urldict = df.loc[0,'link']
print(urldict)

As we saw before, it's a dictionary with a key of 'url', so let's extract the value.

In [None]:
permiturl = urldict['url']
print(permiturl)

# or we could do this in one step: permiturl = df.loc[0,'link']['url']


Now, pass that URL to `requests` in the same way that we did for the API.

In [None]:
r = requests.get(permiturl)

Let's look at what requests has returned. 

Remember, the `.text` attribute gives us the text of what's retrieved.

In [None]:
print(r.text)

### Using BeautifulSoup
It looks like we've got the whole .html webpage. The relevant information is buried in there, but how can we get it in the sea of html code?

This is where the `BeautifulSoup` library comes in ([documentation here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)). Let's convert our text to a "soup" object.

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text)
print(type(soup))

This soup object has a lot of attributes and functions (type `soup.` and press tab to autocomplete). 

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Look at what you can do with the <strong>soup</strong> object. Experiment. What functions seem most useful?
</div>

In [None]:
soup.

Let's suppose we want to get information the project description (where the parking information might be included, since there isn't a separate parking field). (In reality, the "description" field is in the API version, but that wasn't the case originally, and it's good practice.)

Just like with the API output that we saw earlier, extracting this is a case of step-by-step detective work.

If you look at the [output](https://cosaccela.seattle.gov/portal/cap/CapDetail.aspx?type=1000&fromACA=Y&agencyCode=SEATTLE&Module=DPDPermits&capID1=05HST&capID2=00000&capID3=19806) in the Develop mode in your web browser, it seems that Project Description is contained within a `<td>` tag. 

We'll use the `.find_all()` function to find the relevant text.

In [None]:
tds = soup.find_all('td') # returns a "list-like" object, i.e. we can loop through it or slice it like a list

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> What is the <strong>tds</strong> object? How can you make use of it?
</div>

Let's have a look.

In [None]:
type(tds)

What on earth is a `ResultSet`? The [docs](https://tedboy.github.io/bs4_doc/generated/generated/bs4.ResultSet.html) tell us that it's a list. So we can use our regular methods to look at a list.

In [None]:
# look at the first element
print(tds[0])

More systematically, let's loop through to find the element that has the Project Description.

In [None]:
for td in tds:
    if 'Project Description' in td.text: 
        # stop here and abort the loop
        break 
        
print (td) 

Now we are getting closer! It looks like the Project Description is contained in another `<td>` tag, nested one level down. So let's do the same thing again at this second-level link.

In [None]:
tds2 = td.find_all('td')
print(tds2)

We've obtained a list! And the information we need is in the second element of that list.

In [None]:
description = tds2[1]
print(description.text)

Now, let's take everything we've done so far, and put it in a function. To help with that, let's recap.

* For each row of the `DataFrame` (`df`), we have a dictionary with the url (we called that `urldict`)
* We extracted the URL from that dictionary, and put it in the `permiturl` variable
* We requested that URL using requests, and put the response in the `r` variable
* We converted that response to a `soup` object, and called it `soup`
* We found all the content within `td` tags, and put that in the `tds` variable (a `ResultSet` or list)
* We looped over each element of `tds`, and found the one that contains "Project Description"
* We found all the content within the second level of `td` tags, and put the second element in the `description` variable
* We extracted the text from that `description` variable, using the `text` attribute

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Identify each of these steps in the code above.
</div>

That's a lot of steps! So let's write a function that allows us to apply all of these steps to each permit.
 
The function takes a single argument: the dictionary in the `url` column of the pandas DataFrame
 
It returns the Description text, unless that's not found, in which case it returns an empty string `''`.  

In [None]:
def getDescription(urldict):
    permiturl = urldict['url']
    r = requests.get(permiturl)
    soup = BeautifulSoup(r.text)
    tds = soup.find_all('td')
    for td in tds:
        if 'Project Description' in td.text: 
            tds2 = td.find_all('td')
            description = tds2[1]
            # once we find a description, we return it and exit the function
            return description.text 
    
    return '' # if we don't find it, return an empty string

# Now let's apply this function to the first link in our dataframe
urldict = df.loc[0,'link']
getDescription(urldict)

The advantage of a function is that we can now apply this procedure to every row of our pandas DataFrame.

Let's do this for 10 rows (so we are nice and don't disrupt the City's website).

The `apply` function in `pandas` applies a function to each row of a DataFrame.

In [None]:
# create a copy of the first 10 rows of the dataframe.
smalldf = df.head(10).copy()  

# for each row in smallDf, we pass the link column to getDescription
# That will then appear within the function, but be called urldict (the name of the argument to the function) 
descriptions = smalldf['link'].apply(getDescription)  

In [None]:
# what does the function return? It's a pandas Series (basically, a one-column DataFrame)
print(type(descriptions))

In [None]:
print(descriptions)

In [None]:
# So we can insert that into the dataframe as a new column
smalldf['newdescription'] = descriptions
# we could have done this in one step: 
# smalldf['newdescription'] = smalldf['link'].apply(getDescription) 
smalldf

### Parsing text
Now we have scraped the description for each project!

How do we get the number of parking spaces? Well, that depends on whether the city uses consistent terminology. 

You'll need to design a set of rules that cover different possibilities. For example, the description might say "2 parking spaces" or "TWO PARKING SPACES" or "1 uncovered and 1 covered parking space." Looking at your data is key.

For starters, let's take the simplest case. We'll add a column to our dataframe that indicates whether there is "no parking" in the project description.

In [None]:
# import the numpy library, which underlies pandas
# we'll use it's nan (null) value to indicate missing data
import numpy as np

def noparking(description):
    # convert the description to lower case
    text = description.lower()
    if 'no parking' in text:
        return True
    elif 'parking' in text:
        return False
    else:
        # capture all other possibilities
        return np.nan

# Now apply our function
smalldf['noparking'] = smalldf.description.apply(noparking)

In [None]:
# look at the output (just the noparking column)
smalldf.noparking

<div class="alert alert-block alert-info">
<strong>Thought exercise:</strong> If you want to get the number of parking spaces for each project, what would be your next step? In principle, how might you do that?
</div>

<div class="alert alert-block alert-info">
<strong>Let's generalize.</strong> What did we do here?
    
1. We obtained the URL for each page to scrape. (Here, it was given to us in the city's data file, but sometimes we'll have to reverse-engineer the composition of the URL.)
2. We examined a sample page, and identified the html tags that enclose the data we wanted to extract.
3. We wrote a function that pulled out the data for a specific page.
4. We applied that function to each URL / page. Since our URLs were in a pandas DataFrame, we could use the pandas <strong>apply</strong> method.
    
Every scraping project will pose different challenges, but normally it will involve each of these four steps.
</div>