# Module 2. Web scraping
# Scraping permits

## Lecture objectives

1. Demonstrate how to scrape web pages and other data where an API doesn't exist
2. Introduce the `BeautifulSoup` library
3. Provide more practice with `pandas`

APIs make it relatively simple to get data from the web. But sometimes, an API doesn't exist—they take effort on the part of the agency to set up and maintain.

In these cases, we can still obtain data from the web. But rather than dropping it directly into a (geo)pandas `DataFrame`, we'll need to do more work to understand the structure of the webpage, and to clean and process the results. 

## Example: Land use permit data
Often, cities make their building and land use permit data available for download, and/or accessible through an API. But these are typically incomplete—they provide a subset of fields that are most relevant to most users (e.g., permit approval date and number of units), but perhaps exclude more esoteric fields. And parking, sadly, is one of the fields that is often excluded.

For a [recent project](https://www.tandfonline.com/doi/full/10.1080/01944363.2021.1873824), I looked at the impacts of TOD plans in Seattle and San Francisco on development outcomes, including parking ratios. Let's walk through how I obtained the data for the Seattle analysis.

The basic Seattle land use permit dataset [is available through the city's Socrata API](https://data.seattle.gov/Permitting/Land-Use-Permits/ht3q-kdvx). That's a good starting point for our work. Let's get this into a `pandas` dataframe, in the same way that we did with the Los Angeles data.

In [1]:
import json
import requests
import pandas as pd
url = 'https://data.seattle.gov/resource/ht3q-kdvx.json' # copied and pasted from the webpage
r = requests.get(url)
df = pd.DataFrame(json.loads(r.text))
print(df.head())

    permitnum           permitclass permitclassmapped       permittypemapped   
0  3001064-EG            Commercial   Non-Residential  Early Design Guidance  \
1  3001064-LU            Commercial   Non-Residential      Master Use Permit   
2  3001095-LU  Single Family/Duplex       Residential      Master Use Permit   
3  3001121-LU                   N/A               N/A      Master Use Permit   
4  3001139-LU           Multifamily       Residential      Master Use Permit   

  permittypedesc                                        description   
0  Design Review  Early Design Guidance for: Land Use Applicatio...  \
1            NaN  Land Use Application to allow one, 6-story bui...   
2            NaN  Land Use Application to subdivide one parcel i...   
3            NaN                               Unit Lot Subdivision   
4            NaN  Cancel per customer request 4/15/08 log #4507\...   

  estprojectcost statuscurrent   originaladdress1 originalcity  ...   
0  16000000.0000     

There are lots of columns, so the output is truncated.

But we can explore the contents of the dataframe in other ways. For example `.info()` gives us the column names and variable types. (Object is normally a string, or a mixed type.)

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   permitnum              1000 non-null   object
 1   permitclass            1000 non-null   object
 2   permitclassmapped      1000 non-null   object
 3   permittypemapped       1000 non-null   object
 4   permittypedesc         124 non-null    object
 5   description            999 non-null    object
 6   estprojectcost         8 non-null      object
 7   statuscurrent          1000 non-null   object
 8   originaladdress1       1000 non-null   object
 9   originalcity           1000 non-null   object
 10  originalstate          1000 non-null   object
 11  originalzip            869 non-null    object
 12  link                   1000 non-null   object
 13  latitude               918 non-null    object
 14  longitude              918 non-null    object
 15  location1             

Notice that there is a `link` field. Let's take a look at the first one. 

In [3]:
# The .loc operator gives us an extract from the dataframe. 0 is the index, 'link' is the column
# So this gives us the contents of the 'link' column for the row with index 0 (the first one).

print(df.loc[0, 'link'])   

{'url': 'https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=3001064-EG'}


Notice that this column of the pandas dataframe is a dictionary. That's perhaps a surprise, but we know how to deal with dictionaries. 

For now, [let's take a look at what one of these links looks like](https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=3001212-LU). Clearly, there is a lot more information here about the specific permit, than is provided via the API!

How do we bring the information in that webpage into Python? Remember, the `requests` library is our friend in this circumstance. While we've used it to get data from an API, `requests` can retrieve pretty much anything from the web.

First, let's extract the text string that gives the URL for this row.

In [4]:
urldict = df.loc[0,'link']
print(urldict)

{'url': 'https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=3001064-EG'}


As we saw before, it's a dictionary with a key of 'url', so let's extract the value.

In [5]:
permiturl = urldict['url']
print(permiturl)

# or we could do this in one step: permiturl = df.loc[0,'link']['url']

https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=3001064-EG


Now, pass that URL to `requests` in the same way that we did for the API.

In [6]:
r = requests.get(permiturl)

Let's look at what requests has returned. 

Remember, the `.text` attribute gives us the text of what's retrieved.

In [7]:
print(r.text)



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html ng-app="appAca" xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
<head id="ctl00_Head1"><link href="../App_Themes/Default/form.css" type="text/css" rel="stylesheet" /><link href="../App_Themes/Default/style.css" type="text/css" rel="stylesheet" /><title>
	
        Accela Citizen Access
    
</title><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <style type="text/css">
        body, html {
        overflow-y: visible!important;
        }
        .page-overlay-blocker {
          position: fixed;
          z-index: 999;
          top: 0;
          left: 0;
          right: 0;
          bottom: 0;
          width: 100%;
          height: 100%;
          background-color: #

### Using BeautifulSoup
It looks like we've got the whole .html webpage. The relevant information is buried in there, but how can we get it in the sea of html code?

This is where the `BeautifulSoup` library comes in ([documentation here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)). Let's convert our text to a "soup" object.

In [8]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, features='html.parser')
print(type(soup))

<class 'bs4.BeautifulSoup'>


This soup object has a lot of attributes and functions (type `soup.` and press tab to autocomplete). 

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Look at what you can do with the <strong>soup</strong> object. Experiment. What functions seem most useful?
</div>

In [14]:
soup.children

<list_iterator at 0x7fcf49e39690>

Let's suppose we want to get information the project description (where the parking information might be included, since there isn't a separate parking field). (In reality, the "description" field is in the API version, but that wasn't the case originally, and it's good practice.)

Just like with the API output that we saw earlier, extracting this is a case of step-by-step detective work.

If you look at the [output](https://cosaccela.seattle.gov/portal/cap/CapDetail.aspx?type=1000&fromACA=Y&agencyCode=SEATTLE&Module=DPDPermits&capID1=05HST&capID2=00000&capID3=19806) in the Develop mode in your web browser, it seems that Project Description is contained within a `<td>` tag. 

We'll use the `.find_all()` function to find the relevant text.

In [15]:
tds = soup.find_all('td') # returns a "list-like" object, i.e. we can loop through it or slice it like a list

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> What is the <strong>tds</strong> object? How can you make use of it?
</div>

Let's have a look.

In [16]:
type(tds)

bs4.element.ResultSet

What on earth is a `ResultSet`? The [docs](https://tedboy.github.io/bs4_doc/generated/generated/bs4.ResultSet.html) tell us that it's a list. So we can use our regular methods to look at a list.

In [17]:
# look at the first element
print(tds[0])

<td>
<div id="ctl00_HeaderNavigation_beforeLogin">
<!--Login link-->
<div class="ACA_FRight">
<table border="0" cellpadding="0" cellspacing="0" role="presentation">
<tr>
<td>
<a href="/portal/Login.aspx" id="ctl00_HeaderNavigation_btnLogin">
<span class="ACA_Body_Text ACA_Body_Text_FontSize" id="ctl00_HeaderNavigation_lblLogin"><span class="ssp-login">Login</span></span>
</a>
</td>
<td class="ACA_TabRow_Line">
                                              
                                        </td>
</tr>
</table>
</div>
<!--Login link-->
<!--Report-->
<div class="ACA_FRight">
<div>
<table border="0" cellpadding="0" cellspacing="0" role="presentation">
<tr id="reportLink">
<td class="ACA_TabRow_Line">
<a class="nav_more_arro ACA_Report_Arrow NotShowLoading" href="javascript:void(0);" onclick="showReports();" title="Report List">
<span id="ctl00_HeaderNavigation_lblReports"></span>
<span class="ACA_Body_Text ACA_Body_Text_FontSize" id="ctl00_HeaderNavigation_lblAdminReports" style="di

More systematically, let's loop through to find the element that has the Project Description.

In [18]:
for td in tds:
    if 'Project Description' in td.text: 
        # stop here and abort the loop
        break 
        
print (td) 

<td class="td_parent_left"><div>
<h1 style="font-size:1.4em;"><span id="ctl00_PlaceHolderMain_PermitDetailList1_per_permitDetail_label_projectl638168029218552654">Project Description</span></h1><span class="ACA_SmLabel ACA_SmLabel_FontSize"><table class="table_child" role="presentation" style="TEMPLATE_STYLE"><tr><td class="td_child_left font12px"></td><td>Early Design Guidance for: Land Use Application to allow one, 6-story building containing 102 assisted living units and 1,445 sq. ft. of retail space.  Parking for 37 vehicles to be provided below grade.  Project includes 10,000 cu. yds. of grading. Existing structure to be demolished.</td></tr></table></span>
</div></td>


Now we are getting closer! It looks like the Project Description is contained in another `<td>` tag, nested one level down. So let's do the same thing again at this second-level link.

In [19]:
tds2 = td.find_all('td')
print(tds2)

[<td class="td_child_left font12px"></td>, <td>Early Design Guidance for: Land Use Application to allow one, 6-story building containing 102 assisted living units and 1,445 sq. ft. of retail space.  Parking for 37 vehicles to be provided below grade.  Project includes 10,000 cu. yds. of grading. Existing structure to be demolished.</td>]


We've obtained a list! And the information we need is in the second element of that list.

In [20]:
description = tds2[1]
print(description.text)

Early Design Guidance for: Land Use Application to allow one, 6-story building containing 102 assisted living units and 1,445 sq. ft. of retail space.  Parking for 37 vehicles to be provided below grade.  Project includes 10,000 cu. yds. of grading. Existing structure to be demolished.


Now, let's take everything we've done so far, and put it in a function. To help with that, let's recap.

* For each row of the `DataFrame` (`df`), we have a dictionary with the url (we called that `urldict`)
* We extracted the URL from that dictionary, and put it in the `permiturl` variable
* We requested that URL using requests, and put the response in the `r` variable
* We converted that response to a `soup` object, and called it `soup`
* We found all the content within `td` tags, and put that in the `tds` variable (a `ResultSet` or list)
* We looped over each element of `tds`, and found the one that contains "Project Description"
* We found all the content within the second level of `td` tags, and put the second element in the `description` variable
* We extracted the text from that `description` variable, using the `text` attribute

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Identify each of these steps in the code above.
</div>

That's a lot of steps! So let's write a function that allows us to apply all of these steps to each permit.
 
The function takes a single argument: the dictionary in the `url` column of the pandas DataFrame
 
It returns the Description text, unless that's not found, in which case it returns an empty string `''`.  

In [21]:
def getDescription(urldict):
    permiturl = urldict['url']
    r = requests.get(permiturl)
    soup = BeautifulSoup(r.text, features='html.parser')
    tds = soup.find_all('td')
    for td in tds:
        if 'Project Description' in td.text: 
            tds2 = td.find_all('td')
            description = tds2[1]
            # once we find a description, we return it and exit the function
            return description.text 
    
    return '' # if we don't find it, return an empty string

# Now let's apply this function to the first link in our dataframe
urldict = df.loc[0,'link']
getDescription(urldict)

'Early Design Guidance for: Land Use Application to allow one, 6-story building containing 102 assisted living units and 1,445 sq. ft. of retail space.  Parking for 37 vehicles to be provided below grade.  Project includes 10,000 cu. yds. of grading. Existing structure to be demolished.'

The advantage of a function is that we can now apply this procedure to every row of our pandas DataFrame.

Let's do this for 10 rows (so we are nice and don't disrupt the City's website).

The `apply` function in `pandas` applies a function to each row of a DataFrame.

In [22]:
# create a copy of the first 10 rows of the dataframe.
smalldf = df.head(10).copy()  

# for each row in smallDf, we pass the link column to getDescription
# That will then appear within the function, but be called urldict (the name of the argument to the function) 
descriptions = smalldf['link'].apply(getDescription)  

In [23]:
# what does the function return? It's a pandas Series (basically, a one-column DataFrame)
print(type(descriptions))

<class 'pandas.core.series.Series'>


In [24]:
print(descriptions)

0    Early Design Guidance for: Land Use Applicatio...
1    Land Use Application to allow one, 6-story bui...
2    Land Use Application to subdivide one parcel i...
3                                 Unit Lot Subdivision
4    Cancel per customer request 4/15/08 log #4507\...
5    PROJECT CANCELLED 12/8/2010 -- This short plat...
6    Cancelled 10/20/2010   Council land use action...
7    Land Use Application to install a 10 ft. by 18...
8    Land use permit to subdivide 1 parcel into 3 p...
9    Land Use Permit to adjust the boundary between...
Name: link, dtype: object


In [25]:
# So we can insert that into the dataframe as a new column
smalldf['newdescription'] = descriptions
# we could have done this in one step: 
# smalldf['newdescription'] = smalldf['link'].apply(getDescription) 
smalldf

Unnamed: 0,permitnum,permitclass,permitclassmapped,permittypemapped,permittypedesc,description,estprojectcost,statuscurrent,originaladdress1,originalcity,...,longitude,location1,housingunitsremoved,housingunitsadded,applieddate,issueddate,expiresdate,decisiondate,contractorcompanyname,newdescription
0,3001064-EG,Commercial,Non-Residential,Early Design Guidance,Design Review,Early Design Guidance for: Land Use Applicatio...,16000000.0,Completed,2200 E MADISON ST,SEATTLE,...,-122.30341969,"{'latitude': '47.61885483', 'longitude': '-122...",,,,,,,,Early Design Guidance for: Land Use Applicatio...
1,3001064-LU,Commercial,Non-Residential,Master Use Permit,,"Land Use Application to allow one, 6-story bui...",16000000.0,Completed,2200 E MADISON ST,SEATTLE,...,-122.30341969,"{'latitude': '47.61885483', 'longitude': '-122...",0.0,103.0,2011-06-21,2012-04-06,2015-02-17,2012-02-02,,"Land Use Application to allow one, 6-story bui..."
2,3001095-LU,Single Family/Duplex,Residential,Master Use Permit,,Land Use Application to subdivide one parcel i...,,Completed,5414 21ST AVE SW,SEATTLE,...,-122.35930088,"{'latitude': '47.55326217', 'longitude': '-122...",,,2012-04-25,2012-09-05,2015-08-10,2012-07-26,,Land Use Application to subdivide one parcel i...
3,3001121-LU,,,Master Use Permit,,Unit Lot Subdivision,,Canceled,103 30TH AVE,SEATTLE,...,-122.29413641,"{'latitude': '47.60184595', 'longitude': '-122...",,,,,,,,Unit Lot Subdivision
4,3001139-LU,Multifamily,Residential,Master Use Permit,,Cancel per customer request 4/15/08 log #4507\...,,Canceled,3649 S MORGAN ST,SEATTLE,...,-122.28533406,"{'latitude': '47.54411404', 'longitude': '-122...",,,2008-03-14,,,,,Cancel per customer request 4/15/08 log #4507\...
5,3001212-LU,Single Family/Duplex,Residential,Master Use Permit,,PROJECT CANCELLED 12/8/2010 -- This short plat...,,Canceled,6519 S BANGOR ST,SEATTLE,...,-122.25172068,"{'latitude': '47.50588981', 'longitude': '-122...",,,,,,,,PROJECT CANCELLED 12/8/2010 -- This short plat...
6,3001242-LU,Commercial,Non-Residential,Master Use Permit,,Cancelled 10/20/2010 Council land use action...,,Canceled,1400 S DEARBORN ST,SEATTLE,...,-122.31388861,"{'latitude': '47.59614654', 'longitude': '-122...",0.0,565.0,2006-07-28,,,,,Cancelled 10/20/2010 Council land use action...
7,3001244-LU,Single Family/Duplex,Residential,Master Use Permit,,Land Use Application to install a 10 ft. by 18...,5000.0,Completed,1115 BROADWAY E,SEATTLE,...,-122.32161841,"{'latitude': '47.6291712', 'longitude': '-122....",0.0,0.0,2013-08-06,2013-12-24,2016-11-26,2013-11-12,,Land Use Application to install a 10 ft. by 18...
8,3001249-LU,Single Family/Duplex,Residential,Master Use Permit,,Land use permit to subdivide 1 parcel into 3 p...,,Completed,6349 21ST AVE SW,SEATTLE,...,-122.35999977,"{'latitude': '47.54505253', 'longitude': '-122...",,,2006-02-09,2006-06-07,2007-12-07,2006-05-08,,Land use permit to subdivide 1 parcel into 3 p...
9,3001271-LU,Single Family/Duplex,Residential,Master Use Permit,,Land Use Permit to adjust the boundary between...,,Completed,4226 1ST AVE NW,SEATTLE,...,-122.35692862,"{'latitude': '47.65850007', 'longitude': '-122...",0.0,0.0,2005-12-16,2006-05-15,2007-11-15,2006-05-10,,Land Use Permit to adjust the boundary between...


### Saving the file
Let's stop here for the moment. But in the next lecture, we'll come back to this dataset. Rather than request the data again from the server, we can save it locally as a file.

If you type `smalldf.to_` and tab complete, you'll see options for lots of different file formats. Typically, we'll use `to_csv()` (a plain text .csv format), or `to_pickle()` (which saves the `pandas` object). .csv is more reliable for longer-term storage and for sharing information, but is slower and loses some of the properties of the data frame.

So let's save the `pandas` object to the `scratch` folder in your repository. The `..` means "go back one level," i.e. to the enclosing folder. 

In [27]:
smalldf.to_pickle('../scratch/Seattle_permits.pandas')

<div class="alert alert-block alert-info">
<strong>Let's generalize.</strong> What did we do here?
    
1. We obtained the URL for each page to scrape. (Here, it was given to us in the city's data file, but sometimes we'll have to reverse-engineer the composition of the URL.)
2. We examined a sample page, and identified the html tags that enclose the data we wanted to extract.
3. We wrote a function that pulled out the data for a specific page.
4. We applied that function to each URL / page. Since our URLs were in a pandas DataFrame, we could use the pandas <strong>apply</strong> method.
    
Every scraping project will pose different challenges, but normally it will involve each of these four steps.
</div>