## Module 2 Class activities
This notebook is a starting point for the exercises and activities that we'll do in class.

Before you attempt any of these activities, make sure to watch the video lectures for this module.

### Scraping permit data
Here's the code that we saw in the video lecture that queries the City of Seattle permit website, gets a dataframe of permits (including the URL), and then digs down further into that permit-specific URL.

In [85]:
# get the permit data from the API
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://data.seattle.gov/resource/ht3q-kdvx.json' # copied and pasted from the webpage
r = requests.get(url)
df = pd.DataFrame(json.loads(r.text))

df = df.head(5) # get the first 5 rows, so we don't overload the city's website.

# get an example link
permiturl = df.loc[0,'link']['url']
print(permiturl)

# request that page and get the soup object
r = requests.get(permiturl)
soup = BeautifulSoup(r.text)
print(soup.prettify())

https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=3001212-LU
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en-US" ng-app="appAca" xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head id="ctl00_Head1">
  <link href="../App_Themes/Default/_progressbar.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/breadcrumb.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/Calendar.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/font.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/form.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/grid.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/layout.css" rel="stylesheet" type="text/css"/>
  <link href="../App

In [86]:
# then we wrote this code to extract the project description 
links = soup.find_all('td')
for link in links:
    if 'Project Description' in link.text: 
        sublinks = link.find_all('td')
        description = sublinks[1].text
        # once we find a description, we exit
        break
    
print(description)

PROJECT CANCELLED 12/8/2010 -- This short plat has an ECA exemption in the project planning template. A limited exemption was granted. Processing short plat with the ECA exemption #3002070.


<div class="alert alert-block alert-info">
<strong>Exercise:</strong> If you look at the example, there is a <strong>Legal Description</strong> section. Extract that to a variable and print it.
</div>

In [87]:
# then we wrote this code to extract the project description 
links = soup.find_all('tr')
for link in links:
    if 'Legal Description' in link.text: 
        sublinks = link.find_all('td')
        description = sublinks[5].text
        # once we find a description, we exit
        break
    
print(description)



Development Site Parcel:DV0001932Legal Description:PAR A, LBA #8806752, THT POR BLK 13, KINNEARS 1ST RAINIER BEACH ADD, AND POR TRC A, BLK 2, GUTHERIES TERRACE PARK, AND POR SE 1/4 (FILE)




<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Now turn that into a function that you can apply to each row of your dataframe. Add a new column, <strong>legal_description</strong>, to your dataframe.
</div>

In [88]:
# then we wrote this code to extract the legal project description 
def get_legal(urldict):
    permiturl = urldict['url']

    # request that page and get the soup object
    r = requests.get(permiturl)
    soup = BeautifulSoup(r.text)
    #print(soup.prettify())
    links = soup.find_all('tr')
    for link in links:
        if 'Legal Description' in link.text: 
            sublinks = link.find_all('td')
            description = sublinks[5].text
            #df['legal_descdription'][i]=description
            # once we find a description, we exit
            return description

sl = df.link.apply(get_legal)
df['legal_description']= sl

### Fixing errors
We'll do more scraping in just a moment. But first, let's do some examples of how to interpret an error message, and fix it.

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Each of the cells below will generate an error. Look at the error message and see if you can figure out how to fix it. (Don't Google it until you try to figure it out based on the error message.)
</div>

In [89]:
# the housingunitsremoved and housingunitsadded give useful information
# let's create a new column with netunits
df['netunits'] = df.housingunitsadded.astype(float) - df.housingunitsremoved.astype(float)

In [90]:
# print the address of the first row
print('Address of first row is {}. Permit type is {}'.format(df.iloc[0].originaladdress1, df.iloc[0].permitclass))

Address of first row is 6519 S BANGOR ST. Permit type is Single Family/Duplex


In [91]:
df['housingunitsadded']=df.housingunitsadded.astype(float)

In [92]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 25 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   permitnum              5 non-null      object 
 1   permitclass            5 non-null      object 
 2   permitclassmapped      5 non-null      object 
 3   permittypemapped       5 non-null      object 
 4   description            5 non-null      object 
 5   statuscurrent          5 non-null      object 
 6   originaladdress1       5 non-null      object 
 7   originalcity           5 non-null      object 
 8   originalstate          5 non-null      object 
 9   originalzip            5 non-null      object 
 10  link                   5 non-null      object 
 11  latitude               5 non-null      object 
 12  longitude              5 non-null      object 
 13  location1              5 non-null      object 
 14  housingunitsremoved    1 non-null      object 
 15  housinguni

In [93]:
import math

In [94]:
math.isnan(df.housingunitsadded[1])

False

In [100]:
# Convert the number of housing units to integers
# and then summarize
import numpy as np
df['housingunitsadded']=df.housingunitsadded.astype(float)
df['unitsadded_numeric']=''
for x in range(0, df.shape[0]):
    if np.isnan(df.housingunitsadded[x])== False:
        df['unitsadded_numeric'][x]=df.housingunitsadded[x].astype(int)
    else:
        df['unitsadded_numeric'][x]= None
df.unitsadded_numeric.describe()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['unitsadded_numeric'][x]= None
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['unitsadded_numeric'][x]=df.housingunitsadded[x].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['unitsadded_numeric'][x]= None
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['unitsadded_numeric'][x]= None
A 

count     1
unique    1
top       0
freq      1
Name: unitsadded_numeric, dtype: int64

### Scraping craigslist

In the lecture, we saw how to scrape the main page (the list of posts).

What if you want to get more information about (say) a particular apartment?

Go to the [craigslist housing page](https://losangeles.craigslist.org/search/apa#search=1~gallery~0~0) and copy the link for one of the listings. It should look something like this:
https://losangeles.craigslist.org/lgb/apa/d/long-beach-home-for-rent/7597309102.html

(It's fine to copy and paste the URL for now. A second step would be to loop over the URLs from the dataframe of postings that we created in the video lecture, but in class, we'll just focus on one example.)

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> For this URL, use requests to get the content of the post. (No need to create a soup object yet.)
</div>

In [None]:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup

# your code here
# put the output of the request in a variable called r
# so you can access the content like this
print(r.content)

Now let's extract more information from the page. We have a couple of strategies here. First, we could skip trying to parse the page with `BeautifulSoup`, and just see if particular bits of text are present.

For example, what transportation modes does the post emphasize? Do they mention Section 8 vouchers? Some of this might be exploratory—we can see what type of language is included, and then parse in a more structured way (e.g. distinguishing between "No Section 8" and "Section 8 welcome").

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Write a function that will return True if Section 8 is mentioned, otherwise False.

*Hint*: the `in` operator is a simple way to do this. For example:

In [None]:
'plan' in 'urban planning'

In [None]:
'plan' in 'Urban Planning' 

In [None]:
# your code here to return Section 8 information
description=str(r.content)

def sec8find(text):
    if 'section 8' in text.lower():
        print('True')
    else: 
        print('False')
    return 


Most of the post is free-form text. So there's not going to be much value added by `BeautifulSoup`.

The exceptions are (i) parking, and (ii) the geographic coordinates.

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Write a function that will return True if the apartment has no parking, and also returns the lat/lon of the apartment

*Hint*: First, create a `soup` object. Then, look and see what tag and class encloses this information. Then, you can experiment with `find` and `find_all` with this tag and class.

In [None]:
# your code here

Now you've written this code, a next step would be to package it in a function that you can apply to all the URLs in your dataframe of posts (like the one we created in the video lecture). 

<div class="alert alert-block alert-info">
<h3>What you should have learned</h3>
<ul>
  <li>Gain confidence in experimenting with code - exploring different objects, writing functions, and so on</li>
  <li>Learn how to extract information from a scraped webpage - how to do the detective work.</li>
  <li>Gain confidence in debugging errors.</li>
</ul>
</div>