## Module 2 Class activities
This notebook is a starting point for the exercises and activities that we'll do in class.

Before you attempt any of these activities, make sure to watch the video lectures for this module.

### Scraping permit data
Here's the code that we saw in the video lecture that queries the City of Seattle permit website, gets a dataframe of permits (including the URL), and then digs down further into that permit-specific URL.

In [1]:
# get the permit data from the API
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://data.seattle.gov/resource/ht3q-kdvx.json' # copied and pasted from the webpage
r = requests.get(url)
df = pd.DataFrame(json.loads(r.text))

df = df.head(5) # get the first 5 rows, so we don't overload the city's website.

# get an example link
permiturl = df.loc[0,'link']['url']
print(permiturl)

# request that page and get the soup object
r = requests.get(permiturl)
soup = BeautifulSoup(r.text)

https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=3001212-LU


In [2]:
#print(soup.prettify())
#class="MoreDetail_ItemCol MoreDetail_ItemCol2"

In [3]:
# then we wrote this code to extract the project description 
links = soup.find_all('td')
for link in links:
    if 'Project Description' in link.text: 
        sublinks = link.find_all('td')
        description = sublinks[1].text
        # once we find a description, we exit
        break
    
print(description)

PROJECT CANCELLED 12/8/2010 -- This short plat has an ECA exemption in the project planning template. A limited exemption was granted. Processing short plat with the ECA exemption #3002070.


<div class="alert alert-block alert-info">
<strong>Exercise:</strong> If you look at the example, there is a <strong>Legal Description</strong> section. Extract that to a variable and print it.
</div>

In [4]:
# your code here
for link in links:
    if 'Legal Description' in link.text: 
        sublinks = link.find_all('td')
        description = sublinks[1].text
        # once we find a description, we exit
        break
    
print(description)




GROUND DISTURBANCE

Land Disturbing Activity: 


Yes
 PERMIT APPLICATION COMMON

Where on your property are you working?: 


IMPORTED FROM MASTER TRACKER


Choose the Primary Property Use: 


Single Family/Duplex
 PERMIT TRACKING COMMON

Review Level: 


Full C





<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Now turn that into a function that you can apply to each row of your dataframe. Add a new column, <strong>legal_description</strong>, to your dataframe.
</div>

In [5]:
# your code here
def get_legal(urldict):
    permiturl = urldict['url']
    r = requests.get(permiturl)
    soup = BeautifulSoup(r.text, features='html.parser')
    tds = soup.find_all('td')
    for td in tds:
        if 'Legal Description' in td.text: 
            tds2 = td.find_all('td')
            description = tds2[1]
            # once we find a description, we return it and exit the function
            return description.text 
    
    return '' # if we don't find it, return an empty string

# Now let's apply this function to the first link in our dataframe
urldict = df.loc[0,'link']
get_legal(urldict)

'\n\nGROUND DISTURBANCE\n\nLand Disturbing Activity: \n\n\nYes\n\xa0PERMIT APPLICATION COMMON\n\nWhere on your property are you working?: \n\n\nIMPORTED FROM MASTER TRACKER\n\n\nChoose the Primary Property Use: \n\n\nSingle Family/Duplex\n\xa0PERMIT TRACKING COMMON\n\nReview Level: \n\n\nFull C\n\n\n'

In [6]:
descriptions = df['link'].apply(get_legal)
descriptions

0    \n\nGROUND DISTURBANCE\n\nLand Disturbing Acti...
1    \n\nGROUND DISTURBANCE\n\nLand Disturbing Acti...
2    \n\nTenant Relocation Assistance\n\nResidentia...
3    \n\nGROUND DISTURBANCE\n\nLand Disturbing Acti...
4    \n\nTenant Relocation Assistance\n\nResidentia...
Name: link, dtype: object

In [7]:
df.head()

Unnamed: 0,permitnum,permitclass,permitclassmapped,permittypemapped,description,statuscurrent,originaladdress1,originalcity,originalstate,originalzip,...,location1,housingunitsremoved,housingunitsadded,applieddate,issueddate,expiresdate,decisiondate,permittypedesc,contractorcompanyname,estprojectcost
0,3001212-LU,Single Family/Duplex,Residential,Master Use Permit,PROJECT CANCELLED 12/8/2010 -- This short plat...,Canceled,6519 S BANGOR ST,SEATTLE,WA,98178,...,"{'latitude': '47.50588981', 'longitude': '-122...",,,,,,,,,
1,3001271-LU,Single Family/Duplex,Residential,Master Use Permit,Land Use Permit to adjust the boundary between...,Completed,4226 1ST AVE NW,SEATTLE,WA,98107,...,"{'latitude': '47.65850007', 'longitude': '-122...",0.0,0.0,2005-12-16,2006-05-15,2007-11-15,2006-05-10,,,
2,3001310-LU,Single Family/Duplex,Residential,Master Use Permit,Land use application to adjust the boundary be...,Completed,941 23RD AVE S,SEATTLE,WA,98144,...,"{'latitude': '47.59337775', 'longitude': '-122...",,,2007-02-14,2008-08-28,2011-08-14,2008-08-13,,,
3,3001312-LU,,,Master Use Permit,Cancelled due to no activity for more than 9 y...,Canceled,3131 E MADISON ST,SEATTLE,WA,98112,...,"{'latitude': '47.62648852', 'longitude': '-122...",,,,,,,,,
4,3001440-LU,Commercial,Non-Residential,Master Use Permit,PROJECT CANCELLED 5/23/2011 -- Project On Hold...,Canceled,9030 13TH AVE NW,SEATTLE,WA,98117,...,"{'latitude': '47.69516506', 'longitude': '-122...",,,2005-08-12,,,,,,


### Fixing errors
We'll do more scraping in just a moment. But first, let's do some examples of how to interpret an error message, and fix it.

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Each of the cells below will generate an error. Look at the error message and see if you can figure out how to fix it. (Don't Google it until you try to figure it out based on the error message.)
</div>

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   permitnum              5 non-null      object
 1   permitclass            5 non-null      object
 2   permitclassmapped      5 non-null      object
 3   permittypemapped       5 non-null      object
 4   description            5 non-null      object
 5   statuscurrent          5 non-null      object
 6   originaladdress1       5 non-null      object
 7   originalcity           5 non-null      object
 8   originalstate          5 non-null      object
 9   originalzip            5 non-null      object
 10  link                   5 non-null      object
 11  latitude               5 non-null      object
 12  longitude              5 non-null      object
 13  location1              5 non-null      object
 14  housingunitsremoved    1 non-null      object
 15  housingunitsadded      1 no

In [9]:
# the housingunitsremoved and housingunitsadded give useful information
# let's create a new column with netunits
df['netunits'] = df.housingunitsadded - df.housingunitsremoved

TypeError: unsupported operand type(s) for -: 'str' and 'str'

In [10]:
df['housingunitsadded'] = df['housingunitsadded'].astype(float)
df['housingunitsremoved'] = df['housingunitsremoved'].astype(float)
df['netunits'] = df.housingunitsadded - df.housingunitsremoved

In [11]:
df['netunits']

0    NaN
1    0.0
2    NaN
3    NaN
4    NaN
Name: netunits, dtype: float64

In [12]:
# print the address of the first row
print('Address of first row is {}. Permit type is {}'.format(df.iloc[0].originaladdress1))

IndexError: Replacement index 1 out of range for positional args tuple

In [13]:

print('Address of first row is {}. Permit type is {}'.format(df.iloc[0].originaladdress1,df.iloc[0].permittypemapped))

Address of first row is 6519 S BANGOR ST. Permit type is Master Use Permit


In [14]:
# Convert the number of housing units to integers
# and then summarize
df['unitsadded_numeric'] = df.housingunitsadded.astype(int)
df.unitsadded_numeric.describe(

SyntaxError: incomplete input (3128679474.py, line 4)

In [15]:
# Convert the number of housing units to integers
# and then summarize
df['unitsadded_numeric'] = df.housingunitsadded.astype(int)
df.unitsadded_numeric.describe()

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

In [None]:
# Convert the number of housing units to integers
# and then summarize
df['unitsadded_numeric'] = df.housingunitsadded.astype(float)
df.unitsadded_numeric.describe()

### Scraping craigslist

In the lecture, we saw how to scrape the main page (the list of posts).

What if you want to get more information about (say) a particular apartment?

Go to the [craigslist housing page](https://losangeles.craigslist.org/search/apa#search=1~gallery~0~0) and copy the link for one of the listings. It should look something like this:
https://losangeles.craigslist.org/lgb/apa/d/long-beach-home-for-rent/7597309102.html

(It's fine to copy and paste the URL for now. A second step would be to loop over the URLs from the dataframe of postings that we created in the video lecture, but in class, we'll just focus on one example.)

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> For this URL, use requests to get the content of the post. (No need to create a soup object yet.)
</div>

In [16]:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup

url_cl="https://losangeles.craigslist.org/lac/apa/d/los-angeles-happening-studio-for-your/7609613809.html"
r = requests.get(url_cl)

# your code here
# put the output of the request in a variable called r
# so you can access the content like this
print(r.content)

b'<!DOCTYPE html>\n<html>\n<head>\n    \n\t<meta charset="UTF-8">\n\t<meta http-equiv="X-UA-Compatible" content="IE=Edge">\n\t<meta name="viewport" content="width=device-width,initial-scale=1">\n\t<meta property="og:site_name" content="craigslist">\n\t<meta name="twitter:card" content="preview">\n\t<meta property="og:title" content="A Happening STUDIO For Your Spring! S Normandie Ave - apts/housing...">\n\t<meta name="description" content="\xec\x97\xac\xeb\xb3\xb4\xec\x84\xb8\xec\x9a\x94! Our upgrades to a wonderful STUDIO will make this doubly nice for you! Enjoy all that Ktown has to offer such as The Wiltern Theater, Metro Purple and Red Lines, H Mart, California Market,...">\n\t<meta property="og:description" content="\xec\x97\xac\xeb\xb3\xb4\xec\x84\xb8\xec\x9a\x94! Our upgrades to a wonderful STUDIO will make this doubly nice for you! Enjoy all that Ktown has to offer such as The Wiltern Theater, Metro Purple and Red Lines, H Mart, California Market,...">\n\t<meta property="og:im

Now let's extract more information from the page. We have a couple of strategies here. First, we could skip trying to parse the page with `BeautifulSoup`, and just see if particular bits of text are present.

For example, what transportation modes does the post emphasize? Do they mention Section 8 vouchers? Some of this might be exploratory—we can see what type of language is included, and then parse in a more structured way (e.g. distinguishing between "No Section 8" and "Section 8 welcome").

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Write a function that will return True if Section 8 is mentioned, otherwise False.

*Hint*: the `in` operator is a simple way to do this. For example:

In [17]:
'plan' in 'Urban planning' 

True

In [18]:
'plan' in 'Urban Planning' 

False

In [19]:
# your code here to return Section 8 information
description = str(r.content)

def sec8find (text):
    if 'section 8' in text.lower(): 
         print("listing mentions section 8")
    else: print("listing does not mention section 8")
        # once we find a description, we exit
        
        
sec8find(description)

listing does not mention section 8


Most of the post is free-form text. So there's not going to be much value added by `BeautifulSoup`.

The exceptions are (i) parking, and (ii) the geographic coordinates.

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Write a function that will return True if the apartment has no parking, and also returns the lat/lon of the apartment

*Hint*: First, create a `soup` object. Then, look and see what tag and class encloses this information. Then, you can experiment with `find` and `find_all` with this tag and class.

class="viewposting leaflet-container leaflet-retina leaflet-safari leaflet-fade-anim leaflet-grab leaflet-touch-drag
    {"longitude":"-118.300549","@context":"http://schema.org","latitude":"34.064875","numberOfBathroomsTotal":1,"name":"A Happening STUDIO For Your Spring! S Normandie Ave","address":{"streetAddress":"541 S Normandie Ave.","addressCountry":"US","addressLocality":"Los Angeles","postalCode":"90020","addressRegion":"CA","@type":"PostalAddress"},"@type":"Apartment"}

In [22]:
# your code here
soup = BeautifulSoup(r.content, features='html.parser')
#print(soup.prettify())

def sec8find(pk):
    parking = soup.find_all('span')
    for options in parking:
        if 'parking' in options.text:
            print("Listing mentions parking as", options)
            #return(latlon(pk))
            break
            
#def latlon(pk):
    # get untill the dictionary 
    
    
    


#df = pd.DataFrame(json.loads(r.text))

#df = df.head(5) # get the first 5 rows, so we don't overload the city's website.

# get an example link
#permiturl = df.loc[0,'link']['url']
#print(permiturl)

# request that page and get the soup object
#r = requests.get(permiturl)
#soup = BeautifulSoup(r.text)

In [23]:
sec8find(soup)

Listing mentions parking as <span>off-street parking</span>


In [25]:
geo = soup.find_all('script', id="ld_posting_data")[0].contents[0].rsplit()[0]
geo

'{"longitude":"-118.300549","@context":"http://schema.org","latitude":"34.064875","numberOfBathroomsTotal":1,"name":"A'

In [163]:
geo.

'\n'

In [85]:
sec8find(parking)

Listing mentions parking as <span>off-street parking</span>


Now you've written this code, a next step would be to package it in a function that you can apply to all the URLs in your dataframe of posts (like the one we created in the video lecture). 

<div class="alert alert-block alert-info">
<h3>What you should have learned</h3>
<ul>
  <li>Gain confidence in experimenting with code - exploring different objects, writing functions, and so on</li>
  <li>Learn how to extract information from a scraped webpage - how to do the detective work.</li>
  <li>Gain confidence in debugging errors.</li>
</ul>
</div>