## Module 2 Class activities
This notebook is a starting point for the exercises and activities that we'll do in class.

Before you attempt any of these activities, make sure to watch the video lectures for this module.

### Scraping permit data
Here's the code that we saw in the video lecture that queries the City of Seattle permit website, gets a dataframe of permits (including the URL), and then digs down further into that permit-specific URL.

In [None]:
# get the permit data from the API
import json
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

url = 'https://data.seattle.gov/resource/ht3q-kdvx.json' # copied and pasted from the webpage
r = requests.get(url)
df = pd.DataFrame(json.loads(r.text))

df = df.head(20) # get the first 5 rows, so we don't overload the city's website.

# get an example link
permiturl = df.loc[8,'link']['url']
print(permiturl)

# request that page and get the soup object
r = requests.get(permiturl)
soup = BeautifulSoup(r.text)
#print(soup.prettify())

In [None]:
# then we wrote this code to extract the project description 
links = soup.find_all('td')
for link in links:
    if 'Project Description' in link.text: 
        sublinks = link.find_all('td')
        description = sublinks[1].text
        # once we find a description, we exit
        break
    
print(description)

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> If you look at the example, there may be a section giving information on the number of curb cuts. Extract that to a variable and print it.
</div>

In [None]:
# Hints
# Not all of the records have the curb cut field. You can see an example here:
# https://services.seattle.gov/portal/customize/LinkToRecord.aspx?altId=3001776-LU

    # If you look at that webpage, you'll see that the text "Number of Curb Cuts for This Permit: "
# is within "span" tags
curbcuttext = soup.find("span", string="Number of Curb Cuts for This Permit: ")

# So to get the number of curb cuts (which is the next piece of text), 
# you can ask for the NEXT tag using find_next()
#n_curbcuts = curbcuttext.find_next()

# you'll need to add an if statement to deal with the case when this text does not exist
    
# your code here
# all I did was extract the text, and add the if statement
if curbcuttext is None:
    n_curbcuts = np.nan
else:
    n_curbcuts = int(curbcuttext.find_next().text)
print(n_curbcuts)

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Now turn that into a function that you can apply to each row of your dataframe. Add a new column, <strong>n_curbcuts</strong>, to your dataframe.
</div>

In [None]:
# your code here

# I just copied and pasted the code above
# and indented it into a function
def get_curbcuts(urldict):
    permiturl = urldict['url']
    
    r = requests.get(permiturl)
    soup = BeautifulSoup(r.text)
    
    curbcuttext = soup.find("span", string="Number of Curb Cuts for This Permit: ")
    
    if curbcuttext is None:
        n_curbcuts = np.nan
    else:
        n_curbcuts = int(curbcuttext.find_next().text)

    print(n_curbcuts)
    return(n_curbcuts)
    

get_curbcuts(df.loc[8,'link'])


### Fixing errors
We'll do more scraping in just a moment. But first, let's do some examples of how to interpret an error message, and fix it.

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Each of the cells below will generate an error. Look at the error message and see if you can figure out how to fix it. (Don't Google it until you try to figure it out based on the error message.)
</div>

In [None]:
# the housingunitsremoved and housingunitsadded give useful information
# let's create a new column with netunits
df['netunits'] = df.housingunitsadded - df.housingunitsremoved

In [None]:
# SOLUTION: we need to convert them to a float first
df['netunits'] = df.housingunitsadded.astype(float) - df.housingunitsremoved.astype(float)
df['netunits']

In [None]:
# print the address of the first row
print('Address of first row is {}. Permit type is {}'.format(df.iloc[0].originaladdress1))

In [None]:
# SOLUTION: We had two placeholders {} but only one variable to insert into them
# We could delete one of the {} or add a second argument to the format()
print('Address of first row is {}. Permit type is'.format(df.iloc[0].originaladdress1))
print('Address of first row is {}. Permit type is {}'.format(df.iloc[0].originaladdress1, df.iloc[0].permitclass))

In [None]:
# Convert the number of housing units to integers
# and then summarize

df['unitsadded_numeric'] = df.housingunitsadded.astype(int)
df.unitsadded_numeric.describe(

In [None]:
# SOLUTION: the first problem was our missing parenthesis

df['unitsadded_numeric'] = df.housingunitsadded.astype(int)
df.unitsadded_numeric.describe()

In [None]:
# our second problem was the data type. An integer type cannot hold NaN
# so we do float
df['unitsadded_numeric'] = df.housingunitsadded.astype(float)
df.unitsadded_numeric.describe()

### Scraping craigslist

In the lecture, we saw how to scrape the main page (the list of posts).

What if you want to get more information about (say) a particular apartment?

Go to the [craigslist housing page](https://losangeles.craigslist.org/search/apa#search=1~gallery~0~0) and copy the link for one of the listings. It should look something like this:
https://losangeles.craigslist.org/lgb/apa/d/long-beach-home-for-rent/7597309102.html

(It's fine to copy and paste the URL for now. A second step would be to loop over the URLs from the dataframe of postings that we created in the video lecture, but in class, we'll just focus on one example.)

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> For this URL, use requests to get the content of the post. (No need to create a soup object yet.)
</div>

In [None]:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup

# your code here
# put the output of the request in a variable called r
# so you can access the content like this
url = 'https://losangeles.craigslist.org/ant/apa/d/valencia-charming-br-ba-with-1000-off/7838334390.html'
r = requests.get(url)
print(r.content)

Now let's extract more information from the page. We have a couple of strategies here. First, we could skip trying to parse the page with `BeautifulSoup`, and just see if particular bits of text are present.

For example, what transportation modes does the post emphasize? Do they mention Section 8 vouchers? Some of this might be exploratory—we can see what type of language is included, and then parse in a more structured way (e.g. distinguishing between "No Section 8" and "Section 8 welcome").

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Write a function that will return True if Section 8 is mentioned, otherwise False.

*Hint*: the `in` operator is a simple way to do this. For example:

In [None]:
'plan' in 'urban planning'

In [None]:
'plan' in 'Urban Planning' 

In [None]:
# your code here to return Section 8 information

# we can use the same approach to see 
# if a string is in the text that we retrieved via requests
# note the use of lower() to avoid case sensitivity
'section 8' in r.text.lower()

In [None]:
# but this will return True if the text is in the string
'los angeles' in r.text.lower()

In [None]:
# so let's put this in a function

def sect8(url):
   r = requests.get(url)
   return 'section 8' in r.text.lower()

# test it
sect8(url)

Most of the post is free-form text. So there's not going to be much value added by `BeautifulSoup`.

The exceptions are (i) parking, and (ii) the geographic coordinates.

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Write a function that will return True if the apartment has no parking, and also returns the lat/lon of the apartment

*Hint*: First, create a `soup` object. Then, look and see what tag and class encloses this information. Then, you can experiment with `find` and `find_all` with this tag and class.

In [None]:
# your code here

# replace with your own url
url = 'https://losangeles.craigslist.org/ant/apa/d/valencia-charming-br-ba-with-1000-off/7838334390.html'

# get a soup object
r = requests.get(url)
soup = BeautifulSoup(r.content)
print(soup.prettify())

In [None]:
# in my example, the webpage said "carport" in the right hand panel
# I did a CTRL-F, and this was the result

#<div class="attrgroup">
#       <div class="attr rent_period">
#        <span class="labl">
# ...
#       <div class="attr">
#        <span class="valu">
#         <a href="//losangeles.craigslist.org/search/apa?parking=1">
#          carport
#         </a>
#        </span>
#       </div>


# so it looks like we want the class valu
links = soup.find_all(class_='valu')

In [None]:
# I see that the result was a list, with the carport in a href which includes "parking"
links


In [None]:
# so I can loop over those links, and find the one that includes "parking" in href
for link in links:
    if 'parking' in link.find('a')['href']:
        print(link.text)
    else:
        continue


# A simpler way would be to search for "no parking" in r.text! 

In [None]:
# what about lat and lon?
# I found this in <div id="map" class="viewposting"
links = soup.find_all('div', class_='viewposting')
links

In [None]:
# it's in a list of length 1
# we could get this via link = links[0]
# or use find (which gets the first instance) rather than find_all
link = soup.find('div', class_='viewposting')
link

In [None]:
# This functions like a dictionary object!
lat = link['data-latitude']
lon = link['data-longitude']
lat, lon

Now you've written this code, a next step would be to package it in a function that you can apply to all the URLs in your dataframe of posts (like the one we created in the video lecture). 

## Large language models [Optional]

Large language models (LLMs) such as ChatGPT can also be accessed via an API.

The APIs are changing very rapidly, as are the pricing structures. For now, some LLMs are offered for free, for limited use. One such model with a free tier is Gemini, by Google.

If you'd like to experiment with Gemini, you need to [get an API key here](https://aistudio.google.com/u/1/apikey). No credit card is necessary, but your UCLA Google account won't work - you need a personal Google account.

[The documentation and some examples are here](https://github.com/googleapis/python-genai).

The other challenge with Gemini is that its Python library has several incompatibilities, including with some of the ones we use elsewhere in the course. So you will need to create a new environment in Anaconda, using the same setup process (importing an environment) as you did at the [start of the course](https://urbandatascience.its.ucla.edu/getting-started/).

Specifically:
- In Anaconda, go to the Environments tab
- Click on Import
- Choose `google-genai-env.yml` under Local Drive, and call the environment `genai`
- Manually add the `google-genai` package (choose "Not Installed", type `google-genai` into the search bar, select the checkbox, and then click Apply at the bottom of the screen)
- Anaconda will take some time, before you can click Apply again to install
- Close this notebook and open it again after you switch to your new `genai` environment

All set up? Let's look at a simple example---passing a query to the chat interface.

In [None]:
from google import genai  # if this doesn't load, you probably have the wrong environment
gemini_api_key = 'XXXX' #'XXXXX' # fill in your key here

c = genai.Client(api_key=gemini_api_key)
chat = c.chats.create(model='gemini-2.0-flash-001')
response = chat.send_message('What do urban planners need to learn about gen AI?')
print(response.text)

How might this be used in web scraping? 

Well, perhaps you can ask it to parse the text.

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Pass the text of the craiglist post to Gemini, and ask it to return the number of parking spaces (if any). Add the result to your dataframe. NOTE: This exercise is optional (you might not want to create a Google account to get an API key).

In [None]:
# your code here
c = genai.Client(api_key=gemini_api_key)
chat = c.chats.create(model='gemini-2.0-flash-001')
url = 'https://losangeles.craigslist.org/ant/apa/d/valencia-charming-br-ba-with-1000-off/7838334390.html'

response = chat.send_message('Does this Craiglist posting have parking? If so, which type? URL:'+url)
print(response.text)

We won't do it here, but [you can also experiment with the MetaAI API](https://github.com/Strvm/meta-ai-api). The advantage: it doesn't need an API key, but the capabilities are a bit more limited. You can also find the Meta installed in the `genai` environment.

<div class="alert alert-block alert-info">
<h3>What you should have learned</h3>
<ul>
  <li>Gain confidence in experimenting with code - exploring different objects, writing functions, and so on</li>
  <li>Learn how to extract information from a scraped webpage - how to do the detective work.</li>
  <li>Gain confidence in debugging errors.</li>
  <li>Learn how to integrate Large Language Models into Python</li>
</ul>
</div>