# Week 2. Web scraping (continued)

## Example: Scraping Craigslist data
Craiglist provides a wealth of information on apartment rentals and other types of housing (as we saw in the [Boeing and Waddell paper](https://journals.sagepub.com/doi/abs/10.1177/0739456X16664789)). But short of clicking through lots of links, how do we access it?

As with any scraping project, the first step is to get an example web page, and see if we can reverse-engineer the structure.

One option is to parse each detailed post, with information on parking, desired qualities of roommates, etc. But a lot of information is actually in the [list of posts](https://losangeles.craigslist.org/search/lac/hhh). 

In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://losangeles.craigslist.org/search/lac/hhh'
r = requests.get(url)

In [None]:
soup = BeautifulSoup(r.content)
print(soup.prettify())

It looks like each post is in a `<li>` tag. Moreover, note that it's also in a `class` called `result-row`. Structured data like this make it much easier to scrape! The `find_all()` function takes an optional `class_` argument that can filter by class.

In [None]:
posts = soup.find_all('li', class_= 'result-row')

# Note that there are 120 results, which is the number of posts returned on the Craigslist webpage. A good thing!
print(len(posts))

# Let's look at a sample post
posts[0]

It looks like the price and the neighborhood have their own class, within the `span` tag. 
The title and URL look like they are within the `a` tag. The number of bedrooms is a bit more complicated, but it's somewhere in the housing class.

Let's test this out. Note that `find` will get the first occurence. `find_all` will get all of them, and return a list. But in the CraigsList posts, there's only either one occurence or they are all the same, so `find` is easier. (Try it out.)

In [None]:
print('Price:')
print(posts[0].find('span', class_= 'result-price'))

print('\nNeighborhood:') # \n adds an empty line before
print(posts[0].find('span', class_= 'result-hood'))

print('\nHousing size:')
print(posts[0].find('span', class_= 'housing'))

print('\nTitle:')
print(posts[0].find('a', class_= 'result-title'))

# For all of these results, we can extract just the text
print('\nTitle  (text only):')
print(posts[0].find('a', class_= 'result-title').text)

# except the URL has it's own key
print('\nURL:')
print(posts[0].find('a', class_= 'result-title')['href'])

Now we understand the structure of each page. So we are ready to put all of the posts in a dataframe.

`pandas` can create a dataframe from many different data structures. But one of the easiest ways to is to create a list of dictionaries, and then tell `pandas` to convert that into a dataframe. The list is of rows. Within each list, we have a dictionary of columns.

In [None]:
import pandas as pd
postList = [] # empty list that we can add to
for post in posts:
    # temporary variables
    price = post.find('span', class_= 'result-price').text
    neighborhood = post.find('span', class_= 'result-hood').text
    housingsize = post.find('span', class_= 'housing').text
    title = post.find('a', class_= 'result-title').text
    url = post.find('a', class_= 'result-title')['href']

    # now put them in the dictionary, and append to our list
    postList.append({'price': price, 'neighborhood':neighborhood, 
                     'housingsize':housingsize, 'title':title, 'url':url})
df = pd.DataFrame(postList)

<div class="alert alert-block alert-info">
We probably got an error there. Let's discuss how to fix this to be more robust to missing fields.
</div>

In [None]:
print(df)

So it looks pretty good, except for the `housingsize` field. What's going on here?

In [None]:
print(df.housingsize)

print('\nThe first entry is {}'.format(df.housingsize.iloc[0]))

It looks like there is a lot of whitespace here. And sometimes, the field contains ft2, sometimes br, sometimes neither and sometimes both.

Let's use the `split()` function to split the string by the whitespace.

In [None]:
print(df.housingsize.str.split())

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Conceptually, how would you go about creating two new fields in the dataframe—bedrooms and sqft? Write some code if you can, but the most important step is to think through how you'd do it in words.
</div>

Now let's plot the distribution of price. A box plot would be a good choice here.

In [None]:
# Oops. What went wrong?
df.boxplot('price')

In [None]:
df['price_numeric'] = df.price.str.replace('$','').str.replace(',','').astype(float)

In [None]:
df.boxplot('price_numeric')

In [None]:
# We can also break it out by neighborhood.
# But what's the problem here?
df.boxplot('price_numeric', by='neighborhood')

In [None]:
# What about the relationship between prices and the apartment size?
#df.plot('price_numeric', 'price_numeric')
df.plot('sqft', 'price_numeric', kind='scatter')

So now we've created a dataframe that extracts all the posts on the first page!

What next?
* We only have one page, and it would be useful to get data from the subsequent pages
* Our neighborhood field is really dirty, so it's hard to do any mapping
* We don't have any information about parking

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> How might you implement one or more of these extensions? Before writing any code, sketch out the principle and sequence of steps that you would follow.
</div>

Let's briefly see what it would take to get the information on a specific webpage. 

Note that we had the foresight to save the URL in the DataFrame that we created above. Let's take the first one.

In [None]:
#url = df.iloc[0]
url = 'https://losangeles.craigslist.org/lac/roo/d/pasadena-top-floor-townhouse-master/7300581598.html'
r = requests.get(url)
txt = r.text
print(txt)

We have a couple of strategies here. First, we could skip trying to parse the page with `BeautifulSoup`, and just see if particular bits of text are present.

For example, what transportation modes does the post emphasize? Do they mention Section 8 vouchers? Some of this might be exploratory—we can see what type of language is included, and then parse in a more structured way (e.g. distinguishing between "No Section 8" and "Section 8 welcome").

In [None]:
if 'freeway' in txt:
    print('This post mentions freeways')
if 'transit' in txt or 'train' in txt or 'bus' in txt:
        print('This post mentions transit')
if 'section 8' in txt:
        print('This post mentions Section 8')

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Write a function that takes a URL as its argument, and returns 3 boolean values for whether a post mentions freeways, transit, and Section 8. Make sure that it is not case-sensitive!
</div>

Most of the post is free-form text. So there's not going to be much value added by `BeautifulSoup`.

The exception is the geographic coordinates, which look like they are in a `div` tag and a `viewposting` class.

In [None]:
soup = BeautifulSoup(r.content)
print(soup.prettify())

In [None]:
latlon = soup.find('div', class_='viewposting')
lat = latlon['data-latitude']
lon = latlon['data-longitude']
print(lat, lon)

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>Scraping unstructured webpages involves more detective work and trial and error.</li>
  <li>Some will have a consistent format and helpful class codes and html tags. Some won't.</li>
  <li>Your code will need to be robust to missing fields and other inconsistencies in page formatting.</li>
  <li>Be nice! You may need to slow the pace of your requests down.</li>
</ul>
</div>