# Scraping craigslist

## Lecture objectives

1. Explore scraping more complex web pages
2. Show how to use classes to extract web content

## Example: Scraping craigslist data
Craiglist provides a wealth of information on apartment rentals and other types of housing, as you can read about in the [Boeing and Waddell paper](https://journals.sagepub.com/doi/abs/10.1177/0739456X16664789). But short of clicking through lots of links, how do we access it?

As with any scraping project, the first step is to get an example web page, and see if we can reverse-engineer the structure.

One option is to parse each detailed post, with information on parking, desired qualities of roommates, etc. But a lot of information is actually in the [list of posts](https://losangeles.craigslist.org/search/apa#search=1~list~0~0). 

Until late 2022, it was a pretty straightforward process to use `requests` to retrieve the list of posts. However, craigslist recently changed their web pages to use JavaScript, which means that the method using `requests` no longer works. There are workarounds, but to keep things simple, let's just download it to our computer using a web browser. This is a great illustration of some of the hazards of web scraping - there is no guarantee that the website owner won't suddenly change the structure on you!

In Chrome, you download as a "web page, complete." That will give you a `html` document and a folder with other files. You'll just need the `html` file, which I saved to our git repository as `cl_posts.html`.

We can open the file in Python using the `read()` function.

In [1]:
from bs4 import BeautifulSoup

with open('../data/cl_posts.html', 'r') as f:  
    saved_content = f.read()
        
soup = BeautifulSoup(saved_content, features='html.parser')
print(soup.prettify())

<!DOCTYPE html>
<!-- saved from url=(0062)https://losangeles.craigslist.org/search/apa#search=1~list~0~0 -->
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="craigslist" property="og:site_name"/>
  <meta content="preview" name="twitter:card"/>
  <meta content="los angeles apartments / housing for rent - craigslist" property="og:title"/>
  <meta content="los angeles apartments / housing for rent - craigslist" name="description"/>
  <meta content="los angeles apartments / housing for rent - craigslist" property="og:description"/>
  <meta content="https://losangeles.craigslist.org/search/apa" property="og:url"/>
  <title>
   los angeles apartments / housing for rent - craigslist
  </title>
  <link href="https://losangeles.craigslist.org/search/apa" rel="canonical"/>
  <style type="text/css">
   body {
        

Let's look at the output to figure out how to parse it.

Again, this takes some detective work and trial and error. As we saw before, opening up the page in the Develop mode in your web browser is often the simplest way to see the hierarchical tag struture.

It looks like each post is in a `<li>` tag. Moreover, note that it's also in a `class` called `cl-search-result`. Structured data like this make it much easier to scrape! The `find_all()` function takes an optional `class_` argument that can filter by class.

In [2]:
posts = soup.find_all('li', class_='cl-search-result')

# Note that there are 120 results, which is the number of posts returned on the Craigslist webpage. 
# That's a good sign!
print(len(posts))

120


Let's look at a sample post.

In [3]:
posts[2]

<li class="cl-search-result cl-search-view-mode-list" title="SR- Brand new 2 bed, 2 bath  luxury unit with ROOTOP"><div class="result-node-wide"><button class="bd-button cl-favorite-button icon-only" tabindex="0" title="add to favorites list" type="button"><span class="icon icom-"></span><span class="label"></span></button><a class="titlestring" href="https://losangeles.craigslist.org/wst/apa/d/los-angeles-sr-brand-new-bed-bath/7584176475.html">SR- Brand new 2 bed, 2 bath  luxury unit with ROOTOP</a><span class="meta"><span class="separator">·</span>Hollywood<span class="separator">·</span><span title="Mon Jan 30 2023 11:59:28 GMT-0800 (Pacific Standard Time)">5 minutes ago</span><span class="separator">·</span><span class="housing-meta"><span class="post-bedrooms">2br</span><span class="post-sqft">925ft<sup>2</sup></span></span><span class="separator">·</span><span class="priceinfo">$3,295</span><span class="pic-button">pic</span><button class="bd-button cl-banish-button icon-only" ta

Again, we could use the Develop mode within a web browser to try and reverse-engineer the structure, although this is a short enough snippet that we can just look at this text.

It looks like the title is in a class called `titlestring`, along with the URL.

In [4]:
posts[2].find(class_='titlestring')

<a class="titlestring" href="https://losangeles.craigslist.org/wst/apa/d/los-angeles-sr-brand-new-bed-bath/7584176475.html">SR- Brand new 2 bed, 2 bath  luxury unit with ROOTOP</a>

So the title is in the text, and the URL is an attribute called `href`.

In [5]:
print(posts[2].find(class_='titlestring').text)
print(posts[2].find(class_='titlestring')['href'])

SR- Brand new 2 bed, 2 bath  luxury unit with ROOTOP
https://losangeles.craigslist.org/wst/apa/d/los-angeles-sr-brand-new-bed-bath/7584176475.html


What about the other information? The neighborhood seems to be within a tag called `meta`. Note that `find` just finds the first occurence. `find_all` finds all of them, and returns a list.

In [6]:
posts[2].find(class_='meta').text

'·Hollywood·5 minutes ago·2br925ft2·$3,295pic'

This is a bit annoying to separate out, but we can use the dots and split on these using `str.split()`.

In [7]:
# example
'Splitting this sentence into words'.split()

['Splitting', 'this', 'sentence', 'into', 'words']

In [8]:
# example with a separator that isn't a space
'Splitting this sentence into words'.split('i')

['Spl', 'tt', 'ng th', 's sentence ', 'nto words']

Here, we want to split on the dot.

In [9]:
posts[2].find(class_='meta').text.split('·')

['', 'Hollywood', '5 minutes ago', '2br925ft2', '$3,295pic']

In [10]:
# it's the second element
posts[2].find(class_='meta').text.split('·')[1]

'Hollywood'

The price, number of bedrooms, and square footage are easier to find, as they are in their own dedicated classes.

In [11]:
print(posts[2].find(class_= 'priceinfo').text)
print(posts[2].find(class_= 'post-bedrooms').text)
print(posts[2].find(class_= 'post-sqft').text)

$3,295
2br
925ft2


Now we understand the structure of each page. So we are ready to put all of the posts in a dataframe. We'll do that in the next lecture.

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>Scraping unstructured webpages involves more detective work and trial and error.</li>
  <li>Some will have a consistent format and helpful class codes and html tags. Some won't.</li>
</ul>
</div>