# Scraping San Francisco Apt/Housing Listings on Craigslist

This notebook uses the functions in `scrape_craigslist.py` to scrape 25 pages of Craigslist apartment/housing listings and each listings individual URL for amenities details. 

In order to avoid web scraping detection, I decided to scrape in batches of ~5 listing pages (which would include ~600 post scrapes). Then I compiled the 5 dataframes and wrote the final dataframe to a csv file for cleanining. 


**Craigslist settings:** 
- `SF bay area` > `san francisco` > `housing` > `apartments / housing for rent`
- [x] `Bundle Duplicates`



In [1]:
# Imports

from bs4 import BeautifulSoup
import requests

import pandas as pd
import numpy as np

from random import randint
from time import sleep

from scrape_craigslist import *

## Compiling list of URLs to scrape

In [3]:
# First URL in results page

start_url = 'https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1'

In [4]:
# Get all URLs of listing results pages to scrape
urls = get_results_urls(start_url)

In [6]:
# Review URLs to make sure they were formatted correctly:

page = 1
for url in urls:
    print(page, url)
    print("")
    page += 1

1 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=0

2 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=120

3 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=240

4 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=360

5 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=480

6 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=600

7 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=720

8 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=840

9 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=960

10 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=1080

11 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=1200

12 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=1320

13 https://s

## First batch

In [7]:
# First batch
batch_1 = urls[:6]

In [13]:
page = 1
for url in batch_1:
    print(page, url)
    print("")
    page += 1

1 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=0

2 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=120

3 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=240

4 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=360

5 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=480

6 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=600



In [9]:
sf_1 = full_listings_scrape(batch_1)

Scraping page 1 of 6...

Listing page scrape complete!
Number of postings scraped: 121

Individual posts scrape complete!
Number of posts scraped:  121

Page 1 of 6 scrape complete!

Scraping page 2 of 6...

Listing page scrape complete!
Number of postings scraped: 123

Individual posts scrape complete!
Number of posts scraped:  123

Page 2 of 6 scrape complete!

Scraping page 3 of 6...

Listing page scrape complete!
Number of postings scraped: 137

Individual posts scrape complete!
Number of posts scraped:  137

Page 3 of 6 scrape complete!

Scraping page 4 of 6...

Listing page scrape complete!
Number of postings scraped: 135

Individual posts scrape complete!
Number of posts scraped:  135

Page 4 of 6 scrape complete!

Scraping page 5 of 6...

Listing page scrape complete!
Number of postings scraped: 123

Individual posts scrape complete!
Number of posts scraped:  123

Page 5 of 6 scrape complete!

Scraping page 6 of 6...

Listing page scrape complete!
Number of postings scraped: 12

## Second batch

In [14]:
batch_2 = urls[6:11]

In [15]:
page = 1
for url in batch_2:
    print(page, url)
    print("")
    page += 1

1 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=720

2 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=840

3 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=960

4 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=1080

5 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=1200



In [17]:
sf_2 = full_listings_scrape(batch_2)

Scraping page 1 of 5...

Listing page scrape complete!
Number of postings scraped: 127

Individual posts scrape complete!
Number of posts scraped:  127

Page 1 of 5 scrape complete!

Scraping page 2 of 5...

Listing page scrape complete!
Number of postings scraped: 124

Individual posts scrape complete!
Number of posts scraped:  124

Page 2 of 5 scrape complete!

Scraping page 3 of 5...

Listing page scrape complete!
Number of postings scraped: 127

Individual posts scrape complete!
Number of posts scraped:  127

Page 3 of 5 scrape complete!

Scraping page 4 of 5...

Listing page scrape complete!
Number of postings scraped: 122

Individual posts scrape complete!
Number of posts scraped:  122

Page 4 of 5 scrape complete!

Scraping page 5 of 5...

Listing page scrape complete!
Number of postings scraped: 126

Individual posts scrape complete!
Number of posts scraped:  126

Page 5 of 5 scrape complete!



## Third batch

In [49]:
batch_3 = urls[11:16]

In [21]:
page = 1
for url in batch_3:
    print(page, url)
    print("")
    page += 1

1 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=1320

2 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=1440

3 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=1560

4 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=1680

5 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=1800



In [22]:
sf_3 = full_listings_scrape(batch_3)

Scraping page 1 of 5...

Listing page scrape complete!
Number of postings scraped: 128

Individual posts scrape complete!
Number of posts scraped:  128

Page 1 of 5 scrape complete!

Scraping page 2 of 5...

Listing page scrape complete!
Number of postings scraped: 121

Individual posts scrape complete!
Number of posts scraped:  121

Page 2 of 5 scrape complete!

Scraping page 3 of 5...

Listing page scrape complete!
Number of postings scraped: 122

Individual posts scrape complete!
Number of posts scraped:  122

Page 3 of 5 scrape complete!

Scraping page 4 of 5...

Listing page scrape complete!
Number of postings scraped: 121

Individual posts scrape complete!
Number of posts scraped:  121

Page 4 of 5 scrape complete!

Scraping page 5 of 5...

Listing page scrape complete!
Number of postings scraped: 122

Individual posts scrape complete!
Number of posts scraped:  122

Page 5 of 5 scrape complete!



## Fourth batch

In [24]:
batch_4 = urls[16:21]

In [25]:
page = 1
for url in batch_4:
    print(page, url)
    print("")
    page += 1

1 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=1920

2 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=2040

3 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=2160

4 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=2280

5 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=2400



In [26]:
sf_4 = full_listings_scrape(batch_4)

Scraping page 1 of 5...

Listing page scrape complete!
Number of postings scraped: 121

Individual posts scrape complete!
Number of posts scraped:  121

Page 1 of 5 scrape complete!

Scraping page 2 of 5...

Listing page scrape complete!
Number of postings scraped: 122

Individual posts scrape complete!
Number of posts scraped:  122

Page 2 of 5 scrape complete!

Scraping page 3 of 5...

Listing page scrape complete!
Number of postings scraped: 120

Individual posts scrape complete!
Number of posts scraped:  120

Page 3 of 5 scrape complete!

Scraping page 4 of 5...

Listing page scrape complete!
Number of postings scraped: 121

Individual posts scrape complete!
Number of posts scraped:  121

Page 4 of 5 scrape complete!

Scraping page 5 of 5...

Listing page scrape complete!
Number of postings scraped: 124

Individual posts scrape complete!
Number of posts scraped:  124

Page 5 of 5 scrape complete!



## Fifth batch

In [27]:
batch_5 = urls[21:]

In [28]:
page = 1
for url in batch_5:
    print(page, url)
    print("")
    page += 1

1 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=2520

2 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=2640

3 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=2760

4 https://sfbay.craigslist.org/search/sfc/apa?sort=date&bundleDuplicates=1&s=2880



In [29]:
sf_5 = full_listings_scrape(batch_5)

Scraping page 1 of 4...

Listing page scrape complete!
Number of postings scraped: 121

Individual posts scrape complete!
Number of posts scraped:  121

Page 1 of 4 scrape complete!

Scraping page 2 of 4...

Listing page scrape complete!
Number of postings scraped: 122

Individual posts scrape complete!
Number of posts scraped:  122

Page 2 of 4 scrape complete!

Scraping page 3 of 4...

Listing page scrape complete!
Number of postings scraped: 126

Individual posts scrape complete!
Number of posts scraped:  126

Page 3 of 4 scrape complete!

Scraping page 4 of 4...

Listing page scrape complete!
Number of postings scraped: 120

Individual posts scrape complete!
Number of posts scraped:  120

Page 4 of 4 scrape complete!



## Compiling dataframes & export

In [32]:
df_list = [sf_1, sf_2, sf_3, sf_4, sf_5]

In [43]:
sf = pd.concat(df_list)

In [44]:
sf.head()

Unnamed: 0,index,date,title,link,price,brs,sqft,hood,bath,amenities
0,0,Oct 1,Lovely Newly Renovated 2BR 1BA Apartment (suns...,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2500,2.0,,sunset / parkside,1Ba,"[apartment, laundry on site, no smoking]"
1,1,Oct 1,Boo Spooky Special! The Martin,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2680,1.0,591.0,potrero hill,1.5Ba,"[cats are OK - purrr, dogs are OK - wooof, apa..."
2,2,Oct 1,Arterra condo with Salesforce Tower Skyline Vi...,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2800,1.0,,SOMA / south beach,1Ba,"[cats are OK - purrr, dogs are OK - wooof, con..."
3,3,Oct 1,Bright Marina Studios,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2100,,,marina / cow hollow,1Ba,[application fee details: 20.00 Credit Check f...
4,4,Oct 1,Newly Remodeled ! 2 br1ba$3195,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,3195,,,USF / panhandle,1Ba,"[apartment, no smoking]"


In [46]:
# Dataframe cleanup
sf = sf.drop(['index'], axis=1)

In [47]:
sf.head()

Unnamed: 0,date,title,link,price,brs,sqft,hood,bath,amenities
0,Oct 1,Lovely Newly Renovated 2BR 1BA Apartment (suns...,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2500,2.0,,sunset / parkside,1Ba,"[apartment, laundry on site, no smoking]"
1,Oct 1,Boo Spooky Special! The Martin,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2680,1.0,591.0,potrero hill,1.5Ba,"[cats are OK - purrr, dogs are OK - wooof, apa..."
2,Oct 1,Arterra condo with Salesforce Tower Skyline Vi...,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2800,1.0,,SOMA / south beach,1Ba,"[cats are OK - purrr, dogs are OK - wooof, con..."
3,Oct 1,Bright Marina Studios,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,2100,,,marina / cow hollow,1Ba,[application fee details: 20.00 Credit Check f...
4,Oct 1,Newly Remodeled ! 2 br1ba$3195,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,3195,,,USF / panhandle,1Ba,"[apartment, no smoking]"


In [48]:
# Export 
sf.to_csv('sf_raw.csv', index=False)