---
author: "Robert Ritz"
draft: true
echo: false
---

# Data Collection

This notebook will scrape Airbnb listings for several cities around the world. The purpose is to understand the fee structure for these listings better. If other cities are requested, I can add them later. 

To view the analysis notebook see 02 - Exploring Fees. 

> You can view the written analysis that results from this code on datafantic.com where I'm building with data, one project at a time. Sign up to get notified when a new project drops!

In [1]:
import pandas as pd
import requests

from bs4 import BeautifulSoup
from selenium.common.exceptions import NoSuchElementException, ElementNotVisibleException, ElementNotSelectableException, TimeoutException, SessionNotCreatedException
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from tqdm import tqdm
import time

Selenium options. no-sandbox and headless = True are needed to make Selenium work in Deepnote. 

Note: This uses a good amount of RAM, so the basic machine may be unreliable. I used the Plus machine type, and it worked great.

Also note that I installed the Chrome driver in the init.ipynb. Click on the gear icon under the Environment section to view the notebook.

In [3]:
options = Options()
options.headless = True
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--no-sandbox")

## Scrape flexible times

## Scrape listing URLs for various cities

In [5]:
cities = ['Dallas--TX--United-States', 'Austin--TX--United-States', 'Los-Angeles--United-States', 
          'New-York-City--Manhattan--United-States', 'San-Francisco--CA--United-States']

Airbnb only shows 15 pages of listings, with 20 listings per page. That is 300 listings per city without extra filtering or moving the map around. This seems like a good enough sample for our purposes, and I believe this will approximate a normal user experience when using Airbnb.

In [None]:
len(cities) * 300

Looks good. Let's scrape!

In [None]:
links = []
for city in tqdm(cities, position=0, desc="City"):
    url = f"https://www.airbnb.com/s/{city}/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_lengths%5B%5D=one_week&price_filter_input_type=0&date_picker_type=flexible_dates&source=structured_search_input_header&search_type=autocomplete_click&pagination_search=true"
    # Load root city listings
    driver = Chrome(options=options)
    driver.get(url)
    wait = WebDriverWait(driver, timeout=10, poll_frequency=1, ignored_exceptions=[ElementNotVisibleException, ElementNotSelectableException])
    element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, """nav[aria-label='Search results pagination']""")))

    # Determine number of pages in listings
    pages = driver.find_elements(By.TAG_NAME, "nav")[-1].find_elements(By.TAG_NAME, "a")[-2].text
    pages = int(pages)

    # Navigate through each successive page in the listings
    temp_links = []
    for page in range(pages):
        # Grab all links on page. We can filter them later.
        for a in driver.find_elements(By.TAG_NAME, "a"):
            if 'rooms' in a.get_property('href'):
                links.append({'city':city, 'url':a.get_property('href')})
        
        # Got to next page and wait until element is loaded
        driver.find_elements(By.TAG_NAME, "nav")[-1].find_elements(By.TAG_NAME, "a")[-1].click()
        time.sleep(10)

Let's deduplicate and filter the link list, then write to CSV for use later. Because ain't nobody got time for that.

In [None]:
urls_flexible = pd.DataFrame(links)

In [None]:
urls_flexible.shape

In [None]:
urls_flexible = urls_flexible.drop_duplicates().reset_index(drop=True)

In [None]:
urls_flexible

In [None]:
urls_flexible.to_csv('urls_flexible.csv', index=False)

## Scrape listing details

In [3]:
urls_flexible = pd.read_csv('urls_flexible.csv')
urls_flexible.shape

(3594, 2)

In [15]:
def extract_features(url):
    # Make empty dict
    features = {}
    try:
        driver = Chrome(options=options)
        driver.get(url)
        wait = WebDriverWait(driver, timeout=10, poll_frequency=1, ignored_exceptions=[ElementNotVisibleException, ElementNotSelectableException])

        # Wait for pice list 
        element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, """span[class='_1k4xcdh']""")))
        soup = BeautifulSoup(driver.page_source)

        # Get price list
        price_list = [x.text for x in soup.find_all("div", {'class':'_1fpuhdl'})]
        price_list = [x.split("Show price breakdown") for x in price_list]
        for item in price_list:
            if item[0] == 'Cleaning fee':
                features['cleaning_fee'] = item[1]
            elif item[0] == 'Service fee':
                features['service_fee'] = item[1]
            if item[0] == 'Weekly discount':
                features['weekly_discount'] = item[1]
            else:
                features['weekly_discount'] = ''
            if item[0] == 'Long stay discount':
                features['long_stay_discount'] = item[1]
            else:
                features['long_stay_discount'] = ''
        features['price_minus_fees'] = price_list[0][1]

        # Basic info
        features['title'] = soup.find('h2').text
        
        bed_baths = soup.find_all("ol")[0].text.split("·  ·")
        for item in bed_baths:
            if 'guest' in item:
                features['guest'] = item.strip().split(" ")[0]
            elif 'bedroom' in item:
                features['bedrooms'] = item.strip().split(" ")[0]
            elif ('1 bed' in item) or ('beds' in item):
                features['beds'] = item.strip().split(" ")[0]
            elif 'bath' in item:
                features['baths'] = item.strip().split(" ")[0]

        #add url
        features['url'] = url

        driver.quit()
        return features
    # If the dates aren't available then it will time out. 
    except TimeoutException:
        pass
    except SessionNotCreatedException:
        print("session error")
    except:
        print("unhandled exception")

In [5]:
data = []
for i, row in tqdm(urls_flexible.iterrows(), total=urls_flexible.shape[0]):
    features = extract_features(row['url'])
    if features:
        features['location'] = row['city']
        features['check_in'] = row['url'].split("check_in=")[1].split("&")[0]
        features['check_out'] = row['url'].split("check_out=")[1].split("&")[0]
        data.append(features)

 91%|█████████ | 3269/3594 [3:11:04<11:38,  2.15s/it]unhandled exception
100%|██████████| 3594/3594 [3:31:59<00:00,  3.54s/it]


In [6]:
df = pd.DataFrame(data)

In [7]:
df

Unnamed: 0,weekly_discount,long_stay_discount,cleaning_fee,service_fee,price_minus_fees,title,guest,bedrooms,beds,baths,url,location,check_in,check_out
0,,,$100,$82,$600,Entire condo hosted by Kevin,2,1.0,1,1,https://www.airbnb.com/rooms/73539817214550286...,Dallas--TX--United-States,2022-10-28,2022-11-03
1,,,$30,$54,$350,Tiny home hosted by Grady,2,1.0,1,1,https://www.airbnb.com/rooms/15342315?adults=1...,Dallas--TX--United-States,2022-10-30,2022-11-04
2,,,$110,$131,$824,Entire rental unit hosted by Jen,4,1.0,2,1,https://www.airbnb.com/rooms/46581123?adults=1...,Dallas--TX--United-States,2022-10-19,2022-10-26
3,,,,$24,$196,Private room in home hosted by R And R Hostel,1,1.0,1,1,https://www.airbnb.com/rooms/73090942173103176...,Dallas--TX--United-States,2022-10-20,2022-10-25
4,,,$59,$61,$464,Entire rental unit hosted by Carpediem,2,1.0,1,1,https://www.airbnb.com/rooms/69420558176578361...,Dallas--TX--United-States,2022-10-24,2022-10-30
5,,,$75,$74,$450,Entire guesthouse hosted by Elayna,2,1.0,1,1,https://www.airbnb.com/rooms/46852636?adults=1...,Dallas--TX--United-States,2022-10-23,2022-10-28
6,,,$15,$56,$384,Entire bungalow hosted by Stefan,2,,1,1,https://www.airbnb.com/rooms/36598747?adults=1...,Dallas--TX--United-States,2023-01-02,2023-01-08
7,,,$7,$43,$314,Private room in home hosted by Jack,2,1.0,1,1,https://www.airbnb.com/rooms/43124360?adults=1...,Dallas--TX--United-States,2022-10-19,2022-10-26
8,,,$30,$59,$386,Camper/RV hosted by Grady,2,1.0,1,1,https://www.airbnb.com/rooms/24596033?adults=1...,Dallas--TX--United-States,2022-10-23,2022-10-29
9,,,$150,$52,$215,Entire rental unit hosted by Mark,4,1.0,1,1,https://www.airbnb.com/rooms/68311955881661587...,Dallas--TX--United-States,2022-10-23,2022-10-29


Unnamed: 0,weekly_discount,long_stay_discount,cleaning_fee,service_fee,price_minus_fees,title,guest,bedrooms,beds,baths,url,location,check_in,check_out
0,,,$100,$82,$600,Entire condo hosted by Kevin,2,1,1,1,https://www.airbnb.com/rooms/73539817214550286...,Dallas--TX--United-States,2022-10-28,2022-11-03
1,,,$30,$54,$350,Tiny home hosted by Grady,2,1,1,1,https://www.airbnb.com/rooms/15342315?adults=1...,Dallas--TX--United-States,2022-10-30,2022-11-04
2,,,$110,$131,$824,Entire rental unit hosted by Jen,4,1,2,1,https://www.airbnb.com/rooms/46581123?adults=1...,Dallas--TX--United-States,2022-10-19,2022-10-26
3,,,,$24,$196,Private room in home hosted by R And R Hostel,1,1,1,1,https://www.airbnb.com/rooms/73090942173103176...,Dallas--TX--United-States,2022-10-20,2022-10-25
4,,,$59,$61,$464,Entire rental unit hosted by Carpediem,2,1,1,1,https://www.airbnb.com/rooms/69420558176578361...,Dallas--TX--United-States,2022-10-24,2022-10-30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3386,,,$49,$0,$953,Entire rental unit hosted by Marie,3,2,2,1,https://www.airbnb.com/rooms/65582697587414494...,Paris,2023-02-01,2023-02-06
3387,,,,$116,$822,Entire rental unit hosted by Vanessa,4,2,2,1,https://www.airbnb.com/rooms/34830963?adults=1...,Paris,2022-12-13,2022-12-18
3388,,,,$22,$157,Private room in townhouse hosted by Marie,1,1,,1.5,https://www.airbnb.com/rooms/45496636?adults=1...,Paris,2023-08-01,2023-08-06
3389,,,$39,$82,$538,Entire rental unit hosted by Margaux,2,1,1,1,https://www.airbnb.com/rooms/18545424?adults=1...,Paris,2023-01-04,2023-01-09


In [8]:
df.shape

(33, 14)

(3391, 14)

In [9]:
df.to_csv("flexible_listings.csv", index=False)

In [10]:
del df
del data

## Scrape one night listings

Now we will do the same thing but with a one night stay. 

In [6]:
links = []
for city in tqdm(cities, position=0, desc="City"):
    check_in = '2022-11-02'
    check_out = '2022-11-03'
    url = f"https://www.airbnb.com/s/{city}/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_lengths%5B%5D=one_week&price_filter_input_type=0&price_filter_num_nights=5&date_picker_type=calendar&checkin={check_in}&checkout={check_out}"
    # Load root city listings
    driver = Chrome(options=options)
    driver.get(url)
    wait = WebDriverWait(driver, timeout=10, poll_frequency=1, ignored_exceptions=[ElementNotVisibleException, ElementNotSelectableException])
    element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, """nav[aria-label='Search results pagination']""")))

    # Determine number of pages in listings
    pages = driver.find_elements(By.TAG_NAME, "nav")[-1].find_elements(By.TAG_NAME, "a")[-2].text
    pages = int(pages)

    # Navigate through each successive page in the listings
    temp_links = []
    for page in range(pages):
        # Grab all links on page. We can filter them later.
        for a in driver.find_elements(By.TAG_NAME, "a"):
            if 'rooms' in a.get_property('href'):
                links.append({'city':city, 'url':a.get_property('href')})
        
        # Got to next page and wait until element is loaded
        driver.find_elements(By.TAG_NAME, "nav")[-1].find_elements(By.TAG_NAME, "a")[-1].click()
        time.sleep(10)

City: 100%|██████████| 12/12 [32:49<00:00, 164.16s/it]


In [7]:
urls_one_day = pd.DataFrame(links)

In [8]:
urls_one_day.shape

(24853, 2)

In [9]:
urls_one_day = urls_one_day.drop_duplicates().reset_index(drop=True)

In [10]:
urls_one_day.sample(10)

Unnamed: 0,city,url
1691,Brisbane--Queensland,https://www.airbnb.com/rooms/38852719?check_in...
731,Los-Angeles--United-States,https://www.airbnb.com/rooms/16303125?check_in...
650,Los-Angeles--United-States,https://www.airbnb.com/rooms/29630684?check_in...
692,Los-Angeles--United-States,https://www.airbnb.com/rooms/plus/3563533?chec...
3108,Montreal--QC,https://www.airbnb.com/rooms/3006119?check_in=...
2014,Sydney--New-South-Wales--Australia,https://www.airbnb.com/rooms/51157482?check_in...
2108,London,https://www.airbnb.com/rooms/64768646221679252...
2454,Berlin,https://www.airbnb.com/rooms/50973537?check_in...
1841,Sydney--New-South-Wales--Australia,https://www.airbnb.com/rooms/47184627?check_in...
1572,Brisbane--Queensland,https://www.airbnb.com/rooms/45904368?check_in...


In [11]:
urls_one_day.to_csv('urls_one_day.csv', index=False)

### Scrape listing details

In [12]:
urls_one_day = pd.read_csv('urls_one_day.csv')

In [16]:
data = []
for i, row in tqdm(urls_one_day.iterrows(), total=urls_one_day.shape[0]):
    features = extract_features(row['url'])
    if features:
        features['location'] = row['city']
        features['check_in'] = row['url'].split("check_in=")[1].split("&")[0]
        features['check_out'] = row['url'].split("check_out=")[1].split("&")[0]
        data.append(features)

100%|██████████| 3591/3591 [3:14:07<00:00,  3.24s/it]


In [17]:
df = pd.DataFrame(data)
df.to_csv("one_day_listings.csv", index=False)

In [18]:
df.shape

(3566, 14)

In [19]:
df.head()

Unnamed: 0,weekly_discount,long_stay_discount,cleaning_fee,service_fee,price_minus_fees,title,guest,bedrooms,beds,baths,url,location,check_in,check_out
0,,,$30,$12,$58,Entire rental unit hosted by Leonard,3,1,1,1,https://www.airbnb.com/rooms/16353509?check_in...,Dallas--TX--United-States,2022-11-02,2022-11-03
1,,,$40,$14,$58,Entire loft hosted by Simaiya,4,1,1,1,https://www.airbnb.com/rooms/67089461123550275...,Dallas--TX--United-States,2022-11-02,2022-11-03
2,,,$35,$15,$69,Entire rental unit hosted by Frontdesk,4,1,1,1,https://www.airbnb.com/rooms/50029383?check_in...,Dallas--TX--United-States,2022-11-02,2022-11-03
3,,,$30,$13,$65,Entire loft hosted by Leonard,6,1,2,1,https://www.airbnb.com/rooms/39844580?check_in...,Dallas--TX--United-States,2022-11-02,2022-11-03
4,,,,$14,$100,Entire rental unit hosted by Dante,3,1,1,1,https://www.airbnb.com/rooms/69933821236595629...,Dallas--TX--United-States,2022-11-02,2022-11-03


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=3a09bbfb-7f5f-46b3-8fa0-788f71b2887f' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>