# Web parsing with Python and Beautiful Soup 

# Airbnb prediction modelling project
### Objective: To create a data science project, I scraped the data instead of downloading it. Then modelling to predict price of listing based on features of the listing. The end goal is to create a chrome extension for the airbnb page to determine if a listing is overpriced or worth the value based on filters.

### 1,  Get the HTML

In [3]:
#!pip install requests
import requests
#!pip install IPython

In [7]:
airbnb_url = 'https://www.airbnb.com.au/s/Queensland--Australia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=june&flexible_trip_dates%5B%5D=may&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&query=Queensland%2C%20Australia&place_id=ChIJ_dxieiTf1GsRmb4SdiLQ8vU&source=structured_search_input_header&search_type=autocomplete_click'

In [44]:
answer = requests.get(airbnb_url)

In [45]:
# what you can do with an answer
print(answer.airbnb_url)
print(answer.status_code)
print(answer.reason)

AttributeError: 'Response' object has no attribute 'airbnb_url'

In [10]:
print(answer.content)

b'<!doctype html>\n<html data-is-hyperloop="true"><script>window.sherlock_firstbyte = window.performance && window.performance.timing ? window.performance.timing.responseStart : Number(new Date());</script><script>!function(){"use strict";var n=window;const e="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/",o=new RegExp(`^\\\\d{10}_[${e}]{16}$`);const t=/(?:^| )bev=(.*?)(?:;|$)/;let c=!1;function i(){if(c||"undefined"==typeof document)return null;c=!0;const n=(document.cookie||"").match(t);if(!n||2!==n.length)return null;const e=decodeURIComponent(n[1]);return function(n){return o.test(n)}(e)?e:null}!function(){try{if(n.bev=n.bev||i(),!n.bev){const o=function(){const n=[];for(let o=0;o<16;o+=1)n.push(e[Math.floor(Math.random()*e.length)]);return`${Math.floor(Date.now()/1e3)}_${n.join("")}`}();!function(n){const{hostname:e}=document.location,o="."+e.slice(e.indexOf("airbnb.")),t=new Date;t.setDate(t.getDate()+730),document.cookie=["bev="+encodeURIComponent(n),"expires=

In [9]:
# We need to use something else to navigate through the HTML chunk above

## 2. Use BS

In [11]:
from bs4 import BeautifulSoup

In [12]:
soup = BeautifulSoup(answer.content, 'html.parser')

### 3. Scrape Airbnb Webpage

In [22]:
airbnb_url = 'https://www.airbnb.com.au/s/Queensland--Australia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=june&flexible_trip_dates%5B%5D=may&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&query=Queensland%2C%20Australia&place_id=ChIJ_dxieiTf1GsRmb4SdiLQ8vU&source=structured_search_input_header&search_type=autocomplete_click'

In [13]:
# Now I can parse the url above using BS
soup = BeautifulSoup(requests.get(airbnb_url).content,'html.parser')

In [14]:
print(soup.prettify())

<!DOCTYPE html>
<html data-is-hyperloop="true">
 <script>
  window.sherlock_firstbyte = window.performance && window.performance.timing ? window.performance.timing.responseStart : Number(new Date());
 </script>
 <script>
  !function(){"use strict";var n=window;const e="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/",o=new RegExp(`^\\d{10}_[${e}]{16}$`);const t=/(?:^| )bev=(.*?)(?:;|$)/;let c=!1;function i(){if(c||"undefined"==typeof document)return null;c=!0;const n=(document.cookie||"").match(t);if(!n||2!==n.length)return null;const e=decodeURIComponent(n[1]);return function(n){return o.test(n)}(e)?e:null}!function(){try{if(n.bev=n.bev||i(),!n.bev){const o=function(){const n=[];for(let o=0;o<16;o+=1)n.push(e[Math.floor(Math.random()*e.length)]);return`${Math.floor(Date.now()/1e3)}_${n.join("")}`}();!function(n){const{hostname:e}=document.location,o="."+e.slice(e.indexOf("airbnb.")),t=new Date;t.setDate(t.getDate()+730),document.cookie=["bev="+encodeURIComponent(n),"e

# 4. Inspect elements ( that will be useful for airbnb analytics)

In [None]:
#Press F12 on the web url 

# 5. Scrape 1 element

In [16]:
listings=soup.find_all('div','_gig1e7')

In [17]:
# I can also extract the child tag
#listings = soup.find_all('div','_8s3ctt')

In [18]:
listings[0]

<div class="_gig1e7"><div itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem"><meta content="Vintage van 'Stevie' w. Mt views, food &amp; hot tub - null - Pumpenbil" itemprop="name"/><meta content="1" itemprop="position"/><meta content="www.airbnb.com.au/rooms/24088196?previous_page_section_name=1000" itemprop="url"/><div><div><div aria-labelledby="title_24088196" class="_8s3ctt" role="group"><a aria-labelledby="title_24088196" class="_mm360j" href="/rooms/24088196?previous_page_section_name=1000&amp;federated_search_id=ff0f079a-386d-4984-84a1-01bd330ee3d0" rel="noopener noreferrer" target="listing_24088196"></a><div class="_1nz9l7j"><div class="_uae3t0w"><div class="_1mx6kqf" style="background:#484848;--dls-basecard-padding-top:66.6667%"><div class="_1szwzht"><div class="_v0gz4uz" style="--dls-liteimage-padding-top:66.6667%"><div class="_4626ulj"><img alt="" aria-hidden="true" class="_91slf2a" data-original-uri="https://a0.muscache.com/im/pictures/564a09e5-a1

In [19]:
#extract the link to the first listing(anchor tag), extract the url link itself
listings[0].find_all('a')[0].get('href') 

'/rooms/24088196?previous_page_section_name=1000&federated_search_id=ff0f079a-386d-4984-84a1-01bd330ee3d0'

In [20]:
#extract the text that the listing holds
listings[0].get_text()

"SUPERHOSTCampervan/RV in PumpenbilVintage van 'Stevie' w. Mt views, food & hot tub2 guests · 1 bedroom · 1 bed · 1 bathFree parking · Wi-Fi4.97\xa0(33 reviews)$233 AUD/ night$233 AUD per night"

# 6. Inspect all data elements on search page

In [29]:
# url: tag=a, get = href
# name: tag=div, class=_hxt6u1e, get=aria-label
# header: tag=div, class= _b14dlit

# 7. Write a scraping function

In [30]:
# My first iteration 

def extract_basic_features(listing_html):
    features_dict = {}
    
    url = listing_html.find('a').get('href')
    name = listing_html.find("div",{"class":"_5kaapu"}).get_text() #get('aria-label') does not work
    header = listing_html.find("div",{"class":"_b14dlit"}).get_text()
    price = listing_html.find("span",{"class":"_olc9rf0"}).get_text()
    
    features_dict['url'] = url
    features_dict['name']= name
    features_dict['header'] = header
    features_dict['price'] = price
    
    return features_dict

In [31]:
extract_basic_features(listings[0])

{'url': '/rooms/36945292?previous_page_section_name=1000&federated_search_id=382d21b6-9a23-4063-a40e-c88ffbe85642',
 'name': 'Spring Haven Kuranda – Rainforest Garden Retreat',
 'header': 'Entire cabin in Kuranda',
 'price': '$150 AUD'}

In [32]:
# if the tag is not found? e.g.:
listings[0].find('b').get_text()

AttributeError: 'NoneType' object has no attribute 'get_text'

In [21]:
#Second Iteration ( to overcome issue above):
def extract_basic_features(listing_html):
    features_dict= {}
    
    try:
        url=listing_html.find('b').get('href')
    except:
        url= 'empty'
    try:
        name = listing_html.find("div",{"class":"_5kaapu"}).get_text()
    except:
        name = 'empty'
    try: header = listing_html.find("div",{"class":"_b14dlit"}).text
    except:
        header = 'empty'

    features_dict['url'] = url
    features_dict['name']= name
    features_dict['header'] = header
    
    return features_dict

In [23]:
# demonstrates that at least my function won't break if it cannot find a class stated
extract_basic_features(listings[0])


{'url': 'empty',
 'name': "Vintage van 'Stevie' w. Mt views, food & hot tub",
 'header': 'Campervan/RV in Pumpenbil'}

In [24]:
# too many separate extractions! create a dictonary to specify all the elements I wish to scrape
rules_search_page = {
    'url':{'tag':'a', 'get':'href'},
    'name':{'tag':'div','class':'_5kaapu'},
    'header':{'tag':'div', 'class':'_b14dlit'},
    'rooms':{'tag':'div', 'class':'_kqh46o' },
    'facilities':{'tag':'div', 'class':'_kqh46o', 'order':1},
    'rating_n_reviews':{'tag':'a', 'class':'__1jhvjuo', 'get':'href'},
    'price':{'tag':'span', 'class':'_olc9rf0'},
    'superhost':{'tag':'div','class':'_ufoy4t'}
    
    }

In [25]:
# Third iteration (rewrite the function so it understands the dictionary created above)
# 3 steps to writing the function

def extract_element(listing_html,params):

    # 1. Find the right tag
    if 'class' in params:
        elements_found = listing_html.find_all(params['tag'], params['class'])
    else:
        elements_found = listing_html.find_all(params['tag'])
        
    # 2. Extract the right element
    tag_order = params.get('order',0)
    element = elements_found[tag_order]
    
    # 3. Get text
    if 'get' in params:
        output = element.get(params['get'])
    else:
        output = element.get_text()
        
    return output


In [37]:
print(extract_element(listings[0], rules_search_page['header']))
print(extract_element(listings[0], rules_search_page['url']))

Entire cabin in Kuranda
/rooms/36945292?previous_page_section_name=1000&federated_search_id=382d21b6-9a23-4063-a40e-c88ffbe85642


In [26]:
# now iterate through each feature I want to extract
for feature in rules_search_page:
    try:
        print(f"{feature}:{extract_element(listings[0], rules_search_page[feature])}")
    except:
        print(f"{feature}:empty")

url:/rooms/24088196?previous_page_section_name=1000&federated_search_id=ff0f079a-386d-4984-84a1-01bd330ee3d0
name:Vintage van 'Stevie' w. Mt views, food & hot tub
header:Campervan/RV in Pumpenbil
rooms:2 guests · 1 bedroom · 1 bed · 1 bath
facilities:Free parking · Wi-Fi
rating_n_reviews:empty
price:$233 AUD
superhost:SUPERHOST


# Completed extracting features from an airbnb listing (:

# 8. Explore pagination

In [27]:
#observe how url changes when clicking on each webpage of listings
# Page 1: https://www.airbnb.com.au/s/Queensland--Australia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=june&flexible_trip_dates%5B%5D=may&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&query=Queensland%2C%20Australia&place_id=ChIJ_dxieiTf1GsRmb4SdiLQ8vU&source=structured_search_input_header&search_type=autocomplete_click
# Page 2: https://www.airbnb.com.au/s/Queensland--Australia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=june&flexible_trip_dates%5B%5D=may&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&source=structured_search_input_header&search_type=pagination&place_id=ChIJ_dxieiTf1GsRmb4SdiLQ8vU&federated_search_session_id=5cfc0f7e-826a-41a2-8a3f-8c2e22cff802&items_offset=20&section_offset=6
# Page 3: https://www.airbnb.com.au/s/Queensland--Australia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=june&flexible_trip_dates%5B%5D=may&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&source=structured_search_input_header&search_type=pagination&ne_lat=-13.627010389820814&ne_lng=160.0967327582215&sw_lat=-31.13029290215115&sw_lng=138.97689022892462&zoom=5&search_by_map=true&place_id=ChIJ_dxieiTf1GsRmb4SdiLQ8vU&federated_search_session_id=2ea30166-33ed-44aa-b66a-541655cc00e4&items_offset=40&section_offset=6

In [28]:
airbnb_url

'https://www.airbnb.com.au/s/Queensland--Australia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=june&flexible_trip_dates%5B%5D=may&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&query=Queensland%2C%20Australia&place_id=ChIJ_dxieiTf1GsRmb4SdiLQ8vU&source=structured_search_input_header&search_type=autocomplete_click'

In [29]:
# Now write the function which extracts the listings
def get_listings(search_page):
    soup = BeautifulSoup(requests.get(search_page).content, 'html.parser')
    listings = soup.find_all('div','_gig1e7')
    
    return listings 

In [30]:
# check if it works
len(get_listings(airbnb_url))

20

In [31]:
# now try it on second page
url_2 = airbnb_url + '&items_offset=20'
len(get_listings(url_2))

20

In [32]:
# double check the content of url_2 if the data is there
print(extract_element(get_listings(airbnb_url)[0],rules_search_page['name']))
print(extract_element(get_listings(url_2)[0], rules_search_page['name']))

Noosa Hinterland Hideaway 1
Luxury Glamping at Kanimbia in Obi Obi


# 9. Collect all urls

In [34]:
# iterate through all 15 pages
all_listings = []
for  i in range(15):
    offset = 20 * i
    url_2 = airbnb_url + '&items_offset={offset}'
    new_listings = get_listings(url_2)
    all_listings.extend(new_listings)
    
    #check if it's scraping
    print(len(all_listings))
    
    

20
40
60
80
100
120
140
160
180
200
220
240
260
280
300


In [None]:
# if airbnb or any other website stops you from scraping, could try adding a time function.
#import time
#at the end of the loop, add
#time.sleep(2)

In [35]:
# another check worth doing
print(extract_element(all_listings[113], rules_search_page['name']))

Calypso Resort Studio 444


# 10. Scrape all search pages

### 1. Build all urls
### 2. Iteratively scrape them

In [36]:
# 1. Build all urls
def build_urls(main_url, listings_per_page=20, pages_per_location=15):
        url_list = []
        for i in range(pages_per_location):
            offset = listings_per_page * i
            url_pagination = main_url + f'&items_offset={offset}'
            url_list.append(url_pagination)
            
        return url_list

In [37]:
# safe function to extract all features from one page without throwing errors

def extract_page_features(soup,rules):
    features_dict = {}
    for feature in rules:
        try:
            features_dict[feature] = extract_element(soup, rules[feature])
        except:
            features_dict[feature] = 'empty' 
    return features_dict

In [38]:
# 2. Iteratively scrape pages
def process_search_pages(url_list):
    features_list = []
    for page in url_list:
        listings = get_listings(page)
        for listing in listings:
            features = extract_page_features(listing, rules_search_page)
            features_list.append(features)
            
    return features_list

In [39]:
# build a list of urls
url_list = build_urls(airbnb_url)

In [40]:
url_list

['https://www.airbnb.com.au/s/Queensland--Australia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=june&flexible_trip_dates%5B%5D=may&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&query=Queensland%2C%20Australia&place_id=ChIJ_dxieiTf1GsRmb4SdiLQ8vU&source=structured_search_input_header&search_type=autocomplete_click&items_offset=0',
 'https://www.airbnb.com.au/s/Queensland--Australia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=june&flexible_trip_dates%5B%5D=may&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&query=Queensland%2C%20Australia&place_id=ChIJ_dxieiTf1GsRmb4SdiLQ8vU&source=structured_search_input_header&search_type=autocomplete_click&items_offset=20',
 'https://www.airbnb.com.au/s/Queensland--Australia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=june&flexible_trip_dates%5B%5D=may&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_t

In [41]:
# try the function for one page
base_features = process_search_pages(url_list[:1])

In [43]:
base_features

[{'url': '/rooms/17503019?previous_page_section_name=1000&federated_search_id=9ba10514-b727-404d-b035-519c3c4186f9',
  'name': 'Cabarita headland townhouse(2 bedrooms)',
  'header': 'Entire townhouse in Bogangar',
  'rooms': '5 guests · 2 bedrooms · 2 beds · 1 bath',
  'facilities': 'Kitchen · Free parking · Wi-Fi',
  'rating_n_reviews': 'empty',
  'price': '$94 AUD',
  'superhost': 'SUPERHOST'},
 {'url': '/rooms/42819014?previous_page_section_name=1000&federated_search_id=9ba10514-b727-404d-b035-519c3c4186f9',
  'name': 'COSY COMFORT - On The Noosa River',
  'header': 'Entire guest suite in Noosaville',
  'rooms': '2 guests · 1 bedroom · 1 bed · 1 bath',
  'facilities': 'Free parking · Wi-Fi',
  'rating_n_reviews': 'empty',
  'price': '$110 AUD',
  'superhost': 'SUPERHOST'},
 {'url': '/rooms/48739682?previous_page_section_name=1000&federated_search_id=9ba10514-b727-404d-b035-519c3c4186f9',
  'name': 'OtherWorld Residence Byron Bay',
  'header': 'Entire loft in Byron Bay',
  'rooms': '

# Next Steps:

## 1. Scrape data from detailed pages since search pages are dynamic

## 2. ML models
### + clean the data
### + build features
### + fill empty values
