<a href="https://colab.research.google.com/github/ShreyasJothish/airbnb_pricing_DS/blob/master/SJ2%20AirBnB_Data_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scrpaing Notebook:


This notebook contains code to fetch the latest data across world from [InsideAirBnB](http://insideairbnb.com/get-the-data.html)

### Web Scrapping

Using BeautifulSoup library we shall fetch the data for each city listed in InsideAirBnB website.

Reference: [Tutorial: Python Web Scraping Using BeautifulSoup](https://www.dataquest.io/blog/web-scraping-tutorial-python)

In [0]:
import requests
from bs4 import BeautifulSoup

page = requests.get("http://insideairbnb.com/get-the-data.html")

if page.status_code != 200:
  print("Error: Request to InsideAirBnB is failing")

On exploring the web site structure it is found the url links to download the data is present in under tags **td**. Here we shall extract the data based on this **td** tag. The **td** tags contain information for archieved data across years for the same city and this is ignored.

Here we are interested in fetching only Listing information in each city. So only url to **listings.csv.gz** is used.

In [0]:
soup = BeautifulSoup(page.content, 'html.parser')
# print(soup.prettify())

td_tags = soup.find_all('td')

In [3]:
# To ensure only the latest data for a particular city is used.
city_set = set()

# To maintain city level summary for data fetched.
city_data = []

for td_tag in td_tags:
  link_list = [a['href'] for a in td_tag.find_all('a', href=True)]
  
  # Fetch only listings.csv.gz data.
  if len(link_list) > 0 and link_list[0].find('listings.csv.gz') != -1:
    
    url = link_list[0]
    
    # Summary for each city is got by parsing the url itself.
    url_split = link_list[0].split('/')
    
    # InsideAirBnB follows a particular url format which is used as reference
    # for parsing.
    if len(url_split) != 9:
      print(f"Error: URL not following the format {url}")
      
      # It is seen the data for ireland is fetched but the url format is
      # different as compared to others. So this special handling is needed.
      if url_split[3] == "ireland":
        print("Info: Special handling for Ireland")
        country = url_split[3]
        region = url_split[3]
        city = url_split[3]
        date =  url_split[4]
    
        if city not in city_set:
          city_set.add(city)
          print([country, region, city, date, url])
          city_data.append([country, region, city, date, url])
    else:
      country = url_split[3]
      region = url_split[4]
      city = url_split[5]
      date =  url_split[6]
    
      if city not in city_set:
        city_set.add(city)
        city_data.append([country, region, city, date, url])

# Check summary information of each city.
print(f"Total number of city information fetched: {len(city_data)}")
print("Info: Start summary information of each city")
for city in city_data:
  print(city)
print("Info: Completed summary information of each city ")

Error: URL not following the format http://data.insideairbnb.com/ireland/2019-05-12/data/listings.csv.gz
Info: Special handling for Ireland
['ireland', 'ireland', 'ireland', '2019-05-12', 'http://data.insideairbnb.com/ireland/2019-05-12/data/listings.csv.gz']
Error: URL not following the format http://data.insideairbnb.com/united-states/2016-04-18/data/listings.csv.gz
Total number of city information fetched: 100
Info: Start summary information of each city
['the-netherlands', 'north-holland', 'amsterdam', '2019-05-06', 'http://data.insideairbnb.com/the-netherlands/north-holland/amsterdam/2019-05-06/data/listings.csv.gz']
['belgium', 'vlg', 'antwerp', '2019-05-23', 'http://data.insideairbnb.com/belgium/vlg/antwerp/2019-05-23/data/listings.csv.gz']
['united-states', 'nc', 'asheville', '2019-06-26', 'http://data.insideairbnb.com/united-states/nc/asheville/2019-06-26/data/listings.csv.gz']
['greece', 'attica', 'athens', '2019-06-10', 'http://data.insideairbnb.com/greece/attica/athens/2019

### Merging Data 

The data from each city shall be consolidated together into one single dataframe. This shall be used for further data wrangling and modeling.

In [0]:
import time
import gzip
import pandas as pd
import os

In [5]:
# Consolidated data frame to hold data for all cities together.
df_all = pd.DataFrame()

for city in city_data:
  city_name = city[2]
  url = city[4]
  print(f"Info: Downloading data for {city_name} with url {url}")
  
  r = requests.get(url)
  
  # Retrieve HTTP meta-data.
  if r.status_code != 200:
    print(f"Error: Request to {url} failed with status {r.status_code}")
    continue
  
  # Fetch the data locally.
  file_name = f"{city_name}_listings.csv.gz"
  with open(file_name, 'wb') as f:  
    f.write(r.content)
    
  # Unzip and load the file to data frame.
  with gzip.open(file_name) as f:
    df = pd.read_csv(f)
    
    print(f"Info: Shape of data within {file_name}: {df.shape}")
    
    if df_all.empty:
      df_all = df
    else:
      df_all = pd.concat([df_all, df])
  
  print(f"Info: Shape of concatenated dataframe: {df_all.shape}")
  
  # Remove file 
  os.remove(file_name)
  print(f"Info: Removed {file_name}!")
  
  # Sleep for short duration to ensure server is not loaded
  time.sleep(10)

Info: Downloading data for amsterdam with url http://data.insideairbnb.com/the-netherlands/north-holland/amsterdam/2019-05-06/data/listings.csv.gz


  interactivity=interactivity, compiler=compiler, result=result)


Info: Shape of data within amsterdam_listings.csv.gz: (19619, 106)
Info: Shape of concatenated dataframe: (19619, 106)
Info: Removed amsterdam_listings.csv.gz!
Info: Downloading data for antwerp with url http://data.insideairbnb.com/belgium/vlg/antwerp/2019-05-23/data/listings.csv.gz
Info: Shape of data within antwerp_listings.csv.gz: (2035, 106)
Info: Shape of concatenated dataframe: (21654, 106)
Info: Removed antwerp_listings.csv.gz!
Info: Downloading data for asheville with url http://data.insideairbnb.com/united-states/nc/asheville/2019-06-26/data/listings.csv.gz
Info: Shape of data within asheville_listings.csv.gz: (2170, 106)
Info: Shape of concatenated dataframe: (23824, 106)
Info: Removed asheville_listings.csv.gz!
Info: Downloading data for athens with url http://data.insideairbnb.com/greece/attica/athens/2019-06-10/data/listings.csv.gz


  interactivity=interactivity, compiler=compiler, result=result)


Info: Shape of data within athens_listings.csv.gz: (10414, 106)
Info: Shape of concatenated dataframe: (34238, 106)
Info: Removed athens_listings.csv.gz!
Info: Downloading data for austin with url http://data.insideairbnb.com/united-states/tx/austin/2019-05-14/data/listings.csv.gz
Info: Shape of data within austin_listings.csv.gz: (11792, 106)
Info: Shape of concatenated dataframe: (46030, 106)
Info: Removed austin_listings.csv.gz!
Info: Downloading data for barcelona with url http://data.insideairbnb.com/spain/catalonia/barcelona/2019-05-14/data/listings.csv.gz


  interactivity=interactivity, compiler=compiler, result=result)


Info: Shape of data within barcelona_listings.csv.gz: (18302, 106)
Info: Shape of concatenated dataframe: (64332, 106)
Info: Removed barcelona_listings.csv.gz!
Info: Downloading data for barossa-valley with url http://data.insideairbnb.com/australia/sa/barossa-valley/2019-05-22/data/listings.csv.gz
Info: Shape of data within barossa-valley_listings.csv.gz: (240, 106)
Info: Shape of concatenated dataframe: (64572, 106)
Info: Removed barossa-valley_listings.csv.gz!
Info: Downloading data for barwon-south-west-vic with url http://data.insideairbnb.com/australia/vic/barwon-south-west-vic/2019-06-25/data/listings.csv.gz
Info: Shape of data within barwon-south-west-vic_listings.csv.gz: (5077, 106)
Info: Shape of concatenated dataframe: (69649, 106)
Info: Removed barwon-south-west-vic_listings.csv.gz!
Info: Downloading data for beijing with url http://data.insideairbnb.com/china/beijing/beijing/2019-05-21/data/listings.csv.gz


  interactivity=interactivity, compiler=compiler, result=result)


Info: Shape of data within beijing_listings.csv.gz: (31457, 106)
Info: Shape of concatenated dataframe: (101106, 106)
Info: Removed beijing_listings.csv.gz!
Info: Downloading data for belize with url http://data.insideairbnb.com/belize/bz/belize/2019-05-26/data/listings.csv.gz
Info: Shape of data within belize_listings.csv.gz: (2558, 106)
Info: Shape of concatenated dataframe: (103664, 106)
Info: Removed belize_listings.csv.gz!
Info: Downloading data for bergamo with url http://data.insideairbnb.com/italy/lombardia/bergamo/2019-05-30/data/listings.csv.gz
Info: Shape of data within bergamo_listings.csv.gz: (2412, 106)
Info: Shape of concatenated dataframe: (106076, 106)
Info: Removed bergamo_listings.csv.gz!
Info: Downloading data for berlin with url http://data.insideairbnb.com/germany/be/berlin/2019-05-14/data/listings.csv.gz
Info: Shape of data within berlin_listings.csv.gz: (23536, 106)
Info: Shape of concatenated dataframe: (129612, 106)
Info: Removed berlin_listings.csv.gz!
Info: 

  interactivity=interactivity, compiler=compiler, result=result)


Info: Shape of data within buenos-aires_listings.csv.gz: (18708, 106)
Info: Shape of concatenated dataframe: (189042, 106)
Info: Removed buenos-aires_listings.csv.gz!
Info: Downloading data for cambridge with url http://data.insideairbnb.com/united-states/ma/cambridge/2019-06-24/data/listings.csv.gz
Info: Shape of data within cambridge_listings.csv.gz: (1369, 106)
Info: Shape of concatenated dataframe: (190411, 106)
Info: Removed cambridge_listings.csv.gz!
Info: Downloading data for cape-town with url http://data.insideairbnb.com/south-africa/wc/cape-town/2019-05-23/data/listings.csv.gz


  interactivity=interactivity, compiler=compiler, result=result)


Info: Shape of data within cape-town_listings.csv.gz: (22486, 106)
Info: Shape of concatenated dataframe: (212897, 106)
Info: Removed cape-town_listings.csv.gz!
Info: Downloading data for chicago with url http://data.insideairbnb.com/united-states/il/chicago/2019-05-19/data/listings.csv.gz
Info: Shape of data within chicago_listings.csv.gz: (8169, 106)
Info: Shape of concatenated dataframe: (221066, 106)
Info: Removed chicago_listings.csv.gz!
Info: Downloading data for clark-county-nv with url http://data.insideairbnb.com/united-states/nv/clark-county-nv/2019-06-25/data/listings.csv.gz
Info: Shape of data within clark-county-nv_listings.csv.gz: (9369, 106)
Info: Shape of concatenated dataframe: (230435, 106)
Info: Removed clark-county-nv_listings.csv.gz!
Info: Downloading data for columbus with url http://data.insideairbnb.com/united-states/oh/columbus/2019-05-18/data/listings.csv.gz
Info: Shape of data within columbus_listings.csv.gz: (1363, 106)
Info: Shape of concatenated dataframe:

  interactivity=interactivity, compiler=compiler, result=result)


Info: Shape of data within florence_listings.csv.gz: (11762, 106)
Info: Shape of concatenated dataframe: (320805, 106)
Info: Removed florence_listings.csv.gz!
Info: Downloading data for geneva with url http://data.insideairbnb.com/switzerland/geneva/geneva/2019-05-25/data/listings.csv.gz
Info: Shape of data within geneva_listings.csv.gz: (2976, 106)
Info: Shape of concatenated dataframe: (323781, 106)
Info: Removed geneva_listings.csv.gz!
Info: Downloading data for ghent with url http://data.insideairbnb.com/belgium/vlg/ghent/2019-06-16/data/listings.csv.gz
Info: Shape of data within ghent_listings.csv.gz: (1265, 106)
Info: Shape of concatenated dataframe: (325046, 106)
Info: Removed ghent_listings.csv.gz!
Info: Downloading data for girona with url http://data.insideairbnb.com/spain/catalonia/girona/2019-05-29/data/listings.csv.gz
Info: Shape of data within girona_listings.csv.gz: (17478, 106)
Info: Shape of concatenated dataframe: (342524, 106)
Info: Removed girona_listings.csv.gz!
In

  interactivity=interactivity, compiler=compiler, result=result)


Info: Shape of data within hong-kong_listings.csv.gz: (12107, 106)
Info: Shape of concatenated dataframe: (384503, 106)
Info: Removed hong-kong_listings.csv.gz!
Info: Downloading data for ireland with url http://data.insideairbnb.com/ireland/2019-05-12/data/listings.csv.gz
Info: Shape of data within ireland_listings.csv.gz: (27852, 111)


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




Info: Shape of concatenated dataframe: (412355, 113)
Info: Removed ireland_listings.csv.gz!
Info: Downloading data for istanbul with url http://data.insideairbnb.com/turkey/marmara/istanbul/2019-05-29/data/listings.csv.gz
Info: Shape of data within istanbul_listings.csv.gz: (18589, 106)
Info: Shape of concatenated dataframe: (430944, 113)
Info: Removed istanbul_listings.csv.gz!
Info: Downloading data for jersey-city with url http://data.insideairbnb.com/united-states/nj/jersey-city/2019-06-29/data/listings.csv.gz
Info: Shape of data within jersey-city_listings.csv.gz: (2877, 106)
Info: Shape of concatenated dataframe: (433821, 113)
Info: Removed jersey-city_listings.csv.gz!
Info: Downloading data for lisbon with url http://data.insideairbnb.com/portugal/lisbon/lisbon/2019-06-26/data/listings.csv.gz


  interactivity=interactivity, compiler=compiler, result=result)


Info: Shape of data within lisbon_listings.csv.gz: (24423, 106)
Info: Shape of concatenated dataframe: (458244, 113)
Info: Removed lisbon_listings.csv.gz!
Info: Downloading data for london with url http://data.insideairbnb.com/united-kingdom/england/london/2019-06-05/data/listings.csv.gz
Info: Shape of data within london_listings.csv.gz: (82029, 106)
Info: Shape of concatenated dataframe: (540273, 113)
Info: Removed london_listings.csv.gz!
Info: Downloading data for los-angeles with url http://data.insideairbnb.com/united-states/ca/los-angeles/2019-05-05/data/listings.csv.gz
Info: Shape of data within los-angeles_listings.csv.gz: (43954, 106)
Info: Shape of concatenated dataframe: (584227, 113)
Info: Removed los-angeles_listings.csv.gz!
Info: Downloading data for lyon with url http://data.insideairbnb.com/france/auvergne-rhone-alpes/lyon/2019-05-23/data/listings.csv.gz
Info: Shape of data within lyon_listings.csv.gz: (11212, 106)
Info: Shape of concatenated dataframe: (595439, 113)
Inf

  interactivity=interactivity, compiler=compiler, result=result)


Info: Shape of data within mexico-city_listings.csv.gz: (18348, 106)
Info: Shape of concatenated dataframe: (682344, 113)
Info: Removed mexico-city_listings.csv.gz!
Info: Downloading data for milan with url http://data.insideairbnb.com/italy/lombardy/milan/2019-05-14/data/listings.csv.gz
Info: Shape of data within milan_listings.csv.gz: (20627, 106)
Info: Shape of concatenated dataframe: (702971, 113)
Info: Removed milan_listings.csv.gz!
Info: Downloading data for montreal with url http://data.insideairbnb.com/canada/qc/montreal/2019-06-10/data/listings.csv.gz
Info: Shape of data within montreal_listings.csv.gz: (20933, 106)
Info: Shape of concatenated dataframe: (723904, 113)
Info: Removed montreal_listings.csv.gz!
Info: Downloading data for munich with url http://data.insideairbnb.com/germany/bv/munich/2019-05-22/data/listings.csv.gz
Info: Shape of data within munich_listings.csv.gz: (9881, 106)
Info: Shape of concatenated dataframe: (733785, 113)
Info: Removed munich_listings.csv.gz

  interactivity=interactivity, compiler=compiler, result=result)


Info: Shape of data within new-york-city_listings.csv.gz: (48801, 106)
Info: Shape of concatenated dataframe: (804326, 113)
Info: Removed new-york-city_listings.csv.gz!
Info: Downloading data for northern-rivers with url http://data.insideairbnb.com/australia/nsw/northern-rivers/2019-05-29/data/listings.csv.gz
Info: Shape of data within northern-rivers_listings.csv.gz: (5750, 106)
Info: Shape of concatenated dataframe: (810076, 113)
Info: Removed northern-rivers_listings.csv.gz!
Info: Downloading data for oakland with url http://data.insideairbnb.com/united-states/ca/oakland/2019-05-18/data/listings.csv.gz
Info: Shape of data within oakland_listings.csv.gz: (3167, 106)
Info: Shape of concatenated dataframe: (813243, 113)
Info: Removed oakland_listings.csv.gz!
Info: Downloading data for oslo with url http://data.insideairbnb.com/norway/oslo/oslo/2019-05-28/data/listings.csv.gz
Info: Shape of data within oslo_listings.csv.gz: (8192, 106)
Info: Shape of concatenated dataframe: (821435, 11

  interactivity=interactivity, compiler=compiler, result=result)


Info: Shape of data within singapore_listings.csv.gz: (8325, 106)
Info: Shape of concatenated dataframe: (1146694, 113)
Info: Removed singapore_listings.csv.gz!
Info: Downloading data for south-aegean with url http://data.insideairbnb.com/greece/south-aegean/south-aegean/2019-05-28/data/listings.csv.gz
Info: Shape of data within south-aegean_listings.csv.gz: (21008, 106)
Info: Shape of concatenated dataframe: (1167702, 113)
Info: Removed south-aegean_listings.csv.gz!
Info: Downloading data for stockholm with url http://data.insideairbnb.com/sweden/stockholms-l%C3%A4n/stockholm/2019-06-28/data/listings.csv.gz
Info: Shape of data within stockholm_listings.csv.gz: (8012, 106)
Info: Shape of concatenated dataframe: (1175714, 113)
Info: Removed stockholm_listings.csv.gz!
Info: Downloading data for sydney with url http://data.insideairbnb.com/australia/nsw/sydney/2019-06-04/data/listings.csv.gz
Info: Shape of data within sydney_listings.csv.gz: (37644, 106)
Info: Shape of concatenated datafr