## Data Wrangling: Web Scraping

For this project, I am using data from http://insideairbnb.com, which stores its datasets by city. I'm focusing on the listings.csv and reviews.csv files for each city, accessing them by scraping the webpage.

In [15]:
# import relevant packages
import pandas as pd
import requests
import os
import shutil
from bs4 import BeautifulSoup

In [7]:
# catch the response from the url and convert to BeautifulSoup object
url = 'http://insideairbnb.com/get-the-data.html'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)

In [8]:
# extract all href links in <a> tags that contain 'listings.csv' or 'reviews.csv' (but not the visualizations) and store in a list
# all links are in gzip files
zipped_links = []
for link in soup.find_all('a'):
    link_url = link.get('href')
    if (link_url is not None) and ('listings.csv' in link_url or 'reviews.csv' in link_url) and ('visualisations' not in link_url):
        zipped_links.append(link_url)

In [14]:
# write information from the links to their own files
for link in zipped_links:
    file_url_split = link.split('/')
    filename = file_url_split[-4] + '_' + file_url_split[-3] + '_' + file_url_split[-1]
    # if the file doesn't exist in our directory, write to the file
    if(not os.path.isfile(filename)):
        with open(filename, "wb") as f:
            r = requests.get(link)
            f.write(r.content)

In [29]:
# move csv files to 'orig_csv' folder
source = '/Users/limesncoconuts2/Springboard/airbnb_project/data_wrangling/'
dest = source + '/orig_csv'
files = os.listdir(source)
for f in files:
    if f != 'web_scraping.ipynb':
        shutil.move(source+f, dest)