## Data Wrangling: Web Scraping

For this project, I am using data from http://insideairbnb.com, which stores its datasets by city. I'm focusing on the listings.csv and reviews.csv files for each city, accessing them by scraping the webpage.

**1. Import relevant packages**

The requests and BeautifulSoup packages allow me to scrape the url in question and extract the relevant information I need (the csv file names and content).

The os and shutil packages allow me to check if the files I'm writing from the website are already in the directory (since there are so many files and the session would time out, I had to run the script a few times) and put the files in a new folder once they were all created.

In [1]:
# import relevant packages
import requests
import os
import shutil
from bs4 import BeautifulSoup

**2. Request data**

I used requests to capture the response from the url where all of the csv files are hosted: http://insideairbnb.com/get-the-data.html. After catching that response in a variable, I read the text into another variable. I turned the text variable into a BeautifulSoup object using the BeautifulSoup function.

In [2]:
# catch the response from the url and convert to BeautifulSoup object
url = 'http://insideairbnb.com/get-the-data.html'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)

**3. Make list of target filenames**

Using BeautifulSoup and requests, I extracted the text from the HTML anchor tags, where the csv links are found in the ‘href’ attribute of the tag. I stored the names of the csv urls in a list. Since the links led to gzipped (.csv.gz) files, I called the list _zipped_\__links_.

Since I only wanted the raw data about listings and reviews, I used an if statement to choose only those files. Some of the files, however, are actually summary statistics of listings and reviews data that are optimized for visualization. I excluded the 'visualization' keyword in these file names using the if statement.

In [5]:
# extract all href links in <a> tags that contain 'listings.csv' or 'reviews.csv' (but not the visualizations) and store in a list
# all links are in gzip files
zipped_links = []
for link in soup.find_all('a'):
    link_url = link.get('href')
    # only choose csv files that are listings or review data, exclude visualizations
    if (link_url is not None) and ('listings.csv' in link_url or 'reviews.csv' in link_url) and ('visualisations' not in link_url):
        zipped_links.append(link_url)

**4. Write csv.gz files locally**

I wrote the information from the online gzipped csv files to files on my local server. Using a context manager, I looped through _zipped_\__links_ to create custom filenames for each list item and write to that file. The filename includes the city name, the date the data was collected, and the information category (listings or reviews).

In [30]:
# function to write information from the links to their own files
def write_files(ls, directory):
    for link in ls:
        file_url_split = link.split('/')
        filename = file_url_split[-4] + '_' + file_url_split[-3] + '_' + file_url_split[-1]
        # if the file doesn't exist in our directory, write to the file
        if(not os.path.isfile(directory + filename)):
            with open(directory + filename, "wb") as f:
                r = requests.get(link)
                f.write(r.content)

In [44]:
# implement function in script
directory = '/Users/limesncoconuts2/springboard_data/data_capstone_one/web_scraped'

# remove files that are 0 bytes (program timed out while they were being written previously)
for file in os.listdir(directory):
    if os.path.getsize(directory + file) == 0:
        os.remove(directory + file)

# check if all files have been written
# if not, run writing function again
# if so, print affirmative statement
if len(zipped_links) != len(os.listdir(directory)):
    write_files(zipped_links, directory)
print('ALL FILES WRITTEN!')

ALL FILES WRITTEN!


In [45]:
# some links to files on the website are broken, so we have to exclude these from our data
# remove files that are of less than 1kb (not actually csv.gz files because of broken url)
for file in os.listdir(directory):
    if os.path.getsize(directory + file) < 1000:
        os.remove(directory + file)