## Predicting Airbnb Listing Price | Data Collection

The data for this project comes from [Inside Airbnb](http://insideairbnb.com/get-the-data.html), an independent, non-commercial project that collects public data from the travel and accomodations company Airbnb.

This notebook will show how I scraped the necessary files from the web and consolidated them into files of listings data and reviews data.

---

In [1]:
import os
import shutil
import requests
import warnings
import pandas as pd

from bs4 import BeautifulSoup

In [2]:
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)

### Web Scraping

Since I used data sourced from many files found on the same webpage, I decided to write a script that scrapes that page for the relevant URLs to my project. 

To complete this task, I used the requests and BeautifulSoup packages. First, I used requests to capture the response from the url where all of the csv files are hosted: http://insideairbnb.com/get-the-data.html.

After catching that response in a variable, I read the text into another variable. I turned the text variable into a BeautifulSoup object using the BeautifulSoup function. 

In [3]:
url = 'http://insideairbnb.com/get-the-data.html'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)

After creating the BeautifulSoup object, I extracted the text from the HTML anchor tags, where the csv links are found in the ‘href’ attribute of the tag. I stored the names of the csv urls in a list. Since the links led to gzipped (.csv.gz) files, I called the list `zipped_links`.

In [4]:
zipped_links = []
for link in soup.find_all('a'):
    link_url = link.get('href')
    if (link_url is not None) and ('listings.csv' in link_url or 'reviews.csv' in link_url) and ('los-angeles' in link_url) and ('visualisations' not in link_url):
        zipped_links.append(link_url)

Next, I wrote the information from the online csv files to local files. Using a context manager, I looped through `zipped_links` to create custom filenames for each list item and write to that file. The filename includes the city name, the date the data was collected, and the information category (listings or reviews). 

In [5]:
def write_files(ls, directory):
    '''write information from the links to their own files'''
    for link in ls:
        file_url_split = link.split('/')
        filename = file_url_split[-4] + '_' + file_url_split[-3] + '_' + file_url_split[-1]
        # if the file doesn't exist in our directory, write to the file
        if(not os.path.isfile(directory + filename)):
            with open(directory + filename, "wb") as f:
                r = requests.get(link)
                f.write(r.content)

In [6]:
directory = '/Users/limesncoconuts2/datasets/airbnb-web/'

# remove files that are 0 bytes (program timed out while they were being written previously)
for file in os.listdir(directory):
    if os.path.getsize(directory + file) == 0:
        os.remove(directory + file)

# check if all files have been written
if len(zipped_links) != len(os.listdir(directory)):
    write_files(zipped_links, directory)
print('ALL FILES WRITTEN!')

ALL FILES WRITTEN!


Some links to files on the website are broken, so they have to be removed from the dataset.

In [7]:
# remove files that are of less than 1kb 
for file in os.listdir(directory):
    if os.path.getsize(directory + file) < 1000:
        os.remove(directory + file)

### Combining Data

After the raw files were scraped, created, and stored, the data was consolidated from monthly files into one single file for each listings and reviews. Because there are  hundreds of thousands of lines of data in each file, it helped to create functions that did the heavy lifting:

`consolidate_data` checks if the consolidated csv file for either listings or reviews data has been created for the designated city. If the file has not been created, it runs the `combine_files` function for that city, and then creates the csv file for that city.

`combine_files` goes through files in the directory and checks for the designated city files of the specified kind. Then it appends the names of the files of that city to a list, and passes the list and the directory name to the `concat_files` function.

`concat_files` creates a pandas dataframe for each file name in the list of files, then appends the dataframe to a list of dataframes. After all files in the list have been converted to pandas dataframes, it concatenates the dataframes together, drops duplicate rows, and resets the dataframe index.

`export_csv` checks if the desired csv file does not exist in the current working directory, then converts the dataframe to a csv file and moves the the desired folder in the destination directory.

Even though I started by analyzing just Los Angeles, having functions will also allow me to scale my analysis to multiple cities in the future.

In [8]:
def consolidate_data(city, directory, destination):
    """ Checks if the csv file for either listings
        or reviews data has been created for the designated
        city in the destination folder.
        If the file has not been created, run the combine_listings
        or combine_reviews function for that city, and then creates
        the csv file for that city.
    """
    
    filename = city + '_listings.csv'
    if(not os.path.isfile(destination + filename)):
        listings_df = combine_files(city, directory, 'listings')
        export_csv(filename, listings_df, destination)
    
    filename = city + '_reviews.csv'
    if(not os.path.isfile(destination + filename)):
        reviews_df = combine_files(city, directory, 'reviews')
        export_csv(filename, reviews_df, destination)

In [9]:
def combine_files(city, directory, kind):
    """ Goes through files in the directory and checks for the
        designated city files of the specified kind. Appends the names of the
        files of that city to a list, and passes the list and the directory 
        name to the concat_files function.
    """
    target_files = []
    
    for file in os.listdir(directory):
        if city in file and kind in file:
            target_files.append(file)
    return concat_files(target_files, directory) 

In [10]:
def concat_files(file_list, directory):
    """Creates a pandas dataframe for each file name in the 
       list of files. Appends the dataframe to a list of dataframes. After all files
       in the list have been converted to pandas dataframes,
       concatenate the dataframes together, drop duplicates,
       and reset the dataframe index.
    """
    all_dfs = []
        
    for file in file_list:
        df = pd.read_csv(directory + file)
        all_dfs.append(df)
    
    concat_all = pd.concat(all_dfs)    
    concat_all.drop_duplicates(inplace=True)
    concat_all.reset_index(drop=True, inplace=True)
        
    return concat_all

In [11]:
def export_csv(filename, df, destination):
    """ If the desired csv file does not exist in the current
        working directory, convert the dataframe to a csv file
        and move the the desired folder in the destination directory.
    """
    current_dir = os.getcwd() + '/' + filename
    if(not os.path.isfile(current_dir)):
        df.to_csv(filename, index=False)
        shutil.move(os.path.join(current_dir), os.path.join(destination, filename))

After I created the functions, I defined the directory (the pathway where the raw data is stored) and destination folder (where I wanted to store the concatenated data on the computer) and ran the functions. The outcome was a single listings file and single reviews file for Los Angeles.

In [12]:
directory = '/Users/limesncoconuts2/datasets/airbnb-web/'
destination = '/Users/limesncoconuts2/datasets/airbnb/'

In [13]:
city = 'los-angeles'
if(not os.path.isfile(destination + city + '_listings.csv') or not os.path.isfile(destination + city + '_reviews.csv')):
    consolidate_data(city, directory, destination)