## Data Wrangling: Combining Data

This notebook follows the work completed in the Data Wrangling: Web Scraping notebook. Now that the raw files have been created and stored, the data will be further consolidated into their listings and reviews categories in the form of the pandas dataframe.

Since I have a very large dataset scraped from the original webpage, I've decided to use a subset of the data for the present analysis. I will move forward with analysis on the files from Los Angeles.

In [1]:
# import relevant packages
%matplotlib inline

import pandas as pd
import shutil
import os
import time
from datetime import datetime

import warnings
warnings.filterwarnings("ignore")

### Functions to Consolidate the Data
Because there are several raw files to process with hundreds of thousands of lines of data, it helps to create functions that will do the heavy lifting for us. This heavy lifing includes:
1. Checking if the consolidated csv files we want already exist on the computer: **consolidate_data**.
2. Concatenating data of the same city and category (listings or reviews) together: **combine_listings, combine_reviews, and concat_files**.
3. Saving the concatenated data as a csv file: **export_csv**.

In [2]:
def consolidate_data(city, directory, destination):
    """ Checks if the csv file for either listings
        or reviews data has been created for the designated
        city in the destination folder.
        If the file has not been created, run the combine_listings
        or combine_reviews function for that city, and then create
        the csv file for that city.
    """
    
    filename = city + '_listings.csv'
    # if listings file for this city doesn't already exist, create listings_df and save as csv
    if(not os.path.isfile(destination + filename)):
        listings_df = combine_listings(city, directory)
        export_csv(city, filename, listings_df, destination)
    
    filename = city + '_reviews.csv'
    # if reviews file for this city doesn't already exist, create reviews_df and save as csv
    if(not os.path.isfile(destination + filename)):
        reviews_df = combine_reviews(city, directory)
        export_csv(city, filename, reviews_df, destination)



In [3]:
#### FUNCTION FOR LISTINGS #### 
def combine_listings(city, directory):
    """ Goes through files in the directory and checks for the
        designated city listings files. Appends the names of the
        listings files of that city to a list, and passes the list
        and the directory name to the concat_files function.
    """
    target_files = []
    
    for file in os.listdir(directory):
        # check if file from the target city and is listings data
        if city in file and 'listings' in file:
            # add to list of target files
            target_files.append(file)
            
    # concatenate files in list
    return concat_files(target_files, directory) 

In [4]:
#### FUNCTION FOR REVIEWS #### 
def combine_reviews(city, directory):
    """ Goes through files in the directory and checks for the
        designated city reviews files. Add the names of the
        reviews files of that city to a list, and passes the list
        and the directory name to the concat_files function.
    """
    target_files = []
    
    for file in os.listdir(directory):
        # check if file from the target city and is listings data
        if city in file and 'reviews' in file:
            # add to list of target files
            target_files.append(file)
            
    # concatenate files in list
    return concat_files(target_files, directory) 

In [5]:
def concat_files(file_list, directory):
    """Creates a pandas dataframe for each file name in the 
       list of files, then adds the date recorded as a column
       in that dataframe (taken from the file name). Appends
       the dataframe to a list of dataframes. After all files
       in the list have been converted to pandas dataframes,
       concatenate the dataframes together, drop duplicates (ignoring the date_recorded column),
       and reset the dataframe index.
    """
    ### ADD THINGS TO MAKE DATAFRAMES MORE EFFICIENT ###
    # change datatypes to be more efficient
    all_dfs = []
    
    for file in file_list:
        # make into a pandas dataframe
        df = pd.read_csv(directory + file)
        
        # add column of the date
        df['date_recorded'] = file.split('_')[1]
        
        # get rid of duplicates, ignoring new date column
        df = df.drop_duplicates(df.columns.difference(['date_recorded']))
        
        # append to a list of dataframes
        all_dfs.append(df)
    
    # append dataframes together along x-axis
    concat_all = pd.concat(all_dfs)

    # reset index
    concat_all.reset_index(drop=True, inplace=True)
    return concat_all

In [6]:
def export_csv(city, filename, df, destination):
    """ If the desired csv file does not exist in the current
        working directory, convert the dataframe to a csv file
        and move the the desired folder in the destination directory.
    """
    current_dir = os.getcwd() + '/' + filename
    # export listings dataframe to csv if file doesn't already exist
    if(not os.path.isfile(current_dir)):
        df.to_csv(filename, index=False)
        # move csv to destination directory
        shutil.move(os.path.join(current_dir), os.path.join(destination, filename))

#### The following code uses the above functions on the Los Angeles data:

In [7]:
# identify the directory and destination folder
directory = '/Users/limesncoconuts2/springboard_data/data_capstone_one/web_scraped/'
destination = '/Users/limesncoconuts2/springboard_data/data_capstone_one/los_angeles/'

In [8]:
# run function on the list of cities 
start_time = time.time() # timestamp
city = 'los-angeles'

# if both files haven't been created, continue to create the consolidated csv files for that city
if(not os.path.isfile(destination + city + '_listings.csv') or not os.path.isfile(destination + city + '_reviews.csv')):
    consolidate_data(city, directory, destination)

time_to_run = (time.time() - start_time)/60 # timestamp, calculate function time
print('Time:',time_to_run, 'minutes')

Time: 15.199235884348552 minutes


Creating functions will allow us to re-use this code if we wanted to train an algorithm on files for all cities, or a larger subset of the cities. We would simply need to create a list of all the unique cities represented in the files that we scraped from the internet, and then loop through those cities in the consolidate_data function.