# Yelp Data Merging and Cleaning

Due to the overlapping nature of the search queries, we need to ensure that we remove all duplicates from our dataset. Luckily, the Fusion API is able to return the `id` of a business: a unique, 22-character, case-sensitive alphanumeric string. We can easily remove duplicate business listings by using the built-in `pandas` method `drop_duplicates()` and then perform further cleaning from there.

In [1]:
import json, os
import pandas as pd

## Import `.csv` files and construct DataFrame

We provide the following code to account for the possibility of having multiple `.csv` files that contain scraped businesses.

In [2]:
file_paths = []

for file in os.listdir('../data'):
    if 'businesses' in file:
        file_paths.append('../data/'+file)

In [3]:
master_df = {
    'id': [],
    'latitude': [],
    'longitude': [],
    'price': [],
    'review_count': [],
    'rating': [],
    'zip_code': [],
    'city': [],
    'alias': [],
    'category': [],
}

master_df = pd.DataFrame(master_df)

In [4]:
for path in file_paths:
    master_df = pd.concat([master_df, pd.read_csv(path)])

master_df.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0,alias,category,city,id,latitude,longitude,price,rating,review_count,zip_code
0,chichen-itza-restaurant-los-angeles-3,"['mexican', 'sandwiches', 'soup']",Los Angeles,vC_6J_nGyf4J8xt-Vu6Shw,34.01744,-118.2783,$$,4.5,1190.0,90007.0
1,,['childrensmuseums'],,,,,,,,
2,,['museums'],,,,,,,,
3,figueroa-philly-cheese-steak-los-angeles-2,"['cheesesteaks', 'sandwiches', 'breakfast_brun...",Los Angeles,vfHJzF0ShYtwmotXE-0PiA,34.014196,-118.282417,$$,4.5,1076.0,90037.0
4,dirt-dog-los-angeles-4,"['hotdog', 'beerbar']",Los Angeles,0z23Jk7U_MpvtqKINPL2fA,34.028292,-118.275208,$,4.5,1900.0,90007.0


In [5]:
master_df.dtypes

alias            object
category         object
city             object
id               object
latitude        float64
longitude       float64
price            object
rating          float64
review_count    float64
zip_code         object
dtype: object

## Drop duplicate businesses via `id`

In [6]:
print(master_df.shape)

master_df = master_df.sort_values(
    ['id', 'alias'], ascending=False).drop_duplicates(
        subset = ['id', 'alias'], keep ='first')

print(master_df.shape)

(102673, 10)
(29011, 10)


In [7]:
# Resetting master_df index

# master_df.reset_index(inplace=True, drop=True)

## Drop businesses that do not have geocoordinates or ZIP codes

By nature of their business, food trucks, caterers, and other "mobile" services are not tied to any single location. Therefore, their '$$' rating should not contribute to our model.

In [8]:
food_trucks = master_df.loc[(master_df['latitude'].isna()) | (master_df['zip_code'].isna())].index.tolist()

master_df.drop(index = food_trucks, inplace = True)
master_df.shape

(28812, 10)

## Drop businesses that are outside of LA County

The Fusion API is not perfect. Some businesses scraped are:
- outside the range of ZIP codes for LA county
- far outside the boundaries of LA county

In [9]:
master_df[['latitude', 'longitude']].describe()

Unnamed: 0,latitude,longitude
count,28812.0,28812.0
mean,34.075526,-117.46658
std,0.457161,9.932811
min,26.07231,-149.429066
25%,33.952783,-118.384165
50%,34.05244,-118.263
75%,34.13328,-118.120925
max,61.581385,13.37306


Using the LA county boundary `.json` file, we can perform a rough filter to remove offending businesses.

In [10]:
COUNTY_BOUNDS = json.loads(open('../Assets/la_county_coordinates.json').read())
COUNTY_BOUNDS = COUNTY_BOUNDS['geometries'][0]['coordinates'][0][0]

LATMIN = min([ele[1] for ele in COUNTY_BOUNDS])
LATMAX = max([ele[1] for ele in COUNTY_BOUNDS])

LONMIN = min([ele[0] for ele in COUNTY_BOUNDS])
LONMAX = max([ele[0] for ele in COUNTY_BOUNDS])

In [11]:
master_df = master_df.drop(index=master_df.loc[((master_df['latitude'] < LATMIN) |
                                                (master_df['latitude'] > LATMAX) |
                                                (master_df['longitude'] < LONMIN) |
                                                (master_df['longitude'] > LONMAX))].index
                          ).reset_index(drop=True)
master_df.shape

(27446, 10)

In [12]:
master_df['zip_code'].astype(int).describe()

count    27446.000000
mean     90795.312905
std       1279.971004
min      10314.000000
25%      90069.000000
50%      90703.000000
75%      91356.000000
max      98188.000000
Name: zip_code, dtype: float64

In [13]:
# http://file.lacounty.gov/SDSInter/lac/1031552_MasterZipCodes.pdf

outer_zips = master_df.loc[(master_df['zip_code'].astype(int) > 93591) 
                           | (master_df['zip_code'].astype(int) <= 90000)].index.tolist()

In [14]:
master_df.drop(index=outer_zips, inplace=True)
master_df['zip_code'].astype(str)
master_df.shape

(27427, 10)

## Converting \\$ to numbers

In [15]:
master_df['price'] = master_df['price'].replace({'$$$$':4, '$$$':3, '$$':2,'$':1})
master_df['price'].value_counts()

1    15245
2    11344
3      679
4      159
Name: price, dtype: int64

In [16]:
master_df.isna().sum()

alias           0
category        0
city            0
id              0
latitude        0
longitude       0
price           0
rating          0
review_count    0
zip_code        0
dtype: int64

## Converting `category` into an actual list from a string

In [17]:
# print(master_df['category'][0])
# print(master_df['category'][0][2])
# master_df['category'] = [eval(value) for value in master_df['category'].values]
# print(master_df['category'][0])
# print(master_df['category'][0][2])

## Save the cleaned data to a `.csv`

In [18]:
master_df.to_csv('../data/master.csv', index=False)

# Move on to 03 - Yelp Adding IRS Income