# Capstone Project 1: Data Wrangling 2

Data source: https://www.yelp.com/dataset. The dataset used includes over 6 million reviews of 192,609 businesses. (According to Yelp)

Data wrangling activities:
- Read json file.
- Format json strings.
- Create dictionary from json strong.
- Create list of dictionaries.
- Create pandas dataframe from list of dictionaries.
- Filter for restaurant reviews of restaurants in each standing category.
- Parse date column.
- Sort and reset index for all resulting dataframes before using for analysis.

Data wrangling result:
- 1 dataframe containing reviews of restaurants in good standing (4 or more stars).
- 1 dataframe containing reviews of restaurants in moderate standing (between 2 and 4 stars).
- 1 dataframe containing reviews of restaurants in poor standing (less than 2 stars).

In [1]:
import pandas as pd
import json
import ast
import datetime as dt
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
%store -r goodr
%store -r modr
%store -r poorr

In [3]:
# Read file into list
with open('dataset/review.json','r',encoding='utf8') as f:
    reviews = f.readlines()

# Remove the trailing "\n" from each line
reviews = list(map(lambda x: x.rstrip(), reviews))

reviews[0]

'{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,"useful":6,"funny":1,"cool":0,"text":"Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.","date":"2013-05-07 04:34:36"}'

In [4]:
rv = list()
for js in reviews: 
    # Exclude the line that has a syntax error
    if not 'HzeABNLq_UlhrpZXCsWAnA' in js:
        # Evaluate the json string literally as dictionary
        d = ast.literal_eval(js)
        # Delete unwanted keys and values
        del d['text']
        del d['review_id']
        del d['user_id']
        del d['useful']
        del d['funny']
        del d['cool']
        # Add dictionary to list
        rv.append(d)


In [5]:
# Create dataframe using list of dictionaries
df = pd.DataFrame(rv,columns=['business_id','stars','date'])

In [6]:
# Create boolean arrays to filter
grr = df['business_id'].isin(goodr['business_id'])
mrr = df['business_id'].isin(modr['business_id'])
prr = df['business_id'].isin(poorr['business_id'])

# Filter for restaurant reviews using boolean arrays by restaurant standing and assign to dataframes
goodrv = df[grr]
modrv = df[mrr]
poorrv = df[prr]

In [7]:
# Sort by stars in descending order and reset index
goodrv = goodrv.sort_values('stars',ascending=False).reset_index(drop=True)
modrv = modrv.sort_values('stars',ascending=False).reset_index(drop=True)
poorrv = poorrv.sort_values('stars',ascending=False).reset_index(drop=True)

In [8]:
# Convert date strings to datetime values
goodrv['date'] = goodrv['date'].apply(lambda d: dt.datetime.strptime(d.strip(),'%Y-%m-%d %H:%M:%S'))
modrv['date'] = modrv['date'].apply(lambda d: dt.datetime.strptime(d.strip(),'%Y-%m-%d %H:%M:%S'))
poorrv['date'] = poorrv['date'].apply(lambda d: dt.datetime.strptime(d.strip(),'%Y-%m-%d %H:%M:%S'))


In [9]:
# Store the variables so they can be used in data storytelling and other jupyter notebooks
%store goodrv
%store modrv
%store poorrv


Stored 'goodrv' (DataFrame)
Stored 'modrv' (DataFrame)
Stored 'poorrv' (DataFrame)
