### Data Importing and Cleaning

#### Problem Statement

Yelp is a platform that allows users to make reservations, leave reviews, and find businesses. In addition, users can mark which reviews they find useful.  Reviews can be both very helpful and very harmful for businesses. Being able to determine what makes a useful review can help businesses create better, more  targeted listings. 



In [132]:
# imports
import pandas as pd
import sqlite3

In [139]:
# read in yelp json lines 
# (json was randomly sampled from source data* using perl)
yelp = pd.read_json('../data/yelp.json',lines=True)

*Yelp dataset sourced from <a href="https://www.yelp.com/dataset">here</a>.

#### Review Data

In [118]:
# visualize the data
yelp.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,-P5E9BYUaK7s3PwBF5oAyg,Jha0USGDMefGFRLik_xFQg,bMratNjTG5ZFEA6hVyr-xQ,5,0,0,0,First time there and it was excellent!!! It fe...,2017-02-19 13:32:05
1,dQ3EU6cevDqHAr_ygy1O8A,CNyXcn0c0V5CFmigqqw-Xg,oY5LFo6Yxxf32ePna6mEUQ,5,1,0,0,"I absolutely love this place!\n\nGreat hours, ...",2014-12-30 17:55:51
2,Pgh9POx-bH7JFggKXqXWMQ,8fL5qUckzt_nAC1uwvbr0w,uW8L6awmCyjovD9OhWPo7g,5,1,0,1,"As far as I know, this is the best video renta...",2008-04-30 15:49:16
3,Sgs-rER85vBaOBSPVo96xw,EIi4Fy_JW_6v7DaRDet1uw,Q1HHAb4FzrzfnnrRyA8fgg,4,0,0,0,Great atmosphere and service! I don't know how...,2015-07-28 14:26:48
4,yqJv_8CoXNb-NpaEiTY4yw,ZiI40HVbRbFE-tv2K8OQkw,45siW2fI0Cuv5ZKCS23knA,5,0,0,0,Great new location on Central. Great staff and...,2014-04-19 13:06:28


In [97]:
# get shape of data
yelp.shape

(21032, 9)

In [98]:
# check for nulls
yelp.isnull().sum()

review_id      0
user_id        0
business_id    0
stars          0
useful         0
funny          0
cool           0
text           0
date           0
dtype: int64

In [99]:
# check datatypes
yelp.dtypes

review_id              object
user_id                object
business_id            object
stars                   int64
useful                  int64
funny                   int64
cool                    int64
text                   object
date           datetime64[ns]
dtype: object

In [100]:
# check review_ids are unique
yelp.nunique()

review_id      21032
user_id        19931
business_id    15494
stars              5
useful            43
funny             27
cool              38
text           21032
date           21032
dtype: int64

#### Data Preparation

In [119]:
# replace newlines with spaces & remove punctuation
yelp.text = yelp.text.str.replace('\n', ' ')
yelp.text = yelp.text.replace('[^a-zA-Z0-9 ]', '', regex=True)

In [120]:
# create target
yelp['target'] = yelp.useful.apply(lambda x: 0 if x < 1 else 1)

In [126]:
# calculate number of words in text
yelp['num_words'] = yelp.text.apply(lambda x: len(x.split(' ')))

In [128]:
# calculate number of chars in text
yelp['num_chars'] = yelp.text.apply(lambda x: len(x))

In [129]:
# check for empty texts
yelp.text.isnull().sum()

0

#### Export Data

In [130]:
# export clean data to csv
yelp.to_csv('../data/yelp.csv',index=False)

In [134]:
# export data to db
db_connection = sqlite3.connect('../data/yelp.db.sqlite')
yelp.to_sql(
    'reviews',
    con=db_connection,
    if_exists = 'replace',
    index=False
)

In [138]:
# check numbers of rows in db
pd.read_sql('select count(*) as count from reviews;',
           con=db_connection)

Unnamed: 0,count
0,21032


### Data Dictionary

|Feature|Type|Dataset|Description|
|---|---|---|---|
|review_id|object|Yelp|Unique review id key|
|user_id|object|Yelp|User id key of the reviewer who left the review|
|business_id|object|Yelp|Business id key of the business being reviewed|
|stars|int64|Yelp|Yelp review star rating (between 1-5 stars|
|useful|int64|Yelp|Number of "useful" votes received|
|funny|int64|Yelp|Number of "funny" votes received|
|cool|int64|Yelp|Number of "cool" votes on the review|
|text|object|Yelp|The review text|
|date|datetime64|Yelp|Date review was posted|
|target|int64|Calculated|Binary 1 = useful, 0 = not useful|
|num_words|int64|Calculated|Number of words in the review text|
|num_chars|int64|Calculated|Number of characters in the review text