# EDA for Reviews

In this notebook, I explore the Yelp reviews dataset, obtain summary statistics about the reviews, and generate visualizations for the variables of interest.

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

engine = create_engine('postgres://postgres:moop@18.236.133.116:5432/capstone')

  """)


The entire reviews dataset has approximately 5.2 milion reviews. About 2.5 million of the reviews have 'useful' reviews. For this analysis, we will consider any review with at least 1 useful vote to be considered useful.

I chose to randomly sample 500000 'useful' reviews and 500000 'not useful' reviews for my initial exploration as follows:

In [2]:
query = """
SELECT stars, text, useful, funny, cool
FROM reviews
where useful > 0
"""

df = pd.read_sql_query(query, engine)
df.shape

(2517186, 5)

In [None]:
query = """
SELECT *
FROM reviews
where useful = 0
ORDER BY random()
limit 500000
"""

df_notuseful = pd.read_sql_query(query, engine)
df_notuseful.shape

In [6]:
del df

## Reviews with 'useful' votes

Create review length in words:

In [4]:
df['review_length'] = df['text'].map(lambda x: len(x.split(' ')))

In [5]:
df.head()

Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool,review_length
0,NhRRo6rIaShhtxgiJIpzkg,IyeakrwiGO4epQSoH7X_8g,t55OEa8kSgMBnGH-KzNHkw,1.0,2017-01-03,Normal arrogant foreigner who thinks he's in p...,7,1,1,27
1,kiOvZERx_t6K9eJJbYN-Kw,IyeakrwiGO4epQSoH7X_8g,MiUrif_vYOuITF4ALf7WHA,1.0,2016-10-10,"OMG! The most basic, microwave oven food you c...",3,0,0,76
2,53GierpsbVvI7-ll691qdQ,IyeakrwiGO4epQSoH7X_8g,OVczZ1qHXc3bjDprNvCKBQ,1.0,2017-03-09,1 star because of coupon deal. I dropped my ph...,2,1,0,106
3,Ep6UiGn8g_jgiGuFg0n3Bg,IyeakrwiGO4epQSoH7X_8g,BrclERrbZrQNRyBmDGawQQ,5.0,2016-08-26,Stopped here for my prefight pizza and discove...,1,1,4,46
4,CKoXVjEwCu_8FECnuZ1V-Q,IyeakrwiGO4epQSoH7X_8g,IOEGLxXwCNiq4P-U359D7Q,5.0,2016-08-17,Made a mistake and reviewed the wrong pizza sh...,1,1,1,49


In [None]:
sns.boxplot(df['review_length'])

It appears that most useful reviews are under 500 words. There is one useful review that is around 1700 words - let's take a closer look:

In [None]:
df[df['review_length'] == 1702]['text']

In [None]:
df.loc[385682, 'text']

This review has lots of empty space due to ASCII art being drawn. We can consider dropping this observation when analyzing the data.

In [None]:
df.drop(385682, inplace=True)

In [None]:
fig = plt.subplots(1, 5, figsize=(12, 8), sharey=True)
plt.title('Boxplot of review length by score')
for i in [1, 2, 3, 4, 5]:
    plt.subplot(1, 5, i) 
    sns.boxplot(df[df['stars'] == i], df[df['stars'] == i]['review_length'], orient='v')
    plt.xlabel(str(i)+' stars')

Overall, it looks like reviews of all levels mostly are 250 words or under. It appears that 5 star reviews may be a little shorter in general (75th percentile at 350 words compared to around 400 for other star levels).

In [None]:
plt.scatter('funny', 'useful', data=df)

In [None]:
plt.scatter('cool', 'useful', data=df)

2 of the reviews seems to have a large amount of useful votes.

## Reviews without 'useful' votes

In [None]:
df_notuseful.head()

In [None]:
df_notuseful['review_length'] = df_notuseful['text'].map(lambda x: len(x.split(' ')))

In [None]:
fig = plt.subplots(1, 5, figsize=(12, 8), sharey=True)
plt.title('Boxplot of review length by score')
for i in [1, 2, 3, 4, 5]:
    plt.subplot(1, 5, i) 
    sns.boxplot(df_notuseful[df_notuseful['stars'] == i], df_notuseful[df_notuseful['stars'] == i]['review_length'], orient='v')
    plt.xlabel(str(i)+' stars')

Again, we see that 5 star reviews seem to be shorter compared to the other star levels. Perhaps those who are satisfied don't have much to say?

In [None]:
df_notuseful.loc[421129, 'text']

Why does this review have 111 funny reviews?

In [None]:
df_notuseful.cool.describe()

## Combining Datasets

In [87]:
df = df.append(df_notuseful)

In [1]:
df.memory_usage()

NameError: name 'df' is not defined

In [89]:
df.to_csv('../data/combined_1mreviews.csv')

KeyboardInterrupt: 