# Import modules

In [1]:
import pandas as pd

# Before you begin...

Make a `dataset` directory at the root of this project and place your json files in there.

# Convert datasets to CSV

I tried countless methods to read the data into a dataframe directly from `JSON` however I was not successful. the [yelp/dataset-examples](https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py) has a `json_to_csv_converter` which can be used to convert the data to `csv`.

## Log of attempt to load JSON data

```
** <2018-02-04 Sun>
- Note taken on [2018-02-04 Sun 18:16] \\
  [[https://www.dataquest.io/blog/python-json-tutorial/][blogpost]] on some techniques to deal with large datasets
- Note taken on [2018-02-04 Sun 18:16] \\
  trying to load the =review.json= file but experiencing weird problems.
  Found a [[https://github.com/pandas-dev/pandas/issues/18152][issue]] on the pandas repository which documents the same error
  that I am seeing. The conclusion seems to be that the json file was
  malformed. I need to verify if my dataset has any issues.
** <2018-02-05 Mon>
- Note taken on [2018-02-06 Tue 12:18] \\
  the converter works when executed with python2!
- Note taken on [2018-02-05 Mon 22:43] \\
  found a =json= to =csv= converter at the [[https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py][yelp/dataset-examples]] repo. The
  code is for python 2 so need to make a few adjustments before it works.
```

**Note:** the following cell will take time to execute, might want to go grab some ☕️. Alternatively, I would recommend manually converting only the files you need in the shell.

In [6]:
%%bash
for file in dataset/*.json;
do
    echo "converting $file to csv..."
    python2 lib/json_to_csv_converter.py $file;
done
# python2 lib/json_to_csv_converter.py

# Exploration of reviews

## Hypothesis

H_o_: Reviews with higher `stars` should have a higher `useful` vote.

H_a_: Reviews with higher `stars` do not have a higher `useful` vote.

In [28]:
reviews = pd.read_csv('dataset/review.csv')

In [29]:
reviews.head()

Unnamed: 0,funny,user_id,review_id,text,business_id,stars,date,useful,cool
0,0,bv2nCi5Qv5vroFiqKGopiw,v0i_UHJMo_hPBq9bxWvW4w,"Love the staff, love the meat, love the place....",0W4lkclzZThpx3V65bVgig,5,2016-05-28,0,0
1,0,bv2nCi5Qv5vroFiqKGopiw,vkVSCC7xljjrAI4UGfnKEQ,Super simple place but amazing nonetheless. It...,AEx2SYEUJmTxVVB18LlCwA,5,2016-05-28,0,0
2,0,bv2nCi5Qv5vroFiqKGopiw,n6QzIUObkYshz4dz2QRJTw,Small unassuming place that changes their menu...,VR6GpWIda3SfvPC-lg9H3w,5,2016-05-28,0,0
3,0,bv2nCi5Qv5vroFiqKGopiw,MV3CcKScW05u5LVfF6ok0g,Lester's is located in a beautiful neighborhoo...,CKC0-MOWMqoeWf6s-szl8g,5,2016-05-28,0,0
4,0,bv2nCi5Qv5vroFiqKGopiw,IXvOzsEMYtiJI0CARmj77Q,Love coming here. Yes the place always needs t...,ACFtxLv8pGrrxMm6EgjreA,4,2016-05-28,0,0


In [30]:
reviews.describe()

Unnamed: 0,funny,stars,useful,cool
count,5261669.0,5261669.0,5261669.0,5261669.0
mean,0.509196,3.72774,1.385085,0.5860916
std,2.686168,1.433593,4.528727,2.233706
min,0.0,1.0,-1.0,-1.0
25%,0.0,3.0,0.0,0.0
50%,0.0,4.0,0.0,0.0
75%,0.0,5.0,2.0,1.0
max,1481.0,5.0,3364.0,1105.0


In [31]:
# correlations
reviews.corr()

Unnamed: 0,funny,stars,useful,cool
funny,1.0,-0.048866,0.621663,0.661669
stars,-0.048866,1.0,-0.077122,0.044828
useful,0.621663,-0.077122,1.0,0.677069
cool,0.661669,0.044828,0.677069,1.0


We note that there is a negative correlation between `stars` and `useful` which means that is a review has a higher `stars` then it received a *lower* `useful` vote. This shows that there is a rational descrepancy in the data. We can further validate this observation by creating a pivot table of `useful` vs. `stars`.

In [25]:
pd.pivot_table(reviews,values='useful', index='stars' )

Unnamed: 0_level_0,useful
stars,Unnamed: 1_level_1
1,3.202899
2,2.900901
3,1.969112
4,1.734139
5,2.047826


From the above pivot table we obtain an inconclusive result, further analysis is required. Next, let's obtain a count of users in the `reviews` df. We can do this using the `user_id` column. Note that `user_counts` obtained below is sorted in descending order. We then take the top 25% of users who have posted a a lot of reviews, similarly we also take the bottom 25% users with the least number of reviews.

In [42]:
# count of users (sorted highest to lowest)
user_counts = reviews['user_id'].value_counts()

In [60]:
import math

REVIEWS_LEN = reviews.shape[0]
TOP_25 = math.floor(REVIEWS_LEN*0.25)
BOT_25 = -TOP_25

most_frequent_users = user_counts[:TOP_25] # first 25%
least_frequent_users = user_counts[BOT_25:] # last 25%

Next, we obtain a df containing the reviews from the top 25% and bottom 25% users.

In [68]:
most_frequent_user_reviews = reviews.filter(items=most_frequent_users, axis=0)
least_frequent_user_reviews = reviews.filter(items=least_frequent_users, axis=0)

In [69]:
most_frequent_user_reviews.corr()

Unnamed: 0,funny,stars,useful,cool
funny,1.0,-0.341695,0.400509,0.31868
stars,-0.341695,1.0,-0.623933,-0.38893
useful,0.400509,-0.623933,1.0,0.368146
cool,0.31868,-0.38893,0.368146,1.0


In [70]:
pd.pivot_table(most_frequent_user_reviews, values='useful', index='stars')

Unnamed: 0_level_0,useful
stars,Unnamed: 1_level_1
1,5.704199
2,1.97107
3,1.988311
4,0.198111
5,0.072265


In [71]:
least_frequent_user_reviews.corr()

Unnamed: 0,funny,stars,useful,cool
funny,1.0,-0.371052,0.423652,0.314621
stars,-0.371052,1.0,-0.63229,-0.3968
useful,0.423652,-0.63229,1.0,0.357234
cool,0.314621,-0.3968,0.357234,1.0


In [72]:
least_frequent_usefulness = pd.pivot_table(least_frequent_user_reviews, values='useful', index='stars')
least_frequent_usefulness

Unnamed: 0_level_0,useful
stars,Unnamed: 1_level_1
1,5.902944
2,2.154024
3,2.060877
4,0.180985
5,0.070508


Not constructing pivot tables using the reviews top 25% and bottom 25% users, we can clearly see that reviews with lower `stars` receive lower `useful` votes! This proves that H_a_ is true.

## Alternative Hypothesis

H_o_: `useful` is a better attribute for predicting whether a  review is fake or not.
H_a_: `stars` is a better attribute for predicting whether a review is fake or not.