## Amazon Book Reviews Part IV: Hypothesis Testing 

#### This is the 5-core dataset which means that each user and item has at least 5 reviews.  It has ~9 million reviews: http://jmcauley.ucsd.edu/data/amazon/

#### Introduction 

1. Overall ratings are divided into two categories where 4 & 5 star ratings are positive reviews. Helpfulness is also divided into two categories and below 50% is 'NOT' helpful. 
2. We test here if the high star ratings are associated with a better helpfulness.

In [43]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.io import gbq
import scipy.stats as st
import statsmodels.stats.api as sms
%matplotlib inline

In [2]:
query1 = "SELECT overall, percHelpful FROM amazon_book_small.help_length \
WHERE unixReviewTime >= '1996-01-01 00:00:00 UTC' AND unixReviewTime < '2012-01-01 00:00:00 UTC'"

In [3]:
project_id = 'dotted-chiller-156222'

In [4]:
da = gbq.read_gbq(query1, project_id=project_id)

Requesting query... ok.
Query running...
Query done.
Processed: 172.1 Mb

Retrieving results...
  Got page: 1; 4.0% done. Elapsed 15.94 s.
  Got page: 2; 7.0% done. Elapsed 19.61 s.
  Got page: 3; 11.0% done. Elapsed 23.62 s.
  Got page: 4; 14.0% done. Elapsed 27.79 s.
  Got page: 5; 18.0% done. Elapsed 31.99 s.
  Got page: 6; 22.0% done. Elapsed 36.33 s.
  Got page: 7; 25.0% done. Elapsed 41.17 s.
  Got page: 8; 29.0% done. Elapsed 45.84 s.
  Got page: 9; 33.0% done. Elapsed 48.79 s.
  Got page: 10; 36.0% done. Elapsed 53.26 s.
  Got page: 11; 40.0% done. Elapsed 56.74 s.
  Got page: 12; 43.0% done. Elapsed 60.37 s.
  Got page: 13; 47.0% done. Elapsed 65.09 s.
  Got page: 14; 51.0% done. Elapsed 69.77 s.
  Got page: 15; 54.0% done. Elapsed 73.78 s.
  Got page: 16; 58.0% done. Elapsed 78.28 s.
  Got page: 17; 61.0% done. Elapsed 81.89 s.
  Got page: 18; 65.0% done. Elapsed 85.95 s.
  Got page: 19; 69.0% done. Elapsed 89.46 s.
  Got page: 20; 72.0% done. Elapsed 94.43 s.
  Got page: 21;

In [5]:
da.head(1)

Unnamed: 0,overall,percHelpful
0,5,0.0


In [6]:
query2 = "SELECT overall, percHelpful FROM amazon_book_small.help_length \
WHERE unixReviewTime >= '2012-01-02 00:00:00 UTC' AND unixReviewTime < '2015-01-01 00:00:00 UTC'"

In [7]:
db = gbq.read_gbq(query2, project_id=project_id)

Requesting query... ok.
Query running...
  Elapsed 11.15 s. Waiting...
Query done.
Processed: 172.1 Mb

Retrieving results...
  Got page: 1; 2.0% done. Elapsed 19.92 s.
  Got page: 2; 3.0% done. Elapsed 22.8 s.
  Got page: 3; 5.0% done. Elapsed 26.19 s.
  Got page: 4; 7.0% done. Elapsed 30.63 s.
  Got page: 5; 8.0% done. Elapsed 34.15 s.
  Got page: 6; 10.0% done. Elapsed 37.75 s.
  Got page: 7; 11.0% done. Elapsed 42.37 s.
  Got page: 8; 13.0% done. Elapsed 45.51 s.
  Got page: 9; 15.0% done. Elapsed 49.19 s.
  Got page: 10; 16.0% done. Elapsed 52.7 s.
  Got page: 11; 18.0% done. Elapsed 57.08 s.
  Got page: 12; 20.0% done. Elapsed 61.29 s.
  Got page: 13; 21.0% done. Elapsed 65.74 s.
  Got page: 14; 23.0% done. Elapsed 69.04 s.
  Got page: 15; 24.0% done. Elapsed 72.56 s.
  Got page: 16; 26.0% done. Elapsed 76.67 s.
  Got page: 17; 28.0% done. Elapsed 80.69 s.
  Got page: 18; 29.0% done. Elapsed 84.85 s.
  Got page: 19; 31.0% done. Elapsed 88.09 s.
  Got page: 20; 33.0% done. Elapsed

In [22]:
df = pd.concat([da, db])
df = df.fillna(0)
df.head(1)

Unnamed: 0,overall,percHelpful
0,5,0.0


In [23]:
df.shape

(8895872, 2)

A percentage rating column were prepared in bigQuery based on (yes vote/total vote). Note that 90/100 will have the same percentage rating as 9/10 but I am not taking the weighted average and assuming that 0.9% carries the same weight throughout. 

### Function to make overall ratings positive or negative

In [None]:
def partition(x):
    if x < 3.0:
        return 'negative'
    return 'positive'

rating = df['overall']
rating = rating.map(partition)

tmp = df
tmp['overall'] = tmp['overall'].map(partition)

In [25]:
tmp.head()

Unnamed: 0,overall,percHelpful
0,positive,0.0
1,positive,0.857143
2,positive,0.666667
3,positive,0.571429
4,positive,0.971429


### Function to make helpfulness useful (1) or not (0)

In [26]:
def partitionP(x):
    if x < 0.5:
        return 0.0
    return 1.0

rating = df['percHelpful']
rating = rating.map(partitionP)
tmp = df
tmp['percHelpful'] = tmp['percHelpful'].map(partitionP)
tmp.head()

Unnamed: 0,overall,percHelpful
0,positive,0.0
1,positive,1.0
2,positive,1.0
3,positive,1.0
4,positive,1.0


In [27]:
total = len(tmp)
total

8895872

In [28]:
pos = len(tmp[tmp.overall == 'positive'])
pos

8157203

In [29]:
pos = len(tmp[tmp.overall == 'positive'])
pos

8157203

In [30]:
neg = len(tmp[tmp.overall == 'negative'])
neg

738669

In [31]:
pos_frac = float(pos)/float(total)
pos_frac

0.9169649698197097

#### 92% of the total reviews are positive.

In [32]:
neg_frac = (1.0 -pos_frac)
neg_frac

0.08303503018029035

I am going to chose 20000 random reviews from the original dataset to test the hypothesis. Since the positive to negative ratio is ~ 9:1 in the original dataset we will take a sample preserving the same ratio. 

In [33]:
from random import sample
dj = df.iloc[sample(range(len(df)), 20000), :]

In [34]:
len(dj[dj.overall == 'positive'])

18352

In [35]:
len(dj[dj.overall == 'negative'])

1648

In [37]:
sum(dj[dj.overall=='positive'].percHelpful)

7838.0

In [39]:
sum(dj[dj.overall =='negative'].percHelpful)

843.0

In [40]:
posYes = 7838.0/18352.0
posYes

0.4270924149956408

In [41]:
negYes = 843.0/1648.0
negYes

0.5115291262135923

In [42]:
negYes - posYes

0.08443671121795143

For this sample, we see that negative reviewers got a better helpfulness rating by 8%. We will test the statistical significance of this statement. 

<p> Null hypothesis is that there is no difference, i.e., posYes = negYes. Samples are chosen randomly, let's calculate $\hat{p}$ 

In [60]:
# phat calculation 
p_exp = (7838.0+843)/20000.0
p_exp

0.43405

In [61]:
pnegY = p_exp*1648
pnegY

715.3144

In [62]:
pnegN = (1-p_exp)*1648
pnegN

932.6855999999999

In [63]:
pposY = p_exp*18352
pposY

7965.6856

In [64]:
pposN = (1-p_exp)*18352
pposN

10386.3144

np and nqs are all greater than 10. We can use CLT to calculate z-score, p-values and confidence intervals:

In [44]:
def ZscoreProp(num1, num2, size1, size2):
    num1, num2, size1, size2 = float(num1), float(num2), float(size1), float(size2)
    prop1 = num1/size1
    prop2 = num2/size2
    teststat = abs(prop1 - prop2)
    p_exp = (num1 + num2)/(size1+size2)
    std_err = round(np.sqrt(p_exp*(1-p_exp)*(1/size1 + 1/size2)), 2)
    z_score = round(teststat/std_err, 2)
    p_value = st.norm.sf(z_score)*2
    conf1 = round(teststat - (1.96*std_err), 3)
    conf2 = round(teststat + (1.96*std_err), 3)
    return z_score, p_value, conf1, conf2
ZscoreProp(7838, 843, 18352, 1648)

(8.44, 3.1733766408965351e-17, 0.065, 0.104)

with a p-value < .001, we can say with 95% confidence that negative reviewers are likely to get more helpfulness ratings the 8-9% range.

we can also use statsmodels stats:

In [47]:
from statsmodels.stats import proportion
sms.proportions_ztest(np.array([7838.0, 843.0]), np.array([18352.0, 1648.0]),value= 0.08)

(-12.901647619219666, 4.4057541906528601e-38)

Since we have the Whole population: we can calculate confidence interval here too:

In [49]:
pos, neg

(8157203, 738669)

In [50]:
sum(df[df.overall=='positive'].percHelpful)

3499532.0

In [51]:
sum(df[df.overall =='negative'].percHelpful)

373310.0

In [59]:
def ZscoreProp1(num1, num2, size1, size2):
    num1, num2, size1, size2 = float(num1), float(num2), float(size1), float(size2)
    prop1 = num1/size1
    prop2 = num2/size2
    teststat = abs(prop1 - prop2)
    p_exp = (num1 + num2)/(size1+size2)
    std_err = np.sqrt(p_exp*(1-p_exp)*(1/size1 + 1/size2))
    z_score = round(teststat/std_err, 2)
    p_value = st.norm.sf(z_score)*2
    conf1 = round(teststat - (1.96*std_err), 3)
    conf2 = round(teststat + (1.96*std_err), 3)
    return z_score, p_value, conf1, conf2
ZscoreProp1(3499532.0, 373310.0, 8157203.0, 738669.0)

(126.77, 0.0, 0.075, 0.078)

Now the confidence interval is very narrow because we have the whole dataset.

### Recommendation

We see that negative reviews are likely to be more helpful. However, it does not mean that positive reviews are not helpful. We have to examine ratings, length of the reviews and the helpfulness rating of reviewers to make that decision.