## About dataset
This dataset, which includes Amazon product data, includes product categories and various metadata. The product with the most comments in the Electronics category has user ratings and comments.
## Variables
- **reviewerID** - User Id
- **asin** – Product Id
- **reviewerName** – User name
- **helpful** – Helpful comment rating
- **reviewText** – User-written review text
- **overall** – Product rating
- **summary** – Review summary
- **unixReviewTime** – Review time (Unix Time)
- **reviewTime** – Review time (Raw)

In [1]:
#Essential imports
import pandas as pd
import math
import scipy.stats as st

In [2]:
# Pandas options
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [3]:
# Reading dataset
df = pd.read_csv("/Users/aslihankalyonkat/Desktop/DSMLBC/datasets/amazon_review.csv")
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,day_diff,helpful_yes,total_vote
0,A3SBTW3WS4IQSN,B007WTAJTO,,"[0, 0]",No issues.,4.0,Four Stars,1406073600,2014-07-23,138,0,0
1,A18K1ODH1I2MVB,B007WTAJTO,0mie,"[0, 0]","Purchased this for my device, it worked as adv...",5.0,MOAR SPACE!!!,1382659200,2013-10-25,409,0,0
2,A2FII3I2MBMUIA,B007WTAJTO,1K3,"[0, 0]",it works as expected. I should have sprung for...,4.0,nothing to really say....,1356220800,2012-12-23,715,0,0
3,A3H99DFEG68SR,B007WTAJTO,1m2,"[0, 0]",This think has worked out great.Had a diff. br...,5.0,Great buy at this price!!! *** UPDATE,1384992000,2013-11-21,382,0,0
4,A375ZM4U047O79,B007WTAJTO,2&amp;1/2Men,"[0, 0]","Bought it with Retail Packaging, arrived legit...",5.0,best deal around,1373673600,2013-07-13,513,0,0


In [4]:
# checking the number of days since the comment was made
df["day_diff"].quantile([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1])

0.10000    167.00000
0.20000    248.00000
0.30000    311.00000
0.40000    361.00000
0.50000    431.00000
0.60000    497.40000
0.70000    562.80000
0.80000    638.00000
0.90000    708.00000
0.99000    943.00000
1.00000   1064.00000
Name: day_diff, dtype: float64

### Defining a time based weight avarage function

In [5]:
def time_based_weighted_average(dataframe, w1=24, w2=22, w3=20, w4=18, w5=16):
    return dataframe.loc[dataframe["day_diff"] <= dataframe["day_diff"].quantile(0.2), "overall"].mean() * w1 / 100 + \
           dataframe.loc[(dataframe["day_diff"] > dataframe["day_diff"].quantile(0.2)) & (
                   dataframe["day_diff"] <= dataframe["day_diff"].quantile(0.4)), "overall"].mean() * w2 / 100 + \
           dataframe.loc[(dataframe["day_diff"] > dataframe["day_diff"].quantile(0.4)) & (
                   dataframe["day_diff"] <= dataframe["day_diff"].quantile(0.6)), "overall"].mean() * w3 / 100 + \
           dataframe.loc[(dataframe["day_diff"] > dataframe["day_diff"].quantile(0.6)) & (
                   dataframe["day_diff"] <= dataframe["day_diff"].quantile(0.8)), "overall"].mean() * w4 / 100 + \
           dataframe.loc[dataframe["day_diff"] > dataframe["day_diff"].quantile(0.8), "overall"].mean() * w5 / 100

In [6]:
# Time based weighted avarave
time_based_weighted_average(df)

4.600026902863648

In [7]:
# Normal avarage
df["overall"].mean()

4.587589013224822

In [8]:
# difference between weigted and normal avarage
time_based_weighted_average(df) - df["overall"].mean()

0.012437889638825972

## Explication
The time-weighted average is almost the same as the normal average. However, there are a total of 4915 reviews for this product. This increase in the average indicates that we have been scoring well recently. It is possible to make this generalization as the number of comments is large and cannot be increased by a few high scores.

### Checking recent reviews

In [9]:
df.loc[df["day_diff"] <= df["day_diff"].quantile(0.2), "overall"].mean() 

4.680203045685279

### Checking old reviews

In [10]:
df.loc[df["day_diff"] > df["day_diff"].quantile(0.8), "overall"].mean()

4.4346938775510205

### As a result
The average score of recent comments is 4.680203045685279, and the average of previous comments is 4.4346938775510205. Which means we've been getting better scores recently.

## Specify 20 reviews for the product to display on the product detail page

In [11]:
# Sorting by total vote number
df[["total_vote","helpful_yes"]].sort_values(by="total_vote",ascending=False)

Unnamed: 0,total_vote,helpful_yes
2031,2020,1952
4212,1694,1568
3449,1505,1428
317,495,422
2909,236,53
...,...,...
1729,0,0
1728,0,0
1727,0,0
1726,0,0


In [12]:
# Creating helpful_no column for negative ratings
df["helpful_no"]=df["total_vote"]-df["helpful_yes"]
df[["total_vote","helpful_yes", "helpful_no"]].sort_values(by="total_vote",ascending=False)

Unnamed: 0,total_vote,helpful_yes,helpful_no
2031,2020,1952,68
4212,1694,1568,126
3449,1505,1428,77
317,495,422,73
2909,236,53,183
...,...,...,...
1729,0,0,0
1728,0,0,0
1727,0,0,0
1726,0,0,0


### Defining wilson lower bound function
Lower bound of Wilson score confidence interval for a Bernoulli parameter provides a way to sort a product based on positive and negative ratings.

In [13]:
def wilson_lower_bound(up, down, confidence=0.95):
    n = up + down
    if n == 0:
        return 0
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    phat = 1.0 * up / n
    return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n)

In [14]:
# applying wilson lower bound function
df["wilson_lower_bound"] = df.apply(lambda row: wilson_lower_bound(row["helpful_yes"], row["helpful_no"]), axis=1)
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,day_diff,helpful_yes,total_vote,helpful_no,wilson_lower_bound
0,A3SBTW3WS4IQSN,B007WTAJTO,,"[0, 0]",No issues.,4.0,Four Stars,1406073600,2014-07-23,138,0,0,0,0.0
1,A18K1ODH1I2MVB,B007WTAJTO,0mie,"[0, 0]","Purchased this for my device, it worked as adv...",5.0,MOAR SPACE!!!,1382659200,2013-10-25,409,0,0,0,0.0
2,A2FII3I2MBMUIA,B007WTAJTO,1K3,"[0, 0]",it works as expected. I should have sprung for...,4.0,nothing to really say....,1356220800,2012-12-23,715,0,0,0,0.0
3,A3H99DFEG68SR,B007WTAJTO,1m2,"[0, 0]",This think has worked out great.Had a diff. br...,5.0,Great buy at this price!!! *** UPDATE,1384992000,2013-11-21,382,0,0,0,0.0
4,A375ZM4U047O79,B007WTAJTO,2&amp;1/2Men,"[0, 0]","Bought it with Retail Packaging, arrived legit...",5.0,best deal around,1373673600,2013-07-13,513,0,0,0,0.0


## Getting top twenty reviews

In [15]:
df[["reviewText","wilson_lower_bound"]].sort_values(by="wilson_lower_bound", ascending=False).head(20)

Unnamed: 0,reviewText,wilson_lower_bound
2031,[[ UPDATE - 6/19/2014 ]]So my lovely wife boug...,0.95754
3449,I have tested dozens of SDHC and micro-SDHC ca...,0.93652
4212,NOTE: please read the last update (scroll to ...,0.91214
317,"If your card gets hot enough to be painful, it...",0.81858
4672,Sandisk announcement of the first 128GB micro ...,0.80811
1835,Bought from BestBuy online the day it was anno...,0.78465
3981,The last few days I have been diligently shopp...,0.73214
3807,I bought this card to replace a lost 16 gig in...,0.70044
4306,"While I got this card as a ""deal of the day"" o...",0.67033
4596,Hi:I ordered two card and they arrived the nex...,0.66359
