You can find a book rating dataset in the following github page:
https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv Calculate for each book its average rating. Also
calculate for each book its Bayesian average rating. How does the average
and Bayesian average rating differ based on the number of reviews for each
book.

In [10]:
#load tools
import pandas as pd
import numpy as np


In [39]:
#load file
df = pd.read_csv("https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv")
df

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3
...,...,...,...
5976474,49925,510,5
5976475,49925,528,4
5976476,49925,722,4
5976477,49925,949,5


In [12]:
#the first thing we notice about the data set is that it isn't indexed by book_id
#i'm going to check that there are multiple different ratings for a single book by sorting
df.sort_values(by=['book_id'])

Unnamed: 0,user_id,book_id,rating
2174136,29300,1,4
433265,6590,1,3
1907014,7546,1,5
3743260,43484,1,1
1266846,18689,1,5
...,...,...,...
2366366,31293,10000,3
3376022,12272,10000,4
2811513,35330,10000,4
4134364,46337,10000,5


In [63]:
#we're first going to find the average of the ratings for each book
#we'll keep the averages in a separate frame called book_avg
book_avg = pd.DataFrame()

#this gives us the averages based on the book_id
x = 0
while df['book_id'].equals(x):
    book_avg['average'] = df.groupby('book_id')['rating'].mean()
    x = x+1
book_avg

Unnamed: 0_level_0,average
book_id,Unnamed: 1_level_1
1,4.279707
2,4.351350
3,3.214341
4,4.329369
5,3.772224
...,...
9996,4.014184
9997,4.451613
9998,4.323529
9999,3.707692


In [40]:
#part 2 of this assignment asks for its Bayesian average rating
#i'm going to break it down into its different parts
#Ni = number of ratings for product i

book_count = pd.DataFrame()
book_count['Ni'] = df.groupby('book_id')['rating'].count()
book_count

Unnamed: 0_level_0,Ni
book_id,Unnamed: 1_level_1
1,22806
2,21850
3,16931
4,19088
5,16604
...,...
9996,141
9997,93
9998,102
9999,130


In [44]:
#N = average no of ratings for all products
N = book_count['Ni'].mean()
N

597.6479

In [43]:
#u = average mean across all products
u = df['rating'].mean()
u

3.9198655261735214

In [50]:
#ui = average mean for i product
#we already did this earlier, but i'll do it again in context of the formula
product_mean = pd.DataFrame()
product_mean['ui'] = df.groupby('book_id')['rating'].mean()
product_mean

Unnamed: 0_level_0,ui
book_id,Unnamed: 1_level_1
1,4.279707
2,4.351350
3,3.214341
4,4.329369
5,3.772224
...,...
9996,4.014184
9997,4.451613
9998,4.323529
9999,3.707692


In [52]:
#now we can piece everything together for the calculation
#BA = ((N*u)+(Ni*ui))/(N+Ni)
BA = pd.DataFrame()
BA['Bayesian Average'] = ((N*u)+(book_count['Ni']*product_mean['ui']))/(N+book_count['Ni'])
BA

Unnamed: 0_level_0,Bayesian Average
book_id,Unnamed: 1_level_1
1,4.270518
2,4.339862
3,3.238396
4,4.316937
5,3.777353
...,...
9996,3.937870
9997,3.991469
9998,3.978715
9999,3.881959


In [53]:
#let's look at it side by side
BA['mean'] = product_mean['ui']
BA

Unnamed: 0_level_0,Bayesian Average,mean
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.270518,4.279707
2,4.339862,4.351350
3,3.238396,3.214341
4,4.316937,4.329369
5,3.777353,3.772224
...,...,...
9996,3.937870,4.014184
9997,3.991469,4.451613
9998,3.978715,4.323529
9999,3.881959,3.707692


How does the average and Bayesian average rating differ based on the number of reviews for each book?

As we can see, the regular average can easily be skewed by the number of ratings given. For example, if we look at book_id 9998, the regular mean was 4.3, but the Bayesian average was 3.98. This is quite the difference. The reason lies in the number of ratings given for that book; 102. This is in great contrast with the 598, average number of ratings across all books. The Bayesian average tunes the rating so that it takes into account the number of ratings in context of the rest of the books. 

