## Exploratory Analysis: How to fill missing ratings

What is the probability that a user liked a book, given that they read another book by the author?

Is there a strong relationship between genre and how highly a user rates a book?

Is there a strong relationship between author and how highly a user rates a book?

Does the rating a user gives a book change over time? Do people become less likely to rate books highly if they have been reviewing books for a longer time?

In [None]:
## loading libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import plotly.express as px

In [7]:
# loading in read books
reads = pd.read_csv('data/reads_cleaned.csv')

  reads = pd.read_csv('data/reads_cleaned.csv')


In [8]:
reads.columns

Index(['Unnamed: 0', 'title', 'author', 'num ratings', 'rating',
       '# times read', 'date read', 'date added', 'link', 'user_id',
       'global_pop', 'book_id', 'book_tag', 'is_rated', 'average_user_rating',
       'book_popularity_rated', 'book_popularity_read', 'user_books_rated',
       'user_books_read', 'user_rating_variability', 'user_rating_percentage',
       'book_rating_percentage', 'adjusted_rating'],
      dtype='object')

In [None]:
reads['date read'] = list(map(lambda x: re.sub("not set", "", x), reads['date read']))

reads['date read'] = reads['date read'].apply(lambda x: x[:12] if ',' in x else x[:8])
reads['date read'] = reads['date read'].apply(lambda x: x[:12] if ',' in x else x[:8])
reads['date read'].head(20)

reads['date read'] = pd.to_datetime(reads['date read'], format='mixed')

# What is the probability that a user liked a book, given that they read a subsequent book by the author?

To answer this question, we need to analyze cases where user's rated two books by the same author in cases where the date of reading is intact.

For the purposes of this exercise, a person liked a book if the rating they gave the book is higher than their average rating.

In [11]:

reads = reads.dropna(subset=['date read'])

# adding 'liked book' variable
reads['liked_book']  = reads['rating'] > reads['average_user_rating']

# getting first book read by the author
reads['first_book_by_author'] = reads.sort_values('date read').groupby(['user_id', 'author'])['book_tag'].transform('first')

# getting number of books read by author
reads['num_books_read_by_author'] = reads.sort_values('date read').groupby(['user_id', 'author'])['book_tag'].transform('nunique')

# repeatedly read author
reads['repeat_author'] = reads['num_books_read_by_author'] > 1

# creating a df of each user's first read per author
# excluding cases where they did not rate the first book
first_books_rated = reads[(reads['book_tag'] == reads['first_book_by_author']) & (reads['rating'] > 0)].drop_duplicates(['user_id', 'author'])

In [12]:
# summing 
liked_book_and_read_more = first_books_rated.loc[first_books_rated['repeat_author'], 'liked_book'].sum()
liked_book_didnt_read_more = first_books_rated.loc[(first_books_rated['num_books_read_by_author'] <= 1), 'liked_book'].sum()

reads['num_books_read_by_author'].describe()

# sample size for each group
n_read_more = first_books_rated.loc[(first_books_rated['num_books_read_by_author'] > 1), 'book_tag'].count()
n_didnt_read_more = first_books_rated.loc[(first_books_rated['num_books_read_by_author'] <= 1), 'book_tag'].count()

p_liked_read_more = liked_book_and_read_more / n_read_more

p_liked_didnt_read_more = liked_book_didnt_read_more / n_didnt_read_more

prob_liked = (liked_book_and_read_more + liked_book_didnt_read_more) / (n_didnt_read_more + n_read_more)

print(f'Given that a user read a second book by the author, there is a {round(p_liked_read_more * 100, 2)}% chance they liked the first book.')
print(f'Given that a user did not read a second a book by the author, there is a {round(p_liked_didnt_read_more * 100, 2)}% chance they liked the first book.')
print(f'The overall chance that a user liked the first book they read by an author is {round(prob_liked * 100, 2)}%.')
print(f'The number of times a user rated a book then read at least one more book by the same author was: {n_read_more}')
print(f'The number of times a user rated a book then did not read a second book by the author was : {n_didnt_read_more}')

Given that a user read a second book by the author, there is a 63.61% chance they liked the first book.
Given that a user did not read a second a book by the author, there is a 49.19% chance they liked the first book.
The overall chance that a user liked the first book they read by an author is 52.31%.
The number of times a user rated a book then read at least one more book by the same author was: 82135
The number of times a user rated a book then did not read a second book by the author was : 297506


It looks like the probability that a user likes a book is different, depending on whether or not they read a second book. To test whether or not this is a meaningful effect, we can use a generalized linear mixed model. 

Note: a traditional parametric test for differences in proportions is not appropriate here because we cannot assume independence between observations since books and users are repeated throughout the dataset.

In [13]:
# converting booleans to 0/1
first_books_rated['liked_book'] = first_books_rated['liked_book'].astype(int)
first_books_rated['repeat_author'] = first_books_rated['repeat_author'].astype(int)

# converting user_id and book_id to factors
first_books_rated['user_id'] = pd.Categorical(first_books_rated['user_id']).codes
first_books_rated['book_tag'] = pd.Categorical(first_books_rated['book_tag']).codes

In [None]:
from scipy.stats import chi2

# Compute proportions for each Z level
proportions = first_books_rated.groupby(['user_id', 'repeat_author'])['liked_book'].mean().unstack()
counts = first_books_rated.groupby(['user_id', 'repeat_author'])['liked_book'].count().unstack()

# Weighted difference of proportions
prop_diff = (proportions[1] - proportions[0])
weights = counts.sum(axis=1) / counts.sum().sum()  # Weight by stratum size
weighted_diff = np.sum(prop_diff * weights)

# Compute standard error
se = np.sqrt(np.sum(weights**2 * (proportions[1] * (1 - proportions[1]) / counts[1] +
                                  proportions[0] * (1 - proportions[0]) / counts[0])))

# Compute z-score and p-value
z_score = weighted_diff / se
p_value = 2 * (1 - chi2.cdf(z_score**2, df=1))

print(f"Weighted Proportion Difference: {weighted_diff:.4f}, p-value: {p_value:.4f}")

Weighted Proportion Difference: 0.1577, p-value: 0.0000
