# Modeling

In [1]:
# Import libraries
# ----------------

# Pandas
import pandas as pd

# Numpy
import numpy as np

# SKLearn
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import pairwise_distances

In [2]:
# Load Datasets
# ----------------

# Ratings
ratings_data = "../data/raw/ratings.csv"
ratings = pd.read_csv(ratings_data)

# Books
books_data = "../data/raw/books.csv"
books = pd.read_csv(books_data)
books = books[['book_id', 'title']]

# Merge ratings and books datasets
df = pd.merge(books, ratings)
df.head()

Unnamed: 0,book_id,title,user_id,rating
0,1,"The Hunger Games (The Hunger Games, #1)",2886,5
1,1,"The Hunger Games (The Hunger Games, #1)",6158,5
2,1,"The Hunger Games (The Hunger Games, #1)",3991,4
3,1,"The Hunger Games (The Hunger Games, #1)",5281,5
4,1,"The Hunger Games (The Hunger Games, #1)",5721,5


## Build a simple model

A simple model recommends the same books to all users regardless of their reading or rating history. The recommendations are usually based on the most popular or highest rated items or a score that combines the two. For my simple model, I will be giving each book a weighted score in order to avoid having books with few ratings, but a high average rating skew my results.

In order to create the model and a weighted score, I will be adapting DataCamp's _[IMDB 250 tutorial](https://www.datacamp.com/community/tutorials/recommender-systems-python)_ which gives each item a weighted score using the following formula: 

<img src="figures/image1.png" width=400 align="center">

* _v_ number of ratings a book has received
* _m_ the minimum number of ratings a book needs to receive to be included in the model;
* _R_ the average rating for the book; And
* _C_ the average rating for all books

In [5]:
# Create a column for the number of ratings a book has received called 'ratings_count'
df['ratings_count'] = df['book_id'].groupby(df['book_id']).transform('count')

# Create a column for the average rating a book has received called 'average_rating'
df['average_rating'] = df['rating'].groupby(df['book_id']).transform('mean')

# Calculate the average rating for all books
C = df['rating'].mean()

# Calculate the minimum number of ratings a book needs to receive in order to be included in the model
m = df['ratings_count'].quantile(0.90)

df.head()

Unnamed: 0,book_id,title,user_id,rating,ratings_count,average_rating
0,1,"The Hunger Games (The Hunger Games, #1)",2886,5,22806,4.279707
1,1,"The Hunger Games (The Hunger Games, #1)",6158,5,22806,4.279707
2,1,"The Hunger Games (The Hunger Games, #1)",3991,4,22806,4.279707
3,1,"The Hunger Games (The Hunger Games, #1)",5281,5,22806,4.279707
4,1,"The Hunger Games (The Hunger Games, #1)",5721,5,22806,4.279707


In [8]:
# Filter the dataset based on value m
df2 = df.copy().loc[df['ratings_count'] >= m]

# Create a function that gives a weighted score to each book
def weighted_score(x, m=m, C=C):
    v = x['ratings_count']
    R = x['average_rating']
    # Calculation based on an IMDB formula 
    return (v/(v+m) * R) + (m/(m+v) * C)

# Create a 'score' column and give each book a weighted score
df2['score'] = df2.apply(weighted_score, axis=1)
df2.head()

Unnamed: 0,book_id,title,user_id,rating,ratings_count,average_rating,score
0,1,"The Hunger Games (The Hunger Games, #1)",2886,5,22806,4.279707,4.170325
1,1,"The Hunger Games (The Hunger Games, #1)",6158,5,22806,4.279707,4.170325
2,1,"The Hunger Games (The Hunger Games, #1)",3991,4,22806,4.279707,4.170325
3,1,"The Hunger Games (The Hunger Games, #1)",5281,5,22806,4.279707,4.170325
4,1,"The Hunger Games (The Hunger Games, #1)",5721,5,22806,4.279707,4.170325


In [9]:
# Sort books based on score
df2 = df2.sort_values('score', ascending=False)

# Drop duplicate books
df2 = df2.drop_duplicates(subset='book_id', keep="first")

#Print the top 10 books
df2[['title', 'ratings_count', 'average_rating', 'score']].head(10)

Unnamed: 0,title,ratings_count,average_rating,score
379644,Harry Potter and the Deathly Hallows (Harry Po...,15304,4.525941,4.287004
402878,Harry Potter and the Half-Blood Prince (Harry ...,15081,4.443339,4.235129
354945,Harry Potter and the Goblet of Fire (Harry Pot...,15523,4.43078,4.23109
262333,Harry Potter and the Prisoner of Azkaban (Harr...,15855,4.418732,4.226258
25217,Harry Potter and the Sorcerer's Stone (Harry P...,21850,4.35135,4.216248
63196,To Kill a Mockingbird,19088,4.329369,4.188958
310931,Harry Potter and the Order of the Phoenix (Har...,15258,4.358697,4.185378
451504,The Help,12727,4.382887,4.179612
20785,"The Hunger Games (The Hunger Games, #1)",22806,4.279707,4.170325
535323,"A Game of Thrones (A Song of Ice and Fire, #1)",10692,4.33988,4.137317


If we were to use this model, Goodreads would recommend these 10 books to every user. While this is interesting, I don't necessarily want to recommend Harry Potter to every single user. In my next model I will be using a reader's reading and ratings history to create specific recommendations for each user. 

## Building a collaborative model from scratch

Collaborative models make recommendations by comparing users and books to each other. This is accomplished by measuring the distance between each user and each item in order to group similar users and items together. Because the model is making multiple calculations for every item in the dataset, it works best on smaller datasets. I will be filtering the dataset to only include users that have reviewed at least 175 books. 

In [10]:
# Filter data to only include users who have reviewed at least 175 books
ratings = df
ratings['number_of_reviews'] = ratings['user_id'].groupby(ratings['user_id']).transform('count') 
ratings = ratings[ratings['number_of_reviews'] >= 175]
ratings = ratings.drop('number_of_reviews', 1)

# Create a new column for user_index and book_index
ratings = ratings.assign(user_index=(ratings['user_id']).astype('category').cat.codes)
ratings = ratings.assign(book_index=(ratings['book_id']).astype('category').cat.codes)

n_users = ratings.user_id.unique().shape[0]
n_items = ratings.book_id.unique().shape[0]
print('DataFrame shape: {}'.format(ratings.shape))
print('Number of unique users:', n_users)
print('Number of books:', n_items)
ratings.head()

DataFrame shape: (108497, 8)
Number of unique users: 596
Number of books: 7574


Unnamed: 0,book_id,title,user_id,rating,ratings_count,average_rating,user_index,book_index
200,1,"The Hunger Games (The Hunger Games, #1)",6323,4,22806,4.279707,60,0
459,1,"The Hunger Games (The Hunger Games, #1)",10140,4,22806,4.279707,120,0
461,1,"The Hunger Games (The Hunger Games, #1)",10146,5,22806,4.279707,121,0
616,1,"The Hunger Games (The Hunger Games, #1)",10792,5,22806,4.279707,134,0
631,1,"The Hunger Games (The Hunger Games, #1)",10120,3,22806,4.279707,119,0


When I removed users who have rated fewer than 175 books, I had to create a new user index and a new book index so that I can place the ratings for each user and book into a matrix. 

In [11]:
# Create a user-item matrix
data_matrix = np.zeros((n_users, n_items))

for line in ratings.itertuples(): 
    data_matrix[line[7]-1, line[8]-1] = line[3]

In [12]:
# Calculate the similarity using pairwise_distance
user_similarity = pairwise_distances(data_matrix, metric='cosine')
item_similarity = pairwise_distances(data_matrix.T, metric='cosine')

In [13]:
# Define a function to make predictions
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #We use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [14]:
# Make Predictions
user_prediction = predict(data_matrix, user_similarity, type='user')
item_prediction = predict(data_matrix, item_similarity, type='item')

## Next Steps

Now that I have created two different models, my next steps will be to assess my second model's performance using the root means square error (RMSE). I will also be building a Singular Value Decomposition model in order to hopefully create the strongest model possible. 