# Pre-processing and Model

This is the Jupyter Notebook version of this section of the project. It contains all of the steps involved in pre-processing the data and then manually manipulating the data to create similarity scores to rank books. This file is beneficial because it produces visualizations of the dataframe step by step up until the final recommendations are produced. The main files used for this project however are .py files created in Atom. I used object oriented programming to make my code more reusable and callable, and I layered in a Web App using Streamlit to make this a more interactive project. Ultimately, this Notebook serves as a step by step guide of the modeling process.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt
import random

pd.set_option('display.max_columns', None)

Attain the saved data files from the previous Data_Wrangling_and_EDA Notebook.

In [2]:
ratings_file = '/Users/gregoryolson/Documents/Data Science CT/Capstone/Capstone_Books/Data/ratings_cleaned.csv'
books_file = '/Users/gregoryolson/Documents/Data Science CT/Capstone/Capstone_Books/Data/books_cleaned.csv'

ratings = pd.read_csv(ratings_file)
books = pd.read_csv(books_file)

Let's look at the two datasets again to get a reminder of their structure

In [3]:
ratings.head()

Unnamed: 0,book_id,user_id,rating
0,1,314,5
1,1,439,3
2,1,588,5
3,1,1169,4
4,1,1185,4


In [4]:
books.head()

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,title,language_code,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,"The Hunger Games (The Hunger Games, #1)",eng,4.34,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,Harry Potter and the Sorcerer's Stone (Harry P...,eng,4.44,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,"Twilight (Twilight, #1)",en-US,3.57,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,To Kill a Mockingbird,eng,4.25,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,The Great Gatsby,eng,3.89,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


We're ready to start setting up the model.

## Collaborative Filtering

I chose a collaborative filtering approach for my recommender. I used [this](https://medium.com/swlh/how-to-build-simple-recommender-systems-in-python-647e5bcd78bd#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjE3MTllYjk1N2Y2OTU2YjU4MThjMTk2OGZmMTZkZmY3NzRlNzA4ZGUiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJuYmYiOjE2MjI4MjY3MjksImF1ZCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsInN1YiI6IjEwMTQyMDI1MzEyNjU3NjI4MzY4MyIsImVtYWlsIjoiZ2NvbHNvbjExQGdtYWlsLmNvbSIsImVtYWlsX3ZlcmlmaWVkIjp0cnVlLCJhenAiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJuYW1lIjoiR3JlZyBPbHNvbiIsInBpY3R1cmUiOiJodHRwczovL2xoMy5nb29nbGV1c2VyY29udGVudC5jb20vYS0vQU9oMTRHaGRQdnowSmtaU0JXUkF1Y1VSaWlLSWJOdVo0Qk1yY05PelBLR1R6Zz1zOTYtYyIsImdpdmVuX25hbWUiOiJHcmVnIiwiZmFtaWx5X25hbWUiOiJPbHNvbiIsImlhdCI6MTYyMjgyNzAyOSwiZXhwIjoxNjIyODMwNjI5LCJqdGkiOiIzM2E2OTVmZTliMzM5OTgxMzhiYTQyYTEwOWVkMGRiZmU2Zjg2ZmNiIn0.WggZap4Lz64ddHAd04CbGvb-rzwoOR5RsRa9BMmnYEJ_ea52lpsv7rSk9josCYIxCFJgelHKbl6DcDrJ_m19rSHdzMSafpwnrOZzyA3mq6qsiti2tBcHabLR_4SnGaRBPaSpO0N9sa5i1NKC0Hn144cGiqhbpxBLVSdSEgfmW-evaMFY7_-Jk1zpY-i0cyWRyhxFCoSABcy_KfHCaHadHvFRcVeIhfMCkF_Ztb3_r1AQdOdV3I-QXOVzWPwgVaLO67FbsiC3I3pOHEtGveLslRIqOHeiK_mYbmn_jts_P14s2mrlbUwC1W9cCE6sj_All88jcZU4HqM5Rft4qSUi2Q) project by Bryan Tan as a guide to the steps involved in creating such a model. To begin, we need to create a simplified version of the books dataframe with only the essential attributes.

In [5]:
# make dataframe with only essential columns
books_cf = books[['id', 'title', 'authors', 'original_publication_year']].copy()
books_cf.head(10)

Unnamed: 0,id,title,authors,original_publication_year
0,1,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,2008.0
1,2,Harry Potter and the Sorcerer's Stone (Harry P...,"J.K. Rowling, Mary GrandPré",1997.0
2,3,"Twilight (Twilight, #1)",Stephenie Meyer,2005.0
3,4,To Kill a Mockingbird,Harper Lee,1960.0
4,5,The Great Gatsby,F. Scott Fitzgerald,1925.0
5,6,The Fault in Our Stars,John Green,2012.0
6,7,The Hobbit,J.R.R. Tolkien,1937.0
7,8,The Catcher in the Rye,J.D. Salinger,1951.0
8,9,"Angels & Demons (Robert Langdon, #1)",Dan Brown,2000.0
9,10,Pride and Prejudice,Jane Austen,1813.0


We can clean up this dataframe a little more to make it easier to work with.

In [6]:
# convert original_publication_year to int
books_cf['original_publication_year'] = books_cf['original_publication_year'].astype('Int64')

In [7]:
# rename column to something shorter
books_cf = books_cf.rename(columns={'original_publication_year': 'year'})

In [8]:
books_cf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9943 entries, 0 to 9942
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       9943 non-null   int64 
 1   title    9943 non-null   object
 2   authors  9943 non-null   object
 3   year     9943 non-null   Int64 
dtypes: Int64(1), int64(1), object(2)
memory usage: 320.6+ KB


### Sample User Input

Now that the data is ready to be used, let's create a sample user input to supplement an actual user input on the Web App.

In [9]:
# create sample dataframe with 5 book-rating pairs

user_input = [
    {'title': 'The Hobbit', 'authors': 'J.R.R. Tolkien', 'rating': 2},
    {'title': 'The Catcher in the Rye', 'authors': 'J.D. Salinger', 'rating': 3},
    {'title': 'The Great Gatsby', 'authors': 'F. Scott Fitzgerald', 'rating': 4},
    {'title': 'To Kill a Mockingbird', 'authors': 'Harper Lee', 'rating': 5},
    {'title': 'Pride and Prejudice', 'authors': 'Jane Austen', 'rating': 5}
]

input_books = pd.DataFrame(user_input)
input_books

Unnamed: 0,title,authors,rating
0,The Hobbit,J.R.R. Tolkien,2
1,The Catcher in the Rye,J.D. Salinger,3
2,The Great Gatsby,F. Scott Fitzgerald,4
3,To Kill a Mockingbird,Harper Lee,5
4,Pride and Prejudice,Jane Austen,5


This is good, but now let's begin to manipulate the data so that it is ready to be used to calculate Correlations

In [10]:
# filter the books by title, merge df's and drop year column
input_id = books_cf[books_cf['title'].isin(input_books['title'].tolist())]
input_books = pd.merge(input_id, input_books)
input_books = input_books.drop('year', axis=1)
input_books

Unnamed: 0,id,title,authors,rating
0,4,To Kill a Mockingbird,Harper Lee,5
1,5,The Great Gatsby,F. Scott Fitzgerald,4
2,7,The Hobbit,J.R.R. Tolkien,2
3,8,The Catcher in the Rye,J.D. Salinger,3
4,10,Pride and Prejudice,Jane Austen,5


In [11]:
# filter users that have read books that the input has also read
user_subset = ratings[ratings['book_id'].isin(input_books['id'].tolist())]
user_subset.head()

Unnamed: 0,book_id,user_id,rating
300,4,439,5
301,4,1169,5
302,4,2324,5
303,4,2487,5
304,4,3739,5


In [12]:
# groupby user_id to create user subset group
user_subset_group = user_subset.groupby(['user_id'])

In [13]:
# sort so that users with book most in common with the input will have priority
user_subset_group = sorted(user_subset_group, key=lambda x: len(x[1]), reverse=True)
user_subset_group[0:5]

[(3922,
       book_id  user_id  rating
  305        4     3922       4
  409        5     3922       2
  605        7     3922       5
  706        8     3922       1
  901       10     3922       5),
 (6630,
       book_id  user_id  rating
  310        4     6630       4
  413        5     6630       3
  612        7     6630       5
  710        8     6630       4
  907       10     6630       3),
 (11868,
       book_id  user_id  rating
  318        4    11868       3
  424        5    11868       3
  630        7    11868       1
  718        8    11868       2
  920       10    11868       5),
 (12381,
       book_id  user_id  rating
  320        4    12381       5
  427        5    12381       4
  632        7    12381       5
  720        8    12381       3
  922       10    12381       5),
 (12874,
       book_id  user_id  rating
  321        4    12874       5
  429        5    12874       4
  634        7    12874       4
  721        8    12874       4
  925       10    128

In [14]:
# limit number of users we look through to 100
user_subset_group = user_subset_group[0:100]

### Calculating Pearson Correlations

We will manually calculate these correlations

In [15]:
# create dict of pearson correlation values
pearson_corr_dict = {}

# for loop that calculates Pearson correlation and stores values in above dict
for name, group in user_subset_group:
    
    group = group.sort_values(by='book_id')
    input_books = input_books.sort_values(by='id')
    num_ratings = len(group)
    
    # get the review scores for the books in common
    temp_df = input_books[input_books['id'].isin(group['book_id'].tolist())]
    
    # store both ratings in list for calculations
    rating_list = temp_df['rating'].tolist()
    group_list = group['rating'].tolist()

    # calculate the pearson correlation between users 
    Sxx = sum([i**2 for i in rating_list]) - (sum(rating_list)**2 / float(num_ratings))
    Syy = sum([i**2 for i in group_list]) - (sum(group_list)**2 / float(num_ratings))
    Sxy = sum([i*j for i, j in zip(rating_list, group_list)]) \
        - (sum(rating_list) * sum(group_list) / num_ratings)
    
    # calculate Pearson corr if Sxx and Syy not 0, else set = 0
    if Sxx != 0 and Syy != 0:
        pearson_corr_dict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearson_corr_dict[name] = 0

In [16]:
# convert dictionary to dataframe
pearson_df = pd.DataFrame.from_dict(pearson_corr_dict, orient='index')
pearson_df.columns = ['similarity_index']
pearson_df['user_id'] = pearson_df.index
pearson_df.index = range(len(pearson_df))
pearson_df.head()

Unnamed: 0,similarity_index,user_id
0,0.14777,3922
1,-0.733359,6630
2,0.879049,11868
3,0.300123,12381
4,0.514496,12874


In [17]:
# get top 50 similar users
top_users = pearson_df.sort_values(by='similarity_index', ascending=False)[0:50]
top_users.head()

Unnamed: 0,similarity_index,user_id
50,1.0,33065
87,1.0,439
91,1.0,2900
90,1.0,2487
89,1.0,1952


In [18]:
# merge top_users df with ratings 
top_users_rating = top_users.merge(ratings, left_on='user_id', right_on='user_id', how='inner')
top_users_rating.head()

Unnamed: 0,similarity_index,user_id,book_id,rating
0,1.0,33065,1,4
1,1.0,33065,2,5
2,1.0,33065,3,1
3,1.0,33065,4,5
4,1.0,33065,5,4


In [19]:
# multiply the user similarity by the user ratings
top_users_rating['weighted_rating'] = top_users_rating['similarity_index'] * top_users_rating['rating']
top_users_rating.head()

Unnamed: 0,similarity_index,user_id,book_id,rating,weighted_rating
0,1.0,33065,1,4,4.0
1,1.0,33065,2,5,5.0
2,1.0,33065,3,1,1.0
3,1.0,33065,4,5,5.0
4,1.0,33065,5,4,4.0


In [20]:
# apply a sum to the top_users after grouping it up by user_id
temp_top_users_rating = top_users_rating.groupby('book_id').sum()[['similarity_index','weighted_rating']]
temp_top_users_rating.columns = ['sum_similarity_index','sum_weighted_rating']
temp_top_users_rating.head()

Unnamed: 0_level_0,sum_similarity_index,sum_weighted_rating
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,15.666497,68.243037
2,21.437768,93.621073
3,14.718443,43.777174
4,23.912154,113.433647
5,26.020819,102.288831


### Recommmendations

Finally, we can use the weighted rating and similarty index to calculate the weighted average recommendation score to be used to give recommendations.

In [21]:
# create an empty dataframe
recommendation_df = pd.DataFrame()

# find the weighted average
recommendation_df['weighted average recommendation score'] = temp_top_users_rating['sum_weighted_rating'] / temp_top_users_rating['sum_similarity_index']
recommendation_df['book_id'] = temp_top_users_rating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,book_id
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.355986,1
2,4.367109,2
3,2.974307,3
4,4.743765,4
5,3.931038,5


In [22]:
# sort values in order of highest weights descending
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,book_id
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1024,5.0,1024
906,5.0,906
1519,5.0,1519
863,5.0,863
958,5.0,958
283,5.0,283
989,5.0,989
2025,5.0,2025
3296,5.0,3296
755,5.0,755


In [23]:
# return rows from books_cf with above book_id's to make final recommendation
recommendation = books_cf.loc[books_cf['id'].isin(recommendation_df.head(10)['book_id'].tolist())]
recommendation = recommendation.drop('id', axis=1)
print(recommendation.to_string(index=False))

                                                                    title                       authors  year
      Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch  Terry Pratchett, Neil Gaiman  1990
                                          Mort (Death, #1; Discworld, #4)               Terry Pratchett  1987
                                                Guess How Much I Love You    Sam McBratney, Anita Jeram  1988
                                             Major Pettigrew's Last Stand                Helen Simonson  2010
 The Complete Anne of Green Gables Boxed Set (Anne of Green Gables, #1-8)               L.M. Montgomery  1908
                                                    The Day of the Jackal             Frederick Forsyth  1971
              The Sweetness at the Bottom of the Pie (Flavia de Luce, #1)                  Alan Bradley  2009
                   The Diamond Age: or, A Young Lady's Illustrated Primer               Neal Stephenson  1995
          

This gives us a 10 book recommendation.