# Pre-processing and Model

This is the Jupyter Notebook version of this section of the project. It contains all of the steps involved in pre-processing the data and then manually manipulating the data to create similarity scores to rank books. This file is beneficial because it produces visualizations of the dataframe step by step up until the final recommendations are produced. The main files used for this project however are .py files created in Atom. I used object oriented programming to make my code more reusable and callable, and I layered in a Web App using Streamlit to make this a more interactive project. Ultimately, this Notebook serves as a step by step guide of the modeling process.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt
import random

pd.set_option('display.max_columns', None)

Attain the saved data files from the previous Data_Wrangling_and_EDA Notebook.

In [2]:
ratings_file = '/Users/gregoryolson/Documents/Data Science CT/Capstone/Capstone_Books/Data/ratings_cleaned.csv'
books_file = '/Users/gregoryolson/Documents/Data Science CT/Capstone/Capstone_Books/Data/books_cleaned.csv'

ratings = pd.read_csv(ratings_file)
books = pd.read_csv(books_file)

Let's look at the two datasets again to get a reminder of their structure

In [3]:
ratings.head()

Unnamed: 0,book_id,user_id,rating
0,1,314,5
1,1,439,3
2,1,588,5
3,1,1169,4
4,1,1185,4


In [4]:
books.head()

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,title,language_code,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url,goodreads_book_id,genre1,genre2,genre3
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,"The Hunger Games (The Hunger Games, #1)",eng,4.34,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...,2767052,young-adult,fiction,fantasy
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,Harry Potter and the Sorcerer's Stone (Harry P...,eng,4.44,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...,3,fantasy,young-adult,fiction
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,"Twilight (Twilight, #1)",en-US,3.57,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...,41865,young-adult,fantasy,fiction
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,To Kill a Mockingbird,eng,4.25,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...,2657,classics,historical-fiction,young-adult
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,The Great Gatsby,eng,3.89,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...,4671,classics,fiction,historical-fiction


We're ready to start setting up the model.

## Collaborative Filtering

I chose a collaborative filtering approach for my recommender. I used [this](https://medium.com/swlh/how-to-build-simple-recommender-systems-in-python-647e5bcd78bd#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjE3MTllYjk1N2Y2OTU2YjU4MThjMTk2OGZmMTZkZmY3NzRlNzA4ZGUiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJuYmYiOjE2MjI4MjY3MjksImF1ZCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsInN1YiI6IjEwMTQyMDI1MzEyNjU3NjI4MzY4MyIsImVtYWlsIjoiZ2NvbHNvbjExQGdtYWlsLmNvbSIsImVtYWlsX3ZlcmlmaWVkIjp0cnVlLCJhenAiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJuYW1lIjoiR3JlZyBPbHNvbiIsInBpY3R1cmUiOiJodHRwczovL2xoMy5nb29nbGV1c2VyY29udGVudC5jb20vYS0vQU9oMTRHaGRQdnowSmtaU0JXUkF1Y1VSaWlLSWJOdVo0Qk1yY05PelBLR1R6Zz1zOTYtYyIsImdpdmVuX25hbWUiOiJHcmVnIiwiZmFtaWx5X25hbWUiOiJPbHNvbiIsImlhdCI6MTYyMjgyNzAyOSwiZXhwIjoxNjIyODMwNjI5LCJqdGkiOiIzM2E2OTVmZTliMzM5OTgxMzhiYTQyYTEwOWVkMGRiZmU2Zjg2ZmNiIn0.WggZap4Lz64ddHAd04CbGvb-rzwoOR5RsRa9BMmnYEJ_ea52lpsv7rSk9josCYIxCFJgelHKbl6DcDrJ_m19rSHdzMSafpwnrOZzyA3mq6qsiti2tBcHabLR_4SnGaRBPaSpO0N9sa5i1NKC0Hn144cGiqhbpxBLVSdSEgfmW-evaMFY7_-Jk1zpY-i0cyWRyhxFCoSABcy_KfHCaHadHvFRcVeIhfMCkF_Ztb3_r1AQdOdV3I-QXOVzWPwgVaLO67FbsiC3I3pOHEtGveLslRIqOHeiK_mYbmn_jts_P14s2mrlbUwC1W9cCE6sj_All88jcZU4HqM5Rft4qSUi2Q) project by Bryan Tan as a guide to the steps involved in creating such a model. To begin, we need to create a simplified version of the books dataframe with only the essential attributes.

In [5]:
# make dataframe with only essential columns
books_cf = books[['id', 'title', 'authors', 'original_publication_year', 'genre1', 'genre2', 'genre3']].copy()
books_cf.head(10)

Unnamed: 0,id,title,authors,original_publication_year,genre1,genre2,genre3
0,1,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,2008.0,young-adult,fiction,fantasy
1,2,Harry Potter and the Sorcerer's Stone (Harry P...,"J.K. Rowling, Mary GrandPré",1997.0,fantasy,young-adult,fiction
2,3,"Twilight (Twilight, #1)",Stephenie Meyer,2005.0,young-adult,fantasy,fiction
3,4,To Kill a Mockingbird,Harper Lee,1960.0,classics,historical-fiction,young-adult
4,5,The Great Gatsby,F. Scott Fitzgerald,1925.0,classics,fiction,historical-fiction
5,6,The Fault in Our Stars,John Green,2012.0,young-adult,fiction,romance
6,7,The Hobbit,J.R.R. Tolkien,1937.0,fantasy,classics,fiction
7,8,The Catcher in the Rye,J.D. Salinger,1951.0,classics,fiction,young-adult
8,9,"Angels & Demons (Robert Langdon, #1)",Dan Brown,2000.0,fiction,mystery,thriller
9,10,Pride and Prejudice,Jane Austen,1813.0,classics,fiction,romance


We can clean up this dataframe a little more to make it easier to work with.

In [6]:
# convert original_publication_year to int
books_cf['original_publication_year'] = books_cf['original_publication_year'].astype('Int64')

In [7]:
# rename column to something shorter
books_cf = books_cf.rename(columns={'original_publication_year': 'year'})

In [8]:
books_cf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9942 entries, 0 to 9941
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       9942 non-null   int64 
 1   title    9942 non-null   object
 2   authors  9942 non-null   object
 3   year     9942 non-null   Int64 
 4   genre1   9942 non-null   object
 5   genre2   9942 non-null   object
 6   genre3   9942 non-null   object
dtypes: Int64(1), int64(1), object(5)
memory usage: 553.5+ KB


In [9]:
#books_matrix = books[['id', 'title', 'authors', 'original_publication_year']]
#books_matrix['original_publication_year'].astype(int)
#books_matrix = books_matrix.rename(columns={'original_publication_year': 'year'})
ratings = ratings.rename(columns={'book_id': 'id'}) # changed book_id to id, don't delete
#matrix = pd.merge(books_cf, ratings, left_on='id', right_on='id')
#data_table = pd.pivot_table(matrix, values='rating',columns='title',index='user_id')
#data_table.head()

### Sample User Input

Now that the data is ready to be used, let's create a sample user input to supplement an actual user input on the Web App.

In [10]:
# create sample dataframe with 5 book-rating pairs

user_input = [
    #1
    #{'title': 'The Tipping Point: How Little Things Can Make a Big Difference', 'authors': 'Malcolm Gladwell', 'rating': 4},
    #{'title': 'How to Win Friends and Influence People', 'authors': 'Dale Carnegie', 'rating': 5},
    #{'title': 'The Power of Habit: Why We Do What We Do in Life and Business', 'authors': 'Charles Duhigg', 'rating': 5},
    #{'title': 'Stumbling on Happiness', 'authors': 'Daniel Todd Gilbert', 'rating': 4},
    #{'title': 'Flow: The Psychology of Optimal Experience', 'authors': 'Mihaly Csikszentmihalyi', 'rating': 5},
    
    #2
    #{'title': '1984', 'authors': 'George Orwell, Erich Fromm, Celâl Üster', 'rating': 5},
    #{'title': 'The Catcher in the Rye', 'authors': 'J.D. Salinger', 'rating': 3},
    #{'title': 'The Great Gatsby', 'authors': 'F. Scott Fitzgerald', 'rating': 4},
    #{'title': 'To Kill a Mockingbird', 'authors': 'Harper Lee', 'rating': 5},
    #{'title': 'Pride and Prejudice', 'authors': 'Jane Austen', 'rating': 5}
    
    #3
    {'title': 'The Catcher in the Rye', 'authors': 'J.D. Salinger', 'rating': 5},
    {'title': 'The Great Gatsby', 'authors': 'F. Scott Fitzgerald', 'rating': 4}
]

input_books = pd.DataFrame(user_input)
input_books

Unnamed: 0,title,authors,rating
0,The Catcher in the Rye,J.D. Salinger,5
1,The Great Gatsby,F. Scott Fitzgerald,4


This is good, but now let's begin to manipulate the data so that it is ready to be used to calculate Correlations

In [11]:
# filter the books by title, merge df's and drop year column
input_id = books_cf[books_cf['title'].isin(input_books['title'].tolist())]
input_books = pd.merge(input_id, input_books)
input_books = input_books.drop('year', axis=1)
input_books

Unnamed: 0,id,title,authors,genre1,genre2,genre3,rating
0,5,The Great Gatsby,F. Scott Fitzgerald,classics,fiction,historical-fiction,4
1,8,The Catcher in the Rye,J.D. Salinger,classics,fiction,young-adult,5


In [12]:
# remove input books from books_cf 
input_book_list = input_books['id'].tolist()
books_cf = books_cf[~books_cf['id'].isin(input_book_list)]
books_cf.shape

(9940, 7)

In [13]:
# create a list of genres
genre_list = []
id_list = input_books['id'].tolist()

# for loop that appends unique genre tags to genre_list
for item in id_list:
    temp = books.loc[books['id'] == item]
    for i in range(24,27):
        a = temp.iloc[0, i]
        if a not in genre_list:
            genre_list.append(a)
genre_list

['classics', 'fiction', 'historical-fiction', 'young-adult']

In [14]:
# filter users that have read books that the input has also read
user_subset = ratings[ratings['id'].isin(input_books['id'].tolist())]
user_subset.head()

Unnamed: 0,id,user_id,rating
400,5,314,4
401,5,1169,5
402,5,1952,5
403,5,2324,5
404,5,2487,3


In [15]:
# groupby user_id to create user subset group
user_subset_group = user_subset.groupby(['user_id'])

In [16]:
# sort so that users with book most in common with the input will have priority
user_subset_group = sorted(user_subset_group, key=lambda x: len(x[1]), reverse=True)
user_subset_group[0:5]

[(1952,
       id  user_id  rating
  402   5     1952       5
  701   8     1952       4),
 (2324,
       id  user_id  rating
  403   5     2324       5
  703   8     2324       5),
 (2900,
       id  user_id  rating
  405   5     2900       5
  704   8     2900       3),
 (3022,
       id  user_id  rating
  406   5     3022       1
  705   8     3022       1),
 (3922,
       id  user_id  rating
  409   5     3922       2
  706   8     3922       1)]

In [17]:
# limit number of users we look through to top 100
user_subset_group = user_subset_group[0:100]

### Calculating Pearson Correlations

We will manually calculate these correlations

In [18]:
# create dict of pearson correlation values
pearson_corr_dict = {}

# for loop that calculates Pearson correlation and stores values in above dict
for name, group in user_subset_group:
    
    group = group.sort_values(by='id')
    input_books = input_books.sort_values(by='id')
    num_ratings = len(group)
    
    # get the review scores for the books in common
    temp_df = input_books[input_books['id'].isin(group['id'].tolist())]

    # store both ratings in list for calculations        
    rating_list = temp_df['rating'].tolist()
    group_list = group['rating'].tolist()

    # calculate the pearson correlation between users 
    Sxx = sum([i**2 for i in rating_list]) - (sum(rating_list)**2 / float(num_ratings))
    Syy = sum([i**2 for i in group_list]) - (sum(group_list)**2 / float(num_ratings))
    Sxy = sum([i*j for i, j in zip(rating_list, group_list)]) - (sum(rating_list) * sum(group_list) / float(num_ratings))

    # calculate Pearson corr if Sxx and Syy not 0, else set = 0
    if Sxx != 0 and Syy != 0:
        pearson_corr_dict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearson_corr_dict[name] = 0

In [19]:
# convert dictionary to dataframe
pearson_df = pd.DataFrame.from_dict(pearson_corr_dict, orient='index')
pearson_df.columns = ['similarity_index']
pearson_df['user_id'] = pearson_df.index
pearson_df.index = range(len(pearson_df))
pearson_df #.head()

Unnamed: 0,similarity_index,user_id
0,-1.0,1952
1,0.0,2324
2,-1.0,2900
3,0.0,3022
4,-1.0,3922
...,...,...
95,0.0,20076
96,0.0,20467
97,0.0,20782
98,0.0,20848


In [20]:
# get top 50 similar users
top_users = pearson_df.sort_values(by='similarity_index', ascending=False)[0:50]
top_users.head()

Unnamed: 0,similarity_index,user_id
54,1.0,37284
56,1.0,39423
45,1.0,30681
21,1.0,14603
27,1.0,19729


In [21]:
# merge top_users df with ratings 
top_users_rating = top_users.merge(ratings, left_on='user_id', right_on='user_id', how='inner')
top_users_rating.head()

Unnamed: 0,similarity_index,user_id,id,rating
0,1.0,37284,1,5
1,1.0,37284,3,1
2,1.0,37284,4,3
3,1.0,37284,5,2
4,1.0,37284,7,4


In [22]:
# multiply the user similarity by the user ratings
top_users_rating['weighted_rating'] = top_users_rating['similarity_index'] * top_users_rating['rating']
top_users_rating.head()

Unnamed: 0,similarity_index,user_id,id,rating,weighted_rating
0,1.0,37284,1,5,5.0
1,1.0,37284,3,1,1.0
2,1.0,37284,4,3,3.0
3,1.0,37284,5,2,2.0
4,1.0,37284,7,4,4.0


In [23]:
# apply a sum to the top_users after grouping it up by user_id
temp_top_users_rating = top_users_rating.groupby('id').sum()[['similarity_index','weighted_rating']]
temp_top_users_rating.columns = ['sum_similarity_index','sum_weighted_rating']
temp_top_users_rating.head()

Unnamed: 0_level_0,sum_similarity_index,sum_weighted_rating
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,9.0,37.0
2,10.0,46.0
3,6.0,16.0
4,13.0,60.0
5,15.0,48.0


### Recommmendations

Finally, we can use the weighted rating and similarty index to calculate the weighted average recommendation score to be used to give recommendations.

In [24]:
# create an empty dataframe
recommendation_df = pd.DataFrame()

# find the weighted average
recommendation_df['weighted average recommendation score'] = temp_top_users_rating['sum_weighted_rating'] / temp_top_users_rating['sum_similarity_index']
recommendation_df['id'] = temp_top_users_rating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,id
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.111111,1
2,4.6,2
3,2.666667,3
4,4.615385,4
5,3.2,5


In [25]:
# sort values in order of highest weights descending
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,id
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2199,5.0,2199
366,5.0,366
1399,5.0,1399
822,5.0,822
4445,5.0,4445
1954,5.0,1954
2079,5.0,2079
4383,5.0,4383
4370,5.0,4370
2852,5.0,2852


In [26]:
# loop that only returns rows with genre in common
for index, row in books_cf.iterrows():
    if (row['genre1'] not in genre_list) and (row['genre2'] not in genre_list) and (row['genre3'] not in genre_list):
        books_cf.drop(index, inplace=True)
            
books_cf.shape

(7725, 7)

In [27]:
# loop that drops rows from recommendation_df without shared genres
shared_ids = books_cf['id'].tolist()
for index, row in recommendation_df.iterrows():
    if row['id'] not in shared_ids:
        recommendation_df.drop(index, inplace=True)

In [28]:
fives = recommendation_df.loc[recommendation_df['weighted average recommendation score'] == 5]

In [29]:
len(fives)

179

In [30]:
if len(fives) > 10:
    recommendation = books_cf.loc[books_cf['id'].isin(recommendation_df['id'].head(len(fives)).tolist())]
    count = 0
    scores = []
    for index, row in recommendation.iterrows():
        if row['genre1'] in genre_list:
            count += 5
        if row['genre2'] in genre_list:
            count += 3
        if row['genre3'] in genre_list:
            count += 1
        scores.append(count)
        count = 0
        
    recommendation['scores'] = scores
    recommendation = recommendation.sort_values(by=['scores', 'id'], ascending=[False, True])
    #recommendation = recommendation.drop(['id', 'genre1', 'genre2', 'genre3', 'scores'], axis=1)
    recommendation = recommendation.head(10)

else:
    recommendation = books_cf.loc[books_cf['id'].isin(recommendation_df['id'].head(10).tolist())]
    #recommendation = recommendation.drop(['id', 'genre1', 'genre2', 'genre3'], axis=1)
    
recommendation

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommendation['scores'] = scores


Unnamed: 0,id,title,authors,year,genre1,genre2,genre3,scores
298,300,The Boy in the Striped Pajamas,John Boyne,2006,historical-fiction,young-adult,fiction,9
542,544,"Little House on the Prairie (Little House, #2)","Laura Ingalls Wilder, Garth Williams",1935,classics,fiction,young-adult,9
732,734,Roots: The Saga of an American Family,Alex Haley,1976,historical-fiction,fiction,classics,9
761,763,The Bluest Eye,Toni Morrison,1970,fiction,classics,historical-fiction,9
837,840,"Shōgun (Asian Saga, #1)",James Clavell,1975,historical-fiction,fiction,classics,9
1069,1073,Vanity Fair,"William Makepeace Thackeray, John Carey",1847,classics,fiction,historical-fiction,9
1104,1108,The Complete Fairy Tales,"Hans Christian Andersen, Lily Owens, Arthur Ra...",1835,classics,fiction,young-adult,9
1117,1121,"Blood Meridian, or the Evening Redness in the ...",Cormac McCarthy,1985,fiction,historical-fiction,classics,9
1251,1255,The House on Mango Street,Sandra Cisneros,1984,fiction,young-adult,classics,9
3583,3590,The Tin Drum,Günter Grass,1959,fiction,classics,historical-fiction,9


This gives us a 10 book recommendation.

I have successfully included genre as a factor in the book recommendation. The next step is to copy this code over to the Atom document so that these recommendations can be viewed in the web app.