# Pre-processing and Model

This is the Jupyter Notebook version of this section of the project. It contains all of the steps involved in pre-processing the data and then manually manipulating the data to create similarity scores to rank books. This file is beneficial because it produces visualizations of the dataframe step by step up until the final recommendations are produced. The main files used for this project however are .py files created in Atom. I used object oriented programming to make my code more reusable and callable, and I layered in a Web App using Streamlit to make this a more interactive project. Ultimately, this Notebook serves as a step by step guide of the modeling process.

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt
from statistics import pstdev
import random

# set preferences for displaying dataframes
pd.set_option('display.max_columns', None)

Attain the saved data files from the previous Data_Wrangling_and_EDA Notebook.

In [2]:
# read in the books and ratings dataframes
ratings = pd.read_csv("Data/ratings_cleaned.csv")
books = pd.read_csv("Data/books_cleaned.csv")

Let's look at the two datasets again to get a reminder of their structure

In [3]:
ratings.head()

Unnamed: 0,book_id,user_id,rating
0,1,314,5
1,1,439,3
2,1,588,5
3,1,1169,4
4,1,1185,4


In [4]:
books.head()

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,title,language_code,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url,goodreads_book_id,genre1,genre2,genre3
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,"The Hunger Games (The Hunger Games, #1)",eng,4.34,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...,2767052,young-adult,fiction,fantasy
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,Harry Potter and the Sorcerer's Stone (Harry P...,eng,4.44,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...,3,fantasy,young-adult,fiction
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,"Twilight (Twilight, #1)",en-US,3.57,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...,41865,young-adult,fantasy,fiction
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,To Kill a Mockingbird,eng,4.25,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...,2657,classics,historical-fiction,young-adult
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,The Great Gatsby,eng,3.89,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...,4671,classics,fiction,historical-fiction


We're ready to start setting up the model.

## Collaborative Filtering

I chose a collaborative filtering approach for my recommender. I used [this](https://medium.com/swlh/how-to-build-simple-recommender-systems-in-python-647e5bcd78bd#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjE3MTllYjk1N2Y2OTU2YjU4MThjMTk2OGZmMTZkZmY3NzRlNzA4ZGUiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJuYmYiOjE2MjI4MjY3MjksImF1ZCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsInN1YiI6IjEwMTQyMDI1MzEyNjU3NjI4MzY4MyIsImVtYWlsIjoiZ2NvbHNvbjExQGdtYWlsLmNvbSIsImVtYWlsX3ZlcmlmaWVkIjp0cnVlLCJhenAiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJuYW1lIjoiR3JlZyBPbHNvbiIsInBpY3R1cmUiOiJodHRwczovL2xoMy5nb29nbGV1c2VyY29udGVudC5jb20vYS0vQU9oMTRHaGRQdnowSmtaU0JXUkF1Y1VSaWlLSWJOdVo0Qk1yY05PelBLR1R6Zz1zOTYtYyIsImdpdmVuX25hbWUiOiJHcmVnIiwiZmFtaWx5X25hbWUiOiJPbHNvbiIsImlhdCI6MTYyMjgyNzAyOSwiZXhwIjoxNjIyODMwNjI5LCJqdGkiOiIzM2E2OTVmZTliMzM5OTgxMzhiYTQyYTEwOWVkMGRiZmU2Zjg2ZmNiIn0.WggZap4Lz64ddHAd04CbGvb-rzwoOR5RsRa9BMmnYEJ_ea52lpsv7rSk9josCYIxCFJgelHKbl6DcDrJ_m19rSHdzMSafpwnrOZzyA3mq6qsiti2tBcHabLR_4SnGaRBPaSpO0N9sa5i1NKC0Hn144cGiqhbpxBLVSdSEgfmW-evaMFY7_-Jk1zpY-i0cyWRyhxFCoSABcy_KfHCaHadHvFRcVeIhfMCkF_Ztb3_r1AQdOdV3I-QXOVzWPwgVaLO67FbsiC3I3pOHEtGveLslRIqOHeiK_mYbmn_jts_P14s2mrlbUwC1W9cCE6sj_All88jcZU4HqM5Rft4qSUi2Q) project by Bryan Tan as a guide to the steps involved in creating such a model. To begin, we need to create a simplified version of the books dataframe with only the essential attributes.

In [5]:
# make dataframe with only essential columns
books_cf = books[['id', 'title', 'authors', 'original_publication_year', 'genre1', 'genre2', 'genre3', 'small_image_url']].copy()
books_cf.head(50)

Unnamed: 0,id,title,authors,original_publication_year,genre1,genre2,genre3,small_image_url
0,1,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,2008.0,young-adult,fiction,fantasy,https://images.gr-assets.com/books/1447303603s...
1,2,Harry Potter and the Sorcerer's Stone (Harry P...,"J.K. Rowling, Mary GrandPré",1997.0,fantasy,young-adult,fiction,https://images.gr-assets.com/books/1474154022s...
2,3,"Twilight (Twilight, #1)",Stephenie Meyer,2005.0,young-adult,fantasy,fiction,https://images.gr-assets.com/books/1361039443s...
3,4,To Kill a Mockingbird,Harper Lee,1960.0,classics,historical-fiction,young-adult,https://images.gr-assets.com/books/1361975680s...
4,5,The Great Gatsby,F. Scott Fitzgerald,1925.0,classics,fiction,historical-fiction,https://images.gr-assets.com/books/1490528560s...
5,6,The Fault in Our Stars,John Green,2012.0,young-adult,fiction,romance,https://images.gr-assets.com/books/1360206420s...
6,7,The Hobbit,J.R.R. Tolkien,1937.0,fantasy,classics,fiction,https://images.gr-assets.com/books/1372847500s...
7,8,The Catcher in the Rye,J.D. Salinger,1951.0,classics,fiction,young-adult,https://images.gr-assets.com/books/1398034300s...
8,9,"Angels & Demons (Robert Langdon, #1)",Dan Brown,2000.0,fiction,mystery,thriller,https://images.gr-assets.com/books/1303390735s...
9,10,Pride and Prejudice,Jane Austen,1813.0,classics,fiction,romance,https://images.gr-assets.com/books/1320399351s...


We can clean up this dataframe a little more to make it easier to work with.

In [6]:
# convert original_publication_year to int, rename column to something shorter
books_cf['original_publication_year'] = books_cf['original_publication_year'].astype('Int64')
books_cf = books_cf.rename(columns={'original_publication_year': 'year'})

In [7]:
books_cf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9942 entries, 0 to 9941
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               9942 non-null   int64 
 1   title            9942 non-null   object
 2   authors          9942 non-null   object
 3   year             9942 non-null   Int64 
 4   genre1           9942 non-null   object
 5   genre2           9942 non-null   object
 6   genre3           9942 non-null   object
 7   small_image_url  9942 non-null   object
dtypes: Int64(1), int64(1), object(6)
memory usage: 631.2+ KB


In [8]:
# rename ratings book_id to id
ratings = ratings.rename(columns={'book_id': 'id'})

### Sample User Input

Now that the data is ready to be used, let's create a sample user input to supplement an actual user input on the Web App.

In [9]:
# create sample dataframe with 5 book-rating pairs

user_input = [
    
    {'title': 'The Tipping Point: How Little Things Can Make a Big Difference', 'authors': 'Malcolm Gladwell', 'rating': 4},
    {'title': 'How to Win Friends and Influence People', 'authors': 'Dale Carnegie', 'rating': 5},
    {'title': 'The Power of Habit: Why We Do What We Do in Life and Business', 'authors': 'Charles Duhigg', 'rating': 5},
    {'title': 'Stumbling on Happiness', 'authors': 'Daniel Todd Gilbert', 'rating': 4},
    {'title': 'Flow: The Psychology of Optimal Experience', 'authors': 'Mihaly Csikszentmihalyi', 'rating': 5}
    
    #2 - an alternate testing user input
    #{'title': '1984', 'authors': 'George Orwell, Erich Fromm, Celâl Üster', 'rating': 5}
    #{'title': 'The Catcher in the Rye', 'authors': 'J.D. Salinger', 'rating': 5},
    #{'title': 'The Great Gatsby', 'authors': 'F. Scott Fitzgerald', 'rating': 4},
    #{'title': 'To Kill a Mockingbird', 'authors': 'Harper Lee', 'rating': 5},
    #{'title': 'Of Mice and Men', 'authors': 'John Steinbeck', 'rating': 5}

]

input_books = pd.DataFrame(user_input)
input_books

Unnamed: 0,title,authors,rating
0,The Tipping Point: How Little Things Can Make ...,Malcolm Gladwell,4
1,How to Win Friends and Influence People,Dale Carnegie,5
2,The Power of Habit: Why We Do What We Do in Li...,Charles Duhigg,5
3,Stumbling on Happiness,Daniel Todd Gilbert,4
4,Flow: The Psychology of Optimal Experience,Mihaly Csikszentmihalyi,5


This is good, but now let's begin to manipulate the data so that it is ready to be used to calculate Correlations

In [10]:
# filter the books by title, merge df's and drop year column
input_id = books_cf[books_cf['title'].isin(input_books['title'].tolist())]
input_books = pd.merge(input_id, input_books)
input_books = input_books.drop('year', axis=1)
input_books

Unnamed: 0,id,title,authors,genre1,genre2,genre3,small_image_url,rating
0,127,The Tipping Point: How Little Things Can Make ...,Malcolm Gladwell,non-fiction,business,psychology,https://images.gr-assets.com/books/1473396980s...,4
1,260,How to Win Friends and Influence People,Dale Carnegie,non-fiction,classics,philosophy,https://images.gr-assets.com/books/1442726934s...,5
2,537,The Power of Habit: Why We Do What We Do in Li...,Charles Duhigg,science,non-fiction,psychology,https://images.gr-assets.com/books/1366758683s...,5
3,2465,Stumbling on Happiness,Daniel Todd Gilbert,non-fiction,psychology,science,https://images.gr-assets.com/books/1327947323s...,4
4,2946,Flow: The Psychology of Optimal Experience,Mihaly Csikszentmihalyi,psychology,non-fiction,self-help,https://s.gr-assets.com/assets/nophoto/book/50...,5


In [11]:
# remove input books from books_cf 
input_book_list = input_books['id'].tolist()
books_cf = books_cf[~books_cf['id'].isin(input_book_list)]
books_cf.shape

(9937, 8)

In [12]:
# create a list of genres
genre_list = []
id_list = input_books['id'].tolist()

# for loop that appends unique genre tags to genre_list
for item in id_list:
    temp = books.loc[books['id'] == item]
    for i in range(24,27):
        a = temp.iloc[0, i]
        if a not in genre_list:
            genre_list.append(a)
            
genre_list

['non-fiction',
 'business',
 'psychology',
 'classics',
 'philosophy',
 'science',
 'self-help']

In [13]:
# make average rating for each genre among user inputs
avg_genre_rating = []
for item in genre_list:
    temp_gdf = input_books.loc[(input_books['genre1'] == item) | \
                               (input_books['genre2'] == item) | \
                               (input_books['genre3'] == item)]
    c = temp_gdf['rating'].mean()
    avg_genre_rating.append(c)
avg_genre_rating

[4.6, 4.0, 4.5, 5.0, 5.0, 4.5, 5.0]

In [14]:
# create dictionary among the two lists
genre_rating_dict = {genre_list[i]: avg_genre_rating[i] for i in range(len(genre_list))}
genre_rating_dict

{'non-fiction': 4.6,
 'business': 4.0,
 'psychology': 4.5,
 'classics': 5.0,
 'philosophy': 5.0,
 'science': 4.5,
 'self-help': 5.0}

In [15]:
# filter users that have read books that the input has also read
user_subset = ratings[ratings['id'].isin(input_books['id'].tolist())]
user_subset.head()

Unnamed: 0,id,user_id,rating
12600,127,173,4
12601,127,588,4
12602,127,1449,4
12603,127,1456,4
12604,127,1759,3


In [16]:
# groupby user_id to create user subset group
user_subset_group = user_subset.groupby(['user_id'])

In [17]:
# sort so that users with book most in common with the input will have priority
user_subset_group = sorted(user_subset_group, key=lambda x: len(x[1]), reverse=True)
user_subset_group[0:5]

[(39720,
            id  user_id  rating
  12690    127    39720       2
  25872    260    39720       4
  245925  2465    39720       5
  293947  2946    39720       4),
 (14901,
            id  user_id  rating
  25825    260    14901       2
  245876  2465    14901       2
  293907  2946    14901       2),
 (36206,
            id  user_id  rating
  53571    537    36206       5
  245918  2465    36206       3
  293939  2946    36206       5),
 (47478,
            id  user_id  rating
  12697    127    47478       3
  25884    260    47478       2
  293963  2946    47478       3),
 (1456,
            id  user_id  rating
  12603    127     1456       4
  245844  2465     1456       5)]

In [18]:
# limit number of users we look through to top 100
user_subset_group = user_subset_group[0:100]

### Calculating Similarities

We will manually calculate these correlations

In [19]:
# create dict of pearson correlation values
sim_dict = {}

std = pstdev(input_books['rating'].tolist())

# for loop that calculates Pearson correlation and stores values in above dict
for name, group in user_subset_group:
    
    group = group.sort_values(by='id')
    input_books = input_books.sort_values(by='id')
    num_ratings = len(group)
    
    # get the review scores for the books in common
    temp_df = input_books[input_books['id'].isin(group['id'].tolist())]

    # store both ratings in list for calculations        
    rating_list = temp_df['rating'].tolist()
    group_list = group['rating'].tolist()
    
    if std == 0:
    
        # calculate cosine similarity
        cos_sim = np.dot(rating_list, group_list)/(np.linalg.norm(rating_list)*np.linalg.norm(group_list))
        sim_dict[name] = cos_sim
        
    else:

        # calculate the pearson correlation between users 
        Sxx = sum([i**2 for i in rating_list]) - (sum(rating_list)**2 / float(num_ratings))
        Syy = sum([i**2 for i in group_list]) - (sum(group_list)**2 / float(num_ratings))
        Sxy = sum([i*j for i, j in zip(rating_list, group_list)]) - (sum(rating_list) * sum(group_list) / float(num_ratings))

        # calculate Pearson corr if Sxx and Syy not 0, else set = 0
        if Sxx != 0 and Syy != 0:
            sim_dict[name] = Sxy/sqrt(Sxx*Syy)
        else:
            sim_dict[name] = 0

In [20]:
# convert dictionary to dataframe
sim_df = pd.DataFrame.from_dict(sim_dict, orient='index')
sim_df.columns = ['similarity_index']
sim_df['user_id'] = sim_df.index
sim_df.index = range(len(sim_df))
sim_df.head()

Unnamed: 0,similarity_index,user_id
0,0.229416,39720
1,0.0,14901
2,1.0,36206
3,-0.5,47478
4,0.0,1456


In [21]:
# get top 50 similar users
top_users = sim_df.sort_values(by='similarity_index', ascending=False)[0:50]
top_users.head()

Unnamed: 0,similarity_index,user_id
15,1.0,25812
9,1.0,14136
28,1.0,37035
31,1.0,41282
5,1.0,1877


In [22]:
# merge top_users df with ratings 
top_users_rating = top_users.merge(ratings, left_on='user_id', right_on='user_id', how='inner')
top_users_rating.head()

Unnamed: 0,similarity_index,user_id,id,rating
0,1.0,25812,127,1
1,1.0,25812,182,5
2,1.0,25812,183,2
3,1.0,25812,190,1
4,1.0,25812,193,2


In [23]:
# multiply the user similarity by the user ratings
top_users_rating['weighted_rating'] = top_users_rating['similarity_index'] * top_users_rating['rating']
top_users_rating.head()

Unnamed: 0,similarity_index,user_id,id,rating,weighted_rating
0,1.0,25812,127,1,1.0
1,1.0,25812,182,5,5.0
2,1.0,25812,183,2,2.0
3,1.0,25812,190,1,1.0
4,1.0,25812,193,2,2.0


In [24]:
# apply a sum to the top_users after grouping it up by user_id
temp_top_users_rating = top_users_rating.groupby('id').sum()[['similarity_index','weighted_rating']]
temp_top_users_rating.columns = ['sum_similarity_index','sum_weighted_rating']
temp_top_users_rating.head()

Unnamed: 0_level_0,sum_similarity_index,sum_weighted_rating
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.0,0.0
2,0.0,0.0
4,0.0,0.0
5,0.0,0.0
6,1.229416,4.917663


### Recommmendations

Finally, we can use the weighted rating and similarty index to calculate the weighted average recommendation score to be used to give recommendations.

In [25]:
# create an empty dataframe
recommendation_df = pd.DataFrame()

# find the weighted average
recommendation_df['weighted average recommendation score'] = temp_top_users_rating['sum_weighted_rating'] / temp_top_users_rating['sum_similarity_index']
recommendation_df['id'] = temp_top_users_rating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,id
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,,1
2,,2
4,,4
5,,5
6,4.0,6


In [26]:
# sort values in order of highest weights descending
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,id
id,Unnamed: 1_level_1,Unnamed: 2_level_1
9999,5.0,9999
663,5.0,663
4389,5.0,4389
4356,5.0,4356
1280,5.0,1280
501,5.0,501
4080,5.0,4080
537,5.0,537
545,5.0,545
3910,5.0,3910


In [27]:
# check the number of books that have perfect 5's as a recommendation score
fives = recommendation_df.loc[recommendation_df['weighted average recommendation score'] == 5]
len(fives)

180

In [28]:
if len(fives) > 10:
    
    count = 0
    scores = []
    
    # calculate genre score for each book
    for index, row in books_cf.iterrows():
        if row['genre1'] in genre_list:
            count += 5 * (genre_rating_dict[row['genre1']] - 3)
        if row['genre2'] in genre_list:
            count += 3 * (genre_rating_dict[row['genre2']] - 3)
        if row['genre3'] in genre_list:
            count += 1 * (genre_rating_dict[row['genre3']] - 3)
        scores.append(count)
        count = 0
    
    # create new column in dataframe for genre score
    books_cf['genre_score'] = scores
    
    # return top books with top 10 recommendation scores among the remaining books
    recommendation = books_cf.loc[books_cf['id'].isin(recommendation_df['id'].head(len(fives)).tolist())]
    recommendation = recommendation.sort_values(by=['genre_score', 'id'], ascending=[False, True])
    recommendation = recommendation.head(10)
    
else:
    
    # return top 10 books by recommendation score
    recommendation = books_cf.loc[books_cf['id'].isin(recommendation_df['id'].head(10).tolist())]
    recommendation = recommendation.sort_values(by=['id'], ascending=True)
    recommendation = recommendation.head(10)
    
recommendation

Unnamed: 0,id,title,authors,year,genre1,genre2,genre3,small_image_url,genre_score
5613,5637,The Dhammapada,"Anonymous, Ananda Maitreya, Thich Nhat Hanh, B...",-500,philosophy,non-fiction,classics,https://s.gr-assets.com/assets/nophoto/book/50...,16.8
3458,3464,The Doors of Perception & Heaven and Hell,Aldous Huxley,1956,philosophy,non-fiction,psychology,https://images.gr-assets.com/books/1375947566s...,16.3
8438,8485,The Element: How Finding Your Passion Changes ...,"Ken Robinson, Lou Aronica",2009,self-help,psychology,non-fiction,https://s.gr-assets.com/assets/nophoto/book/50...,16.1
788,790,The Four Agreements: A Practical Guide to Pers...,Miguel Ruiz,1997,non-fiction,self-help,philosophy,https://s.gr-assets.com/assets/nophoto/book/50...,16.0
4986,5005,Feeling Good: The New Mood Therapy,David D. Burns,1980,non-fiction,self-help,psychology,https://s.gr-assets.com/assets/nophoto/book/50...,15.5
1309,1314,The Origin of Species,Charles Darwin,1859,science,non-fiction,classics,https://s.gr-assets.com/assets/nophoto/book/50...,14.3
1687,1692,He's Just Not That Into You: The No-Excuses Tr...,"Greg Behrendt, Liz Tuccillo",2004,non-fiction,self-help,chick-lit,https://s.gr-assets.com/assets/nophoto/book/50...,14.0
6591,6620,The Language Instinct: How the Mind Creates La...,Steven Pinker,1994,non-fiction,science,psychology,https://s.gr-assets.com/assets/nophoto/book/50...,14.0
7957,8000,The Red Queen: Sex and the Evolution of Human ...,Matt Ridley,1993,non-fiction,science,psychology,https://s.gr-assets.com/assets/nophoto/book/50...,14.0
5515,5537,The Blank Slate: The Modern Denial of Human Na...,Steven Pinker,2002,science,psychology,non-fiction,https://s.gr-assets.com/assets/nophoto/book/50...,13.6


This gives us a 10 book recommendation.