# Exercises in Recommender systems

This notebook contains exercises in Recommender systems

## Exercise 1

Using the "Coursera Courses Dataset 2021" available at kaggle ([https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021](https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021)) or on moodle, to do the following:

1. Create a Content-based filtering recommender system based on the Course Descriptions.
2. Create a Content-based filtering recommender system based on the Skills.

Using the "Book Recommendation Dataset" available at kaggle ([https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset)) or on moodle, to do the following:

3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.
4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

1. Create a Content-based filtering recommender system based on the Course Descriptions.


In [83]:
import pandas as pd

data = pd.read_csv('Coursera.csv')

data['Course Description'].head()

0    Write a Full Length Feature Film Script  In th...
1    By the end of this guided project, you will be...
2    This course consists of a general presentation...
3    When it comes to numbers, there is always more...
4    In this course you�ll learn how to effectively...
Name: Course Description, dtype: object

In [84]:
data.head()

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...


In [85]:
# Checking for missing values
data.isnull().sum()

Course Name           0
University            0
Difficulty Level      0
Course Rating         0
Course URL            0
Course Description    0
Skills                0
dtype: int64

In [86]:
#Import TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
TfidfVectorizer = TfidfVectorizer(stop_words='english')

#Construct the required TF-IDF matrix by fitting and transforming the data
Tfidf_matrix = TfidfVectorizer.fit_transform(data['Course Description'])

#Output the shape of tfidf_matrix
Tfidf_matrix

<3522x20074 sparse matrix of type '<class 'numpy.float64'>'
	with 253718 stored elements in Compressed Sparse Row format>

In [87]:
Tfidf_matrix.toarray()[1, :]

array([0., 0., 0., ..., 0., 0., 0.])

In [88]:
Tfidf_matrix

<3522x20074 sparse matrix of type '<class 'numpy.float64'>'
	with 253718 stored elements in Compressed Sparse Row format>

Now that we have each Course description in vectors. We now need to measure the distance between between two vectors.

In [89]:
#Using the cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

In [90]:
cosine_sim = cosine_similarity(Tfidf_matrix, Tfidf_matrix)

In [91]:
cosine_sim

array([[1.00000000e+00, 3.12366523e-02, 1.97603991e-02, ...,
        3.17538002e-02, 3.33859933e-02, 1.96231367e-02],
       [3.12366523e-02, 1.00000000e+00, 8.58915185e-03, ...,
        3.13671991e-02, 4.88239107e-03, 4.56033552e-02],
       [1.97603991e-02, 8.58915185e-03, 1.00000000e+00, ...,
        3.45669421e-03, 1.65197252e-02, 6.37237740e-03],
       ...,
       [3.17538002e-02, 3.13671991e-02, 3.45669421e-03, ...,
        1.00000000e+00, 5.07544593e-04, 6.72367274e-03],
       [3.33859933e-02, 4.88239107e-03, 1.65197252e-02, ...,
        5.07544593e-04, 1.00000000e+00, 1.14068789e-03],
       [1.96231367e-02, 4.56033552e-02, 6.37237740e-03, ...,
        6.72367274e-03, 1.14068789e-03, 1.00000000e+00]])

In [92]:
cosine_sim.shape

(3522, 3522)

In [93]:
cosine_sim[0, 1]

0.0312366522978012

In [94]:
cosine_sim[1, 0]

0.0312366522978012

In [95]:
#Constructing a reverse map of indices and course titles
indices = pd.Series(data.index, index=data['Course Description']).drop_duplicates()

In [96]:
indices[0:10]

Course Description
Write a Full Length Feature Film Script  In this course, you will write a complete, feature-length screenplay for film or television, be it a serious drama or romantic comedy or anything in between. You�ll learn to break down the creative process into components, and you�ll discover a structured process that allows you to produce a polished and pitch-ready script by the end of the course. Completing this project will increase your confidence in your ideas and abilities, and you�ll feel prepared to pitch your first script and get started on your next. This is a course designed to tap into your creativity and is based in "Active Learning". Most of the actual learning takes place within your own activities - that is, writing! You will learn by doing.  Here is a link to a TRAILER for the course. To view the trailer, please copy and paste the link into your browser. https://vimeo.com/382067900/b78b800dc0  Learner review: "Love the approach Professor Wheeler takes towards 

We can now define a recommendor function.

In [97]:
def get_recommendations(course_des, cosine_sim=cosine_sim):
    # Find all course descriptions that contain the keyword
    matching_courses = data[data['Course Description'].str.contains(course_des, case=False, na=False)]

    if matching_courses.empty:
        return f"No courses found with the keyword '{course_des}' in their description."

    # Get the index of the first matching course
    idx = matching_courses.index[0]

    # Get the pairwise similarity scores of all courses with that course
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the courses based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar courses
    sim_scores = sim_scores[1:11]

    # Get the course indices
    course_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar courses
    return data['Course Description'].iloc[course_indices]

# Getting recommendations for the course with 'Python'
recommendations = get_recommendations('Python')
print(recommendations)


740     This course will continue the introduction to ...
2329    This course will continue the introduction to ...
747     This course introduces the dictionary data str...
2039    Kickstart your learning of Python for data sci...
2518    Kickstart your learning of Python for data sci...
2519    Kickstart your learning of Python for data sci...
2886    Kickstart your learning of Python for data sci...
1210    By the end of this course, you will create a b...
3242    This two-part course is designed to help stude...
2087    This if the final course in the specialization...
Name: Course Description, dtype: object


In [98]:
# Getting recommendations for the course
get_recommendations('Python')

740     This course will continue the introduction to ...
2329    This course will continue the introduction to ...
747     This course introduces the dictionary data str...
2039    Kickstart your learning of Python for data sci...
2518    Kickstart your learning of Python for data sci...
2519    Kickstart your learning of Python for data sci...
2886    Kickstart your learning of Python for data sci...
1210    By the end of this course, you will create a b...
3242    This two-part course is designed to help stude...
2087    This if the final course in the specialization...
Name: Course Description, dtype: object

2. Create a Content-based filtering recommender system based on the Skills.


In [99]:
# Printing an overview of the first 5 skills
data['Skills'].head()

0    Drama  Comedy  peering  screenwriting  film  D...
1    Finance  business plan  persona (user experien...
2    chemistry  physics  Solar Energy  film  lambda...
3    accounts receivable  dupont analysis  analysis...
4    Data Analysis  select (sql)  database manageme...
Name: Skills, dtype: object

In [100]:
data['Skills'][0]

'Drama  Comedy  peering  screenwriting  film  Document Review  dialogue  creative writing  Writing  unix shells arts-and-humanities music-and-art'

In [101]:
# Checking for missing values
data.isnull().sum()

Course Name           0
University            0
Difficulty Level      0
Course Rating         0
Course URL            0
Course Description    0
Skills                0
dtype: int64

In [102]:
# Since there's no null values in the 'Skills' column, we can proceed with the same steps as befor

#Construct the required TF-IDF matrix by fitting and transforming the data
Tfidf_matrix = TfidfVectorizer.fit_transform(data['Skills'])

#Output the shape of tfidf_matrix
Tfidf_matrix.shape

(3522, 4337)

In [103]:
Tfidf_matrix.toarray()[1, :]

array([0., 0., 0., ..., 0., 0., 0.])

In [104]:
Tfidf_matrix

<3522x4337 sparse matrix of type '<class 'numpy.float64'>'
	with 55616 stored elements in Compressed Sparse Row format>

In [105]:
%%time
cosine_sim = cosine_similarity(Tfidf_matrix, Tfidf_matrix)

CPU times: total: 109 ms
Wall time: 142 ms


In [106]:
cosine_sim

array([[1.        , 0.        , 0.05204333, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.20061523, 0.        ,
        0.01306076],
       [0.05204333, 0.        , 1.        , ..., 0.        , 0.1787157 ,
        0.00490933],
       ...,
       [0.        , 0.20061523, 0.        , ..., 1.        , 0.        ,
        0.03178263],
       [0.        , 0.        , 0.1787157 , ..., 0.        , 1.        ,
        0.00459616],
       [0.        , 0.01306076, 0.00490933, ..., 0.03178263, 0.00459616,
        1.        ]])

In [107]:
cosine_sim.shape

(3522, 3522)

In [108]:
cosine_sim[0,1]

0.0

In [109]:
cosine_sim[1,0]

0.0

In [110]:
#Contructing a reverse map of indices and course titles
indices = pd.Series(data.index, index=data['Skills']).drop_duplicates()

In [111]:
indices[0:10]

Skills
Drama  Comedy  peering  screenwriting  film  Document Review  dialogue  creative writing  Writing  unix shells arts-and-humanities music-and-art                                                                                               0
Finance  business plan  persona (user experience)  business model canvas  Planning  Business  project  Product Development  presentation  Strategy business business-strategy                                                                  1
chemistry  physics  Solar Energy  film  lambda calculus  Electrical Engineering  electronics  energy  silicon  thinning physical-science-and-engineering electrical-engineering                                                                2
accounts receivable  dupont analysis  analysis  Accounting  Finance  Operations Management  Leadership and Management  balance sheet  inventory  Financial Analysis business finance                                                           3
Data Analysis  select (sql)  

In [112]:
# Function to get recommendations based on the skills
def Get_recommendations(Skills, cosine_sim=cosine_sim):
    # Find all courses that contain the keyword
    matching_courses = data[data['Skills'].str.contains(Skills, case=False, na=False)]

    if matching_courses.empty:
        return f"No courses found with the keyword '{Skills}' in their description."

    # Get the index of the first matching course
    idx = matching_courses.index[0]

    # Get the pairwise similarity scores of all courses with that course
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the courses based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar courses
    sim_scores = sim_scores[1:11]

    # Get the course indices
    course_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar courses
    return data['Skills'].iloc[course_indices]

In [113]:
# Getting recommendations for the course with 'Computer Science'
recommendations = Get_recommendations('Python')
print(recommendations)

1656    basic programming language  language  Computer...
548     Python Programming  Databases  ipython  langua...
305     dict  Python Programming  python syntax and se...
1871    Problem Solving  python syntax and semantics  ...
460     syntax  language  semantics  Computer Programm...
2824    Programming Principles  principle  semantics  ...
3517    Databases  syntax  analysis  web  Data Visuali...
3242    syntax  interactive computing  social media ma...
1000    list comprehension  python syntax and semantic...
734     python syntax and semantics  computer program ...
Name: Skills, dtype: object


Using the "Book Recommendation Dataset" available at kaggle ([https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset)) or on moodle, to do the following:

3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.
4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.


In [137]:
# Loading the data
ratings = pd.read_csv('Books_Ratings.csv')

ratings

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6
...,...,...,...
1149775,276704,1563526298,9
1149776,276706,0679447156,0
1149777,276709,0515107662,10
1149778,276721,0590442449,10


In [138]:
# Grouping the data by 'User-ID' in decsending order to get the users who rated the most books
ratings.groupby('User-ID').count().sort_values(by='Book-Rating', ascending=False).head()

Unnamed: 0_level_0,ISBN,Book-Rating
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1
11676,13602,13602
198711,7550,7550
153662,6109,6109
98391,5891,5891
35859,5850,5850


In [151]:
top_200_users = ratings['User-ID'].value_counts().head(200).index

top_200_users

Index([ 11676, 198711, 153662,  98391,  35859, 212898, 278418,  76352, 110973,
       235105,
       ...
        28204, 150124, 180651, 149908,  33974, 262998, 210035, 106225, 133747,
       206534],
      dtype='int64', name='User-ID', length=200)

In [149]:
# Filtering the data to get the top 200 users that rated the most books
top_200_users_ratings = ratings[ratings['User-ID'].isin(top_200_users)]

In [150]:
top_200_users_ratings

Unnamed: 0,User-ID,ISBN,Book-Rating
4330,278418,0006128831,0
4331,278418,0006542808,5
4332,278418,0020209606,0
4333,278418,0020418809,0
4334,278418,0020420900,0
...,...,...,...
1147612,275970,3829021860,0
1147613,275970,4770019572,0
1147614,275970,896086097,0
1147615,275970,9626340762,8


4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

In [140]:
books = pd.read_csv('Books.csv')

books.head()


  books = pd.read_csv('Books.csv')


Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [152]:
# Creating a Collaborative Filtering recommender system based on the user ratings from 3 together with the books dataset

# Merging the dataframes
df = top_200_users_ratings.merge(books, on='ISBN', how='inner')

df.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,278418,0006542808,5,Silence of the Sirens,Adel Garcia Morales,0,Flamingo,http://images.amazon.com/images/P/0006542808.0...,http://images.amazon.com/images/P/0006542808.0...,http://images.amazon.com/images/P/0006542808.0...
1,278418,0020209606,0,NEVER ALONE REISSUE,Phyllis Hobe,1987,Scribner Paper Fiction,http://images.amazon.com/images/P/0020209606.0...,http://images.amazon.com/images/P/0020209606.0...,http://images.amazon.com/images/P/0020209606.0...
2,278418,0020418809,0,CADDIE WOODLAWN,Carol Ryrie Brink,1970,Simon Pulse,http://images.amazon.com/images/P/0020418809.0...,http://images.amazon.com/images/P/0020418809.0...,http://images.amazon.com/images/P/0020418809.0...
3,278418,0020420900,0,Paul Revere : Boston Patriot (Childhood Of Fam...,Augusta Stevenson,1986,Aladdin,http://images.amazon.com/images/P/0020420900.0...,http://images.amazon.com/images/P/0020420900.0...,http://images.amazon.com/images/P/0020420900.0...
4,278418,002043300X,0,Big Snow,Berta Hader,1972,MacMillan Publishing Company.,http://images.amazon.com/images/P/002043300X.0...,http://images.amazon.com/images/P/002043300X.0...,http://images.amazon.com/images/P/002043300X.0...


In [153]:
df.shape

(270629, 10)

In [154]:
df['User-ID'].nunique()

200

In [155]:
df['Book-Title'].nunique()

115766

We see that there are 1031136 ratings in total by 92106 users and 241071 unique books.

Let us see how many rated the most rated books

In [156]:
rating_counts = pd.DataFrame(df["Book-Title"].value_counts())

rating_counts.head(10)

Unnamed: 0_level_0,count
Book-Title,Unnamed: 1_level_1
Bridget Jones's Diary,117
Wild Animus,100
The Pelican Brief,99
Message in a Bottle,97
The Notebook,93
A Time to Kill,91
The Firm,90
A Painted House,89
The Horse Whisperer,89
Divine Secrets of the Ya-Ya Sisterhood: A Novel,84


Lets see and drop how many books, that were not rated.

In [157]:
df[["Book-Title", "Book-Rating"]].dropna().drop(columns=["Book-Rating"]).value_counts()

Book-Title                    
Bridget Jones's Diary             117
Wild Animus                       100
The Pelican Brief                  99
Message in a Bottle                97
The Notebook                       93
                                 ... 
King David's Spaceship              1
King Charlie                        1
King Charles III: A Biography       1
King Bongo : A Novel of Havana      1
Ã?Â?stlich der Berge.               1
Name: count, Length: 115766, dtype: int64

In [167]:
user_book_df = df[["User-ID", "Book-Title", "Book-Rating"]]

In [168]:
user_book_df

Unnamed: 0,User-ID,Book-Title,Book-Rating
0,278418,Silence of the Sirens,5
1,278418,NEVER ALONE REISSUE,0
2,278418,CADDIE WOODLAWN,0
3,278418,Paul Revere : Boston Patriot (Childhood Of Fam...,0
4,278418,Big Snow,0
...,...,...,...
270624,275970,There's a Porcupine in My Outhouse: Misadventu...,0
270625,275970,Die Biene.,10
270626,275970,The Penis Book,0
270627,275970,Musashi,0


Creating a dataframe with the user ratings only

In [169]:
user_book_df = user_book_df.pivot_table(index='User-ID', columns='Book-Title', values='Book-Rating')

In [161]:
user_book_df

Book-Title,"A Light in the Storm: The Civil War Diary of Amelia Martin, Fenwick Island, Delaware, 1861 (Dear America)",Always Have Popsicles,Apple Magic (The Collector's series),Beyond IBM: Leadership Marketing and Finance for the 1990s,Dark Justice,Deceived,"Earth Prayers From around the World: 365 Prayers, Poems, and Invocations for Honoring the Earth",Final Fantasy Anthology: Official Strategy Guide (Brady Games),Garfield Bigger and Better (Garfield (Numbered Paperback)),"Good Wives: Image and Reality in the Lives of Women in Northern New England, 1650-1750",...,whataboutrick.com: a poetic tribute to Richard A. Ricci,"Â¡Corre, perro, corre!",Â¡Cristina! confidencias de una rubia,Â¿Eres tu mi mamÃ¡?/Are You My Mother?,"Â¿QuÃ© me quieres, amor?","Ã?ber den Wunsch, sich wohlzufÃ¼hlen: Geschichten",Ã?Â?ber das Fernsehen.,Ã?Â?ber die Pflicht zum Ungehorsam gegen den Staat.,Ã?Â?lpiraten.,Ã?Â?stlich der Berge.
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3363,,,,,,,,,,,...,,,,,,,,,,
6251,,,,,,,,,,,...,,,,,,,,,,
6575,,,,,,,,,,,...,,,,,,,,,,
7346,,,,,,,,,,,...,,,,,,,,,,
11601,,,,0.0,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271284,,,,,,,,,,,...,,,,,,,,,,
274061,,,,,,,,,,,...,,,,,,,,,,
274308,,,,,,,,,,,...,,,,,,,,,,
275970,,,,,,,,,,,...,,,,,,,,,,


In [170]:
user_book_df.shape

(200, 115766)

In [171]:
import numpy as np

random_user = np.array(user_book_df.sample(random_state=42).index)[0]
random_user

143175

In [172]:
random_user_df = user_book_df[user_book_df.index == random_user]
random_user_df

Book-Title,"A Light in the Storm: The Civil War Diary of Amelia Martin, Fenwick Island, Delaware, 1861 (Dear America)",Always Have Popsicles,Apple Magic (The Collector's series),Beyond IBM: Leadership Marketing and Finance for the 1990s,Dark Justice,Deceived,"Earth Prayers From around the World: 365 Prayers, Poems, and Invocations for Honoring the Earth",Final Fantasy Anthology: Official Strategy Guide (Brady Games),Garfield Bigger and Better (Garfield (Numbered Paperback)),"Good Wives: Image and Reality in the Lives of Women in Northern New England, 1650-1750",...,whataboutrick.com: a poetic tribute to Richard A. Ricci,"Â¡Corre, perro, corre!",Â¡Cristina! confidencias de una rubia,Â¿Eres tu mi mamÃ¡?/Are You My Mother?,"Â¿QuÃ© me quieres, amor?","Ã?ber den Wunsch, sich wohlzufÃ¼hlen: Geschichten",Ã?Â?ber das Fernsehen.,Ã?Â?ber die Pflicht zum Ungehorsam gegen den Staat.,Ã?Â?lpiraten.,Ã?Â?stlich der Berge.
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
143175,,,,,,,,,,,...,,,,,,,,,,


In [173]:
random_user_books_read = random_user_df.columns[random_user_df.notna().any()].tolist()
random_user_books_read

['1812 (The American Story)',
 '1984',
 '1st to Die',
 '1st to Die: A Novel',
 '3rd Degree',
 '4 Blondes',
 'A Colton Family Christmas',
 'A Day Late and Dollar Short',
 'A Five-Year Plan',
 'A Knife to Remember (Jane Jeffry Mysteries (Paperback))',
 'A Knight in Shining Armor',
 'A Lady of Property',
 'A Long Way from Chicago',
 'A Love to Remember',
 'A Man in Full',
 'A Painted House',
 'A Perfect Fit (Time of Your Life)',
 'A Season Of Miracles',
 'A THOUSAND ACRES (MOVIE TIE-IN REISSUE)  CASSETTE',
 'A Very Long Engagement',
 'A Walk in the Woods: Rediscovering America on the Appalachian Trail',
 'A Wizard of Earthsea (Earthsea Trilogy, Book 1)',
 'ANGELS FROM HELL : ALMOST TOTALLY TRUE TALES OF INFERNAL, AND OTHERWISE INEXPLICABLE, INTERVENTION',
 "Adam's Fall",
 'After Sundown',
 'Against Her Will',
 'Air Battle Force (Brown, Dale)',
 'All But The Queen Of Hearts (Harlequin Historical Romances, No 369)',
 'All Creatures Great and Small',
 'All Mothers Work: A Guilt Free Guide fo

In [174]:
len(random_user_books_read)

697

In [175]:
books_read_df = user_book_df[random_user_books_read]

In [176]:
books_read_df

Book-Title,1812 (The American Story),1984,1st to Die,1st to Die: A Novel,3rd Degree,4 Blondes,A Colton Family Christmas,A Day Late and Dollar Short,A Five-Year Plan,A Knife to Remember (Jane Jeffry Mysteries (Paperback)),...,Wild Hearts,Wild Swan (Wild Swan),Wizard War,"Woman With A Mystery (Harlequin Intrigue, No. 643)",Women &amp; Love : Finding True Love While Staying True to Yourself: The Eight Make-Or-Break Experiences in Women's Lives,Women on Men,Word of Honor,Worst Fears Realized,Years,Yuletide Brides (2 Novels in 1)
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3363,,,,,,,,,,,...,,,,,,,,,,
6251,,,,,,0.0,,,,,...,,,,,,,,,,
6575,,,,,,,,,,,...,,,,,,,,,,
7346,,8.0,,,,,,,,,...,,,,,,,,,,
11601,,,,,,,,,,,...,,,,,,,,,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271284,,,,,,,,,,,...,,,,,,,,,,
274061,,,,,,,,,,,...,,,,,,,,,,
274308,,,,,0.0,,,,,,...,,,,,,,,0.0,,
275970,,0.0,,,,,,,,,...,,,,,,,,,,


In [177]:
books_read_df.shape

(200, 697)

We now wanna calculate how many users how many books all the users has read that the randon user has read

In [178]:
user_books_read = books_read_df.T.notnull().sum()

In [179]:
user_books_read

User-ID
3363      26
6251      30
6575      30
7346      40
11601     39
          ..
271284    43
274061    14
274308    44
275970    19
278418    50
Length: 200, dtype: int64

In [180]:
user_books_read = user_books_read.reset_index()
user_books_read.columns = ['Book-Title', 'Number of Users']
user_books_read

Unnamed: 0,Book-Title,Number of Users
0,3363,26
1,6251,30
2,6575,30
3,7346,40
4,11601,39
...,...,...
195,271284,43
196,274061,14
197,274308,44
198,275970,19


We now select the users that have rated more than 70% of the movies the random user have rated

In [189]:
user_same_books = user_books_read[user_books_read['Number of Users'] > (len(random_user_books_read)*70)/100]["Book-Title"]
user_same_books

95    143175
Name: Book-Title, dtype: int64

Since its only 1 book, im gonna lower the %

In [193]:
user_same_books_20 = user_books_read[user_books_read['Number of Users'] > (len(random_user_books_read)*20)/100]["Book-Title"]
user_same_books_20

5      11676
95    143175
Name: Book-Title, dtype: int64

I want to have a few more books, so i'll try again

In [196]:
user_same_books_5 = user_books_read[user_books_read['Number of Users'] > (len(random_user_books_read)*5)/100]["Book-Title"]
user_same_books_5

3        7346
4       11601
5       11676
7       13552
10      16795
        ...  
192    265313
194    269566
195    271284
197    274308
199    278418
Name: Book-Title, Length: 95, dtype: int64

Now that im satisfied with the amount of books, i can create a data frame with the rating of only those users.

In [198]:
final_df = books_read_df[books_read_df.index.isin(user_same_books_5)]
final_df

Book-Title,1812 (The American Story),1984,1st to Die,1st to Die: A Novel,3rd Degree,4 Blondes,A Colton Family Christmas,A Day Late and Dollar Short,A Five-Year Plan,A Knife to Remember (Jane Jeffry Mysteries (Paperback)),...,Wild Hearts,Wild Swan (Wild Swan),Wizard War,"Woman With A Mystery (Harlequin Intrigue, No. 643)",Women &amp; Love : Finding True Love While Staying True to Yourself: The Eight Make-Or-Break Experiences in Women's Lives,Women on Men,Word of Honor,Worst Fears Realized,Years,Yuletide Brides (2 Novels in 1)
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7346,,8.000000,,,,,,,,,...,,,,,,,,,,
11601,,,,,,,,,,,...,,,,,,,,,,0.0
11676,,3.333333,,9.0,4.0,,,,,,...,,,,0.0,,,,0.0,,8.0
13552,,,,,,,,,,,...,,,,,,,,,,
16795,,8.000000,,9.0,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
265313,,,,,,,,,,,...,,,,,,,,,,
269566,,,,,,,,,,,...,,,,,,,,,,
271284,,,,,,,,,,,...,,,,,,,,,,
274308,,,,,0.0,,,,,,...,,,,,,,,0.0,,


We can now calculate the correlation between all these users.

In [199]:
corr_df = final_df.T.corr()
corr_df

User-ID,7346,11601,11676,13552,16795,21014,23768,25981,26544,31315,...,245963,246655,254465,261829,265115,265313,269566,271284,274308,278418
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7346,1.000000,,0.231382,-0.355523,-0.077045,0.046638,-0.267580,1.000000,,-0.148606,...,,,-0.038303,-0.424176,0.487762,,0.173153,,-0.389176,
11601,,,,,,,,,,,...,,,,,,,,,,
11676,0.231382,,1.000000,-0.077059,0.164093,-0.062206,-0.005900,0.101114,-0.296127,-0.062072,...,,,0.181113,-0.155620,0.024025,0.082451,0.148402,0.132622,0.116247,
13552,-0.355523,,-0.077059,1.000000,0.157841,-0.188777,0.380261,0.400000,,0.034668,...,,,-0.069151,,-0.092476,,-0.277464,0.490881,-0.321324,
16795,-0.077045,,0.164093,0.157841,1.000000,0.132341,0.233715,-0.287895,-0.308650,0.119999,...,,,0.245150,0.398834,0.028906,0.078361,-0.156706,-0.222817,0.369756,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
265313,,,0.082451,,0.078361,-0.215166,-0.062500,-0.455983,-0.112839,-0.401116,...,,,-0.288675,0.129069,0.194320,1.000000,-0.120370,,-0.152349,
269566,0.173153,,0.148402,-0.277464,-0.156706,0.246289,,,,,...,,,,-0.310647,,-0.120370,1.000000,,,
271284,,,0.132622,0.490881,-0.222817,,,,,,...,,,,,,,,1.000000,-0.083333,
274308,-0.389176,,0.116247,-0.321324,0.369756,,,,-0.120281,0.347180,...,,,-0.239046,0.354857,0.128596,-0.152349,,-0.083333,1.000000,


In [204]:
user_corr = corr_df[random_user].reset_index()
user_corr = user_corr.rename(columns={random_user: 'Correlation'})
user_corr = user_corr.sort_values('Correlation', ascending=False)
user_corr = user_corr.loc[user_corr["User-ID"] != random_user]
user_corr = user_corr.reset_index(drop=True)
user_corr

Unnamed: 0,User-ID,Correlation
0,190925,0.378718
1,73394,0.347483
2,106225,0.309498
3,98391,0.304480
4,217375,0.273498
...,...,...
89,228998,
90,242824,
91,245963,
92,246655,


Now let us merge it with all the ratings of the users

In [205]:
# Merge it with all the ratings of the users
top_users_rating = user_corr.merge(ratings[['User-ID', 'ISBN', 'Book-Rating']], how='inner')
top_users_rating

Unnamed: 0,User-ID,Correlation,ISBN,Book-Rating
0,190925,0.378718,*0515128325,0
1,190925,0.378718,0006473369,0
2,190925,0.378718,0006512143,7
3,190925,0.378718,0007146078,0
4,190925,0.378718,0020258801,0
...,...,...,...,...
183792,278418,,5008601364,0
183793,278418,,5008602064,0
183794,278418,,528826859,0
183795,278418,,684124645,0


We can now create ratings that are weigthed with respect to the correlation

In [206]:
top_users_rating["Weighted Rating"] = top_users_rating["Correlation"] * top_users_rating["Book-Rating"]
top_users_rating

Unnamed: 0,User-ID,Correlation,ISBN,Book-Rating,Weighted Rating
0,190925,0.378718,*0515128325,0,0.000000
1,190925,0.378718,0006473369,0,0.000000
2,190925,0.378718,0006512143,7,2.651027
3,190925,0.378718,0007146078,0,0.000000
4,190925,0.378718,0020258801,0,0.000000
...,...,...,...,...,...
183792,278418,,5008601364,0,
183793,278418,,5008602064,0,
183794,278418,,528826859,0,
183795,278418,,684124645,0,


For each book, we can now take the average of the weighted ratings to get a final rating for all the movies

In [207]:
recommendation_df = top_users_rating.groupby('ISBN').agg({"Weighted Rating": "mean"}).sort_values(by='Weighted Rating', ascending=False)
reccomendation_df = recommendation_df.reset_index()
recommendation_df

Unnamed: 0_level_0,Weighted Rating
ISBN,Unnamed: 1_level_1
0553575325,3.787181
0736413057,3.474826
0451166906,3.408463
0807072036,3.408463
0449224627,3.408463
...,...
9684320949,
970690042X,
B00005VY5H,
B00005W4C1,


In [208]:
books_to_be_recommended = recommendation_df.merge(books[['ISBN', 'Book-Title']], left_on='ISBN', right_on='ISBN').drop(columns='ISBN')
books_to_be_recommended = books_to_be_recommended.head()
books_to_be_recommended

Unnamed: 0,Weighted Rating,Book-Title
0,3.787181,Simple Justice: A Benjamin Justice Mystery (Be...
1,3.474826,Pooh Loves You (Disney Winnie the Pooh)
2,3.408463,Hard Candy
3,3.408463,Girl in the Mirror: Three Generations of Black...
4,3.408463,Dress Her in Indigo


We can now put it together into a recommender function

In [209]:
def user_based_recommender(input_user, user_book_df, rate_ratio=0.05, num_recommendations=5):
    # Creating a list of books the input user have rated
    input_user_df = user_book_df[user_book_df.index == input_user]
    input_user_books_read = input_user_df.columns[input_user_df.notna().any()].tolist()

    # Creating a dataframe of books the input user have rated
    books_read_df = user_book_df[input_user_books_read]

    # Counting how many books other users have rated that the input user have also rated
    user_books_read = books_read_df.T.notnull().sum()
    user_books_read = user_books_read.reset_index()
    user_books_read.columns = ['Book-Title', 'Number of Users']

    # Selecting similar users over based on a rating similiarity count ratio threshold
    user_same_books = user_books_read[user_books_read['Number of Users'] > (len(input_user_books_read)*rate_ratio)]["Book-Title"]

    # Creating a correlation matrix
    final_df = books_read_df[books_read_df.index.isin(user_same_books)]
    corr_df = final_df.T.corr()

    # Created top correlated users
    user_corr = corr_df[input_user].reset_index()
    user_corr = user_corr.rename(columns={input_user: 'Correlation'})
    user_corr = user_corr.sort_values('Correlation', ascending=False)
    user_corr = user_corr.loc[user_corr["User-ID"] != input_user]
    user_corr = user_corr.reset_index(drop=True)

    # Creating correleted weighted ratings
    top_users_rating = user_corr.merge(ratings[['User-ID', 'ISBN', 'Book-Rating']], how='inner')
    top_users_rating["Weighted Rating"] = top_users_rating["Correlation"] * top_users_rating["Book-Rating"]

    # Creating a recommendation dataframe
    recommendation_df = top_users_rating.groupby('ISBN').agg({"Weighted Rating": "mean"}).sort_values(by='Weighted Rating', ascending=False)
    recommendation_df = recommendation_df.reset_index()

    # Creating the final recommendations
    books_to_be_recommended = recommendation_df.merge(books[['ISBN', 'Book-Title']], left_on='ISBN', right_on='ISBN').drop(columns='ISBN')
    books_to_be_recommended = books_to_be_recommended.head(num_recommendations)

    return books_to_be_recommended["Book-Title"]

In [211]:
user_book_df

Book-Title,"A Light in the Storm: The Civil War Diary of Amelia Martin, Fenwick Island, Delaware, 1861 (Dear America)",Always Have Popsicles,Apple Magic (The Collector's series),Beyond IBM: Leadership Marketing and Finance for the 1990s,Dark Justice,Deceived,"Earth Prayers From around the World: 365 Prayers, Poems, and Invocations for Honoring the Earth",Final Fantasy Anthology: Official Strategy Guide (Brady Games),Garfield Bigger and Better (Garfield (Numbered Paperback)),"Good Wives: Image and Reality in the Lives of Women in Northern New England, 1650-1750",...,whataboutrick.com: a poetic tribute to Richard A. Ricci,"Â¡Corre, perro, corre!",Â¡Cristina! confidencias de una rubia,Â¿Eres tu mi mamÃ¡?/Are You My Mother?,"Â¿QuÃ© me quieres, amor?","Ã?ber den Wunsch, sich wohlzufÃ¼hlen: Geschichten",Ã?Â?ber das Fernsehen.,Ã?Â?ber die Pflicht zum Ungehorsam gegen den Staat.,Ã?Â?lpiraten.,Ã?Â?stlich der Berge.
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3363,,,,,,,,,,,...,,,,,,,,,,
6251,,,,,,,,,,,...,,,,,,,,,,
6575,,,,,,,,,,,...,,,,,,,,,,
7346,,,,,,,,,,,...,,,,,,,,,,
11601,,,,0.0,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271284,,,,,,,,,,,...,,,,,,,,,,
274061,,,,,,,,,,,...,,,,,,,,,,
274308,,,,,,,,,,,...,,,,,,,,,,
275970,,,,,,,,,,,...,,,,,,,,,,


In [212]:
user_based_recommender(random_user, user_book_df)

0    Simple Justice: A Benjamin Justice Mystery (Be...
1              Pooh Loves You (Disney Winnie the Pooh)
2                                           Hard Candy
3    Girl in the Mirror: Three Generations of Black...
4                                  Dress Her in Indigo
Name: Book-Title, dtype: object