<a href="https://colab.research.google.com/github/amanichivilkar/Books-Recommendation-System/blob/main/Amani_Chivilkar_Content_Based_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

During the last few decades, with the rise of Youtube, Amazon, Netflix, and many other such
web services, recommender systems have taken more and more place in our lives. From
e-commerce (suggest to buyers articles that could interest them) to online advertisement
(suggest to users the right contents, matching their preferences), recommender systems are
today unavoidable in our daily online journeys.
In a very general way, recommender systems are algorithms aimed at suggesting relevant
items to users (items being movies to watch, text to read, products to buy, or anything else
depending on industries).
Recommender systems are really critical in some industries as they can generate a huge
amount of income when they are efficient or also be a way to stand out significantly from
competitors. The main objective is to create a book recommendation system for users.

 **Content**
--------------
The Book-Crossing dataset comprises 3 files.
*  Users
---
Contains the users. Note that user IDs (User-ID) have been anonymized and map to
integers. Demographic data is provided (Location, Age) if available. Otherwise, these
fields contain NULL values. 
*  Books
---------
Books are identified by their respective ISBN. Invalid ISBNs have already been removed
from the dataset. Moreover, some content-based information is given (Book-Title,
Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web
Services. Note that in the case of several authors, only the first is provided. URLs linking
to cover images are also given, appearing in three different flavors (Image-URL-S,
Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the Amazon website.
*   Ratings
------------
Contains the book rating information. Ratings (Book-Rating) are either explicit,
expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit,
expressed by 0.

## **Import data and libraries**

In [None]:
import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
a = np.random.randn(12288, 150)
a.shape

(12288, 150)

In [None]:
b = np.random.randn(150, 45) 

In [None]:
c=np.dot(a,b)
c.shape

(12288, 45)

In [None]:
books = pd.read_csv('/content/drive/MyDrive/data/data_book_recommendation/Books.csv')
ratings = pd.read_csv('/content/drive/MyDrive/data/data_book_recommendation/Ratings.csv')
user = pd.read_csv('/content/drive/MyDrive/data/data_book_recommendation/Users.csv')

In [None]:
books.drop(columns=['Image-URL-S', 'Image-URL-M', 'Image-URL-L'], inplace=True)
books.rename(columns={'User-ID':'user_id','Book-Title':'title', 'Book-Rating':'rating','Book-Author':'author', 'Year-Of-Publication':'year', 'Publisher':'publisher',
                    'Book-Rating':'rating'}, inplace=True)
print(len(books))
books.head(2)

271360


Unnamed: 0,ISBN,title,author,year,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada


In [None]:
# Working on books dataframe
books[(books['title']=='Selected Poems')].head()

Unnamed: 0,ISBN,title,author,year,publisher
4523,081120958X,Selected Poems,William Carlos Williams,1985,New Directions Publishing Corporation
39416,0811201465,Selected Poems,K. Patchen,1957,New Directions Publishing Corporation
41316,0679750800,Selected Poems,Rita Dove,1993,Vintage Books USA
106885,0060931744,Selected Poems,Gwendolyn Brooks,1999,Perennial
118775,0517101548,Selected Poems,John Donne,1994,Gramercy Books


*  Since same title has different author, in order to differentiate between the title of same nae we combine the tile with the auther 

*  And since the same book has different ISBN we cant use it insted we will combine the title and auther , and create a title_id for each unique title

In [None]:
# combining tiltle with author to differentiate bet the books with same title
books['title']=books['title'] + " " + books['author']

In [None]:
# Generating book_id for each unique book title
books['book_id'] = books[['title']].sum(axis=1).map(hash)

In [None]:
books.head()

Unnamed: 0,ISBN,title,author,year,publisher,book_id
0,195153448,Classical Mythology Mark P. O. Morford,Mark P. O. Morford,2002,Oxford University Press,-7293735737548367065
1,2005018,Clara Callan Richard Bruce Wright,Richard Bruce Wright,2001,HarperFlamingo Canada,7149617095307170465
2,60973129,Decision in Normandy Carlo D'Este,Carlo D'Este,1991,HarperPerennial,4608496659329701630
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,1160329172564954229
4,393045218,The Mummies of Urumchi E. J. W. Barber,E. J. W. Barber,1999,W. W. Norton &amp; Company,-8279562061523982204


In [None]:
print(len(ratings))
ratings.rename(columns={'User-ID':'user_id', 'Book-Rating':'rating'}, inplace=True)
ratings.head(2)

1149780


Unnamed: 0,user_id,ISBN,rating
0,276725,034545104X,0
1,276726,0155061224,5


## **Data Preprocessing**

In [None]:
# We select those user who has given >= 50 ratings 
df1=ratings.groupby(['user_id'])['rating'].count().reset_index()
list_of_imp_user=list(df1[df1['rating']>50]['user_id'])
len(list_of_imp_user)

3371

In [None]:
# Get rating dataframe of the user who has given >= 50 ratings
ratings=ratings[ratings['user_id'].isin(list_of_imp_user)]
print(len(ratings))
ratings.head()

765672


Unnamed: 0,user_id,ISBN,rating
173,276847,446364193,0
174,276847,3257200552,5
175,276847,3379015180,0
176,276847,3404145909,8
177,276847,3404148576,8


In [None]:
# merge ratings with books
df=ratings.merge(books, on='ISBN')
print(len(df))
df.head()

700848


Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,book_id
0,276847,446364193,0,Along Came a Spider (Alex Cross Novels) James ...,James Patterson,1993,Warner Books,6935983676206818760
1,278418,446364193,0,Along Came a Spider (Alex Cross Novels) James ...,James Patterson,1993,Warner Books,6935983676206818760
2,5483,446364193,0,Along Came a Spider (Alex Cross Novels) James ...,James Patterson,1993,Warner Books,6935983676206818760
3,7346,446364193,0,Along Came a Spider (Alex Cross Novels) James ...,James Patterson,1993,Warner Books,6935983676206818760
4,8362,446364193,0,Along Came a Spider (Alex Cross Novels) James ...,James Patterson,1993,Warner Books,6935983676206818760


In [None]:
print(f"unique title = {df['title'].nunique()}")
print(f"unique ISBN = {df['ISBN'].nunique()}")

unique title = 206071
unique ISBN = 221678


## **Final DataFrame**

In [None]:
# The since a user reads only one book of one kind.. we will remove a row in which a user has read 2 book of same kind
df.drop_duplicates(['user_id','title'], inplace=True)
print(len(df))
df.head()

697534


Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,book_id
0,276847,446364193,0,Along Came a Spider (Alex Cross Novels) James ...,James Patterson,1993,Warner Books,6935983676206818760
1,278418,446364193,0,Along Came a Spider (Alex Cross Novels) James ...,James Patterson,1993,Warner Books,6935983676206818760
2,5483,446364193,0,Along Came a Spider (Alex Cross Novels) James ...,James Patterson,1993,Warner Books,6935983676206818760
3,7346,446364193,0,Along Came a Spider (Alex Cross Novels) James ...,James Patterson,1993,Warner Books,6935983676206818760
4,8362,446364193,0,Along Came a Spider (Alex Cross Novels) James ...,James Patterson,1993,Warner Books,6935983676206818760


In [None]:
df.dropna(inplace=True)

## **Train Test Split**

*  **Befor splitting df we need to remove user_id with one value count or else stratify on item_id in train_test_split will show error**

In [None]:
x=df[['user_id']].value_counts().reset_index()
value1=list(x[x[0]==1]['user_id'])

In [None]:
df=df[~df['user_id'].isin(value1)]
df['user_id'].value_counts().tail(2)

152464    3
76440     2
Name: user_id, dtype: int64

In [None]:
# TrainTest split df
train_df, test_df = train_test_split(df, stratify=df['user_id'] ,test_size=0.20, random_state=42)

In [None]:
# Indexing the User-ID to speed up the searches during evaluation
df = df.set_index('user_id')
train_df = train_df.set_index('user_id')
test_df = test_df.set_index('user_id')

## **You need to study How TF-ID vectorizer woks in order to get better accuracy**

*  ISBN=014028009X title=Bridget Jones's Diary	year=1999	
*  ISBN=0330375253 title=Bridget Jones's Diary	year=2001

## **Vectorize the title**

In [None]:
# Get all the stop words
stopword_list=stopwords.words('english') 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(analyzer='word',ngram_range=(1, 2) , min_df=0.003, max_df=0.5,
                     max_features=5000 , stop_words=stopword_list)

In [None]:
#  Drop nan values befor vectorizing title and author
books.dropna(inplace=True)

In [None]:
vectorized_matrix=vectorizer.fit_transform(books['title'])
vectorized_matrix_feature_name=vectorizer.get_feature_names()

In [None]:
# Put all the books title in a list
books_ids=books['book_id'].tolist()
books_ids

In [None]:
def get_item_profile(item_id):
    idx = books_ids.index(item_id)
    item_profile = vectorized_matrix[idx:idx+1]
    return item_profile

def get_item_profiles(ids):
    item_profiles_list = [get_item_profile(x) for x in ids]
    item_profiles = scipy.sparse.vstack(item_profiles_list)
    return item_profiles

def build_users_profile(person_id, interactions_indexed_df):
    interactions_person_df = interactions_indexed_df.loc[person_id]
    user_item_profiles = get_item_profiles(interactions_person_df['book_id'])
    
    user_item_strengths = np.array(interactions_person_df['rating']).reshape(-1,1)
    
    # Weighted average of item profiles by the interactions strength
    user_item_strengths_weighted_avg = np.sum(user_item_profiles.multiply(user_item_strengths), axis=0) /np.sum(user_item_strengths)
    #user_profile_norm = normalize(user_item_strengths_weighted_avg)
    return user_item_strengths_weighted_avg

def build_users_profiles(): 
    interactions_indexed_df = df[df['book_id'].isin(books['book_id'])]
    user_profiles = {}
    for person_id in interactions_indexed_df.index.unique():
        user_profiles[person_id] = build_users_profile(person_id, interactions_indexed_df)
    return user_profiles

In [None]:
book_id= -7293735737548367065

In [None]:
for i in [book_id]:
      idx = books_ids.index(i)
      item_profile = vectorized_matrix[idx:idx+1]
      print(item_profile)

  (0, 136)	1.0


In [None]:
item_profiles_list = [get_item_profile(x) for x in books_ids[0:5]]
item_profiles = scipy.sparse.vstack(item_profiles_list)
print(item_profiles) 

  (0, 136)	1.0
  (1, 170)	1.0
  (3, 82)	0.7417688707518302
  (3, 191)	0.6706556063909066


In [None]:
user_item_strengths=np.array(df['rating'][0:5]).reshape(-1,1)
user_item_strengths

array([[0],
       [0],
       [0],
       [0],
       [0]])

In [None]:
np.sum(user_item_strengths)

0

In [None]:
user_profiles = build_users_profiles()
len(user_profiles)

3369

*  user_profiles looks like a dictionary with keys as user and values as user profile
-------
{ -9223121837663643404 : array( [ [0.00679228, 0.01231635, 0.        , ..., 0.        , 0.        ,
         0.        ] ] ) , 
 -9212075797126931087  : array( [ [0.        , 0.02568444, 0.        , ..., 0.        , 0.00905023,
         0.        ] ] ) }

### Let's take a look at a particular user profile. It is a unit vector of 5000 length. The value in each position represents how relevant is a token (unigram or bigram) for the selected user

In [None]:
user_id=276847

In [None]:
# zip feature_name and user_profile
a=zip(vectorizer.get_feature_names() , user_profiles[user_id].flatten().tolist()[0])

In [None]:
# Create a DataFrame of the matrix feature name and its 
user_df=pd.DataFrame( sorted(a), columns=['token', 'relevance']).sort_values("relevance",ascending=False)

In [None]:
user_df.head(10)

Unnamed: 0,token,relevance
56,der,0.24874
60,die,0.185939
77,george,0.128292
70,elizabeth,0.125502
206,und,0.101812
89,harry,0.075526
57,des,0.065645
209,von,0.038183
173,roman,0.036535
49,das,0.035976


### Looking at this user's profile, it appears that the top relevant tokens really represent  professional interests in **machine learning**, **deep learning**, **artificial intelligence** and **google cloud platform**!

## **Class for Content-Based Filtering**

In [None]:
class ContentBasedRecommender:
    
    MODEL_NAME = 'Content-Based'
    
    def __init__(self, items_df=books):
        self.books_ids = books_ids
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def _get_similar_items_to_user_profile(self, person_id, topn=1000):
        
        # Compute the cosine similarity between the user profile and all item profiles
        cosine_similarities = cosine_similarity(user_profiles[person_id], vectorized_matrix)
        
        # Get the top similar items
        similar_indices = cosine_similarities.argsort().flatten()[-topn:]
        
        # Sort the similar items by similarity
        similar_items = sorted([(books_ids[i], cosine_similarities[0,i]) for i in similar_indices], key=lambda x: -x[1])
        return similar_items
        
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        similar_items = self._get_similar_items_to_user_profile(user_id)
        
        #Ignores items the user has already interacted
        similar_items_filtered = list(filter(lambda x: x[0] not in items_to_ignore, similar_items))
        
        recommendations_df = pd.DataFrame(similar_items_filtered, columns=['book_id', 'ratings']) \
                                    .head(topn)

        # if verbose:
        #     if self.items_df is None:
        #         raise Exception('"items_df" is required in verbose mode')

        recommendations_df = recommendations_df.merge(books, how = 'left', 
                                                          left_on = 'book_id', 
                                                          right_on = 'book_id')[['title', 'author', 'book_id','ratings']]


        return recommendations_df
    
content_based_recommender_model = ContentBasedRecommender(books)

In [None]:
def get_items_interacted(person_id, interactions_df):
    interacted_items = interactions_df.loc[person_id]['book_id']
    return set(interacted_items if type(interacted_items) == pd.Series else [interacted_items])

In [None]:
user_id=276847

In [None]:
I = get_items_interacted(user_id, train_df)
Recommended_user_df=content_based_recommender_model.recommend_items( user_id, items_to_ignore = I, topn=100, verbose=False)
Recommended_user_df.head(10)

Unnamed: 0,title,author,book_id,ratings
0,Die Zeit drÃ¤ngt: Eine Weltversammlung der Chr...,Carl Friedrich WeizsÃ¤cker,3128858540793276650,0.814412
1,Wenn du den Emu am Himmel siehst. Eine Reise i...,Elizabeth Fuller,731262988036679743,0.811261
2,Der Weg der Kriegerin. Die neuen Waffen der Fr...,Nina George,-6393379969813860173,0.807779
3,Die sechste Puppe im Bauch der fÃ?Â¼nften Pupp...,Urs Widmer,8006077406649666092,0.802604
4,"Der Sturz des DÃ¤dalus, oder, Eizes fÃ¼r die E...",Wolf Biermann,-7102167731247001579,0.798823
5,"Die Rolle der Musik in der Film-, Funk- und Fe...",Klaus WÃ¼sthoff,-4653702235353713550,0.796966
6,"Semele, Zeus und Hera: Die Rolle der Geliebten...",Hans Jellouschek,8802769490799028748,0.786305
7,Die Gottsucher: Eine Vereinigung der christlic...,Remo F Roth,-1191308468740631542,0.786305
8,Die Nacht der Macht. Der Spion und der PrÃ?Â¤s...,Liaty Pisani,-4797834653744313170,0.786305
9,Die Mennyms auf der Flucht. ( Ab 11 J.). Sylvi...,Sylvia Waugh,-4653854896053459589,0.77484


## **Evaluation**

In [None]:
#Top-N accuracy metrics consts
EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 100

class ModelEvaluator:

    # Function for getting the set of items which a user has not interacted with
    def get_not_interacted_items_sample(self, person_id, sample_size, seed=42):
        interacted_items = get_items_interacted(person_id, df)
        all_items = set(books['book_id'])
        non_interacted_items = all_items - interacted_items

        random.seed(seed)
        non_interacted_items_sample = random.sample(non_interacted_items, sample_size)
        return set(non_interacted_items_sample)

    # Function to verify whether a particular item_id was present in the set of top N recommended items
    def _verify_hit_top_n(self, item_id, recommended_items, topn):        
            try:
                index = next(i for i, c in enumerate(recommended_items) if c == item_id)
            except:
                index = -1
            hit = int(index in range(0, topn))
            return hit, index
    
    # Function to evaluate the performance of model for each user
    def evaluate_model_for_user(self, model, person_id):
        
        # Getting the items in test set
        interacted_values_testset = test_df.loc[person_id]
        
        if type(interacted_values_testset['book_id']) == pd.Series:
            person_interacted_items_testset = set(interacted_values_testset['book_id'])
        else:
            person_interacted_items_testset = set([int(interacted_values_testset['book_id'])])
            
        interacted_items_count_testset = len(person_interacted_items_testset) 

        # Getting a ranked recommendation list from the model for a given user
        person_recs_df = model.recommend_items(person_id, items_to_ignore=get_items_interacted(person_id, train_df),topn=10000000000)

        hits_at_5_count = 0
        hits_at_10_count = 0
        
        # For each item the user has interacted in test set
        for item_id in person_interacted_items_testset:
            
            # Getting a random sample of 100 items the user has not interacted with
            non_interacted_items_sample = self.get_not_interacted_items_sample(person_id, sample_size=EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS, seed=item_id%(2**32))

            # Combining the current interacted item with the 100 random items
            items_to_filter_recs = non_interacted_items_sample.union(set([item_id]))

            # Filtering only recommendations that are either the interacted item or from a random sample of 100 non-interacted items
            valid_recs_df = person_recs_df[person_recs_df['book_id'].isin(items_to_filter_recs)]                    
            valid_recs = valid_recs_df['book_id'].values
            
            # Verifying if the current interacted item is among the Top-N recommended items
            hit_at_5, index_at_5 = self._verify_hit_top_n(item_id, valid_recs, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._verify_hit_top_n(item_id, valid_recs, 10)
            hits_at_10_count += hit_at_10

        # Recall is the rate of the interacted items that are ranked among the Top-N recommended items
        recall_at_5 = hits_at_5_count / float(interacted_items_count_testset)
        recall_at_10 = hits_at_10_count / float(interacted_items_count_testset)

        person_metrics = {'hits@5_count':hits_at_5_count, 
                          'hits@10_count':hits_at_10_count, 
                          'interacted_count': interacted_items_count_testset,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        return person_metrics

    
    # Function to evaluate the performance of model at overall level
    def evaluate_model(self, model):
        
        people_metrics = []
        
        for idx, person_id in enumerate(list(test_df.index.unique().values)):    
            person_metrics = self.evaluate_model_for_user(model,person_id)  
            person_metrics['_person_id'] = person_id
            people_metrics.append(person_metrics)
            
        print('%d users processed' % idx)

        detailed_results_df = pd.DataFrame(people_metrics).sort_values('interacted_count', ascending=False)
        
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        
        global_metrics = {'modelName': model.get_model_name(),
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}    
        return global_metrics, detailed_results_df
    
model_evaluator = ModelEvaluator()    

In [None]:
for user_id in list(test_df.index.unique().values)[0:10]:
     print( model_evaluator.evaluate_model_for_user(content_based_recommender_model, user_id) )

{'hits@5_count': 1, 'hits@10_count': 1, 'interacted_count': 209, 'recall@5': 0.004784688995215311, 'recall@10': 0.004784688995215311}
{'hits@5_count': 1, 'hits@10_count': 1, 'interacted_count': 8, 'recall@5': 0.125, 'recall@10': 0.125}
{'hits@5_count': 2, 'hits@10_count': 2, 'interacted_count': 255, 'recall@5': 0.00784313725490196, 'recall@10': 0.00784313725490196}
{'hits@5_count': 0, 'hits@10_count': 0, 'interacted_count': 98, 'recall@5': 0.0, 'recall@10': 0.0}
{'hits@5_count': 2, 'hits@10_count': 2, 'interacted_count': 29, 'recall@5': 0.06896551724137931, 'recall@10': 0.06896551724137931}
{'hits@5_count': 0, 'hits@10_count': 0, 'interacted_count': 80, 'recall@5': 0.0, 'recall@10': 0.0}
{'hits@5_count': 3, 'hits@10_count': 3, 'interacted_count': 167, 'recall@5': 0.017964071856287425, 'recall@10': 0.017964071856287425}
{'hits@5_count': 3, 'hits@10_count': 3, 'interacted_count': 797, 'recall@5': 0.0037641154328732747, 'recall@10': 0.0037641154328732747}
{'hits@5_count': 5, 'hits@10_coun

**We see that Content_Based_Filtering does not perform well on books data Since these algorithms try to recommend items that are similar to those that a user liked in the past and we provide recommendation based on the book title, which is always different for different books, therefore this model is unable to correctly recommend books to the user**

In [None]:
# print('Evaluating Content-Based Filtering model...')
# cb_global_metrics, cb_detailed_results_df = model_evaluator.evaluate_model(content_based_recommender_model)


# print('\nGlobal metrics:\n%s' % cb_global_metrics)
# cb_detailed_results_df.head(10)

## **My testing**

In [None]:
user_id=276847

In [None]:
model_evaluator.evaluate_model_for_user(content_based_recommender_model, user_id)

{'hits@10_count': 0,
 'hits@5_count': 0,
 'interacted_count': 9,
 'recall@10': 0.0,
 'recall@5': 0.0}

In [None]:
Recommended_user_df['author'].value_counts().head(20)

Ken Follett                   9
Siri Hustvedt                 4
Wolfgang Hohlbein             3
Bertolt Brecht                2
Gaby Hauptmann                2
Andreas SteinhÃ?Â¶fel         2
Gerit Kopietz                 2
Carl Friedrich WeizsÃ¤cker    1
Leena Lander                  1
Virginia Doyle                1
Sabine Blau                   1
Maria Benedickt               1
Sparkle Hayter                1
Uwe Timm                      1
Brigitte Blobel               1
Donna Leon                    1
Lotte Ingrisch                1
Shani Mootoo                  1
Henning Mankell               1
Charlotte Link                1
Name: author, dtype: int64

In [None]:
train_df.loc[user_id]['author'].value_counts()

Elizabeth George             6
J. R. R. Tolkien             4
Joanne K. Rowling            3
Philip Kerr                  2
Horst Bosetzky               1
Guido Knopp                  1
Charles Todd                 1
Jan Kjaerstad                1
Patricia Cornwell            1
Arnaldur Indridason          1
Tess Gerritsen               1
Jockel Tschiersch            1
J. K. Rowling                1
Bill Bryson                  1
Michael Miersch              1
Stephen Fry                  1
Franz Kafka                  1
James Patterson              1
Tony Hawks                   1
Thea Dorn                    1
Ann Granger                  1
Robert Schneider             1
John Ronald Reuel Tolkien    1
Martha Grimes                1
Jan Philipp Reemtsma         1
P. D. James                  1
Frederick Forsyth            1
Name: author, dtype: int64

In [None]:
test_df.loc[user_id]['author'].value_counts()

Elke Eberhardt              1
Philip Pullman              1
Antoine de Saint-Exupery    1
Elizabeth George            1
Georg Friedrich Nikol       1
Rhue                        1
Kerr                        1
Martha Grimes               1
Michael Phillips            1
Name: author, dtype: int64

In [None]:
user_df.head(10)

Unnamed: 0,token,relevance
56,der,0.24874
60,die,0.185939
77,george,0.128292
70,elizabeth,0.125502
206,und,0.101812
89,harry,0.075526
57,des,0.065645
209,von,0.038183
173,roman,0.036535
49,das,0.035976
