# ADAM NOWAK
https://www.geeksforgeeks.org/recommendation-system-in-python/

### CONTENT-BASED RECOMMENDATION SYSTEM 
#### Item profile
1. TF - term frequency -  It displays the regularity with which a certain term or word occurs in a text corpus or document. - IS USED TO RANK TERM in the document - identifying important terms in the document
2. IDF - inverse document frequency - is employed in text analysis and information retrieval to evaluate the significance of phrases within a set of documents. IDF measures how uncommon or unique a term is in the corpus. - common terms has low ITF values, when rare items high. 

We're going to use TF-IDF vectorizer to count the number of times element appear and to measure how significant its role is. 

$TF-IDFscore(w_{ij}) = TF_{ij} * IDF_i$

#### User profile 
User profile is a vector which describes user preferences. 
During the creation of the user’s profile, we use a utility matrix that describes the relationship between user and item. 


Problems which this system contains: 
1. Finding the appropriate feature is hard.
2. Doesn’t recommend items outside the user profile.

## Collaborative Filtering
Collaborative filtering is based on the idea that similar people (based on the data) generally tend to like similar things. It predicts which item a user will like based on the item preferences of other similar users. 

This system uses user-item matrix to generate recommendations. 
This matrix contains the values that indicate a user’s preference towards a given item. These values can represent either explicit feedback (direct user ratings) or implicit feedback (indirect user behavior such as listening, purchasing, watching).
1. explicit feedback - the amount of data collected from the users when they choose to do so (personal ratings). - USER TELLS WHAT he/she LIKES!
2. implicit feedback - we track user behavior to predict their preference. - we try to infer what she/he likes from usage data. 

Problems: 
1. Hard to find new features which improves quality of the model. 
2. cannot handle fresh item - cold start. 


### General Information: 
#### Problems: 
1. Cold start problem. 
2. evolving with time. 
3. scalability 

#### Approaches to build such a system: 
1. Popularity 
2. Classification model - based on some features like (purchase history or product info classifier can count the probability that the user will buy some product) - less efficient than collaborative filtering methods. 
3. co-occurrence 
4. matrix factorization 




# Book recommendation system

In [2]:
import numpy as np
import pandas as pd

books_df = pd.read_csv('Books.csv')
ratings_df = pd.read_csv('Ratings.csv')
users_df = pd.read_csv('Users.csv')

  books_df = pd.read_csv('Books.csv')


In [3]:
books_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271358 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB


In [4]:
books_df = books_df[['ISBN', 'Book-Title', 'Book-Author']]
books_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   ISBN         271360 non-null  object
 1   Book-Title   271360 non-null  object
 2   Book-Author  271358 non-null  object
dtypes: object(3)
memory usage: 6.2+ MB


In [5]:
books_df.dropna(inplace=True)

In [6]:
books_df.shape

(271358, 3)

In [7]:
ratings_df.shape

(1149780, 3)

In [8]:
ratings_df = ratings_df.dropna(subset=['ISBN', 'Book-Rating'])

In [9]:
#merging datasets and dropping useless columns 
book_user_rating = books_df.merge(ratings_df, on='ISBN', how='inner')
book_user_rating = book_user_rating[['ISBN', 'Book-Title', 'Book-Author', 'User-ID', 'Book-Rating']]

book_user_rating.head(10)

Unnamed: 0,ISBN,Book-Title,Book-Author,User-ID,Book-Rating
0,195153448,Classical Mythology,Mark P. O. Morford,2,0
1,2005018,Clara Callan,Richard Bruce Wright,8,5
2,2005018,Clara Callan,Richard Bruce Wright,11400,0
3,2005018,Clara Callan,Richard Bruce Wright,11676,8
4,2005018,Clara Callan,Richard Bruce Wright,41385,0
5,2005018,Clara Callan,Richard Bruce Wright,67544,8
6,2005018,Clara Callan,Richard Bruce Wright,85526,0
7,2005018,Clara Callan,Richard Bruce Wright,96054,0
8,2005018,Clara Callan,Richard Bruce Wright,116866,9
9,2005018,Clara Callan,Richard Bruce Wright,123629,9


In [10]:
users_df.head(10)

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",
5,6,"santa monica, california, usa",61.0
6,7,"washington, dc, usa",
7,8,"timmins, ontario, canada",
8,9,"germantown, tennessee, usa",
9,10,"albacete, wisconsin, spain",26.0


In [11]:
print(f"Unique users: {ratings_df['User-ID'].nunique()}")
print(f"Unique books: {ratings_df['ISBN'].nunique()}")

Unique users: 105283
Unique books: 340556


In [12]:
#list of some books
selected_books = ["Where the Heart Is (Oprah's Book Club (Paperback))",
                  "The Surgeon",
                  "I Know This Much Is True"]

filtered_books = book_user_rating[book_user_rating['Book-Title'].isin(selected_books)]

#gruop by title and count reviews 
reviews_count = filtered_books.groupby('Book-Title')['Book-Rating'].count()

print(reviews_count)

Book-Title
I Know This Much Is True                              276
The Surgeon                                           175
Where the Heart Is (Oprah's Book Club (Paperback))    585
Name: Book-Rating, dtype: int64


In [13]:
book_user_rating.shape

(1031134, 5)

#### Only books with more than 100 reviews and users with more than 20 reviews 

In [17]:
book_counts = book_user_rating['Book-Title'].value_counts()
popular_books = book_counts[book_counts > 100].index
filtered_books = book_user_rating[book_user_rating['Book-Title'].isin(popular_books)]
filtered_books.shape

(182799, 5)

In [32]:
user_counts = filtered_books['User-ID'].value_counts()
active_users = user_counts[user_counts > 20].index
filtered_data = filtered_books[filtered_books['User-ID'].isin(active_users)]
filtered_data.shape

(92883, 5)

In [33]:
prepared_data = filtered_data.sort_values(by=['Book-Title', 'User-ID'])
prepared_data.shape

(92883, 5)

In [34]:
prepared_data.head(5)

Unnamed: 0,ISBN,Book-Title,Book-Author,User-ID,Book-Rating
94366,451524934,1984,George Orwell,254,9
240083,451519841,1984,George Orwell,7346,8
94372,451524934,1984,George Orwell,11676,0
240085,451519841,1984,George Orwell,11676,0
307406,452262933,1984,George Orwell,11676,10


In [35]:
pivot_table = prepared_data.pivot_table(
    index='Book-Title',        
    columns='User-ID',   
    values='Book-Rating',   
    fill_value=0            
)

pivot_table.head()

User-ID,243,254,507,638,882,1131,1435,1848,1903,2033,...,276231,276463,276680,277195,277427,277639,278144,278418,278535,278633
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0
24 Hours,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Books in rows, users in columns! 

In [None]:
from sklearn.neighbors import NearestNeighbors

In [None]:


matrix = pivot_table.values
#training model with cosine metric
knn_model = NearestNeighbors(metric='cosine', algorithm='auto')
knn_model.fit(matrix)

In [8]:
from scipy.sparse import csr_matrix

#mapping because its easier to use 0,1,2... values then ISBN numbers or user ids

user_id_mapping = {user_id: idx for idx, user_id in enumerate(users_df['User-ID'].unique())}
isbn_mapping = {isbn: idx for idx, isbn in enumerate(book_user_rating['ISBN'].unique())}

# Merging datasets and extracting necessary columns
book_user_rating = books_df.merge(ratings_df, on='ISBN', how='inner')
book_user_rating = book_user_rating[['ISBN', 'User-ID', 'Book-Rating']]

#mapping to used is and isbn to numeric values
book_user_rating['user_idx'] = book_user_rating['User-ID'].map(user_id_mapping)
book_user_rating['isbn_idx'] = book_user_rating['ISBN'].map(isbn_mapping)

rows = book_user_rating['user_idx'].values
cols = book_user_rating['isbn_idx'].values
data = book_user_rating['Book-Rating'].values

#creating csr matrix 
csr_matrix_data = csr_matrix((data, (rows, cols)), shape=(len(users_df), len(isbn_mapping)))

#basic matrix info 
print(f"Shape of the matrix: {csr_matrix_data.shape}")
print(f"Number of non-zero elements: {csr_matrix_data.nnz}")


Shape of the matrix: (278858, 270151)
Number of non-zero elements: 1031136


Calculating similarities between books and searching for recommendation based on one book. Item-based Collaborative Filtering