<a href="https://colab.research.google.com/github/developerdatascience/AMEX_Default_Prediction/blob/main/Recommended_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The main families of methods for RecSys are:

---



---



1.Simple Recommender: recommend the most popular items for each user.

2.Collaborative filtering: This method makes automatic predictions(filtering) about the interest of the users by collecting preferences or
taste information from many users(collaborating). The underlying assumption of the collaborative filtering approach is that if a person A
has the same opinion as a person B on the set of items, A is more likely to have B's opinion for a given item than that of a randomly chosen person

3.Content based filtering: This method uses only information about the description and attributes of the items users has previously consumed to model users's preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with itesms previously rated by the user and best-matching items are recommended.

4.Hybrid methods: Recent research has demonstrated that a hybrid approach, combining collaborative filtering and content based filtering could be more effective that pure approaches in some cases. These methods can also be used to overcome some of the common problems in the recommender systems such as cold start and the sparsity problems.

In [5]:
import warnings
warnings.filterwarnings('ignore')

# DATA

We will use the small version of MovieLens datasets that contains 100000 rating and 3600 tag applications applied to 9000 movies by 600 users, and last update on it was in 2018

for more information about bigger versions: https://grouplens.org/datasets/movieslens/

In [6]:
!wget https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!unzip ml-latest-small.zip

--2023-04-09 09:01:38--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2023-04-09 09:01:39 (3.06 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]

Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


This dataset consists of the following files:

movies.csv: This file contains genre information on movies featured in the dataset

tags.csv: Contains tages that are user-generated metadata about the movies. Each tag is typically a single word or short phrases. The meaning , value, and purpose of a particular tag is determined by each user.

links.csv: This file contains the TMBD and IMDB IDs of all the moves featured in the dataset.

ratings.csv: Each line of this file after the header row represents one rating of one movie by one user. Ratings are made on a 5-star scale,with half star increments.

In [7]:
import pandas as pd

movies = pd.read_csv("ml-latest-small/movies.csv", header=0)
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [8]:
movies.shape

(9742, 3)

In [9]:
ratings = pd.read_csv("ml-latest-small/ratings.csv")
print(ratings.head(5))
print(ratings.shape)

   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931
(100836, 4)


# Evaluation

We evaluate on the subset of data called Test data, in our case we can take 20% random data sample as test OR for more robust evaluation approach could be to split train and test sets by a reference date, where the train set is composed by all the movies before that date, and the test set are movies after that date to better simulate how the recsys would perform in the production predicting "future" users interaction

In [10]:
movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [11]:
import re

def extract_year(title):
  match = re.search(r'\((.*?)\)', title)
  if match:
    return match.group(len(match.groups()))
  return '1999'

movies['year'] = movies['title'].apply(lambda x: extract_year(x))
movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II (1995),Comedy,1995


In [12]:
threshold_date = '2015'
ids = movies[movies['year'] < threshold_date]['movieId'].values

training_data = ratings[ratings['movieId'].isin(ids)]
print(f'Training data size: {training_data.shape}')
testing_data = ratings[~ratings['movieId'].isin(ids)]
print(f"Testing data size: {testing_data.shape}")


Training data size: (91637, 4)
Testing data size: (9199, 4)


For evaluation we will work with Top-K metrics, which evaluates the accuracy of the top recomendations provided to a users, comparing to the items the users has actually interacted in the test set.

The Top-k accuracy metric choosen are Recall@K and Precision@K:

Recall@K = (Relevant_Items_Reccommended in top-k)/(Relevant_Items)

Precison@K = (Relevant_Items_Recommended in top-k)/(n_items_Recommended)

Let's understand the definitions of recall@k and precision@k, assume we are providing 5 recommendations in this order -1 0 1 0 1, where 1 represets relevant and 0 irrelevant. So the precision@k at different values of k will be precision@3 is 2/3, precision@4 is 2/4 and precision@5 is 3/5. The recall@k would be , recall@3 is 2/3, recall@4 is 2/3 and recall@5 is 3/3

In [13]:
def get_favorites_movies(user_id, ratings_df):
  favorites = ratings_df[(ratings_df['userId'] == user_id) & (ratings_df['rating'] >= 3.5)].sort_values(by='rating', ascending=False)['movieId']
  return set(favorites if type(favorites) == pd.Series else [favorites])

In [26]:
class ModelEvaluator:
  def __init__(self, training_data, testing_data, threshold=3.5):
    self.training_data = training_data
    self.testing_data = testing_data
    self.threshold = threshold

  
  def evaluate_model_for_user(self, model, user_id):
    #getting the items in test set
    favorites_in_test = get_favorites_movies(user_id, self.testing_data)

    #getting a ranked recommendations list from a model for a given user (movieId, predicted_rating)
    person_recs_df = model.recommend_items(user_id, 
                                          items_to_ignore=get_favorites_movies(user_id, self.training_data))
    
    #get only movies with predicted rating >=3.5
    person_recs_df = person_recs_df[person_recs_df['predicted_rating'] >= self.threshold].sort_values(by='predicted_rating', ascending=False)
    true_relevant = person_recs_df[person_recs_df['movieId'].isin(favorites_in_test)].shape[0]

    top_5_recommended = person_recs_df.head(5)
    top_10_recommended= person_recs_df.head(10)

    # Number of relevant and recommended items in top k
    hits_at_5_count = top_5_recommended[top_5_recommended['movieId'].isin(favorites_in_test)].shape[0]
    hits_at_10_count = top_10_recommended[top_10_recommended['movieId'].isin(favorites_in_test)].shape[0]

    precision_at_5 = hits_at_5_count / top_5_recommended.shape[0] if top_5_recommended.shape[0] != 0 else 1
    recall_at_5 = hits_at_5_count / true_relevant if true_relevant != 0 else 1

    precision_at_10 = hits_at_10_count / top_5_recommended.shape[0] if top_10_recommended.shape[0] != 0 else 1
    recall_at_10 = hits_at_10_count / true_relevant if true_relevant != 0 else 1

    person_metrics = {
        'hits@5_count': hits_at_5_count,
        "hits@10_count": hits_at_10_count,
        'recommended@5_count': top_5_recommended.shape[0],
        'recommended@10_count': top_10_recommended.shape[0],
        'relevants': true_relevant,
        'recall@5': recall_at_5,
        'recall@10': recall_at_10,
        'precision@5': precision_at_5,
        'precision@10': precision_at_10
    }

    return person_metrics


  def evaluate_model(self, model):
    #print('Runnning evaluation for user)
    user_metrics = []
    user_ids = list(set(self.testing_data['userId'].values))
    for idx, user_id in enumerate(user_ids):
      metrics = self.evaluate_model_for_user(model, user_id)
      metrics['user_id'] = user_id
      user_metrics.append(metrics)
    print("%d users processed" % idx)

    detailed_results_df = pd.DataFrame(user_metrics) \
                          .sort_values('hits@5_count', ascending=False)
    
    global_recall_at_5 = detailed_results_df['hits@5_count'].sum()/float(detailed_results_df['relevants'].sum())
    global_recall_at_10 = detailed_results_df['hits@10_count'].sum()/float(detailed_results_df['relevants'].sum())

    global_precision_at_5 = detailed_results_df['hits@5_count'].sum()/float(detailed_results_df['recommended@5_count'].sum())
    global_precision_at_10 = detailed_results_df['hits@10_count'].sum()/float(detailed_results_df['recommended@10_count'].sum())

    global_metrics = {
        'modelName': model.get_model_name(),
        'recall@5': global_recall_at_5,
        'recall@10': global_recall_at_10,
        'precision@5': global_precision_at_5,
        'precision@10': global_precision_at_10

    }

    return global_metrics, detailed_results_df

      


In [27]:
model_evaluator = ModelEvaluator(training_data, testing_data)

# Popularity Recommender

A common (and usually hard-to-beat) baseline approach is the Popularity model. This model is not personalized-it simply recommends to a users the most popular items that the user has not previously consumed. As the popularity accounts for the "wisdom of the crowds", it usally provides good recommendations, generally interesting for the most people.

However, using a rating as a metric has a few caveats:

For the one, it doest not take into considerations the popularity of a movie. Therefore, a movie with a rating of 9 from 10 voters will be considered better than a movie with a rating of 8.9 from 10,000 voters.

For example, imagine you want to order chinese food, you have a couple of options, one restaurant  has a 5-star rating by only 5 people while the other restaurant has 4.5 ratings y 1000 people. Which restaurant would you prefer? The second one, right?

Taking these shortcomings  into consideration , you must come up with a weighted rating that takes into account the average rating and the number of votes it has accumulated.

(v/(v+m)*R) + (m/(m+v) * C)

In the above equation,



*   v is the number of ratings for the movie
*   m is the minimum rating required to be listed in the chart


*   R is the average rating of the movie
*   C is the mean ratings across the whole movies





In [33]:
popularity = ratings.groupby('movieId').agg({'rating': ['mean', 'count']}).reset_index()
popularity.columns= ['movieId', 'ratings_mean', 'ratings_count']
popularity.sort_values(by='ratings_mean', ascending=False)

Unnamed: 0,movieId,ratings_mean,ratings_count
7638,88448,5.0,1
8089,100556,5.0,1
9065,143031,5.0,1
9076,143511,5.0,1
9078,143559,5.0,1
...,...,...,...
9253,157172,0.5,1
7536,85334,0.5,1
6486,53453,0.5,1
5200,8494,0.5,1


In [34]:
class PopularityRecommender:

  MODEL_NAME = 'Popularity'

  def __init__(self, popularities_df):
    self.popularities_df = popularities_df

  
  def get_model_name(self):
    return self.MODEL_NAME
  
  def weighted_rating(self, x, m, C):
    v = x['ratings_count']
    R = x['ratings_mean']

    # Calculation based on IMDB formula

    return (v/(v+m) * R) + (m/(m+v) * C)
  

  def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
    C = self.popularities_df['ratings_mean'].mean()

    self.popularities_df['predicted_rating'] = self.popularities_df.apply(lambda x: self.weighted_rating(x, 3.5, C), axis=1)

    recommendations_df = self.popularities_df[~self.popularities_df['movieId'].isin(items_to_ignore)] \
                        .sort_values('predicted_rating', ascending=False) \
                        .head(topn)
    
    return recommendations_df


popularity_model = PopularityRecommender(popularity)


In [36]:
print('Evaluating Popularity recommendation model....')
pop_global_metrics, pop_detailed_results_df = model_evaluator.evaluate_model(popularity_model)
print('\Global metrics:\n%s' % pop_global_metrics)
pop_detailed_results_df.head(10)

Evaluating Popularity recommendation model....
571 users processed
\Global metrics:
{'modelName': 'Popularity', 'recall@5': 0.44, 'recall@10': 1.0, 'precision@5': 0.0038461538461538464, 'precision@10': 0.004370629370629371}


Unnamed: 0,hits@5_count,hits@10_count,recommended@5_count,recommended@10_count,relevants,recall@5,recall@10,precision@5,precision@10,user_id
443,2,3,5,10,3,0.666667,1.0,0.4,0.6,474
23,1,1,5,10,1,1.0,1.0,0.2,0.2,25
316,1,1,5,10,1,1.0,1.0,0.2,0.2,338
197,1,1,5,10,1,1.0,1.0,0.2,0.2,209
390,1,2,5,10,2,0.5,1.0,0.2,0.4,414
220,1,1,5,10,1,1.0,1.0,0.2,0.2,233
200,1,1,5,10,1,1.0,1.0,0.2,0.2,212
297,1,1,5,10,1,1.0,1.0,0.2,0.2,318
438,1,1,5,10,1,1.0,1.0,0.2,0.2,469
16,1,1,5,10,1,1.0,1.0,0.2,0.2,18


In [37]:
popularity_model.recommend_items(5)

Unnamed: 0,movieId,ratings_mean,ratings_count,predicted_rating
277,318,4.429022,317,4.416283
9600,177593,4.75,8,4.297267
840,1104,4.475,20,4.294407
659,858,4.289062,192,4.270683
796,1041,4.590909,11,4.270246
2224,2959,4.272936,218,4.256969
882,1178,4.541667,12,4.252811
2579,3451,4.545455,11,4.235763
921,1221,4.25969,129,4.233348
602,750,4.268041,97,4.233021


In [38]:
ids = popularity_model.recommend_items(10)['movieId'].values
movies[movies['movieId'].isin(ids)]['title'].values

array(['Shawshank Redemption, The (1994)',
       'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)',
       'Godfather, The (1972)', 'Secrets & Lies (1996)',
       'Streetcar Named Desire, A (1951)', 'Paths of Glory (1957)',
       'Godfather: Part II, The (1974)', 'Fight Club (1999)',
       "Guess Who's Coming to Dinner (1967)",
       'Three Billboards Outside Ebbing, Missouri (2017)'], dtype=object)

# Content Based Filtering

Content-based filtering approaches leverage description or attributes from items the user has interacted to recommend similar items. It depends only on the user previous choices, making this method robust to avoid the cold-start problem. For textual items, like articles, news and books, it is simple to use the raw text to build item profiles and user profiles using TF-IDF

# Use Genres as content

In [39]:
# get all the unique genres

genres = list(set([x for genres in movies['genres'].values for x in genres.split('|')]))
genres

['Crime',
 'Drama',
 'Documentary',
 'Comedy',
 'Mystery',
 'IMAX',
 '(no genres listed)',
 'Children',
 'Adventure',
 'Horror',
 'Fantasy',
 'Sci-Fi',
 'Western',
 'War',
 'Romance',
 'Musical',
 'Thriller',
 'Animation',
 'Film-Noir',
 'Action']

In [40]:
expanded_movies_df = movies.copy()
for g in genres:
  expanded_movies_df[g] = [ 0 if not g in genres.split('|') else 1 for genres in movies['genres'].values]

In [41]:
expanded_movies_df.head()

Unnamed: 0,movieId,title,genres,year,Crime,Drama,Documentary,Comedy,Mystery,IMAX,...,Fantasy,Sci-Fi,Western,War,Romance,Musical,Thriller,Animation,Film-Noir,Action
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,0,0,0,1,0,0,...,1,0,0,0,0,0,0,1,0,0
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,0,1,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,1995,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0




To model the user profile, we take all the item profiles the user has interacted and average them. The avergae is weighted by user ratings

In [42]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def normalize(value, old_max, old_min, new_max=5.0, new_min=0.0):
  old_range = (old_max - old_min)
  new_range = (new_max - new_min)
  return (((value - old_min) * new_range) / old_range) + new_min

In [55]:
class ContentBasedRecommender:
  MODEL_NAME = 'Content-Based'

  def __init__(self, items_df, training_data, testing_data):
    self.items_df = items_df
    self.features_names = items_df.columns[4:]
    self.training_data = training_data
    self.testing_data = testing_data
  
  def get_model_name(self):
    return self.MODEL_NAME

  
  def get_item_profile(self, item_id):
    item_profile = self.items_df[self.items['movieId']== item_id][self.features_name]
    return item_profile

  def get_items_profiles(self, ids):
    item_profile = self.items_df[self.items_df['movieId'].isin(ids)][self.features_names]
    return item_profile
  
  def build_users_profile(self, user_id):
    user_df = self.training_data[self.training_data['userId'] == user_id]
    user_items_profiles = self.get_items_profiles(user_df['movieId'].values)
  
    user_items_ratings = np.array(user_df['rating'].values).reshape(-1, 1)
    user_profile = np.sum(np.multiply(user_items_profiles, user_items_ratings), axis=0) / np.sum(user_items_ratings)
    return user_profile
  
  def get_similar_items_to_user_profile(self, user_id, topn=1000):
    user_profile = self.build_users_profile(user_id).reshape(1, -1)
    # Computes the cosine similarity between the user profile and all the item profiles
    cosine_similarities = cosine_similarity(user_profile, self.items_df[self.features_names].values)
    # Get the top similar items
    similar_indices = cosine_similarities.argsort().flatten()[-topn:]
    # sort the similar items by similarity
    similar_items = sorted([(self.items_df.iloc[i, 'movieId'], cosine_similarities[0, i]) for i in similar_indices], key=lambda x: -x[1])
    return similar_items
  
  def recommend_items(self, user_id, items_to_ignore=[], topn=10):
    similar_items = self.get_similar_items_to_user_profile(user_id)
    similar_items_filtered = list(filter(lambda x: x[0] not in items_to_ignore, similar_items))
    recommendations_df = pd.DataFrame(similar_items_filtered, columns=['movieId', 'predicted_rating']) \
                          .head(topn)
    
    recommendations_df['predicted_rating'] = recommendations_df['predicted_rating'].apply(lambda x: normalize(x, 1.0, 0.0))
    return recommendations_df


In [56]:
content_based_recommender_model = ContentBasedRecommender(expanded_movies_df, training_data, testing_data)

In [57]:
print('Evaluating Content-Based Filtering model....')
cb_global_metrics, cb_detailed_results_df = model_evaluator.evaluate_model(content_based_recommender_model)
print('\nGlobal metrics:\n%s' % cb_global_metrics)
cb_detailed_results_df.head(10)

Evaluating Content-Based Filtering model....


AttributeError: ignored

# Use tags as content

The qualitu of your recommender would be increased with the usage of better metadata and by capturing more of the finer details.That is precisely what you re going todo in this section by using movies tags(keywords, cast, geners) as content

We will simply join all the required tags for a movie by a space. This is the final preprocessing step, and the output of this function will be fed into the word vector model

byt note that:


1.   Removing the spaces between words is an important preprocessing step.It is done so that your vectorizer doesn't count the Johnny of 'Johnny Depp' and 'Johnny Galecki' as the same. After this processing step, the aforementioned actors will be represented as 'johnnydepp' and 'johnnyhgalecki' and wikk be distinct to your vectorizer.

2.   Need to convert all letters to small



In [58]:
tags = pd.read_csv('ml-latest-small/tags.csv', header=0)
tags.head(10)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
5,2,89774,Tom Hardy,1445715205
6,2,106782,drugs,1445715054
7,2,106782,Leonardo DiCaprio,1445715051
8,2,106782,Martin Scorsese,1445715056
9,7,48516,way too long,1169687325


In [59]:
def check_name(x):
  words = x.split()
  return all(x[0].isupper() and x[1].islower() for x in words if len(x) > 1)

def clean(x):
  if x.istitle() or check_name(x):
    return x.replace(" ", "").lower()
  else:
    return x.lower().strip()

In [60]:
tags['tag'] = tags['tag'].apply(lambda x: clean(x))
tags.head(10)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,boxing story,1445715207
4,2,89774,mma,1445715200
5,2,89774,tomhardy,1445715205
6,2,106782,drugs,1445715054
7,2,106782,leonardodicaprio,1445715051
8,2,106782,martinscorsese,1445715056
9,7,48516,way too long,1169687325


In [61]:
expended_movies_df = movies.copy()

movies_ids = expended_movies_df['movieId'].values.tolist()
expended_movies_df['soup'] = [" ".join(tags[tags['movieId']==id]['tag'].values.tolist()) for id in movies_ids]

In [63]:
expended_movies_df.head(10)

Unnamed: 0,movieId,title,genres,year,soup
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,pixar pixar fun
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,fantasy magic board game robinwilliams game
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,moldy old
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,
4,5,Father of the Bride Part II (1995),Comedy,1995,pregnancy remake
5,6,Heat (1995),Action|Crime|Thriller,1995,
6,7,Sabrina (1995),Comedy|Romance,1995,remake
7,8,Tom and Huck (1995),Adventure|Children,1995,
8,9,Sudden Death (1995),Action,1995,
9,10,GoldenEye (1995),Action|Adventure|Thriller,1995,


In [64]:
expended_movies_df.loc[8, 'soup']

''

In [65]:
expended_movies_df[expended_movies_df['soup'] != ''].shape

(1572, 5)

In [66]:
def fill_empty_tag(x):
  if x['soup'] =='':
    return " ".join(x['genres'].lower().split('|'))
  return x['soup']

  expended_movies_df['soup'] = expended_movies_df.apply(lambda x: fill_empty_tag(x), axis=1)
  expended_movies_df.head(10)

In [67]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(expended_movies_df['soup'])
count_matrix.shape

(9742, 1578)

In [71]:
class ContentBasedRecommender:

  MODEL_NAME = 'Content-Based'

  def __init__(self, items_ids, items_matrix, training_data, testing_data):
    self.items_ids = items_ids
    self.items_matrix = items_matrix
    self.training_data = training_data
    self.testing_data = testing_data

  
  def get_model_name(self):
    return self.MODEL_NAME
  
  def get_item_profile(self, item_id):
    idx = self.items_ids.index(item_id)
    return self.items_matrix[idx].toarray().reshape(-1)
  
  def get_items_profiles(self, ids):
    items_profiles = np.array([self.get_item_profile(x) for x in ids])
    return items_profiles
  
  def build_users_profile(self, user_id):
    user_df = self.training_data[self.training_data['userId'] == user_id]
    user_items_profiles = self.get_items_profiles(user_df['movieId'].values)

    user_items_ratings = np.array(user_df['rating'].values).reshape(-1, 1)
    user_profile= np.sum(np.multiply(user_items_profiles, user_items_ratings), axis=0) / np.sum(user_items_ratings)
    return user_profile
  
  def get_similar_items_to_user_profile(self, user_id, topn=1000):
    user_profile = self.build_users_profile(user_id).reshape(1, -1)

    # computes the cosine similarity between the user profile and all item profiles
    cosine_similarities = cosine_similarity(user_profile, self.items_matrix.toarray())
    # get the top similar items
    similar_indices = cosine_similarities.argsort().flatten()[-topn:]
    # Sort the similar items by similarity

    similar_items = sorted([(self.items_ids[i], cosine_similarities[0,i]) for i in similar_indices], key=lambda x: -x[1])
    return similar_items

  
  def recommend_items(self, user_id, items_to_ignore=[], topn=10):
    similar_items = self.get_similar_items_to_user_profile(user_id)
    similar_items_filtered = list(filter(lambda x: x[0] not in items_to_ignore, similar_items))

    recommendations_df = pd.DataFrame(similar_items_filtered, columns=['movieId', 'predicted_rating']) \
                        .head(topn)
    
    recommendations_df['predicted_rating'] = recommendations_df['predicted_rating'].apply(lambda x: normalize(x, 1.0, 0.0))
    return recommendations_df


content_based_recommender_model = ContentBasedRecommender(expended_movies_df['movieId'].values.tolist(), count_matrix, training_data, testing_data)


In [72]:
print('Evaluating Content-Based Filtering model....')
cb_global_metrics, cb_detailed_results_df = model_evaluator.evaluate_model(content_based_recommender_model)
print('\nGlobal metrics:\n%s' % cb_global_metrics)
cb_detailed_results_df.head(10)

Evaluating Content-Based Filtering model....
571 users processed

Global metrics:
{'modelName': 'Content-Based', 'recall@5': 1.0, 'recall@10': 1.0, 'precision@5': 0.025, 'precision@10': 0.025}


Unnamed: 0,hits@5_count,hits@10_count,recommended@5_count,recommended@10_count,relevants,recall@5,recall@10,precision@5,precision@10,user_id
18,1,1,5,5,1,1.0,1.0,0.2,0.2,20
393,0,0,0,0,0,1.0,1.0,1.0,1.0,417
377,0,0,0,0,0,1.0,1.0,1.0,1.0,400
378,0,0,0,0,0,1.0,1.0,1.0,1.0,401
379,0,0,0,0,0,1.0,1.0,1.0,1.0,402
380,0,0,0,0,0,1.0,1.0,1.0,1.0,403
381,0,0,1,1,0,1.0,1.0,0.0,0.0,404
382,0,0,0,0,0,1.0,1.0,1.0,1.0,405
383,0,0,0,0,0,1.0,1.0,1.0,1.0,407
384,0,0,0,0,0,1.0,1.0,1.0,1.0,408


You can make execution faster by foe example put user ids as dataframe index or change the dataframe

Issues in recommender System:

1. Cold Start Problem

Whenever a new user enters a recommender system, the question arises of what to recommend him/her and on what basis as previous data is not available and similarity calculation could not be performed. One solution to this problem is to make the new users enter a small introduction form containing basic information about the person's interest, hobbies, occupations, and creating a basic user profile and then recommending items to the new user. This would solve the cold start problem to a great extent

2. Data Sparsity Problem:

The major issue in a recommender system is the unavailability of appropriate data which is the main requirement for the recommendation process. Many users don't bother to review items they bought. As a result, the user-item rating matrix has many sparse entries which degrade the performance of the similarity calculation algorithm. So one solution is to predict sparse entries and many researchers have given algorithms to predict these ratings such as a negative weighted one slope algorithm.


3. Changing Dataset:


With the increase in the amount of data every day, there is an increase in the inclusion of data in the previous dataset of the recommender system which may alter the overall structure and composition of the dataset. Both new users and new items needed to get included in the dataset. So this change needed to be accommodated in the dataset.
4. Scalability Problem:
In a practical scenario, it is not always possible to find similar users and similar items every time and prevent the system from failure. So building a scalable recommender system is a major concern.



5. Shilling attack

Shilling attack is defined as the process of inclusion of fake profiles and biased reviews and ratings to bias the entire recommendation process.
A malicious attacker may inject these profiles so as to increase/decrease recommending frequency of target items.


# Collaborative Filtering model


Collaborative Filtering (CF) has two main implementation strategies:

• Memory-based. This approach uses the memory of previous users interactions to compute users similarities based on items they’ve interacted (user based approach) or compute items similarities based on the users that have interacted with them (item-based approach)
A typical example of this approach is User Neighbourhood-based CF, in which the top-N similar users (usually computed using Pearson correlation) for a user are selected and used to recommend items those similar users liked, but the current user have not interacted yet

• Model-based: This approach, models are developed using different machine learning algorithms to recommend items to users. There are many model-based CF algorithms, like neural networks, bayesian networks, clustering models, and latent factor models such as Singular Value Decomposition (SD) and, probabilistic latent semantic analysis.



In [1]:
# Creating a spares pivot table with users in rows and items in columns
users_items_pivot_matrix_df = ratings.pivot(index='userId', columns='movieId', values='rating').fillna(0)
users_items_pivot_matrix_df.head(10)

NameError: ignored

In [None]:
users_items_pivot_matrix = users_items_pivot_matrix_df.values
users_items_pivot_matrix

In [None]:
from scipy.sparse.linalg import svds

# The number of factors to factor the user-item matrix.
NUMBER_OF_FACTORS_MF = 15
# Performs matrix factorization of the original user item matrix
#U, sigma, Vt = svds(users_items_pivot_matrix, k = NUMBER_OF_FACTORS_MF)
U, sigma, Vt = svds(users_items_pivot_matrix, k = NUMBER_OF_FACTORS_MF)

In [None]:
U.shape

In [None]:
Vt.shape

In [None]:
sigma = np.diag(sigma)
sigma.shape

The resulting matrix is not sparse anymore. It was generated predictions for the items the user have not yet interaction, which we will exploit for recommendations:

In [2]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt)
all_user_predicted_ratings

SyntaxError: ignored

Convert the reconstrcuted matrix back to a Pandas dataframe

In [None]:
preds_df = pd.DataFrame(all_user_predicted_ratings, columns=users_items_pivot_matrix_df.columns, index=users_items_pivot_matrix_df.index)

In [None]:
preds_df.head()

In [None]:
preds_df = preds_df.apply(lambda x: normalize(x, all_user_predicted_ratings.max(), all_user_predicted_ratings.min()))
preds_df.head(10)

In [None]:
class CFRecommender:
  MODEL_NAME = 'Collaborative Filtering'

  def __init__(self, predictions_df):
    self.predictions_df = predictions_df

  def get_model_name(self):
    return self.MODEL_NAME
  
  def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
    # get and sort the user's predictions
    sorted_user_predictions = self.predictions_df.loc[user_id].sort_values(ascending=False)
    recommendations = {'movieId': sorted_user_predictions.index, 'predicted_rating': sorted_user_predictions.values}

    recommendations_df = pd.DataFrame(recommendations)

    # Recommend the highest predicted rating movies that the user hasn't seen yet

    recommendations_df = recommendations_df[~recommendations_df['movieId'].isin(items_to_ignore)] \
                          .sort_values('predicted_rating', ascending=False) \
                          .head(topn)
    
    return recommendations_df

  
cf_recommender_model = CFRecommender(preds_df)


In [None]:
model_evaluator = ModelEvaluator(training_data, testing_data, 2.5)

print('Evaluating Collaborative Filtering (SVD Matrix Factorization) model...')

cf_global_metrics, cf_detailed_results_df = model_evaluator.evaluate_model(cf_recommender_model)
print('\nGlobal metrics:\n%s' % cf_global_metrics)
cf_detailed_results_df.head(10)