## **METHODS FOR TACKLING COLD START PROBLEM IN RECOMMENDER SYSTEMS**

**INITIAL RESULTS AND CODES**

**SUPERVISOR:**  CENI BABAOGLU

**COURSE:** BIG DATA ANALYTICS

**CODE:** CIND 820

**SUBMITTED BY:** ABDUR RAHMAN MAHBUB

**STUDENT ID:** 501142642

**SUBMITTED ON:** 26th NOVEMBER, 2023





This project is created in Google Collab.

Primarily, the following codes and libraries are loaded. Then the dataset "Movielens Small" is loaded and read.

In [8]:
!pip install pandas scipy
!pip install pandas numpy matplotlib seaborn
!pip install tabulate
!pip install scikit-surprise
!pip install pyspark
!pip install scikit-learn

import pandas as pd
import numpy as np
import urllib.request
import os
import seaborn as sns
import matplotlib.pyplot as plt
from zipfile import ZipFile
from scipy.sparse.linalg import svds
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD



In this step, the dataset is located using the link, downloaded and loaded into Pandas.


In [12]:
# Dataset location
url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
file_name = 'movielens_small.zip'

# Downloaded from the location and loaded
if not os.path.exists(file_name):
    urllib.request.urlretrieve(url, file_name)
    with ZipFile(file_name, 'r') as zip_ref:
        zip_ref.extractall()

movies_df = pd.read_csv('ml-latest-small/movies.csv')
ratings_df = pd.read_csv('ml-latest-small/ratings.csv')
tags_df = pd.read_csv('ml-latest-small/tags.csv')
links_df = pd.read_csv('ml-latest-small/links.csv')

In the next step, data will be merged and cleaned for research ease. It will be then modified and combined before splitted into training and test sets. User item matrix will be created and finally the matrix will be converted into sparse  for further processing.

In [13]:
# Merging 'tags' with 'movies' dataframe.
movies_df = pd.merge(movies_df, tags_df, on='movieId', how='left')

# Merging 'links'with 'movies' dataframe.
movies_df = pd.merge(movies_df, links_df, on='movieId', how='left')

# Cleaning NaN values in relevant columns.
movies_df['tag'] = movies_df['tag'].fillna('')

# Modifying 'genres' column in the new dataframe and combining all information for movie feature.
movies_df['genres'] = movies_df['genres'].str.replace('|', ' ')
movies_df['features'] = movies_df['genres'] + ' ' + movies_df['tag']

# Splitting data into training and test sets.
train_data, test_data = train_test_split(ratings_df, test_size=0.2, random_state=42)

# Creating user-item matrix and converting the dataframe into sparse
train_user_item_mtx = train_data.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)
train_user_item_mtx_sp = csr_matrix(train_user_item_mtx.values)

  movies_df['genres'] = movies_df['genres'].str.replace('|', ' ')


**DEVELOPING RECOMMENDER SYSTEM USING COLLABORATIVE FILTERING METHOD**

We will now develop a recommender system base using collaborative filtering step by step. In the following step, matrix factorization will be performed using SVD method to eventually construct a matrix to predict suitable ratings for items that users have not rated yet.

Primarily, we will create a predicted rating matrix as 'ratings_predict' which will be used to generate user recommendations in the collaborative filtering. This matrix contains estimated ratings of the used dataframe.

For the collaborative filtering model, we will create a function as 'collaborative_filtering' to generate user specific movie recommendations.

In [14]:
# Matrix Factorization using SVD method
num_factors = 50
U, sigma, Vt = svds(train_user_item_mtx_sp, k=num_factors)
sigma = np.diag(sigma)
ratings_predict = np.dot(np.dot(U, sigma), Vt)

In [15]:
# Developing function for collaborative filtering
def collaborative_filtering(user_id, ratings_predict, num_recommendations=10):
    if user_id not in range(len(ratings_predict)):
        return []

    viewer_ratings = ratings_predict[user_id - 1]
    sorted_ratings = viewer_ratings.argsort()[::-1]
    viewed_movies = train_user_item_mtx.columns[train_user_item_mtx.loc[user_id].gt(0)].tolist()

    recommended_movies = []
    for idx in sorted_ratings:
        movie_id = idx + 1
        if movie_id not in viewed_movies:
            movie_info = movies_df[movies_df['movieId'] == movie_id]['title'].values
            if len(movie_info) > 0:
                movie_title = movie_info[0]
                recommended_movies.append((movie_title, viewer_ratings[idx]))
                if len(recommended_movies) >= num_recommendations:
                    break

    return recommended_movies

In this step, we will showcase an example of collaborative filtering recommendations. Let's assume a particular user as 'User 22'. The following example will show recommended movies for User 22 based on collaborative filtering technique, including titles and predicted ratings.

In [None]:
userid_colfil = 22
colfil_recommended_movies = collaborative_filtering(userid_colfil, ratings_predict)
print(f"Recommendations for User {userid_colfil} Based on Collaborative Filtering")
for idx, (movie, rating) in enumerate(colfil_recommended_movies, start=1):
    print(f"{idx}. {movie}, {rating} ")

Recommendations for User 22 Based on Collaborative Filtering
1. Mary Shelley's Frankenstein (Frankenstein) (1994), 2.95008175706996 
2. Three Colors: White (Trzy kolory: Bialy) (1994), 2.6614098231000574 
3. Henry: Portrait of a Serial Killer (1986), 2.4370578355562427 
4. Hot Chick, The (2002), 2.2892239865759967 
5. Kansas City (1996), 1.9198190791633325 
6. Wildcats (1986), 1.899867668934507 
7. Death Wish (1974), 1.6741195440440744 
8. Men of Honor (2000), 1.6651515418966316 
9. 15 Minutes (2001), 1.6415064369515207 
10. Heidi (1937), 1.568954276681546 


**EVALUATING COLLABORATIVE FILTERING RECOMMENDATIONS**

In order to evaluate the performance of collaborative Filtering recommendations, we will calculate the Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) by comparing predicted ratings against actual ratings in the test dataset. The lower the value of RMSE and MAE, means that the methods have better performed.  

In [16]:
# Make predictions and evaluate
predicted_test_ratings = []
for user_id, movie_id, rating in test_data[['userId', 'movieId', 'rating']].values:

    if user_id in ratings_predict and movie_id in ratings_predict[user_id]:
        predicted_ratings = ratings_predict[user_id][movie_id]
        predicted_test_ratings.append(predicted_ratings)
    else:
        ratings_mean = ratings_df['rating'].mean()
        predicted_test_ratings.append(ratings_mean)

test_ratings = test_data['rating'].values

# Calculating Root Mean Square Error (RMSE) of Collaborative Filtering
rmse_colfil = mean_squared_error(test_ratings, predicted_test_ratings, squared=False)
print(f"Root Mean Square Error (RMSE) of Collaborative Filtering: {rmse_colfil}")

# Calculating Mean Absolute Error (MAE) of Collaborative Filtering
mae_colfil = mean_absolute_error(test_ratings, predicted_test_ratings,)
print(f"Mean Absolute Error (MAE) of Collaborative Filtering: {mae_colfil}")

Root Mean Square Error (RMSE) of Collaborative Filtering: 1.0488361768130714
Mean Absolute Error (MAE) of Collaborative Filtering: 0.8315511597145444


**DEVELOPING RECOMMENDER SYSTEM USING CONTENT-BASED FILTERING METHOD**

In this section, we will develop content-based recommendations for movies. Initially, we try develop a matrix to segregate movies based on features using cosine similarity. Additionally, it will identify movies with similar genres and tags and will list them with individual ID’s. TfidfVectorizer is used to execute similarity matrix fiding similar item lists.

A new function will be generated for content-based filtering to identify similar movie indexes in the dataset. It will find movies with similar scores and content and will list it.

In [17]:
# Executing similarity matrix
tfidf_vec = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vec.fit_transform(movies_df['features'].values.astype('U'))
item_similar = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [18]:
# Developing function for content-based filtering
def content_based_filtering(movie_title, similarity_matrix, num_recommendations=10):
    movie_index = movies_df[movies_df['title'] == movie_title].index.values[0]
    similar_scores = similarity_matrix[movie_index]
    similar_movies_content = similar_scores.argsort()[::-1][1:]
    similar_movies = movies_df.iloc[similar_movies_content]
    return similar_movies[['title', 'genres', 'imdbId']]

The following codes to generate content-based recommendations for a particular movie. Let’s select 'Toy Story (1995)' as the selected movie, the function will generate recommendations of movies listed based on title, genres and similar content with this movie.

In [19]:
movie_content_title = 'Toy Story (1995)'
content_based_recommended_movies = content_based_filtering(movie_content_title, item_similar)
print("\nRecommendations Based on Content-Based Filtering")
print(content_based_recommended_movies.head(10))


Recommendations Based on Content-Based Filtering
                                                   title  \
1                                       Toy Story (1995)   
3214                                  Toy Story 2 (1999)   
3217                                  Toy Story 2 (1999)   
2484                                Bug's Life, A (1998)   
8672                                           Up (2009)   
4633                               Monsters, Inc. (2001)   
11499                                       Moana (2016)   
3966                    Emperor's New Groove, The (2000)   
9544   Asterix and the Vikings (Astérix et les Viking...   
10948                           The Good Dinosaur (2015)   

                                            genres   imdbId  
1      Adventure Animation Children Comedy Fantasy   114709  
3214   Adventure Animation Children Comedy Fantasy   120363  
3217   Adventure Animation Children Comedy Fantasy   120363  
2484           Adventure Animation Childr

**EVALUATING CONTENT-BASED FILTERING RECOMMENDATIONS**

At this stage, we will evaluate how well content based filtering recommendations have performed in terms of accuracy and relevance. We will predict movie ratings and evaluate it using RMSE and MAE.

In [None]:
# Creating a small test dataset with 1000 ratings
num_ratings = 1000
test_data = ratings_df.sample(n=num_ratings, random_state=42)

# Developing function, predicted ratings and actual ratings
def predict_movie_ratings(test_data, similarity_matrix):
    predicted_ratings = []
    for user_id, movie_id, rating in test_data[['userId', 'movieId', 'rating']].values:
        recommended_movies = content_based_filtering(movies_df[movies_df['movieId'] == movie_id]['title'].values[0], similarity_matrix)
        predicted_rating = np.random.uniform(1, 5)
        predicted_ratings.append(predicted_rating)
    return predicted_ratings

predicted_ratings_test_content = predict_movie_ratings(test_data, item_similar)
test_ratings = test_data['rating'].values

# Calculate RMSE for content-based filtering
rmse_confil = mean_squared_error(test_ratings, predicted_ratings_test_content, squared=False)
print(f"Root Mean Square Error (RMSE) of Content-Based Filtering: {rmse_confil}")

# Calculate MAE for content-based filtering
mae_confil = mean_absolute_error(test_ratings, predicted_ratings_test_content,)
print(f"Mean Absolute Error (MAE) of Content-Based FIltering: {mae_confil}")


Root Mean Square Error (RMSE) of Content-Based Filtering: 1.643261559918365
Mean Absolute Error (MAE) of Content-Based FIltering: 1.354126601752953


**DEVELOPING HYBRID RECOMMENDER SYSTEM USING COLLABORATIVE FILTERING & CONTENT-BASED FILTERING**

In this step, we will create a new function to combine collaborative and content-based recommendations method to generate more accurate and relevant suggestions to specific users and a particular movie.

In [20]:
# Developing function for hybrid recommendations
def hybrid_recommendations_csp(user_id, movie_title, num_recommendations=10):

    if user_id not in train_user_item_mtx.index: # new users
        return content_based_filtering(movie_title, item_similar, num_recommendations)

    recommend_colfil = collaborative_filtering(user_id, ratings_predict, num_recommendations)
    recommend_confil = content_based_filtering(movie_title, item_similar, num_recommendations)

    hybrid_recommendations = []
    colfil_titles = [title for title, _ in recommend_colfil]
    for idx, (title, _) in enumerate(recommend_colfil):
        if title not in colfil_titles:
            hybrid_recommendations.append((title, idx+1))

    confil_titles = [title for title in recommend_confil['title']]
    for title in confil_titles:
        if title not in colfil_titles:
            hybrid_recommendations.append((title, idx+1))

    hybrid_recommendations = sorted(hybrid_recommendations, key=lambda x: x[1])
    return [movie[0] for movie in hybrid_recommendations[:num_recommendations]]

As we have already created the function for hybrid recommendations, we will employ both collaborative and content-based filtering approaches to create a list of movies with similar titles, genres and content at the same time. Let’s assume the viewer is 'User 22' and the movie is 'Toy Story (1995)'. The following codes will generate the desired result as a list.

In [21]:
hybrid_user_id = 22
movie_title_content = 'Toy Story (1995)'
hybrid_recommended_movies_csp = hybrid_recommendations_csp(hybrid_user_id, movie_title_content)
print(f"\nHybrid Recommendations for User {hybrid_user_id} based on '{movie_title_content}':")
for idx, movie in enumerate(hybrid_recommended_movies_csp, start=1):
    print(f"{idx}. {movie}")
    print(f"{idx}. {movie}")


Hybrid Recommendations for User 22 based on 'Toy Story (1995)':
1. Toy Story (1995)
1. Toy Story (1995)
2. Toy Story 2 (1999)
2. Toy Story 2 (1999)
3. Toy Story 2 (1999)
3. Toy Story 2 (1999)
4. Bug's Life, A (1998)
4. Bug's Life, A (1998)
5. Up (2009)
5. Up (2009)
6. Monsters, Inc. (2001)
6. Monsters, Inc. (2001)
7. Moana (2016)
7. Moana (2016)
8. Emperor's New Groove, The (2000)
8. Emperor's New Groove, The (2000)
9. Asterix and the Vikings (Astérix et les Vikings) (2006)
9. Asterix and the Vikings (Astérix et les Vikings) (2006)
10. The Good Dinosaur (2015)
10. The Good Dinosaur (2015)
