# Recommender Systems

Recommender systems are critical in providing personalized proposals for users across various fields. In this assignment, we aim to evaluate the performance of three algorithms covered in the course, namely Naive methods, UV matrix decomposition, and matrix factorization, and to provide insights into their effectiveness using the MovieLens 1M data set. Additionally, we used 5-fold cross-validation to increase the reliability of our recommender systems' results for data (movie or user) that did not occur in the training process. After implementing each of the algorithms, we used the Root Mean Squared Error (RMSE) and the Mean Absolute Error (MAE) over both the training and the test set for the examination of their accuracy.

### Import Libraries

In [85]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
import chardet

## Loading data

It appears that there is inconsistency in the text encoding used in various data files. As a result, we must verify the encoding to ensure accurate data reading from these files.

In [86]:
def get_file_encoding(file_path):
    """
    This function checks the text enconding used in a particular file
    
    :param file_path: The file path you wish to examine for its encoding
    :return: String containing enconding type
    """
    
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read())
        return result['encoding']

In [87]:
# Loading ratings data
ratings_path ="./ratings.dat"
ratings = pd.read_csv(ratings_path, delimiter="::", header=None, engine='python', encoding=get_file_encoding(ratings_path))
ratings = ratings.rename(columns={0: "UserID", 1: "MovieID", 2: "Rating", 3:"Timestamp"}) # Set ratings column names

ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [88]:
# Loading movies data
movies_path = "./movies.dat"
movies = pd.read_csv(movies_path, delimiter="::", header=None, engine='python', encoding= get_file_encoding(movies_path))
movies = movies.rename(columns={0: "MovieID", 1: "Title", 2: "Genres"})

movies.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [89]:
# Loading users data
users_path = "./users.dat"
users = pd.read_csv(users_path, delimiter="::", header=None, engine='python', encoding= get_file_encoding(users_path))
users = users.rename(columns={0: "UserID", 1: "Gender", 2: "Age", 3: "Occupation", 4: "Zip-code"})

users.head()

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


## Pre-processing Data

### Preparing Users

In [90]:
# One Hot Encode Gender
encoder = OneHotEncoder(sparse_output=False)

# Encode genders
encoded_gender = encoder.fit_transform(users[['Gender']])
encoded_gender_df = pd.DataFrame(encoded_gender, columns = encoder.get_feature_names_out(['Gender']))

# Concat new hot encoded columns
users = pd.concat([users, encoded_gender_df], axis = 1)

# Drop previous gender column
users.drop(['Gender'], axis='columns', inplace=True)

In [91]:
# Label Encode Zip-code
le = LabelEncoder()

# Update column
users['Zip-code'] = le.fit_transform(users['Zip-code'])

In [92]:
users.head()

Unnamed: 0,UserID,Age,Occupation,Zip-code,Gender_F,Gender_M
0,1,1,10,1588,1.0,0.0
1,2,56,16,2248,0.0,1.0
2,3,25,15,1863,0.0,1.0
3,4,45,7,140,0.0,1.0
4,5,25,20,1938,0.0,1.0


In [93]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6040 entries, 0 to 6039
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   UserID      6040 non-null   int64  
 1   Age         6040 non-null   int64  
 2   Occupation  6040 non-null   int64  
 3   Zip-code    6040 non-null   int32  
 4   Gender_F    6040 non-null   float64
 5   Gender_M    6040 non-null   float64
dtypes: float64(2), int32(1), int64(3)
memory usage: 259.7 KB


# Naive Approaches

To begin with the first recommender systems algorithm, we implement four functions for each naive recommender approach, namely the Global Average, the Movie Average, the User Average, and the Linear combination (including the $\gamma$ parameter). The first one, the Global Average approach, involves recommending the global average rating to all users. When movie or user average ratings were unavailable for the Movie Average or User Average approach, this approach was utilized as a fallback value. Proceeding to these approaches, recommendations were based on the average rating received by a movie or given by a user, respectively. Finally, the last approach we implemented was the Linear combination of the three averages. In this approach, predictions are a combination of user and movie average ratings, with the $\gamma$ term included. In that case, we used the Movie and User Average Ratings in the Linear Combination function. Thus, the fall-back value used for these approaches was used indirectly for the fourth one when user or movie average ratings were unavailable. Hence, in the last approach, the global average rating was implicitly used again as the fall-back value.

### Import Libraries

In [94]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error

#### 1. Global Average Rating:

$$ {R}_{global} (User, Movie) = mean(\text{all ratings})$$

In [95]:
def global_average(train, test, is_train=False):
    if is_train:
        return [train['Rating'].mean()] * len(train)
    else:
        return [train['Rating'].mean()] * len(test)

#### 2. Movie Average: 

$$ {R}_{movie} (User, Movie) = mean(\text{all ratings for movie})$$

In [96]:
def movie_average(train, test, is_train=False):
    if is_train:
        movie_avg_train = train.groupby('MovieID')['Rating'].mean()
        return movie_avg_train[train['MovieID']].to_numpy()    # A NumPy ndarray representing the values in this Series

    else:
        movie_avg_predictions = test['MovieID'].map(train.groupby('MovieID')['Rating'].mean())  # movie average predictions for a test set based on 
        movie_avg_predictions.fillna(train['Rating'].mean(), inplace=True)                       # the movie average ratings in the training set             
        return movie_avg_predictions

#### 3. User Average:

$$ {R}_{user} (User, Movie) = mean(\text{all ratings for User})$$

In [97]:
def user_average(train, test, is_train=False):
    if is_train:
        user_avg_train = train.groupby('UserID')['Rating'].mean()
        return user_avg_train[train['UserID']].to_numpy()    # A NumPy ndarray representing the values in this Series

    else:
        user_avg_predictions = test['UserID'].map(train.groupby('UserID')['Rating'].mean()) # user average predictions for a test set based on 
        user_avg_predictions.fillna(train['Rating'].mean(), inplace=True)                       # the user average ratings in the training set
        return user_avg_predictions


#### 4. Linear Combination of the three averages:

$$ {R}_{user-movie} (User, Movie) = \alpha * {R}_{user} (User, Movie) + \beta * {R}_{movie} (User, Movie) + \gamma$$

In [98]:
def linear_combination(train, test, is_train=False):
    user_avg = train.groupby('UserID')['Rating'].mean()
    movie_avg = train.groupby('MovieID')['Rating'].mean()

    A = np.vstack([user_avg[train['UserID']], movie_avg[train['MovieID']], np.ones(len(train))]).T
    b = train['Rating']

    alpha, beta, gamma = np.linalg.lstsq(A, b, rcond=None)[0]     # https://numpy.org/doc/stable/reference/generated/numpy.linalg.lstsq.html

    if is_train: 
        prediction = alpha * user_average(train, test, is_train=True) + beta * movie_average(train, test, is_train=True) + gamma
    else:
        prediction = alpha * user_average(train, test) + beta * movie_average(train, test) + gamma

    prediction = np.clip(prediction, 1, 5)

    return prediction
    


## 5-fold Cross-Validation \& Accuracy estimations

In [99]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)   # 42: random seed set at the beginning

# training set
rmse_global_train = []
mae_global_train = []

rmse_user_train = []
mae_user_train = []

rmse_movie_train = []
mae_movie_train = []

rmse_combination_train = []
mae_combination_train = []

# test set
rmse_global_test = []
mae_global_test = []

rmse_user_test = []
mae_user_test = []

rmse_movie_test = []
mae_movie_test = []

rmse_combination_test = []
mae_combination_test = []

for train_index, test_index in kf.split(ratings):
    train_data, test_data = ratings.iloc[train_index], ratings.iloc[test_index]

    # Compute RMSE and MAE over training set
    rmse_global_train.append(np.sqrt(mean_squared_error(train_data['Rating'], global_average(train_data, test_data, is_train=True))))
    mae_global_train.append(mean_absolute_error(train_data['Rating'], global_average(train_data, test_data, is_train=True)))

    rmse_user_train.append(np.sqrt(mean_squared_error(train_data['Rating'], user_average(train_data, test_data, is_train=True))))
    mae_user_train.append(mean_absolute_error(train_data['Rating'], user_average(train_data, test_data, is_train=True)))

    rmse_movie_train.append(np.sqrt(mean_squared_error(train_data['Rating'], movie_average(train_data, test_data, is_train=True))))
    mae_movie_train.append(mean_absolute_error(train_data['Rating'], movie_average(train_data, test_data, is_train=True)))

    rmse_combination_train.append(np.sqrt(mean_squared_error(train_data['Rating'], linear_combination(train_data, test_data, is_train=True))))
    mae_combination_train.append(mean_absolute_error(train_data['Rating'], linear_combination(train_data, test_data, is_train=True)))

    # Compute RMSE and MAE test set
    rmse_global_test.append(np.sqrt(mean_squared_error(test_data['Rating'], global_average(train_data, test_data))))
    mae_global_test.append(mean_absolute_error(test_data['Rating'], global_average(train_data, test_data)))

    rmse_user_test.append(np.sqrt(mean_squared_error(test_data['Rating'], user_average(train_data, test_data))))
    mae_user_test.append(mean_absolute_error(test_data['Rating'], user_average(train_data, test_data)))

    rmse_movie_test.append(np.sqrt(mean_squared_error(test_data['Rating'], movie_average(train_data, test_data))))
    mae_movie_test.append(mean_absolute_error(test_data['Rating'], movie_average(train_data, test_data)))

    rmse_combination_test.append(np.sqrt(mean_squared_error(test_data['Rating'], linear_combination(train_data, test_data))))
    mae_combination_test.append(mean_absolute_error(test_data['Rating'], linear_combination(train_data, test_data)))


In [100]:
d_train = {"Metric": ["RMSE", "MAE"],
          "Global Average": [np.mean(rmse_global_train), np.mean(mae_global_train)],
          "User Average": [np.mean(rmse_user_train), np.mean(mae_user_train)],
          "Movie Average": [np.mean(rmse_movie_train), np.mean(mae_movie_train)],
          "Linear Combination": [np.mean(rmse_combination_train),np.mean(mae_combination_train)]}
df_train = pd.DataFrame(data=d_train)

d_test = {"Metric": ["RMSE", "MAE"],
          "Global Average": [np.mean(rmse_global_test), np.mean(mae_global_test)],
          "User Average": [np.mean(rmse_user_test), np.mean(mae_user_test)],
          "Movie Average": [np.mean(rmse_movie_test), np.mean(mae_movie_test)],
          "Linear Combination": [np.mean(rmse_combination_test),np.mean(mae_combination_test)]}
df_test = pd.DataFrame(data=d_test)

#### RMSE and MAE table over training set:

In [101]:
df_train

Unnamed: 0,Metric,Global Average,User Average,Movie Average,Linear Combination
0,RMSE,1.117101,1.027673,0.974228,0.914556
1,MAE,0.933861,0.822719,0.778336,0.72481


#### RMSE and MAE table over test set:

In [102]:
df_test

Unnamed: 0,Metric,Global Average,User Average,Movie Average,Linear Combination
0,RMSE,1.117101,1.03548,0.979367,0.924256
1,MAE,0.933862,0.82895,0.782284,0.73243
