# Computational Linear Algebra: Singular Value Decomposition Homework

In the following homework we decided to explore the topic of *Singular Value Decomposition* used to device a Movie recommendation system like the one used nowdays by many streaming services.

## 1. Dataset Specifications
The dataset we decided to use is the *MovieLens Dataset* which is one of the most widely used datasets for movie recommendation tasks. Such dataset contains user ratings for movies along with metadata like movie genres, titles, and timestamps.

In particular, we considered the "MovieLens 1M Dataset", which contains 1 million ratings from 6000 users on 4000 movies. The dataset is divided into 3 main files:
- "ratings.dat": which contains all the ratings
- "users.dat": which contains all the user information
- "movies.dat": which contains all the movie information

### Ratings dataset
All ratings are contained in the file "ratings.dat" and are in the following format:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

### Users dataset
User information is in the file "users.dat" and is in the following format:

UserID::Gender::Age::Occupation::Zip-code

All demographic information is provided voluntarily by the users and is
not checked for accuracy.  Only users who have provided some demographic
information are included in this data set.

- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:

	*  1:  "Under 18"
	* 18:  "18-24"
	* 25:  "25-34"
	* 35:  "35-44"
	* 45:  "45-49"
	* 50:  "50-55"
	* 56:  "56+"

- Occupation is chosen from the following choices:

	*  0:  "other" or not specified
	*  1:  "academic/educator"
	*  2:  "artist"
	*  3:  "clerical/admin"
	*  4:  "college/grad student"
	*  5:  "customer service"
	*  6:  "doctor/health care"
	*  7:  "executive/managerial"
	*  8:  "farmer"
	*  9:  "homemaker"
	* 10:  "K-12 student"
	* 11:  "lawyer"
	* 12:  "programmer"
	* 13:  "retired"
	* 14:  "sales/marketing"
	* 15:  "scientist"
	* 16:  "self-employed"
	* 17:  "technician/engineer"
	* 18:  "tradesman/craftsman"
	* 19:  "unemployed"
	* 20:  "writer"

### Movies dataset
Movie information is in the file "movies.dat" and is in the following
format:

MovieID::Title::Genres

- Titles are identical to titles provided by the IMDB (including
year of release)
- Genres are pipe-separated and are selected from the following genres:

	* Action
	* Adventure
	* Animation
	* Children's
	* Comedy
	* Crime
	* Documentary
	* Drama
	* Fantasy
	* Film-Noir
	* Horror
	* Musical
	* Mystery
	* Romance
	* Sci-Fi
	* Thriller
	* War
	* Western

- Some MovieIDs do not correspond to a movie due to accidental duplicate
entries and/or test entries
- Movies are mostly entered by hand, so errors and inconsistencies may exist

// TODO: complete the description of the report and explain what are the main steps

## 2. Dataset preparation
### 2.3 Loading the separate datasets
// TODO: add description

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm

# Load ratings.dat
ratings = pd.read_csv(
    'MovieLens1M/ratings.dat', 
    sep='::', 
    engine='python', 
    names=['UserID', 'MovieID', 'Rating', 'Timestamp'],
    encoding='ISO-8859-1'
)

# Load movies.dat
movies = pd.read_csv(
    'MovieLens1M/movies.dat', 
    sep='::', 
    engine='python', 
    names=['MovieID', 'Title', 'Genres'],
    encoding='ISO-8859-1'
)

# Load users.dat
users = pd.read_csv(
    'MovieLens1M/users.dat', 
    sep='::', 
    engine='python', 
    names=['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code'],
    encoding='ISO-8859-1'
)

# Display the first few rows
print("Ratings:")
print(ratings.head())
print("\nMovies:")
print(movies.head())
print("\nUsers:")
print(users.head())


Ratings:
   UserID  MovieID  Rating  Timestamp
0       1     1193       5  978300760
1       1      661       3  978302109
2       1      914       3  978301968
3       1     3408       4  978300275
4       1     2355       5  978824291

Movies:
   MovieID                               Title                        Genres
0        1                    Toy Story (1995)   Animation|Children's|Comedy
1        2                      Jumanji (1995)  Adventure|Children's|Fantasy
2        3             Grumpier Old Men (1995)                Comedy|Romance
3        4            Waiting to Exhale (1995)                  Comedy|Drama
4        5  Father of the Bride Part II (1995)                        Comedy

Users:
   UserID Gender  Age  Occupation Zip-code
0       1      F    1          10    48067
1       2      M   56          16    70072
2       3      M   25          15    55117
3       4      M   45           7    02460
4       5      M   25          20    55455


### 2.2 Merge DataFrames
We’ll merge the ratings, movies, and users DataFrames to create a single dataset for analysis.

In [10]:
# Merge ratings with movies
ratings_movies = pd.merge(ratings, movies, on='MovieID')

# Merge the result with users
full_data = pd.merge(ratings_movies, users, on='UserID')

# Display the merged dataset
print("Merged Data:")
full_data.head()

Merged Data:


Unnamed: 0,UserID,MovieID,Rating,Timestamp,Title,Genres,Gender,Age,Occupation,Zip-code
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama,F,1,10,48067
1,1,661,3,978302109,James and the Giant Peach (1996),Animation|Children's|Musical,F,1,10,48067
2,1,914,3,978301968,My Fair Lady (1964),Musical|Romance,F,1,10,48067
3,1,3408,4,978300275,Erin Brockovich (2000),Drama,F,1,10,48067
4,1,2355,5,978824291,"Bug's Life, A (1998)",Animation|Children's|Comedy,F,1,10,48067


## 3. Preprocessing
In order to apply the SVD decomposition we need to preprocess the data in the dataset.
### 3.1 Normalize ratings
We normalize the ratings to ensure fair comparison across users.

// TODO: normalize categorical data and encode textual

In [11]:
from sklearn.preprocessing import StandardScaler

# Normalize the ratings
scaler = StandardScaler()
full_data['Normalized_Rating'] = scaler.fit_transform(full_data[['Rating']])

print("Normalized Ratings:")
full_data[['Rating', 'Normalized_Rating']].head()

Normalized Ratings:


Unnamed: 0,Rating,Normalized_Rating
0,5,1.269747
1,3,-0.520601
2,3,-0.520601
3,4,0.374573
4,5,1.269747


### 3.2 Filter data
To reduce noise in the dataset we filtered out movies or users with very few ratings to reduce noise.

In [13]:
# Filter out movies with less than 10 ratings
movie_counts = full_data['MovieID'].value_counts()
filtered_movies = movie_counts[movie_counts >= 10].index
full_data = full_data[full_data['MovieID'].isin(filtered_movies)]

# Filter out users with less than 10 ratings
user_counts = full_data['UserID'].value_counts()
filtered_users = user_counts[user_counts >= 10].index
full_data = full_data[full_data['UserID'].isin(filtered_users)]

print("Filtered Data:")
full_data.head()

Filtered Data:


Unnamed: 0,UserID,MovieID,Rating,Timestamp,Title,Genres,Gender,Age,Occupation,Zip-code,Normalized_Rating
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama,F,1,10,48067,1.269747
1,1,661,3,978302109,James and the Giant Peach (1996),Animation|Children's|Musical,F,1,10,48067,-0.520601
2,1,914,3,978301968,My Fair Lady (1964),Musical|Romance,F,1,10,48067,-0.520601
3,1,3408,4,978300275,Erin Brockovich (2000),Drama,F,1,10,48067,0.374573
4,1,2355,5,978824291,"Bug's Life, A (1998)",Animation|Children's|Comedy,F,1,10,48067,1.269747


### 4. Data Matrix
In order to apply SVD we transform the dataset into a  user-item matrix where rows are users, columns are movies, and values are ratings.

// TODO: see if we can improve by using other data also

// TODO: implement SVD from scratch

In [22]:
from scipy.sparse.linalg import svds
import numpy as np

# Create user-item matrix
user_item_matrix = full_data.pivot(index='UserID', columns='MovieID', values='Rating').fillna(0)

# Convert to numpy array
matrix = user_item_matrix.values

# Display the matrix shape
print("User-Item Matrix Shape:", matrix.shape)

User-Item Matrix Shape: (6040, 3260)


## 5. Perform SVD
We preform Singular Value Decomposition on the user-item matrix.

In [15]:
# Perform SVD
U, sigma, Vt = svds(matrix, k=50)  # k is the number of latent features

# Convert sigma (singular values) to a diagonal matrix -> SVD returns just an array
sigma = np.diag(sigma)

print("U shape:", U.shape)
print("Sigma shape:", sigma.shape)
print("Vt shape:", Vt.shape)

U shape: (6040, 50)
Sigma shape: (50, 50)
Vt shape: (50, 3260)


From the given decomposition we can reconstruct the original matrix and in doing so predict the missing ratings

In [16]:
# Reconstruct the predicted matrix
predicted_matrix = np.dot(np.dot(U, sigma), Vt)

# Convert back to a DataFrame
predicted_ratings = pd.DataFrame(predicted_matrix, index=user_item_matrix.index, columns=user_item_matrix.columns)

print("Predicted Ratings:")
print(predicted_ratings.head())

Predicted Ratings:
MovieID      1         2         3         4         5         6         7     \
UserID                                                                          
1        4.298115  0.163665 -0.183822 -0.017804  0.020646 -0.182256 -0.103408   
2        0.751522  0.129365  0.340417  0.008707  0.002355  1.314438  0.078472   
3        1.844318  0.473045  0.098737 -0.039069 -0.020426 -0.155059 -0.142436   
4        0.400815 -0.045541  0.034384  0.084438  0.051520  0.260889 -0.081613   
5        1.557491 -0.005934 -0.045463  0.248822 -0.042923  1.519572 -0.163107   

MovieID      8         9         10    ...      3942      3943      3945  \
UserID                                 ...                                 
1        0.156572 -0.058205 -0.166397  ...  0.010388  0.032132  0.033226   
2        0.062501  0.163912  1.516349  ...  0.012872 -0.050799 -0.011114   
3        0.111568  0.036230  0.738985  ...  0.012514  0.050203  0.022773   
4        0.023506  0.051903 -0.07

## 6. Recommend Movies
Based on the matrix of predictions we just computed we can recommend top-rated movies based on predicted ratings

In [21]:
def recommend_movies(user_id, predicted_ratings, original_data, num_recommendations=5):
    user_row = predicted_ratings.loc[user_id].sort_values(ascending=False)

    # Exclude movies the user has already rated
    rated_movies = original_data[original_data['UserID'] == user_id]['MovieID']
    recommendations = user_row[~user_row.index.isin(rated_movies)].head(num_recommendations)

    # Map back to movie titles
    recommended_movies = movies[movies['MovieID'].isin(recommendations.index)]
    return recommended_movies

# Recommend movies for a specific user (e.g., user_id = 1)
user_id = 2
recommended_movies = recommend_movies(user_id, predicted_ratings, full_data, num_recommendations=5)

print("Recommended Movies for User {}:".format(user_id))
recommended_movies

Recommended Movies for User 2:


Unnamed: 0,MovieID,Title,Genres
523,527,Schindler's List (1993),Drama|War
724,733,"Rock, The (1996)",Action|Adventure|Thriller
1539,1580,Men in Black (1997),Action|Adventure|Comedy|Sci-Fi
1656,1704,Good Will Hunting (1997),Drama
1892,1961,Rain Man (1988),Drama


## 7. Evaluation
//TODO: implement some MSE to evaluate our recomendations

In [None]:
# Evaluate the model using RMSE
from sklearn.metrics import mean_squared_error

# Filter out NaN values
predicted_ratings = predicted_ratings.fillna(0)

# Calculate MSE
rmse = np.sqrt(mean_squared_error(matrix, predicted_matrix))
print("RMSE:", rmse)

RMSE: 0.5996174825047401


## 8. Conclusions
// TODO: write conclusions