In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [29]:
df_ratings = pd.read_pickle('/content/drive/My Drive/df_ratings_100k.pkl')
df_movies = pd.read_pickle('/content/drive/My Drive/df_movies_cleaned.pkl')

# 1. Preprocessing

## 1.1 Convert UserId & MovieId

In a collaborative filtering model, user IDs and movie IDs need to be converted to a continuous range of integers, which will serve as array indices in the embedding layers. Neural network models, particularly those using embeddings, benefit from having user and movie indices start from zero and continue without gaps. This is because each index directly accesses a position in the embedding matrix. This step ensures that each user and movie can be uniquely identified with a dense vector in the latent space.

In [30]:
# Combine all movie IDs and user IDs from both dataframes to ensure comprehensive encoding
all_movie_ids = pd.concat([df_ratings['movieId'], df_movies['movieId']]).unique()
all_user_ids = df_ratings['userId'].unique()

# Create and fit the encoders
movie_encoder = LabelEncoder()
user_encoder = LabelEncoder()

movie_encoder.fit(all_movie_ids)
user_encoder.fit(all_user_ids)

# Transform movie and user IDs in the both dataframes
df_ratings['user'] = user_encoder.transform(df_ratings['userId'])
df_ratings['movie'] = movie_encoder.transform(df_ratings['movieId'])
df_movies['movie'] = movie_encoder.transform(df_movies['movieId'])

In [31]:
# Check Encoding:

print("Unique users in ratings:", df_ratings['user'].nunique())
print("Unique movies in ratings:", df_ratings['movie'].nunique())
print("Min/Max user IDs:", df_ratings['user'].min(), '/', df_ratings['user'].max())
print("Min/Max movie IDs:", df_ratings['movie'].min(), '/', df_ratings['movie'].max())

Unique users in ratings: 55588
Unique movies in ratings: 9494
Min/Max user IDs: 0 / 55587
Min/Max movie IDs: 0 / 39082


User IDs appear to be correctly encoded. We have 55,588 unique users, and the user IDs range from 0 to 55,587, which suggests that every unique user ID has been mapped to a unique integer in a contiguous zero-based range.

Movie IDs, however, show a discrepancy. While there are 9,494 unique movies, the movie IDs range from 0 to 39,082. This gap indicates that not all possible integer values between 0 and 39,082 are used, suggesting that there are missing IDs within this range. Therefore, we will remap the movie IDs to a new contiguous range.

In [32]:
# Recreate the mapping from the sorted unique IDs
movie_id_map = {id: i for i, id in enumerate(sorted(df_ratings['movieId'].unique()))}

# Apply the new mapping to the DataFrames
df_ratings['movie'] = df_ratings['movieId'].apply(lambda x: movie_id_map.get(x, -1))  # Use get to avoid errors
df_movies['movie'] = df_movies['movieId'].apply(lambda x: movie_id_map.get(x, -1))

# Check again
print("New max movie ID in ratings:", df_ratings['movie'].max())
print("New unique movie IDs in ratings:", df_ratings['movie'].nunique())

New max movie ID in ratings: 9493
New unique movie IDs in ratings: 9494


In [33]:
print(df_ratings[['movieId', 'movie']].head())
print(df_movies[['movieId', 'movie']].head())

          movieId  movie
11800835     1037    829
3192182     27316   5716
10041143      307    269
14911364    73929   7389
13024846     4308   3281
   movieId  movie
0        1      0
1        2      1
2        3      2
3        4      3
4        5      4


The mapping has now been applied correctly. The movie IDs now range from 0 to 9493, which matches exactly with the count of unique movie IDs (9494), implying that the IDs are perfectly contiguous and zero-indexed. Let's do a final validation check to confirm that no movieId is left unmapped and no erroneous transformations have occurred:

In [27]:
# Ensure that all movies referenced in ratings are available in the movies DataFrame
missing_movies = df_ratings[~df_ratings['movie'].isin(df_movies['movie'])]
if missing_movies.empty:
    print("All movies in ratings are accounted for in the movies dataframe.")
else:
    print(f"There are {missing_movies.shape[0]} missing movies in the movies dataframe.")

There are 781 missing movies in the movies dataframe.


The presence of 781 missing movies in the df_movies DataFrame compared to df_ratings indicates that there are movies which have been rated but for which there is no additional metadata available in the df_movies dataset. Those rows will be dropped as they are not significant for the model.

In [28]:
# Removing ratings with missing movie metadata
df_ratings = df_ratings[df_ratings['movie'].isin(df_movies['movie'])]

# Check again
missing_movies_after = df_ratings[~df_ratings['movie'].isin(df_movies['movie'])]
if missing_movies_after.empty:
    print("All movies in ratings are now accounted for in the movies dataframe.")
else:
    print(f"There are still {missing_movies_after.shape[0]} missing movies in the movies dataframe.")

All movies in ratings are now accounted for in the movies dataframe.


The data is now correctly encoded and we can move on to the next preprocessing step.

## 1.2 Normalize Ratings

Next, we normalize the ratings as this helps the model train faster and converge more easily.

In [5]:
# Normalize the ratings to a scale of 0 to 1
df_ratings['rating_norm'] = (df_ratings['rating'] - 0.5) / 4.5

In [13]:
# Check Normalization:

print("Min/Max normalized ratings:", df_ratings['rating_norm'].min(), "/", df_ratings['rating_norm'].max())
print(df_ratings[['rating', 'rating_norm']].head())

Min/Max normalized ratings: 0.0 / 1.0
          rating  rating_norm
11800835     0.5          0.0
3192182      0.5          0.0
10041143     0.5          0.0
14911364     0.5          0.0
13024846     0.5          0.0


## 1.3 Prepare Data for Modelling

Now, we split the data into training and test data. For recommendation tasks, a chronological split makes more sense than a random split because user preferences and item popularity can change over time. Using a chronological split can simulate a real-world scenario where a model trained on past data is used to predict future preferences. This approach helps in evaluating how well the model might perform when deployed in production, as it mimics the model's need to work with new, unseen data arriving over time.

In [15]:
X = df_ratings[['user', 'movie']]
y = df_ratings['rating_norm']

In [16]:
# Sort data chronologically
df_ratings = df_ratings.sort_values(by='timestamp')

# Define a cutoff for splitting the data (80% train, 20% test)
cutoff = int(len(df_ratings) * 0.8)
train_df = df_ratings.iloc[:cutoff]
test_df = df_ratings.iloc[cutoff:]

# Now use the indices of the train and test dataframes to split X and y
X_train, y_train = X.loc[train_df.index], y.loc[train_df.index]
X_test, y_test = X.loc[test_df.index], y.loc[test_df.index]

In [17]:
# Check Train-Test-Split:

print("Train data range from {} to {}".format(train_df['timestamp'].min(), train_df['timestamp'].max()))
print("Test data range from {} to {}".format(test_df['timestamp'].min(), test_df['timestamp'].max()))

Train data range from 1996-02-15 12:59:58 to 2015-03-15 10:25:12
Test data range from 2015-03-15 11:38:12 to 2017-08-04 03:38:23
