Features for the MVP:
- ratings
- userID
- genres (first)
- language (first)
- release_date (in ten year steps) 
- popularity
- cast (first)

In [1]:
import torch
import pandas as pd

Import CSV files into dataframes and merge them into one joined dataframe.
For `movies_metadata.csv` three rows causing problems with mixed up columns. For instance, the `id` column contains date values like `1997-08-20`. We need to clean up the dataframes before we can merge them together.

In [None]:
dfr = pd.read_csv('data/ratings_small.csv', usecols=['userId','movieId','rating'])
dfr.rename(columns={'movieId':'id'})
dfm = pd.read_csv('data/movies_metadata.csv', usecols=['id', 'genres','original_title', 'original_language','release_date','popularity'])
dfc = pd.read_csv('data/credits.csv', usecols=['id','cast'])

# Clean up movies_metadata dataset by removing rows with invalid IDs
dfm['id'] = pd.to_numeric(dfm['id'], errors='coerce')
dfm = dfm.dropna(subset=['id'])
dfm['id'] = dfm['id'].astype('int') 

df_merged = pd.merge(dfm, dfc, on='id')

In [None]:
total_size = len(dfr)
train_size = int(total_size * 0.7)
test_size = total_size - train_size

train_dataset, test_dataset = torch.utils.data.random_split(dfr, [train_size,test_size], generator=torch.Generator().manual_seed(42))
print(len(train_dataset))
print(len(test_dataset))