### Exploration

The main goal of this notebook is to enable exploration and testing, before the actual python scripts are properly written. Hence, we will download the data and test all recommender system algorithms we are aiming at:

* Collaborative Filtering;

* Matrix Factorization;

* Content-Based.

We will attempt to use the whole 20M movielens dataset, instead of using just a slice of it. In order to achieve it, we need to carefully choose the right data strucutres for each strategy and approach.

### Libraries

In [1]:
import os
from datetime import datetime
import pandas as pd
import numpy as np
import scipy
import boto3

### Data

In [2]:
FILES = ["movies.csv", "ratings.csv"]

In [3]:
bucket_name = "data-ml-gfluz94"
target_folder = "../data"

missing_files = (
    set(filter(lambda x: x.split(".")[-1] == "csv", os.listdir(target_folder))) - 
    set(FILES)
)

if len(missing_files) > 0:

    client = boto3.client(
        "s3", aws_access_key_id=os.environ["AWS_ACCESS_KEY"], aws_secret_access_key=os.environ["AWS_SECRET_KEY"]
    )
    content_response = client.list_objects(Bucket=bucket_name)

    for content in content_response["Contents"]:
        client.download_file(
            Bucket=bucket_name,
            Key=content["Key"],
            Filename=os.path.join(target_folder, content["Key"]),
        )

In [4]:
df_movies = pd.read_csv(os.path.join(target_folder, "movies.csv"))
df_ratings = pd.read_csv(os.path.join(target_folder, "ratings.csv"))

In [5]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


In [6]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 762.9 MB


In [7]:
display(df_movies.head())
display(df_ratings.head())

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


Adjusting and preprocessing the data

In [8]:
df_ratings["timestamp"] = df_ratings["timestamp"].apply(datetime.utcfromtimestamp)

In [9]:
display(df_ratings.head())

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,2006-05-17 15:34:04
1,1,306,3.5,2006-05-17 12:26:57
2,1,307,5.0,2006-05-17 12:27:08
3,1,665,5.0,2006-05-17 15:13:40
4,1,899,3.5,2006-05-17 12:21:50


In [10]:
n_users = df_ratings.userId.nunique()
n_movies = df_ratings.movieId.nunique()

print(f"Total users: {n_users}")
print(f"Total movies: {n_movies}")

Total users: 162541
Total movies: 59047


In a sense, the `movies` dataset contains metadata on the actual movie - the name and the genres. We will see later that the genres can be turned into useful features for a content-based approach, however for now we are mostly concerned with the `ratings` dataset.

Since we have roughly 25M entries (~1GB), it will be impossible to fit all this data into memory in a matrix representation - users vs. ratings. Especially because the dimensionality (162,541 x 59,047) is high and there is **high sparsity** - a lot of 0s due to the fact that the users haven't watched a lot of movies.

Hence, we have some options to deal with this huge amount of data - and each one of them might be useful according to the chosen algorithm:

* **Dictionary**: With a dictionary representation, we can considerably reduce the dimensionality, since the movies that haven't been rated by the user won't be taken into account for that particular user;

* **Sparse Matrix**: `scipy` is a really useful tool for us to encode this information, which is computationally efficient for dealing with custom datasets being fed to neural networks.

### Collaborative Filtering