# Info

Read in my exported Letterboxd data and also `surprise` library's built-in MovieLens 32M dataset. Characterize features and transformations to join between the Letterboxd data and MovieLens dataset.

# Prep Data

Map IMDB ID from Letterboxd URI so that it can be merged with MovieLens dataset. Reference for function: [letterboxd2imdb.py](https://github.com/TobiasPankner/Letterboxd-to-IMDb/blob/master/letterboxd2imdb.py)

Then, output a combined ratings dataset that will be used as the input for the SVD model.

Letterboxd data is sourced from their website's CSV export. The zipped contents are extracted and kept under `data/letterboxd`. MovieLens 32M dataset is downloaded from [here](https://grouplens.org/datasets/movielens/32m/), with the zipped contents are extracted and kept under `data/ml-32m`

In [1]:
import os
import pandas as pd
import numpy as np
from surprise import Dataset
import matplotlib as plt
import seaborn as sns
import requests
import re

def get_imdb_id(letterboxd_uri):
  headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
    "Accept": "text/html",
    "Accept-Language": "en-US,en;q=0.9",
    "Connection": "keep-alive",
    "Cache-Control": "max-age=0",
  }

  resp = requests.get(letterboxd_uri, headers=headers)
  if resp.status_code != 200:
    return None

  # extract the IMDb url
  re_match = re.findall(r'href=".+title/(tt\d+)/maindetails"', resp.text)
  if not re_match:
    return None

  return re_match[0].replace('tt', '')

# import movielens data
data_dir = '../data/'
data_dir_ml = os.path.join(data_dir, 'ml-32m')
# dimension table for film list that maps between IMDB and letterboxd identifiers
df_films = pd.merge(
    pd.read_csv(os.path.join(data_dir_ml, 'movies.csv')),
    pd.read_csv(os.path.join(data_dir_ml, 'links.csv')),
    how='left', on='movieId'
  ) \
  .drop('tmdbId', axis=1) \
  .rename(columns={'movieId':'movie_id', 'imdbId':'imdb_id'})                    
df_ratings = pd.read_csv(os.path.join(data_dir_ml, 'ratings.csv')) \
  .rename(columns={'userId':'user_id','movieId':'movie_id',})

In [None]:
# for letterboxd data, re-read when desired so that mapping to IMDB doesn't have to get run every time
lb_rerun = False
data_dir_lb = os.path.join(data_dir, 'letterboxd')
if lb_rerun:
  # letterboxd data
  df_ratings_ec = pd.read_csv(os.path.join(data_dir_lb, 'ratings.csv')) \
    .set_axis(['date', 'title', 'year', 'lb_uri', 'rating'], axis=1) \
    .assign(decade = lambda x: (x['year'] // 10) * 10)
  # use diary for watch counts
  df_log_counts = pd.read_csv(os.path.join(data_dir_lb, 'diary.csv')) \
    .groupby(['Name', 'Year']).size().reset_index(name='n_logs') \
    .assign(rewatched = lambda x: np.where(x.n_logs > 1, True, False)) \
    .rename(columns={'Name':'title', 'Year':'year'})
  df_lb = pd.merge(df_ratings_ec, df_log_counts, how='left', on=['title', 'year']) \
    .fillna({'n_logs':0}) # never logged = NA -> 0
  df_lb['imdb_id'] = pd.to_numeric(df_lb['lb_uri'].apply(get_imdb_id))
  df_lb = pd.merge(
    df_lb,
    df_films[['imdb_id', 'movie_id']],
    how='inner', # inner join - only want to keep common set of titles
    on='imdb_id'
  )
  
  # add myself to ratings database, with my ID being 0
  df_ratings_with_ec = pd.concat([
    df_lb.assign(user_id=0)[['user_id', 'movie_id', 'rating']],
    df_ratings[['user_id', 'movie_id', 'rating']]
  ])

  # write out
  df_lb.to_csv(os.path.join(data_dir_lb, 'lb_joined.csv'), index=False)
  df_ratings_with_ec.to_csv(os.path.join(data_dir, 'ratings_combined.csv'), index=False)
else: 
  df_lb = pd.read_csv(os.path.join(data_dir_lb, 'lb_joined.csv'))
  df_ratings_with_ec = pd.read_csv(os.path.join(data_dir, 'ratings_combined.csv'))

# EDA

## Ratings Distribution

## Additional Letterboxd EDA

Further characterizations & superlatives of my Letterboxd data (mostly just for fun, as they're uninvolved in SVD):

* Letterboxd log count/rewatches
* Year or Decade characterizations/superlatives
* Highest rated/most rewatched compared against MovieLens' ratings