<a href="https://colab.research.google.com/github/bil-ardhry/fastai_course_22/blob/main/Collaborate_Filtering_Deepdive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Collaborative Filtering**

In [1]:
import pandas as pd
import numpy as np
import torch
from fastai.collab import *
from fastai.tabular.all import *
set_seed(42)

## Import Data from [Movie Lens](https://grouplens.org/datasets/movielens/)

In [3]:
path = untar_data(URLs.ML_100k)

In [6]:
# What's going on behind the scenes with untar_data
??untar_data

In [8]:
# The main table is in the file u.data and is tab separated
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user', 'movie', 'rating', 'timestamp'])
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   user       100000 non-null  int64
 1   movie      100000 non-null  int64
 2   rating     100000 non-null  int64
 3   timestamp  100000 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB


In [9]:
ratings.head()

Unnamed: 0,user,movie,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [11]:
ratings.describe()

Unnamed: 0,user,movie,rating,timestamp
count,100000.0,100000.0,100000.0,100000.0
mean,462.48475,425.53013,3.52986,883528900.0
std,266.61442,330.798356,1.125674,5343856.0
min,1.0,1.0,1.0,874724700.0
25%,254.0,175.0,3.0,879448700.0
50%,447.0,322.0,4.0,882826900.0
75%,682.0,631.0,4.0,888260000.0
max,943.0,1682.0,5.0,893286600.0




*   Now, it is important to remember that latent factors are key to building recommender systems - using collaborative filtering.
*   If we knew the degree to which each user preferred characteristics like directors, actors, genres etc. then the exercise would be pretty straight forward.



In [12]:
# Assuming known latent factor fall into a range between -1 and +1, with positive
# numbers indicating stronger preferential matches and the categories are
# Science Fiction, Action and Old Movies

# We can extend this to the Last Skywalker
last_skywalker = np.array([0.98, 0.9, -0.9])

# A user can be represented as
user1 = np.array([0.9, 0.8, -0.6])

In [14]:
# Calculating a match between this combo by  using a dot product
(user1 * last_skywalker).sum().round(2)

2.14

In [15]:
# Representing Casablanca as
casablanca = np.array([-0.99, -0.3, 0.8])

# and the dot product reveals
(user1 * casablanca).sum()

-1.611

## Creating our Data Loaders

In [18]:
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1',
                     usecols=(0,1), names=('movie', 'title'), header=None)
movies.head()

Unnamed: 0,movie,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [19]:
# Merging details with the ratings table
ratings = ratings.merge(movies)
ratings.head()

Unnamed: 0,user,movie,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


In [20]:
# Creating a dataloader
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()

Unnamed: 0,user,title,rating
0,542,My Left Foot (1989),4
1,422,Event Horizon (1997),3
2,311,"African Queen, The (1951)",4
3,595,Face/Off (1997),4
4,617,Evil Dead II (1987),1
5,158,Jurassic Park (1993),5
6,836,Chasing Amy (1997),3
7,474,Emma (1996),3
8,466,Jackie Chan's First Strike (1996),3
9,554,Scream (1996),3


In [23]:
??CollabDataLoaders

In [25]:
# For Pytorch to be able to process our data loader, we will represent
# our movie and user latent factor tables as simple matrices
n_users  = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 5

user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)

In [26]:
# We can use 'look up in an index' as a matrix product and we can pull this off
# by replacing our indices with one-hot-encoded vectors likee in the following
# example.
one_hot_3 = one_hot(3, n_users).float()

user_factors.t() @ one_hot_3


tensor([-0.4586, -0.9915, -0.4052, -0.3621, -0.5908])

In [28]:
??one_hot

In [29]:
user_factors[5]

tensor([-1.2018, -1.2946, -1.8869,  1.2259,  0.2970])