<a href="https://colab.research.google.com/github/bachaudhry/FastAI-22-23/blob/main/FastAI_2022_Collaborative_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Collaborative Filtering Deep Dive

In [1]:
from fastai.collab import *
from fastai.tabular.all import *
set_seed(42)

## Loading and Assessing Data

In [3]:
path = untar_data(URLs.ML_100k)

For reference `untar_data()` stores these downloads in `root/.fastai/data`

In [6]:
# Loading the data from the file u.data which is tab separated with 4
# primary columns. These need to be declared when creating the dataframe
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user', 'movie', 'rating', 'timestamp'])
ratings.head()


Unnamed: 0,user,movie,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [8]:
# If we already had data on users and their demographics, coupled with
# their movie preferences, then this would be a straight forward exercise in
# calculating dot products like below.
# The movie Last Skywalker would have a set of scores for science-fiction,
# lots of action, and not very old i.e...
last_skywalker = np.array([0.98, 0.9, -0.9])
# a user, on the other hand, might like sci-fi movies with lots of action which
# aren't very old i.e.
user1 = np.array([0.9, 0.8, -0.6])
# The combination, or dot product would give a recommendation score of
round((user1 * last_skywalker).sum(), 3)


2.142

## Learning Latent Factors and Creating Data Loaders

Most of the times, we will be confronted with incomplete matrices of user preferences regarding products on our platform. This complicates the issue of providing them with the correct recommendations.

The preferred approach is to:
1. Randomly initialize some parameters, which will be a set of latent factors for each user and movie. The lesson does not go into a lot of detail about the choosing the number of these factors.
2. Calculate our predictions using dot products of each movie against each user. Strong matches yield higher dot product results, whereas weak matches yield lower dot product results.
3. Finally, calculate the loss.

In [9]:
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1',
                     usecols=(0,1), names=('movie', 'title'), header=None)
movies.head()

Unnamed: 0,movie,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [16]:
# Merging the ratings table with movies
ratings = ratings.merge(movies)
ratings.head()

Unnamed: 0,user,movie,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


In [18]:
ratings.shape

(100000, 5)

In [19]:
# Building a DataLoader
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()

Unnamed: 0,user,title,rating
0,542,My Left Foot (1989),4
1,422,Event Horizon (1997),3
2,311,"African Queen, The (1951)",4
3,595,Face/Off (1997),4
4,617,Evil Dead II (1987),1
5,158,Jurassic Park (1993),5
6,836,Chasing Amy (1997),3
7,474,Emma (1996),3
8,466,Jackie Chan's First Strike (1996),3
9,554,Scream (1996),3
