# MovieLens-1M Example

This notebook demonstrates how `lynx` can be used to perform experiments against
real-world datasets such as the MovieLens-1M dataset. `lynx` also provides a
dataset loader for the popular dataset.

This notebook is just for illustrative purposes. It uses a subset of 
MovieLens-1M so that the runtimes of the experiments at the end of the notebook 
are much faster. The `nrows` parameter in the data loader methods can be removed
or set to `None` though if you'd like to run the experiments on the full 
dataset. Running the experiments on the full dataset will show the runtime
improvement that taking advantage of relational data can provide.

Displayed below is a comparison of dense vs block structure `libFM` MCMC 
regression tasks running for 50 iterations and a `k` of 8 on the full 
MovieLens-1M dataset. The experiments were run on a 2023 MacBook Pro with a 
M2 Max chip and 32 GB of RAM.

| | Runtime (s) | RMSE | 
| --- | --- | --- |
| Dense libFM | 797.43 | 0.87 |
| BS libFM | 9.33 | 0.85 |

Yes, that is a *huge* speedup while retaining a similar RMSE!

In [1]:
import lynx as lx
from lynx.datasets import movielens

dataset_path = "~/Downloads/lynx/datasets/movielens/ml-1m"

In [2]:
if False:
    movielens.download(destination="~/Downloads/lynx/datasets/movielens")

In [3]:
users = movielens.load_users(
    dataset_path,
    usecols=["user_id", "gender", "age", "occupation"],
    nrows=1000
)
users.head()

Unnamed: 0,user_id,gender,age,occupation
0,1,F,1,10
1,2,M,56,16
2,3,M,25,15
3,4,M,45,7
4,5,M,25,20


In [4]:
users_table = lx.Table(users, "users")

users_table = (
    users_table
    .onehot("age")
    .onehot("gender")
    .onehot("occupation")
)
users_table.to_dataframe().head()

Unnamed: 0,user_id,0,1,2,3,4,5,6,0.1,1.1,...,11,12,13,14,15,16,17,18,19,20
0,1,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,1,0,0,0,0
2,3,0,0,1,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0
3,4,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,5,0,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1


In [5]:
movies = movielens.load_movies(
    dataset_path,
    usecols=["movie_id", "genres"],
    nrows=1000
)
movies["genres"] = movies["genres"].str.split("|")
movies.head()

Unnamed: 0,movie_id,genres
0,1,"[Animation, Children's, Comedy]"
1,2,"[Adventure, Children's, Fantasy]"
2,3,"[Comedy, Romance]"
3,4,"[Comedy, Drama]"
4,5,[Comedy]


In [6]:
movies_table = lx.Table(movies, "movies")

movies_table = movies_table.manyhot("genres")
movies_table.to_dataframe().head()

Unnamed: 0,movie_id,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,1,0.0,0.0,0.333333,0.333333,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0.0,0.333333,0.0,0.333333,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0
3,4,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
ratings = movielens.load_ratings(
    dataset_path,
    usecols=["user_id", "movie_id", "rating"],
    nrows=10000
)
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5


In [8]:
ratings_table = lx.Table(ratings, "ratings")
ratings_table.to_dataframe().head()

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5


In [9]:
merged_table = (
    ratings_table
    .merge(movies_table, left_on="movie_id", right_on="movie_id")
    .merge(users_table, left_on="user_id", right_on="user_id")
    .model_interactions("user_id", "movie_id")
    .onehot("movie_id")
    .onehot("user_id")
)

print(merged_table.shape)
print(merged_table.block_shapes)
# Order looks different because some ratings were removed during the inner join
# with `movies` and `users`.
merged_table.to_dataframe().head()

(2544, 1227)
{'ratings': (10000, 1), 'genres_manyhot': (154, 18), 'age_onehot': (7, 7), 'gender_onehot': (2, 2), 'occupation_onehot': (21, 21), 'user_id_movie_id_interactions': (70, 554), 'movie_id_onehot': (554, 554), 'user_id_onehot': (70, 70)}


Unnamed: 0,rating,0,1,2,3,4,5,6,7,8,...,60,61,62,63,64,65,66,67,68,69
0,3,0.0,0.0,0.333333,0.333333,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
1,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
2,4,0.0,0.0,0.333333,0.333333,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
3,4,0.0,0.25,0.0,0.25,0.0,0.0,0.0,0.25,0.0,...,0,0,0,0,0,0,0,0,0,0
4,5,0.0,0.0,0.333333,0.333333,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


# `libFM` Experiments

The following experiments show how `lynx` can be used to run the different
tasks and learning methods provided `libFM` through a user-friendly API.

The following experiments are all regression tasks, but `lynx` also provides
`FMClassification()` classes for each learning method.

For proper timing, instead of timing `fit()`, run `write()` and time `train()`
as can be seen in the `experiments` scripts, e.g.

```python
import time

fm = ...FMRegression()
fm.write(...)
start_time = time.perf_counter()
fm.train(...)
end_time = time.perf_counter()

train_time = start_time - end_time
```

To set up the experiments, we have to split our dataset.

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def rmse(predictions, targets):
    return mean_squared_error(predictions, targets, squared=False)

seed = 0

y = merged_table.pop("rating")
X_train, X_test, y_train, y_test = train_test_split(
    merged_table, y,
    train_size=0.8,
    test_size=0.2,
    random_state=seed
)

## Dense Structure (X)

Monte-Carlo Markov Chain (MCMC) Regression

In [11]:
from lynx.libfm import mcmc

fm = mcmc.FMRegression(seed=seed)
pred = fm.fit_predict(X_train, y_train, X_test)
fm.flush()

rmse(pred, y_test)

0.9146968351235664

Alternating Least Squares (ALS) Regression

In [12]:
from lynx.libfm import als

fm = als.FMRegression(regularizations=(0,0,10), seed=seed)
fm.fit(X_train, y_train)
pred = fm.predict(X_test)
fm.flush()

rmse(pred, y_test)

1.0202029856645878

Stochastic Gradient Descent (SGD) Regression

In [13]:
from lynx.libfm import sgd

fm = sgd.FMRegression(learn_rate=0.001, seed=seed)
fm.fit(X_train, y_train)
pred = fm.predict(X_test)
fm.flush()

rmse(pred, y_test)

0.9764159338677373

Adaptive Stochastic Gradient Descent (SGDA) Regression

In [14]:
from lynx.libfm import sgda

X_val, X_sgda_test, y_val, y_sgda_test = train_test_split(
    X_test, y_test,
    train_size=0.5,
    test_size=0.5,
    random_state=seed
)

fm = sgda.FMRegression(learn_rate=0.001, seed=seed)
fm.fit_validation(X_train, y_train, X_val, y_val)
pred = fm.predict(X_sgda_test)
fm.flush()

rmse(pred, y_sgda_test)

0.970302366639935

## Block Structure (BS)

BS MCMC Regression

In [15]:
from lynx.libfm.bs import mcmc as mcmc_bs

fm = mcmc_bs.FMRegression(seed=seed)
pred = fm.fit_predict(X_train, y_train, X_test)
fm.flush()

rmse(pred, y_test)

0.9202588476668059

BS ALS Regression

In [16]:
from lynx.libfm.bs import als as als_bs

fm = als_bs.FMRegression(seed=seed)
fm.fit(X_train, y_train)
pred = fm.predict(X_test)
fm.flush()

rmse(pred, y_test)

1.4089717533840018