# Last FM 360k Factorization Machine Implementation
This notebook implements a factorization machine using the tffm module. Technically, we are performing regression to predict the number of times a certain user has played a certain artist. More information of Factorization Machines can be found in this paper https://www.csie.ntu.edu.tw/~b97053/paper/Factorization%20Machines%20with%20libFM.pdf. 

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

## Load Data

In [2]:
plays_df = pd.read_csv("lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv", sep="\t", names = ["userId", "artistId", "artistName", "plays"])
profile_df = pd.read_csv("lastfm-dataset-360K/usersha1-profile.tsv", sep="\t", names = ["userId", "gender", "age", "country", "signup"])

In [3]:
# How large of a subset of the data do we want to use? 
num_rows = 5000
full_df = plays_df.join(profile_df.set_index("userId"), on="userId", how="inner").sample(n=num_rows, axis=0).drop(columns=["artistId"])
full_df

Unnamed: 0,userId,artistName,plays,gender,age,country,signup
3144413,2df214c3ba2a0bf8ffc6d52b549707a58c3c970f,terrorvision,2,m,28.0,United Kingdom,"Sep 1, 2008"
6530843,5f7c7bf9b60ad244f97819e5214910a5e5647b79,the cure,358,f,20.0,Kazakhstan,"Feb 24, 2008"
9107567,85137d5069c180912cb52e158c56f7ed252b5a8e,big bang,278,f,22.0,United States,"Feb 28, 2007"
12588702,b80285fe25904ea70d232648b3160bbad25d231a,wondermints,136,m,33.0,United States,"Mar 9, 2005"
4412387,406f7de346d6dd251a7127e6358727af0cfd8075,less than jake,163,m,21.0,Germany,"Jun 14, 2007"
9077779,84a1e3df7545a5cc85ce0d74eed357063f6ce509,otros aires,146,f,20.0,Brazil,"Mar 25, 2008"
10738985,9cdb2fe91882e5c476cbd59ef175a8d2764d50e5,johann sebastian bach,2,m,36.0,United Kingdom,"Sep 6, 2008"
6822027,63b24bc1e8e4e4bf1b1eb59202e05f29c40dd2fb,tool,2121,m,29.0,United States,"Sep 15, 2007"
5379666,4ec3b47d7edac4b963a938bfc14927830c6ccf1d,the clash,158,m,30.0,Norway,"Sep 15, 2005"
14650853,d5e274d4f21c4145fea103ab63ab2174946a2af8,angra,25,f,19.0,Croatia,"Aug 17, 2008"


## Transform the dataset to have one-hot encodings for categorical variables

In [4]:
import time
from datetime import datetime
from dateutil import parser

# Let's drop the rows that have missing values. I'm worried that they are biasing our training data
full_df = full_df.dropna()

# Give the signup times a numeric value for each user
full_df["signup"] = full_df["signup"].map(lambda time: parser.parse(time).timestamp())

# Normalize the plays, timestamps, and age
full_df["signup"] = full_df["signup"] / full_df["signup"].mean()
full_df["plays"] = full_df["plays"] / full_df["plays"].mean()
full_df["age"] = full_df["age"] / full_df["age"].mean()

truth_df = full_df.drop(["userId", "artistName", "gender", "age", "country", "signup"], axis=1)
full_df = pd.get_dummies(full_df).drop("plays", axis=1)
full_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pand

Unnamed: 0,age,signup,userId_0003906ab668111f2cd332962cb09f8e3b795c6c,userId_000bce5b008caef9cce3f2b981ec71ef20a5926e,userId_0019c9f8e0ba0ddba802878b93ba36905c5aaf23,userId_001e2ade35f2476b47c15cce7bcb39dafa89b97a,userId_00225cadbc42648b61dd73ddb571e3995b2d462e,userId_0027bdefcde28fe58e126906c08c58d843b77c69,userId_0027e5c1f880965e1aabd56609fd88c21f326e6e,userId_0057ec15ebf78f74e84da3f2caebbbbb5bfd5c8b,...,country_Ukraine,country_United Arab Emirates,country_United Kingdom,country_United States,country_United States Minor Outlying Islands,country_Venezuela,country_Viet Nam,"country_Virgin Islands, U.s.",country_Yemen,country_Zimbabwe
3144413,1.121212,1.030367,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
6530843,0.800866,1.016508,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9107567,0.880952,0.990171,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
12588702,1.321428,0.937570,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4412387,0.840909,0.997901,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9077779,0.800866,1.018694,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10738985,1.441558,1.030731,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
6822027,1.161255,1.004686,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
5379666,1.201298,0.951428,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14650853,0.760822,1.029272,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Apply the Factorization Machine model to our data set

In [5]:
from sklearn.model_selection import train_test_split
from scipy import sparse

# Handle missing values for some numeric features such as "age"
X = np.nan_to_num(full_df.values)
y = np.squeeze(np.array(truth_df))

X_tr, X_te, y_tr, y_te = train_test_split(sparse.csr_matrix(X), y, test_size=0.2)
print(X_tr.shape, y_tr.shape, X_te.shape, y_te.shape)

(2960, 6280) (2960,) (741, 6280) (741,)


This part requires tensorflow and the tffm module. You can find more information about it here: https://github.com/geffy/tffm.

In [6]:
import tensorflow as tf
from tffm import TFFMRegressor
from sklearn.metrics import mean_squared_error

learning_rates = [0.1, 0.01, 0.001]
epochs = [100, 500, 1000, 5000]

for lr in learning_rates:
    for epoch in epochs:
        # Create the factorization machine model
        model = TFFMRegressor(
            optimizer=tf.train.AdamOptimizer(learning_rate=lr),
            n_epochs=epoch,
            input_type='sparse'
        )

        # Compute the mean squared error for test set
        model.fit(X_tr, y_tr, show_progress=True)

        predictions_tr = model.predict(X_tr)
        predictions_te = model.predict(X_te)
        
        print(f"-------Learning Rate:{lr}, Num Epochs: {epoch} ----------")
        print(f"MSE Train Set: {mean_squared_error(y_tr, predictions_tr)}")
        print(f"MSE Test Set: {mean_squared_error(y_te, predictions_te)}")

  from ._conv import register_converters as _register_converters




100%|████████████████████████████████████| 100/100 [00:00<00:00, 207.46epoch/s]


-------Learning Rate:0.1, Num Epochs: 100 ----------
MSE Train Set: 5.7410993412200585e-05
MSE Test Set: 5.815242605050522


100%|████████████████████████████████████| 500/500 [00:02<00:00, 240.32epoch/s]


-------Learning Rate:0.1, Num Epochs: 500 ----------
MSE Train Set: 1.185527245125794e-07
MSE Test Set: 5.458953708617969


100%|██████████████████████████████████| 1000/1000 [00:04<00:00, 243.98epoch/s]


-------Learning Rate:0.1, Num Epochs: 1000 ----------
MSE Train Set: 1.0165728555340257e-06
MSE Test Set: 5.536753188655327


100%|██████████████████████████████████| 5000/5000 [00:20<00:00, 248.53epoch/s]


-------Learning Rate:0.1, Num Epochs: 5000 ----------
MSE Train Set: 3.789229082764683e-13
MSE Test Set: 4.516978873588441


100%|████████████████████████████████████| 100/100 [00:00<00:00, 207.64epoch/s]


-------Learning Rate:0.01, Num Epochs: 100 ----------
MSE Train Set: 0.3209802292220321
MSE Test Set: 12.290804193942098


100%|████████████████████████████████████| 500/500 [00:02<00:00, 241.85epoch/s]


-------Learning Rate:0.01, Num Epochs: 500 ----------
MSE Train Set: 8.766452952508507e-11
MSE Test Set: 19.422737639329952


100%|██████████████████████████████████| 1000/1000 [00:04<00:00, 246.40epoch/s]


-------Learning Rate:0.01, Num Epochs: 1000 ----------
MSE Train Set: 9.945945616005649e-08
MSE Test Set: 18.54961851428771


100%|██████████████████████████████████| 5000/5000 [00:20<00:00, 244.56epoch/s]


-------Learning Rate:0.01, Num Epochs: 5000 ----------
MSE Train Set: 1.8737498738639386e-08
MSE Test Set: 18.09974524439696


100%|████████████████████████████████████| 100/100 [00:00<00:00, 206.21epoch/s]


-------Learning Rate:0.001, Num Epochs: 100 ----------
MSE Train Set: 3.8938875333261276
MSE Test Set: 4.2071943971250505


100%|████████████████████████████████████| 500/500 [00:02<00:00, 233.28epoch/s]


-------Learning Rate:0.001, Num Epochs: 500 ----------
MSE Train Set: 1.1929895111559017
MSE Test Set: 6.759159642951926


100%|██████████████████████████████████| 1000/1000 [00:04<00:00, 243.21epoch/s]


-------Learning Rate:0.001, Num Epochs: 1000 ----------
MSE Train Set: 0.19448551060927202
MSE Test Set: 14.089031346868886


100%|██████████████████████████████████| 5000/5000 [00:20<00:00, 245.39epoch/s]


-------Learning Rate:0.001, Num Epochs: 5000 ----------
MSE Train Set: 3.1176622776227896e-10
MSE Test Set: 20.336244397199508
