## Recommender System can be solved by Factorization Machine

### Reasons to use Factorization Machine

* While matrix factorization algo, such as SVD++, only deal with user-item interactions, Factorization Machine can deal with extra data, such as user profile. In this dataset, there are user profile data, including gender, age, country. 
* FM is a more generic approach. It can do classification and regression under the same framework.


### Some Issues in the prototype:

+ Since the matrix is extremely large, memory would be an issue. I only select 200 most voted female customer due to the limitation of computational resources.
+ Learning time can also be an issue. Since the number of customers and items can be very large, FM may take long time to train. Please note that FM is actually a highly efficient algorithm. 

### Background on Factorization Machine

Factorization Machine achieve high efficiency by decomposing $w_{ij}$ (weight of interaction of feature i and j) into dot product $<v_i, v_j>$. This reduces the number of parameters from $O(n^2)$ to $O(n)$. Also, this approach solves the problem of sparsity. There is not many samples to estimate $w_{ij}$. However, to estimate $v_i$ is much easier as there are many samples for each i.


### About the input

The input of model is a sparse matrix consists of four kinds of features: user, artist, user history and gender. I remove the age feature due to limitation of computational resource. User, artist and gender is one-hot encoded and user history is many-hot: each user likes more than one artist. user history vector is set to one if the user has listened to him/her before. Each record is concated by these four sparse vector and the total input matrix is a sparse matrix consists of these records


### Preprocessing

The target is log transformed as it is highly diverse. Also, scale does not matter in this problem


### Optimization

This hyper-parameters are optimized by bayesian optimation


### Recommendation

The recommender system predict for each user the score he/ she would give to each artist. It returns the most liked artist for each user

## <span style="color:blue">Remark: tffm can only be run in tensorflow==1.3</span>

In [4]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score
import gc

def data_preprocessing():

    data = pd.read_csv('data/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv', sep = '\t', header = None)
    data.columns = ['user_id', 'artist_id', 'artist_name', 'plays']

    profile = pd.read_csv('data/lastfm-dataset-360K/usersha1-profile.tsv', sep = '\t', header = None)
    profile.columns = ['user_id', 'gender', 'age', 'country', 'signup_date']


    female_user = profile[profile['gender'] == 'f']['user_id']

    data = data[data['user_id'].isin(female_user)]

    selected_user = data['user_id'].value_counts().iloc[:200].index

    data = data[data['user_id'].isin(selected_user)]

    data = pd.merge(data, profile, how = 'inner', left_on = 'user_id', right_on = 'user_id', left_index = True)


    data['log_plays'] = np.log1p(data['plays'])

    plays = data['plays']
    data.drop('plays', inplace = True, axis = 1)

    data.dropna(subset = ['artist_name'], inplace = True)

    user_encoder = preprocessing.LabelEncoder()
    artist_encoder = preprocessing.LabelEncoder()

    data['encoded_user'] = user_encoder.fit_transform(data['user_id'])
    data['encoded_artist'] = artist_encoder.fit_transform(data['artist_name'])

    y = data['log_plays']
    data = data[['encoded_user', 'encoded_artist', 'gender', 'country']]

    data.reset_index(inplace = True)


    gc.collect()

    X_train, X_test, y_train, y_test = train_test_split(data.index, y, random_state=42, test_size=0.3)

    history = data.groupby('encoded_user')['encoded_artist'].apply(list)

    X = pd.get_dummies(data, columns = ['encoded_user', 'encoded_artist', 'gender', 'country'], sparse = True)
    X.drop(['index'], inplace = True, axis = 1)

    history_record = pd.DataFrame(0, index = range(200), columns = ['history_' + str(artist) for artist in data['encoded_artist'].unique()])

    for i, records in enumerate(history.values):
        records_col = ['history_' + str(each) for each in records]
        history_record.loc[i][records_col] = 1


    history_to_be_added = history_record.loc[data['encoded_user']]

    history_to_be_added = history_to_be_added.to_sparse()

    gc.collect()

    history_to_be_added.reset_index(inplace=True)

    for col in history_to_be_added.columns:
        X[col] = history_to_be_added[col]
        
        
    return X, y, X_train, X_test, y_train, y_test, user_encoder, artist_encoder, selected_user, profile, history

In [14]:
X, y, X_train, X_test, y_train, y_test, user_encoder, artist_encoder, selected_user, profile, history = data_preprocessing()

In [15]:
import gc

del X, data, profile
gc.collect()

49

In [33]:
import scipy.sparse as sp

X_tr_sparse = sp.csr_matrix(X.loc[X_train])
X_te_sparse = sp.csr_matrix(X.loc[X_test])

In [37]:
from sklearn.metrics import mean_squared_error
from math import sqrt
from tffm import TFFMRegressor
import tensorflow as tf

def tffm_regressor_score(rank, learning_rate, reg):
    rank = int(rank)
    
    order = 2
    model = TFFMRegressor(
        order=order, 
        rank=rank, 
        optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate), 
        n_epochs=30, 
        batch_size=64,
        init_std=0.1,
        reg=reg,
        input_type='sparse'
    )
    model.fit(X_tr_sparse, np.array(y_train), show_progress=True)
    predictions = model.predict(X_te_sparse)
    
    return - sqrt(mean_squared_error(np.array(y_test), predictions))

In [38]:
from bayes_opt import BayesianOptimization

FM_BO = BayesianOptimization(tffm_regressor_score, 
                          {'rank' :(3, 10), 
                           'learning_rate' :(0.0005, 0.01),
                           'reg': (0.005, 0.1)}
                         )

In [39]:
FM_BO.maximize(n_iter = 20, init_points=10)

Initialization
-----------------------------------------------------------------------
 Step |   Time |      Value |   learning_rate |      rank |       reg | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:19<00:00,  2.63s/epoch]


    1 | 01m19s |   -1.22981 |          0.0032 |    5.8588 |    0.0546 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:30<00:00,  3.00s/epoch]


    2 | 01m30s |   -2.65849 |          0.0051 |    9.8469 |    0.0314 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:28<00:00,  2.94s/epoch]


    3 | 01m29s |  -42.63065 |          0.0048 |    9.7880 |    0.0430 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:17<00:00,  2.59s/epoch]


    4 | 01m18s |   -2.02902 |          0.0097 |    5.9987 |    0.0712 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:18<00:00,  2.62s/epoch]


    5 | 01m19s |   -1.15964 |          0.0006 |    5.8242 |    0.0920 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:18<00:00,  2.60s/epoch]


    6 | 01m18s |   -1.24147 |          0.0040 |    5.0260 |    0.0176 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:23<00:00,  2.80s/epoch]


    7 | 01m24s |   -2.11322 |          0.0050 |    7.5138 |    0.0686 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:16<00:00,  2.55s/epoch]


    8 | 01m17s |   -1.46208 |          0.0039 |    3.8986 |    0.0379 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:17<00:00,  2.59s/epoch]


    9 | 01m18s |   -1.17107 |          0.0011 |    4.3645 |    0.0591 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:17<00:00,  2.57s/epoch]


   10 | 01m17s |   -1.34251 |          0.0039 |    4.8152 |    0.0661 | 
Bayesian Optimization
-----------------------------------------------------------------------
 Step |   Time |      Value |   learning_rate |      rank |       reg | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:16<00:00,  2.56s/epoch]


   11 | 01m18s |   -2.06113 |          0.0100 |    3.0000 |    0.1000 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:25<00:00,  2.86s/epoch]


   12 | 01m27s |   -2.13338 |          0.0044 |    8.5492 |    0.0955 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:21<00:00,  2.72s/epoch]


   13 | 01m23s |   -4.29490 |          0.0100 |    6.9858 |    0.0050 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:26<00:00,  2.90s/epoch]


   14 | 01m28s |   -4.36408 |          0.0082 |    8.0704 |    0.0280 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:31<00:00,  3.06s/epoch]


   15 | 01m32s |   -7.88231 |          0.0100 |   10.0000 |    0.0050 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:29<00:00,  2.97s/epoch]


   16 | 01m30s |  -10.60281 |          0.0090 |    9.2724 |    0.0251 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:17<00:00,  2.60s/epoch]


   17 | 01m19s |   -1.76239 |          0.0092 |    4.7670 |    0.0699 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:21<00:00,  2.73s/epoch]


   18 | 01m23s |   -2.37699 |          0.0072 |    6.4820 |    0.0976 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:15<00:00,  2.53s/epoch]


   19 | 01m17s |   -2.70548 |          0.0100 |    3.4518 |    0.0050 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:17<00:00,  2.59s/epoch]


   20 | 01m19s |   -1.43841 |          0.0025 |    5.4307 |    0.0052 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:15<00:00,  2.52s/epoch]


   21 | 01m17s |   -1.94782 |          0.0091 |    3.2284 |    0.0127 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:25<00:00,  2.86s/epoch]


   22 | 01m28s |   -2.74664 |          0.0078 |    8.8960 |    0.0053 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:23<00:00,  2.78s/epoch]


   23 | 01m25s |   -3.25478 |          0.0064 |    7.7930 |    0.0996 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:19<00:00,  2.65s/epoch]


   24 | 01m20s |   -1.65622 |          0.0070 |    5.9263 |    0.0546 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:24<00:00,  2.83s/epoch]


   25 | 01m27s |   -1.17053 |          0.0005 |    7.2448 |    0.1000 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:21<00:00,  2.73s/epoch]


   26 | 01m24s |   -1.21448 |          0.0017 |    6.7377 |    0.0991 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:21<00:00,  2.71s/epoch]


   27 | 01m23s |   -1.36145 |          0.0037 |    6.2435 |    0.0055 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:17<00:00,  2.59s/epoch]


   28 | 01m20s |   -1.90252 |          0.0083 |    4.1320 |    0.0999 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:25<00:00,  2.84s/epoch]


   29 | 01m28s |   -1.67510 |          0.0086 |    8.3220 |    0.0078 | 


100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:29<00:00,  2.99s/epoch]


   30 | 01m32s |   -1.33205 |          0.0065 |    9.5170 |    0.1000 | 


In [40]:
FM_BO.res['max']

{'max_val': -1.1596375231638125,
 'max_params': {'rank': 5.8242099837337813,
  'learning_rate': 0.00064084533179448405,
  'reg': 0.092006569503125024}}

In [41]:
import json
with open('optimized_params.json', 'w') as outfile:
    json.dump(FM_BO.res['max'], outfile)

## Recommender System

### you can run start from here

In [5]:
X, y, X_train, X_test, y_train, y_test, user_encoder, artist_encoder, selected_user, profile, history = data_preprocessing()

In [6]:
import json

with open('optimized_params.json') as f:
    params = json.load(f)

In [7]:
from sklearn.metrics import mean_squared_error
from math import sqrt
from tffm import TFFMRegressor
import tensorflow as tf
from bayes_opt import BayesianOptimization
import scipy.sparse as sp

class recommender_system(object):
    
    def __init__(self):

        self.X = X
        self.y = y
        self.artist_list = [each for each in self.X.columns if 'artist' in each]
        self.user_encoder = user_encoder
        self.artist_encoder = artist_encoder
        self.selected_user = profile[(profile['gender'] == 'f') & 
                                     (profile['user_id'].isin(selected_user))]

        self.X_sparse = sp.csr_matrix(X)
        
        
    def train(self, rank, learning_rate, reg):

        self.rank = int(rank)
        self.learning_rate = learning_rate
        self.reg = reg

        order = 2
        self.model = TFFMRegressor(
            order=order, 
            rank=self.rank, 
            optimizer=tf.train.AdamOptimizer(learning_rate=self.learning_rate), 
            n_epochs=30, 
            batch_size=64,
            init_std=0.1,
            reg=self.reg,
            input_type='sparse'
        )

        self.model.fit(self.X_sparse, np.array(self.y), show_progress=True)
        
    def recommend(self):
        
        total_artist = len(self.artist_list)
        self.recommendation = {}
        for user in self.selected_user.values:
            user_id = self.user_encoder.transform([user[0]])[0]
            country = user[3]
            prediction_matrix = pd.DataFrame(0, index = range(total_artist), columns = self.X.columns)

            for i in range(total_artist):
                prediction_matrix.loc[i]['encoded_artist_' + str(i)] = 1

            prediction_matrix['encoded_user_' + str(user_id)] = 1
            prediction_matrix['gender_f'] = 1
            prediction_matrix['country_' + country] = 1

            for artist in history[user_id]:
                prediction_matrix['history_' + str(artist)] = 1



            prediction = self.model.predict(sp.csr_matrix(prediction_matrix))
            self.recommendation[user[0]] = self.artist_encoder.inverse_transform(np.argmax(prediction))
            del prediction, prediction_matrix
            gc.collect()
            
        return self.recommendation
                

In [8]:
system = recommender_system()

In [9]:
system.train(**params['max_params'])

100%|███████████████████████████████████████████████████████████████████████████████| 30/30 [01:43<00:00,  3.44s/epoch]


In [10]:
recommendation = system.recommend()

In [11]:
recommendation

{'00b853397a79019699fba7ceb1dbb8e47ec94a67': 'the weepies',
 '019db87d5373986820b439d2c3a9c3a96bdab540': 'turin brakes',
 '062c3a27808341862c7203a95008d16de008258e': 'vanguart',
 '06728ba447eb3096c4397bf4998f7b3489049957': 'boom boom satellites',
 '0701715a7c6c6bc35036ad5fa7d89ddfe6691c95': 'the hot toddies',
 '0af0de2900ec1dc10afc8a4782ff1549c8b160a1': 'sam milby',
 '0b693c05ad30fbec114e4b83ba8171e90ac696ca': 'frightened rabbit',
 '0dc9c167dfca7d11e91b8e45dc917683bfb03b3a': 'sam milby',
 '0e6b168a73f4ca504957b21ff5ac5d7b7408f00f': 'the uniques',
 '1172d5780cfe1389588469ad557afabd3d6beadc': 'nick lachey',
 '12185cbb5d0007207d6adaef2fba64721487a4ec': 'vienna teng',
 '140c9b8363f4cce509595d31e199a55990248937': 'tomte',
 '1598b8ce0be701110d5cc178de8a280387c809e4': 'butthole surfers',
 '16d334e10f16ea7cdc2c5447d3312fb748e7cd93': 'indigo girls',
 '17adfff22b9477141eff9419e8e77b196193e7a2': 'luna',
 '19907dac6921cdf2f7150f6d737f66a12e5fcd8a': 'nick lachey',
 '1b17285c8468a4c77d01488293c9ef06

In [12]:
pd.DataFrame.from_dict(recommendation, orient='index').to_csv('recommendation.csv')