# Service Intelligence [Recommender Systems for Services]
### TA: Jongkyung Shin (AIGS)
### contact : shinjk1156@unist.ac.kr (if you have any questions about this session, feel free to ask me)



# Factorization Machine (FM)


![screensh](https://drive.google.com/uc?export=view&id=1EQU3eXDI1GHVbEaYx9pN8FDpzDC1S9Oz)

<img src="https://drive.google.com/uc?export=view&id=1a7g5NudnRDJRc8vn4KXNYZDnsmFzBpsy" width="80%">

## Input data format for FM

- Combination of one-hot encoding vectors (and continous variables)
- Very sparse (most of the elements of matrix are zero)

<img src="https://drive.google.com/uc?export=view&id=1hpL4zcxGxBtXg_pM_BOfXozpBFaTErR4" width="80%">


# Data preparation (DP)

In [2]:
import numpy as np
import pandas as pd
#from google.colab import files
import io

In [4]:
# Load interaction data (user_id, item_id) using pandas read_csv() function
## User A buy item a
## User B buy item c ...
### Output variable name : interaction_df
'''
file_uploaded = files.upload()
df = pd.read_csv(io.BytesIO(file_uploaded['interaction.csv']))
'''
interaction_df = pd.read_csv('interaction.csv')
interaction_df.head()



Unnamed: 0,user_id,item_id
0,1369335,26615000
1,1369335,30626000
2,1369335,23917000
3,1369335,26563000
4,1369335,1294201000


In [5]:
# Load User information data (user_id, sex, age) using pandas read_csv() function
## User information
## User A is male and fifties.
## User B is female and forties.
### Output variable name : user_df
'''
file_uploaded = files.upload()
df = pd.read_csv(io.BytesIO(file_uploaded['users.csv']))
'''
user_df = pd.read_csv('users.csv')
user_df.head()



Unnamed: 0,user_id,sex,age
0,1369335,M,50
1,2965149,F,40
2,2107571,F,50
3,22182386,F,50
4,3082967,F,40


In [6]:
# Load Item information data (item_id, large_category) using pandas read_csv() function
## Item A belongs to the fish category
## Item B belongs to the vegetable category
### Output variable name : item_df
'''
file_uploaded = files.upload()
df = pd.read_csv(io.BytesIO(file_uploaded['items.csv']))
'''
item_df = pd.read_csv('items.csv')
item_df.head()



Unnamed: 0,item_id,large_category
0,26615000,fish
1,30626000,vegetable
2,23917000,vegetable
3,26563000,vegetable
4,1294201000,dried_seafood


In [8]:
# Check the number of users/items/interactions and print the results
# Hint : use pandas unique() and tolist() function
# Output variable name : num_user, num_item, num_interactions

### Write your code ###

#######################

print("# of Users : {}".format(num_user))
print("# of Items : {}".format(num_item))
print("# of interactions : {}".format(num_interactions))

NameError: name 'num_user' is not defined

## (DP1) Create dummy variables to represent categorical variables


### User information

In [None]:
# Create dummy variables of user information using pandas get_dummies() function (excluding user_id)
## Output variable name : user_features

### Write your code ###

#######################

user_features.head()

### Item information

In [None]:
# Create dummy variables of item information using pandas get_dummies() function (excluding item_id)
## Output variable name : item_features

### Write your code  ###

#######################

item_features.head()

### Calculation of the number of dimensions and matrix sparsity 

In [None]:
# Create datasets merged user/item/interaction dataset using pandas merge() function
## Output variable name : merge_df

### Write your code ###

#######################

merge_df

In [None]:
# Create dummy variables of merged datasets to represent whole categorical variables using pandas get_dummies() function
# Note: input_matrix is not used to train the FM model, this matrix is just used for calculating the number of dimensions and sparsity
### Output variable name : input_matrix

### Write your code ###

#######################
input_matrix

In [None]:
# Print the shape of merged_datasets (variable name : input_matrix)

print("Shape of input matrix : {}".format(input_matrix.shape))

- The final data format is **wide** (the number of dimensions is large; 5753)
- This kind of datasets requires high computational cost during training the ML model
- This occurs **'curse of dimensionality'** that makes model performance low

### FM's unique trick to reduce dimensions of interaction matrix


![screensh](https://drive.google.com/uc?export=view&id=1wnmcd1PeH0Le6U2gg6_nZVPslBALapxK)

- FM does not estimates W matrix (5753 \* 5753), but V matrix (5753 \* K) for the interaction term
- It allows for reducing the computational cost and memory

In [None]:
# Calculate sparisity of matrix using numpy count_nonzero() function
## sparsity == 1 means that the matrix only consists of zeros
## Output variable name : sparsity

### Write your code ###

#######################

print("Sparsity : {}".format(round(sparsity,3)))

- The matrix which consists of dummy variables has high sparsity.
- High sparsity occurs low performance.
- This is called **sparsity problem** in recommendation research domain

## (DP2) Split the data into Training / Test sets

In [None]:
# Split the Data into training and test sets in your own way
## Output variable name : train_interaction, test_interaction

### Write your code ###

#######################

train_interaction

In [None]:
# Reset the index of each dataframe using pandas reset_index() function
## Output variable name : train_interaction, test_interaction

### Write your code ###

#######################

train_interaction

### Cold start issue
- Cold-start problem is very important issue in recommendation domain.
- Cold-start users : **new users that the model did not observse** during training step 
- Cold-start items : **new items that the model did not observse** during training step 
- For more detail about cold-start problem, please refer to the below links.
  - https://www.yusp.com/blog-posts/cold-start-problem/
  -  https://en.wikipedia.org/wiki/Cold_start_(recommender_systems)

In [None]:
# Identify the users in each interaction datasets
# Hint: Use pandas unique() function
## Output variable name : train_users_list, test_users_list

### Write your code ###

#######################

# Calculate the number of the cold start users who contain in training datasets but not in test datasets.
cold_start_users = set(test_users_list) - set(train_users_list)
print("# of cold_start_users : {}".format(len(cold_start_users)))

In [None]:
# Identify the items in each interaction datasets
# Hint: Use pandas unique() function
## Output variable name : train_items_list, test_items_list

### Write your code ###

#######################

# Calculate the number of the cold start items who contain in training datasets but not in test datasets.
cold_start_items = set(test_items_list) - set(train_items_list)
print("# of cold_start_items : {}".format(len(cold_start_items)))

# Cold start items cannot be derived from FM models because they are not observed during training.

In [None]:
# Create dummy variables of user and item information for training datasets
# Hint : pandas isin() function
## Output variable name : train_user_features, train_item_features

### Write your code ###

#######################


# RankFM
#### Ref : https://rankfm.readthedocs.io/en/latest/home.html, https://github.com/etlundquist/rankfm

- RankFM has 9 hyper-parameters.
    - **factors** : the number of latent factors (>1)
    - **loss** : optimization/loss function to use for training: ['bpr', 'warp'] ('warp' recommended.)
       -  For more details about 'warp loss' and 'negative sampling', see https://medium.com/@gabrieltseng/intro-to-warp-loss-automatic-differentiation-and-pytorch-b6aa5083187a)
    - **max_samples** : Maximum number of negative samples to draw for WARP loss (>0)
    - **alpha** : L2 regularization penalty on [user, item] model weights (>0.0)
    - **beta** : L2 regularization penalty on [user-feature, item-feature] model weights (>0.0)
    - **sigma** : standard deviation to use for random initialization of factor weights (>0.0)
    - **learning_rate** : initial learning rate for gradient step updates (>0.0)
    - **learning_schedule** : schedule for adjusting learning rates by training epoch: ['constant', 'invscaling']
    - **learning_exponent** : exponent applied to epoch number to adjust learning rate (>0.0): scaling = 1 / pow(epoch + 1, learning_exponent) 

- You can change the value of each hyper-parameter to get the highly performed model.

## 1) Training step for FM model

In [None]:
pip install rankfm

In [None]:
import rankfm
from rankfm.rankfm import RankFM
# Call the RankFM function and Set the hyper-parameters in your own ways.
## Output variable name : model

### Write your code ###

#######################


In [None]:
# Train the model using fit() function. you can choose the number of epochs.
# You must input interaction data, user_features, and item_features (for training datasets) into the fit() function.
%%time

### Write your code ###

#######################

In [None]:
# FM model is a kind of regression model.
# Therefore, it calculates the score of each interaction (user/item pair:[user_id, item_id]) (The larger score is, The higher probability is)
## Calculate the score each interaction in test interaction datasets using rankfm predict() function.
### Output variable name : test_scores

### Write your code ###

#######################

test_scores

## 2) Generating TopK recommendation
-  Based on the score, TopK recommendations can be generated.
-  'recommend()' function in rankfm package provide each users' TopK recommended items in descending order
  -  The best recommended item is in column 0.

In [None]:
# Generate TopK recommendation for each users in test interaction data using rankfm recommend() function
## Output variable name : test_recommendation
TopK = 10

### Write your code ###

#######################

test_recommendation

## 3) Evaluating FM model

- For more information about metrics for recommendation, please refer to https://towardsdatascience.com/ranking-evaluation-metrics-for-recommender-systems-263d0a66ef54

- For this practice, Hit ratio, reciprocal rank, discounted cumulative gain (DCG), precision, recall and F1 score are used to measure the performance of the FM model (The higher the better)

In [10]:
from rankfm.evaluation import hit_rate, reciprocal_rank, discounted_cumulative_gain, precision, recall

ModuleNotFoundError: No module named 'rankfm'

In [11]:
# Evaluate the trained FM model using all metrics (Hit rate, reciprocal rank, DCG, precision, recall, F1 score) when K is equal to 10.

K = 10

Hit_rate = hit_rate(model, test_interaction, k=K)
Reciprocal_rank = reciprocal_rank(model, test_interaction, k=K)
Dcg = discounted_cumulative_gain(model, test_interaction, k=K)
Precision = precision(model, test_interaction, k=K)
Recall = recall(model, test_interaction, k=K)

print("*"*5 + " Performance of RankFM " + "*"*5)
print("Hit_ratio: {}".format(round(Hit_rate, 3)))
print("Reciprocal_rank: {}".format(round(Reciprocal_rank, 3)))
print("Dcg: {}".format(round(Dcg, 3)))
print("Precision: {}".format(round(Precision, 3)))
print("Recall: {}".format(round(Recall, 3)))
print("F1 score: {}".format(round((2*Recall*Precision)/(Recall+Precision),3)))

NameError: name 'hit_rate' is not defined

## 4) Performance Comparison

- Here, we compare the trained RankFM model and baseline; POP
- POP is Popularity based recommendation model (very simple)


- **The evaluation result of your FM model must be better than the results of the baseline model; PoP, For this, you need to tune the hyper-parameters of the FM model appropriately.**


In [None]:
popular_items = train_interaction.groupby('item_id')['user_id'].count().sort_values(ascending=False)[:K]
popular_items

In [None]:
test_user_items = test_interaction.groupby('user_id')['item_id'].apply(set).to_dict()
test_user_items = {key: val for key, val in test_user_items.items() if key in set(train_interaction.user_id.unique())}

In [None]:
base_pre = np.mean([len(set(popular_items.index) & set(val)) / len(set(popular_items.index)) for key, val in test_user_items.items()])
base_rec = np.mean([len(set(popular_items.index) & set(val)) / len(set(val))                for key, val in test_user_items.items()])

In [None]:
print("Performance Comparision\n")

print("*"*5 + " Performance of PoP " + "*"*5)
print("Precision: {:.3f}".format(base_pre))
print("Recall: {:.3f}".format(base_rec))
print("F1 score: {}\n".format(round((2*base_rec*base_pre)/(base_rec + base_pre),3)))

print("*"*5 + " Performance of RankFM " + "*"*5)
print("Precision: {}".format(round(Precision, 3)))
print("Recall: {}".format(round(Recall, 3)))
print("F1 score: {}".format(round((2*Recall*Precision)/(Recall+Precision),3)))