<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Neural Collaborative Filtering on MovieLens dataset.

Neural Collaborative Filtering (NCF) is a well known recommendation algorithm that generalizes the matrix factorization problem with multi-layer perceptron. 

This notebook provides an example of how to utilize and evaluate NCF implementation in the `reco_utils`. We use a smaller dataset in this example to run NCF efficiently with GPU acceleration on a [Data Science Virtual Machine](https://azure.microsoft.com/en-gb/services/virtual-machines/data-science-virtual-machines/).


* https://github.com/microsoft/recommenders/blob/6815e5663ef87da1d0b9029bc9a8a367dc3d33a7/examples/02_model_hybrid/ncf_deep_dive.ipynb

In [1]:
%reload_ext autoreload
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
import sys
sys.path.append("../../")

sys.path.append("./recommenders-master")
import time
import pandas as pd
import tensorflow as tf

from reco_utils.recommender.ncf.ncf_singlenode import NCF
from reco_utils.recommender.ncf.dataset import Dataset as NCFDataset
from reco_utils.dataset import movielens
from reco_utils.common.notebook_utils import is_jupyter
from reco_utils.dataset.python_splitters import python_chrono_split
from reco_utils.evaluation.python_evaluation import (rmse, mae, rsquared, exp_var, map_at_k, ndcg_at_k, precision_at_k, 
                                                     recall_at_k, get_top_k_items)

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))
print("Tensorflow version: {}".format(tf.__version__))

System version: 3.6.11 (default, Nov 27 2020, 18:37:51) [MSC v.1916 64 bit (AMD64)]
Pandas version: 0.25.3
Tensorflow version: 1.15.2


Set the default parameters.

In [3]:
# top k items to recommend
TOP_K = 4

# Model parameters
EPOCHS = 20 # 20
BATCH_SIZE =  64#256

SEED = 42

MIN_TARGET_FREQ = 25

In [4]:
USE_COLS = ["utrip_id","city_id","checkin"]

df = pd.read_csv("booking_train_set.csv",
#                  nrows=423456,
#                  index_col=[0],
                 parse_dates=["checkin"],infer_datetime_format=True,
                usecols=USE_COLS)

df.sort_values([ "utrip_id", "checkin"],inplace=True)

print(df.shape[0])
##################
## filter min freq
freq = df["city_id"].value_counts()
df["city_id_count"] = df["city_id"].map(freq)
df = df.loc[df["city_id_count"]>= MIN_TARGET_FREQ]
df.drop(["city_id_count"],axis=1,inplace=True,errors="ignore")
### filter rare users
freq = df["utrip_id"].value_counts()
df["utrip_id_count"] = df["utrip_id"].map(freq)
df = df.loc[df["utrip_id_count"]>= 4]
df.drop(["utrip_id_count"],axis=1,inplace=True,errors="ignore")

freq = df["city_id"].value_counts()
df["city_id_count"] = df["city_id"].map(freq)
df = df.loc[df["city_id_count"]>= MIN_TARGET_FREQ]
df.drop(["city_id_count"],axis=1,inplace=True,errors="ignore")
freq = df["utrip_id"].value_counts()
df["utrip_id_count"] = df["utrip_id"].map(freq)
df = df.loc[df["utrip_id_count"]>= 4]
df.drop(["utrip_id_count"],axis=1,inplace=True,errors="ignore")
#################
print(df.shape[0])

# df.columns = ["userID", "itemID", "timestamp"]
df.rename(columns={"utrip_id":"userID","city_id":"itemID","checkin":"timestamp"},inplace=True)

df["rating"] = 1
print(df.columns)
df

1166835
867651
Index(['timestamp', 'itemID', 'userID', 'rating'], dtype='object')


Unnamed: 0,timestamp,itemID,userID,rating
1061281,2016-04-09,38677,1000033_1,1
1061282,2016-04-11,52089,1000033_1,1
1061283,2016-04-12,21328,1000033_1,1
1061284,2016-04-14,27485,1000033_1,1
1061285,2016-04-16,38677,1000033_1,1
...,...,...,...,...
1120385,2016-04-21,24718,999855_1,1
1120386,2016-04-22,33408,999855_1,1
1120390,2016-04-27,63729,999855_1,1
1120391,2016-04-29,44489,999855_1,1


In [5]:
## could also type userID as categorical to save memory, but that may cause errors?   and datetime to total seconds? 
df["itemID"] = pd.to_numeric(df["itemID"].values, errors="ignore",downcast="integer")
df["rating"] = pd.to_numeric(df["rating"].values, errors="ignore",downcast="integer")

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 867651 entries, 1061281 to 1120392
Data columns (total 4 columns):
timestamp    867651 non-null datetime64[ns]
itemID       867651 non-null int32
userID       867651 non-null object
rating       867651 non-null int8
dtypes: datetime64[ns](1), int32(1), int8(1), object(1)
memory usage: 24.0+ MB


In [6]:
print(df[['itemID', 'userID']].nunique())

itemID      3874
userID    164813
dtype: int64


### 2. Split the data using the Spark chronological splitter provided in utilities

In [7]:
train, test = python_chrono_split(df, 0.9, filter_by="user")

Generate an NCF dataset object from the data subsets.

In [None]:
data = NCFDataset(train=train, test=test, seed=SEED,n_neg_test=15)

### 3. Train the NCF model on the training data, and get the top-k recommendations for our testing data

NCF accepts implicit feedback and generates prospensity of items to be recommended to users in the scale of 0 to 1. A recommended item list can then be generated based on the scores. Note that this quickstart notebook is using a smaller number of epochs to reduce time for training. As a consequence, the model performance will be slighlty deteriorated. 

In [None]:
model = NCF (
    n_users=data.n_users, 
    n_items=data.n_items,
    model_type="NeuMF",
#     n_factors=6,
#     layer_sizes=[16,8], # layer_sizes=[16,8,4],
    n_epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    learning_rate=1e-3,
    verbose=1,
    seed=SEED
)

In [None]:
start_time = time.time()

model.fit(data)

train_time = time.time() - start_time

print("Took {} seconds for training.".format(train_time))

In the movie recommendation use case scenario, seen movies are not recommended to the users.

In [None]:
start_time = time.time()
#### this part is very slow and compute intensive. output file is 2.5 GB in memory

users, items, preds = [], [], []
item = list(train.itemID.unique())
for user in train.userID.unique():
    user = [user] * len(item) 
    users.extend(user)
    items.extend(item)
    preds.extend(list(model.predict(user, item, is_list=True)))

all_predictions = pd.DataFrame(data={"userID": users, "itemID":items, "prediction":preds})

# ### following was in original code - I comment out, as we do want repeat predictions - dan
# merged = pd.merge(train, all_predictions, on=["userID", "itemID"], how="outer")
# all_predictions = merged[merged.rating.isnull()].drop('rating', axis=1)

test_time = time.time() - start_time
print("Took {} seconds for prediction.".format(test_time))

### 4. Evaluate how well NCF performs

The ranking metrics are used for evaluation.

In [None]:
%%time
## get top 4 accuracy - per user
all_preds = all_predictions.drop("timestamp",axis=1).sort_values(["userID","prediction"],ascending=False)

all_preds = all_preds.groupby("userID").head(4).drop("prediction",axis=1)
# .head(250123)

test_users_total = test["userID"].nunique()
print("test_users_total",test_users_total)

print("%  match of users with top4 acc", round(100*test.merge(all_preds,on=["itemID","userID"],how="inner")["userID"].nunique()/test_users_total,5))

In [None]:
%%time
### very slowwww

# eval_map = map_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)
# eval_ndcg = ndcg_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)

# eval_precision = precision_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)
eval_recall = recall_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)


print(#"MAP:\t%f" % eval_map,
      #"NDCG:\t%f" % eval_ndcg,
#       "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall, sep='\n')

In [None]:
recall_at_k(test, all_predictions, col_prediction='prediction', k=90)

In [None]:
if is_jupyter():
    # Record results with papermill for tests
    import papermill as pm
    import scrapbook as sb
    sb.glue("map", eval_map)
    sb.glue("ndcg", eval_ndcg)
    sb.glue("precision", eval_precision)
    sb.glue("recall", eval_recall)
    sb.glue("train_time", train_time)
    sb.glue("test_time", test_time)