Collaborative Filtering Utilizing Neural Networks/DBLP

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-pre-processing" data-toc-modified-id="Data-pre-processing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data pre-processing</a></span></li><li><span><a href="#Models" data-toc-modified-id="Models-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Models</a></span><ul class="toc-item"><li><span><a href="#MF-MLP-model" data-toc-modified-id="MF-MLP-model-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>MF-MLP model</a></span><ul class="toc-item"><li><span><a href="#Model-Construction" data-toc-modified-id="Model-Construction-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Model Construction</a></span></li><li><span><a href="#Prediction-and-evaluation" data-toc-modified-id="Prediction-and-evaluation-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Prediction and evaluation</a></span></li></ul></li><li><span><a href="#MF-LSTM-model" data-toc-modified-id="MF-LSTM-model-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>MF-LSTM model</a></span><ul class="toc-item"><li><span><a href="#Model-Construction" data-toc-modified-id="Model-Construction-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Model Construction</a></span></li><li><span><a href="#Prediction-and-evaluation" data-toc-modified-id="Prediction-and-evaluation-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Prediction and evaluation</a></span></li></ul></li></ul></li></ul></div>

## Data pre-processing

**Convert dat to txt file**

**Load data, split columns by delimiter and keep columns 'Authors' and 'References'**

In [1]:
import pandas as pd
authors=[]
references=[]

dff = pd.read_excel('data/paper_collection.xlsx', header=None)
dff.columns = ['col']
for index, row in dff.iterrows():
    objects=json.loads(row['col'])
    if "authors" in objects:
            authors.append(objects['authors'])
    else:
            authors.append("NaN")
    if "references" in objects:
            references.append(objects['references'])
    else:
            references.append("NaN")
            
data = pd.DataFrame()
data["Authors"] = authors
data["References"] = references
     
data.head()   

Unnamed: 0,Authors,References
0,"[{'_id': '53f43403dabfaedce5517a1c', 'name': '...",
1,"[{'name': 'Rachid Hba', 'sid': '2371749277'}, ...","[56d8b0e7dabfae2eeee050a8, 53e9ae11b7602d97038..."
2,"[{'name': 'David Poirier-Quinot', 'org': 'Airb...","[53e9a952b7602d97032a6396, 53e9b9bab7602d97045..."
3,"[{'_id': '562d3b5845cedb3398d965a8', 'name': '...",
4,"[{'_id': '53f42e97dabfaeb2acff9b40', 'name': '...",


**Distinguish Authors only by id**

In [2]:
counter = -1
new_df = pd.DataFrame(columns = ['Authors', 'Articles'])
for row in data["Authors"]:
    counter = counter + 1
    for i in range(len(row)):
        if "_id" in row[i]:
            new_row = {'Authors': row[i]['_id'], 'Articles': data["References"][counter]}
            new_df = new_df.append(new_row, ignore_index=True)
            
new_df.head(6)

Unnamed: 0,Authors,Articles
0,53f43403dabfaedce5517a1c,
1,54055b50dabfae8faa5c6b14,
2,562d198845cedb3398d5191b,
3,53f43767dabfaeb2ac05c0bc,
4,53f42930dabfaeb2acfb5f31,
5,560587a245ce1e595e65d699,"[53e9a952b7602d97032a6396, 53e9b9bab7602d97045..."


**Replace NaN values with 0**

In [3]:
from numpy import nan

new_df = new_df.replace('NaN', 0)
new_df.head(6)

Unnamed: 0,Authors,Articles
0,53f43403dabfaedce5517a1c,0
1,54055b50dabfae8faa5c6b14,0
2,562d198845cedb3398d5191b,0
3,53f43767dabfaeb2ac05c0bc,0
4,53f42930dabfaeb2acfb5f31,0
5,560587a245ce1e595e65d699,"[53e9a952b7602d97032a6396, 53e9b9bab7602d97045..."


**Make new rows for each article an author liked**

In [4]:
counter = -1
df = pd.DataFrame(columns = ['Authors', 'Articles'])
for row in new_df["Articles"]:
    counter = counter + 1
    if row != 0:
        for i in row:
            new_row = {'Authors': new_df['Authors'][counter], 'Articles': i}
            df = df.append(new_row, ignore_index=True)
            
df.head()

Unnamed: 0,Authors,Articles
0,560587a245ce1e595e65d699,53e9a952b7602d97032a6396
1,560587a245ce1e595e65d699,53e9b9bab7602d97045b2219
2,560587a245ce1e595e65d699,5550441f45ce0a409eb4b702
3,560587a245ce1e595e65d699,53e9ba7cb7602d970469c8e9
4,560587a245ce1e595e65d699,555041f245ce0a409eb3eda7


**Check the number of registrations, the number of authors and the number of articles we have in our dataset**

In [5]:
print('len:', len(df))
print('Authors:', df['Authors'].nunique())
print('Articles:', df['Articles'].nunique())

len: 229183
Authors: 10063
Articles: 65874


**Keep authors who has read at least 40 articles and articles with at least 20 authors read them in out dataset**

In [6]:
users_keep = df['Authors'].value_counts() > 40
y = users_keep[users_keep].index
data = df[df['Authors'].isin(y)]

data.reset_index(inplace = True, drop = True)

books_keep = data['Articles'].value_counts() > 20
y = books_keep[books_keep].index
data = data[data['Articles'].isin(y)]

**Check the number of registrations, the number of authors and the number of articles we have in our dataset after the previous restrictions**

In [7]:
print('len:', len(data))
print('Authors:', data['Authors'].nunique())
print('Articles:', data['Articles'].nunique())

len: 3254
Authors: 676
Articles: 78


**Reset the index column**

In [8]:
data.reset_index(inplace = True, drop = True)
data.head()

Unnamed: 0,Authors,Articles
0,5487f377dabfae8a11fb3f0a,53e9bcc1b7602d97049412d4
1,542a1cabdabfae61d49563cd,53e9ab6fb7602d970350dc5d
2,54876d00dabfae8a11fb39ca,599c7f08601a182cd28e5abd
3,54876d00dabfae8a11fb39ca,53e99fc2b7602d9702899ee6
4,53f44d05dabfaeee22a11481,53e9b068b7602d9703acf032


**Create a new column which represents that author liked the article**

In [9]:
import numpy as np

data['View'] = np.ones(len(data), dtype=int)
data.head()

Unnamed: 0,Authors,Articles,View
0,5487f377dabfae8a11fb3f0a,53e9bcc1b7602d97049412d4,1
1,542a1cabdabfae61d49563cd,53e9ab6fb7602d970350dc5d,1
2,54876d00dabfae8a11fb39ca,599c7f08601a182cd28e5abd,1
3,54876d00dabfae8a11fb39ca,53e99fc2b7602d9702899ee6,1
4,53f44d05dabfaeee22a11481,53e9b068b7602d9703acf032,1


**Create a new dataset with all combinations Author-Article**

In [10]:
dataset = pd.DataFrame(columns = ['Authors', 'Articles'])

In [11]:
from itertools import product

for x, y in product(set(data['Articles']), set(data['Authors'])):
        dataset = dataset.append({'Authors': y, 'Articles': x}, ignore_index = True)

In [12]:
dataset.head()

Unnamed: 0,Authors,Articles
0,562d996945cedb3398e78be8,53e9b108b7602d9703b85b88
1,53f433efdabfaedce5516b23,53e9b108b7602d9703b85b88
2,5628eae445ce1e59660effe6,53e9b108b7602d9703b85b88
3,53f38e3edabfae4b34a44275,53e9b108b7602d9703b85b88
4,548691a3dabfaed7b5fa2a43,53e9b108b7602d9703b85b88


**Check the number of registrations we have in that(dff) dataset**

In [13]:
len(dataset)

52728

**Merge the two datasets. This way we will have a dataset whith all combinations Author-Article and a column 'View' which represents if each author liked the corresponding article (1) or not (0)**

In [14]:
merged_df = dataset.merge(data, how='left', left_on=["Authors", "Articles"], right_on=["Authors","Articles"])
merged_df.head()

Unnamed: 0,Authors,Articles,View
0,562d996945cedb3398e78be8,53e9b108b7602d9703b85b88,
1,53f433efdabfaedce5516b23,53e9b108b7602d9703b85b88,
2,5628eae445ce1e59660effe6,53e9b108b7602d9703b85b88,
3,53f38e3edabfae4b34a44275,53e9b108b7602d9703b85b88,
4,548691a3dabfaed7b5fa2a43,53e9b108b7602d9703b85b88,1.0


**Replace NaN values with 0**

In [15]:
merged_df= merged_df.fillna(0)
merged_df.head()

Unnamed: 0,Authors,Articles,View
0,562d996945cedb3398e78be8,53e9b108b7602d9703b85b88,0.0
1,53f433efdabfaedce5516b23,53e9b108b7602d9703b85b88,0.0
2,5628eae445ce1e59660effe6,53e9b108b7602d9703b85b88,0.0
3,53f38e3edabfae4b34a44275,53e9b108b7602d9703b85b88,0.0
4,548691a3dabfaed7b5fa2a43,53e9b108b7602d9703b85b88,1.0


**Convert authors and articles names into numerical IDs**

In [16]:
from sklearn import preprocessing

lbl_authors = preprocessing.LabelEncoder()
lbl_articles = preprocessing.LabelEncoder()

merged_df['Lbl_Authors'] = lbl_authors.fit_transform(merged_df['Authors'].values)
merged_df['Lbl_Articles'] = lbl_articles.fit_transform(merged_df['Articles'].values)

merged_df.head()

Unnamed: 0,Authors,Articles,View,Lbl_Authors,Lbl_Articles
0,562d996945cedb3398e78be8,53e9b108b7602d9703b85b88,0.0,640,18
1,53f433efdabfaedce5516b23,53e9b108b7602d9703b85b88,0.0,103,18
2,5628eae445ce1e59660effe6,53e9b108b7602d9703b85b88,0.0,563,18
3,53f38e3edabfae4b34a44275,53e9b108b7602d9703b85b88,0.0,24,18
4,548691a3dabfaed7b5fa2a43,53e9b108b7602d9703b85b88,1.0,497,18


**Handle Imbalanced Data & split dataset in train and test datasets**

In [17]:
print(merged_df['View'].value_counts())

0.0    49619
1.0     3254
Name: View, dtype: int64


In [18]:
from imblearn import over_sampling
from imblearn.over_sampling import RandomOverSampler
from sklearn import model_selection

oversample = RandomOverSampler(sampling_strategy='minority')

Y = merged_df['View'] 
X = merged_df[['Authors', 'Articles', 'Lbl_Authors', 'Lbl_Articles']]

X , Y = oversample.fit_resample(X , Y)
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2)

data_train = X_train
data_train['View'] = Y_train
data_test = X_test
data_test['View'] = Y_test

## Models


### MF-MLP model

#### Model Construction

In [19]:
import tensorflow.keras as keras
from tensorflow.keras.layers import Concatenate, Dense, Embedding, Flatten, Input, Multiply, LSTM, Dropout, Reshape, BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l2
from typing import List

**Define the model**

In [20]:
def create_ncf(
    number_of_users: int,
    number_of_items: int,
    latent_dim_mf: int = 4,
    latent_dim_mlp: int = 32,
    reg_mf: int = 0,
    reg_mlp: int = 0.01
) -> keras.Model:

    # input layer
    user = Input(shape=(), dtype="int32", name="Lbl_Authors")
    item = Input(shape=(), dtype="int32", name="Lbl_Articles")

    # embedding layers
    mf_user_embedding = Embedding(input_dim = number_of_users, output_dim = latent_dim_mf, name = "mf_user_embedding", 
        embeddings_initializer = "RandomNormal", embeddings_regularizer = l2(reg_mf), input_length = 1)
    mf_item_embedding = Embedding(input_dim = number_of_items, output_dim = latent_dim_mf, name = "mf_item_embedding",
        embeddings_initializer = "RandomNormal", embeddings_regularizer = l2(reg_mf), input_length = 1)

    mlp_user_embedding = Embedding(input_dim = number_of_users, output_dim = latent_dim_mlp, name = "mlp_user_embedding",
        embeddings_initializer = "RandomNormal", embeddings_regularizer = l2(reg_mlp), input_length = 1)
    mlp_item_embedding = Embedding(input_dim = number_of_items, output_dim = latent_dim_mlp, name = "mlp_item_embedding",
        embeddings_initializer = "RandomNormal", embeddings_regularizer = l2(reg_mlp), input_length = 1)

    # MF vector
    mf_user_latent = Flatten()(mf_user_embedding(user))
    mf_item_latent = Flatten()(mf_item_embedding(item))
    mf_cat_latent = Multiply()([mf_user_latent, mf_item_latent])

    # MLP vector
    mlp_user_latent = Flatten()(mlp_user_embedding(user))
    mlp_item_latent = Flatten()(mlp_item_embedding(item))
    mlp_cat_latent = Concatenate()([mlp_user_latent, mlp_item_latent])
    # Add a first dropout layer.
    dropout = Dropout(0.2)(mlp_cat_latent)
    #Αdd four hidden layers along with batch normalization and dropouts.
    layer_1 = Dense(64, activation='relu', name='layer1')(dropout)
    batch_norm1 = BatchNormalization(name='batch_norm1')(layer_1)
    dropout1 = Dropout(0.2, name='dropout1')(batch_norm1)

    layer_2 = Dense(32, activation='relu', name='layer2')(layer_1)
    batch_norm2 = BatchNormalization(name='batch_norm1')(layer_2)
    dropout2 = Dropout(0.2, name='dropout1')(batch_norm2)

    layer_3 = Dense(16, activation='relu', name='layer3')(layer_2)
    layer_4 = Dense(8, activation='relu', name='layer4')(layer_3)
    #Merge the two networks together
    merged_vector = Concatenate()([mf_cat_latent, layer_4])
    #Add the final single neuron output layer.
    output_layer = Dense(1, activation = "sigmoid", kernel_initializer="lecun_uniform", name="View")(merged_vector)

    model = Model(inputs = [user, item], outputs = [output_layer])

    return model

**Create and compile the model**

In [21]:
from tensorflow.keras.optimizers import Adam
import tensorflow as tf

n_users = merged_df['Lbl_Authors'].nunique()
n_items = merged_df['Lbl_Articles'].nunique()

ncf_model = create_ncf(n_users, n_items)

ncf_model.compile(optimizer = Adam(), loss = "binary_crossentropy",
    metrics=[
        tf.keras.metrics.TruePositives(name="tp"),
        tf.keras.metrics.FalsePositives(name="fp"),
        tf.keras.metrics.TrueNegatives(name="tn"),
        tf.keras.metrics.FalseNegatives(name="fn"),
        tf.keras.metrics.BinaryAccuracy(name="accuracy"),
        tf.keras.metrics.Precision(name="precision"),
        tf.keras.metrics.Recall(name="recall"),
        tf.keras.metrics.AUC(name="auc"),
    ],
)

ncf_model._name = "neural_collaborative_filtering"
ncf_model.summary()

Model: "neural_collaborative_filtering"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 Lbl_Authors (InputLayer)       [(None,)]            0           []                               
                                                                                                  
 Lbl_Articles (InputLayer)      [(None,)]            0           []                               
                                                                                                  
 mlp_user_embedding (Embedding)  (None, 32)          21632       ['Lbl_Authors[0][0]']            
                                                                                                  
 mlp_item_embedding (Embedding)  (None, 32)          2496        ['Lbl_Articles[0][0]']           
                                                                     

**Make TensorFlow dataset from Pandas DataFrame to use it as input**

In [22]:
def make_tf_dataset(
    df: pd.DataFrame,
    targets: List[str],
    val_split: float = 0.1,
    batch_size: int = 512,
    seed=42,
):
    """
    :param df: input DataFrame - only contains features and target(s)
    :param targets: list of columns names corresponding to targets
    :param val_split: fraction of the data that should be used for validation
    :param batch_size: batch size for training
    :param seed: random seed for shuffling data - `None` won't shuffle the data
    """

    n_val = round(df.shape[0] * val_split)
    if seed:
        # shuffle all the rows
        x = df.sample(frac=1, random_state=seed).to_dict("series")
    else:
        x = df.to_dict("series")
    y = dict()
    for t in targets:
        y[t] = x.pop(t)
    ds = tf.data.Dataset.from_tensor_slices((x, y))

    ds_val = ds.take(n_val).batch(batch_size)
    ds_train = ds.skip(n_val).batch(batch_size)
    return ds_train, ds_val

**Create train and validation datasets**

In [23]:
ds_train, ds_val = make_tf_dataset(data_train[['Lbl_Authors', 'Lbl_Articles', 'View']], ["View"])

**Fit the model**

In [24]:
%%time
train_hist = ncf_model.fit(ds_train, validation_data = ds_val, epochs = 40, verbose=1)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Wall time: 56.3 s


#### Prediction and evaluation

**Make tf testing dataset** 

In [25]:
ds_test, _ = make_tf_dataset(data_test[['Lbl_Authors', 'Lbl_Articles', 'View']], ["View"], val_split=0, seed=None)

**Make the prediction**

In [26]:
%%time
ncf_predictions = ncf_model.predict(ds_test)
data_test["ncf_predictions"] = ncf_predictions

data_test.head()

Wall time: 451 ms


Unnamed: 0,Authors,Articles,Lbl_Authors,Lbl_Articles,View,ncf_predictions
35368,5485810adabfae9b40133700,5550411745ce0a409eb38760,485,31,0.0,0.999039
46207,562d148645cedb3398d49e87,573696026e3b12023e515eec,615,48,1.0,0.977494
35273,53f4c02adabfaedce5658d38,5550411745ce0a409eb38760,297,31,0.0,0.017526
21235,53f438c7dabfaeee229c1616,53e9aa48b7602d97033b6b90,140,12,0.0,1.2e-05
61424,5409688bdabfae450f481730,58d82fced649053542fd7289,386,65,1.0,0.966127


**Delete duplicates**

In [27]:
len(data_test[data_test.duplicated(subset=['Lbl_Authors','Lbl_Articles'])])

6931

In [28]:
df_test = data_test.drop_duplicates(subset=['Lbl_Authors','Lbl_Articles'])

In [29]:
len(df_test[df_test.duplicated(subset=['Lbl_Authors','Lbl_Articles'])])

0

**Compute Precision, Recall and RMSE metrics**

In [105]:
from tensorflow.keras.metrics import Precision, Recall

Precision = Precision(top_k=100)
#Recall = Recall(top_k=5)

Precision.update_state(df_test["View"], df_test["ncf_predictions"])
#Recall.update_state(df_test["View"], df_test["ncf_predictions"])

print("We have a precision of",  Precision.result().numpy()) #, ",a recall of", Recall.result().numpy())

We have a precision of 0.93133336


**Εquate predictions to 1 if it is grater than 0.5 or else to 0** 

In [31]:
df_test['ncf_predictions_dummy'] = df_test['ncf_predictions'].apply(lambda rating : +1 if rating >= 0.5 else 0)
print(df_test['ncf_predictions_dummy'].value_counts())
df_test.head()

0    9393
1    3524
Name: ncf_predictions_dummy, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test['ncf_predictions_dummy'] = df_test['ncf_predictions'].apply(lambda rating : +1 if rating >= 0.5 else 0)


Unnamed: 0,Authors,Articles,Lbl_Authors,Lbl_Articles,View,ncf_predictions,ncf_predictions_dummy
35368,5485810adabfae9b40133700,5550411745ce0a409eb38760,485,31,0.0,0.999039,1
46207,562d148645cedb3398d49e87,573696026e3b12023e515eec,615,48,1.0,0.977494,1
35273,53f4c02adabfaedce5658d38,5550411745ce0a409eb38760,297,31,0.0,0.017526,0
21235,53f438c7dabfaeee229c1616,53e9aa48b7602d97033b6b90,140,12,0.0,1.2e-05,0
61424,5409688bdabfae450f481730,58d82fced649053542fd7289,386,65,1.0,0.966127,1


**Compute the accuracy score**

In [32]:
from sklearn.metrics import accuracy_score

print("Accuracy is equal to", accuracy_score(df_test["View"], df_test["ncf_predictions_dummy"]))

Accuracy is equal to 0.954401176743826


**Make an example recommendation**

In [33]:
#Pick a random author
smpl = df_test.sample()
#Find his predictions
author_pred = df_test.loc[df_test['Lbl_Authors'] == smpl.iloc[0]['Lbl_Authors']]
#Sort them by ncf_prediction from largest to smallest.
recommendation = author_pred.sort_values(by=['ncf_predictions'], ascending=False)
#Make the 5 most likely to like recommendations
recommendation.head()

Unnamed: 0,Authors,Articles,Lbl_Authors,Lbl_Articles,View,ncf_predictions,ncf_predictions_dummy
6837,5486563cdabfae9b40133d17,53e9bcc1b7602d97049412d4,494,25,1.0,0.999934,1
65504,5486563cdabfae9b40133d17,573698016e3b12023e6da477,494,59,1.0,0.998943,1
56244,5486563cdabfae9b40133d17,58d82fced649053542fd7355,494,66,1.0,0.9926,1
33955,5486563cdabfae9b40133d17,53e9a479b7602d9702d98afa,494,6,0.0,0.881807,1
14300,5486563cdabfae9b40133d17,58d82fced649053542fd7289,494,65,0.0,0.511678,1


### MF-LSTM model

#### Model Construction

**Define the model**

In [34]:
def create_ncf2(
    number_of_users: int,
    number_of_items: int,
    latent_dim_mf: int = 4,
    latent_dim_lstm: int = 32,
    reg_mf: int = 0,
    reg_lstm: int = 0.01,
    dense_layers: List[int] = [8, 4],
    reg_layers: List[int] = [0.01, 0.01],
    activation_dense: str = "relu",
) -> keras.Model:

    # input layer
    user = Input(shape=(1,), name="Lbl_Authors")
    item = Input(shape=(1,), name="Lbl_Articles")

    # embedding layers
    mf_user_embedding = Embedding(input_dim = number_of_users, output_dim = latent_dim_mf, name = "mf_user_embedding", 
        embeddings_initializer = "RandomNormal", embeddings_regularizer = l2(reg_mf), input_length = 1)
    mf_item_embedding = Embedding(input_dim = number_of_items, output_dim = latent_dim_mf, name = "mf_item_embedding",
        embeddings_initializer = "RandomNormal", embeddings_regularizer = l2(reg_mf), input_length = 1)

    lstm_user_embedding = Embedding(input_dim = number_of_users, output_dim = latent_dim_lstm, name = "lstm_user_embedding",
        embeddings_initializer = "RandomNormal", embeddings_regularizer = l2(reg_lstm), input_length = 1)
    lstm_item_embedding = Embedding(input_dim = number_of_items, output_dim = latent_dim_lstm, name = "lstm_item_embedding",
        embeddings_initializer = "RandomNormal", embeddings_regularizer = l2(reg_lstm), input_length = 1)
    
    # MF vector
    mf_user_latent = Flatten()(mf_user_embedding(user))
    mf_item_latent = Flatten()(mf_item_embedding(item))
    mf_cat_latent = Multiply()([mf_user_latent, mf_item_latent])

    # LSTM vector
    lstm_user_latent = Flatten()(lstm_user_embedding(user))
    lstm_item_latent = Flatten()(lstm_item_embedding(item))
    nn = Concatenate()([lstm_user_latent, lstm_item_latent])
    lstm_cat_latent = Reshape((1, latent_dim_lstm * 2), input_shape=(latent_dim_lstm * 2,))(nn)
    
    lstm1 = LSTM(name="LSTM1", units=latent_dim_lstm, activation='relu')(lstm_cat_latent)
    lstm1 = Dropout(0.3)(lstm1)
    lstm2 = LSTM(name="LSTM2", units=latent_dim_lstm, activation='relu')(lstm_cat_latent)
    lstm2 = Dropout(0.3)(lstm2)
    lstm3 = LSTM(name="LSTM3", units=latent_dim_lstm, activation='relu')(lstm_cat_latent)
    lstm3 = Dropout(0.3)(lstm3)
    lstm4 = LSTM(name="LSTM4", units=latent_dim_lstm, activation='relu')(lstm_cat_latent)
    lstm4 = Dropout(0.3)(lstm4)
    
    output = Concatenate()([lstm1, lstm2, lstm3, lstm4])
    
    output = Dense(units=int(latent_dim_lstm / 2), activation='relu')(output)
    output = Dropout(.3)(output)
    lstm_vector = Reshape((int(latent_dim_lstm / 2),), input_shape=(1, int(latent_dim_lstm / 2)))(output) 
    
    predict_layer = Concatenate()([mf_cat_latent, lstm_vector])

    result = Dense(1, activation = "sigmoid", kernel_initializer = "lecun_uniform", name = "View")

    output = result(predict_layer)

    model = Model(inputs = [user, item], outputs = [output])

    return model

**Create and compile the model**

In [35]:
from tensorflow.keras.optimizers import Adam
import tensorflow as tf

n_users = merged_df['Lbl_Authors'].nunique()
n_items = merged_df['Lbl_Articles'].nunique()

ncf2_model = create_ncf2(n_users, n_items)

ncf2_model.compile(optimizer = Adam(), loss = "binary_crossentropy",
    metrics=[
        tf.keras.metrics.TruePositives(name="tp"),
        tf.keras.metrics.FalsePositives(name="fp"),
        tf.keras.metrics.TrueNegatives(name="tn"),
        tf.keras.metrics.FalseNegatives(name="fn"),
        tf.keras.metrics.BinaryAccuracy(name="accuracy"),
        tf.keras.metrics.Precision(name="precision"),
        tf.keras.metrics.Recall(name="recall"),
        tf.keras.metrics.AUC(name="auc"),
    ],
)

ncf2_model._name = "neural_collaborative_filtering"
ncf2_model.summary()

Model: "neural_collaborative_filtering"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 Lbl_Authors (InputLayer)       [(None, 1)]          0           []                               
                                                                                                  
 Lbl_Articles (InputLayer)      [(None, 1)]          0           []                               
                                                                                                  
 lstm_user_embedding (Embedding  (None, 1, 32)       21632       ['Lbl_Authors[0][0]']            
 )                                                                                                
                                                                                                  
 lstm_item_embedding (Embedding  (None, 1, 32)       2496        ['Lb

**Fit the model**

In [36]:
%%time
train_hist = ncf2_model.fit(ds_train, validation_data = ds_val, epochs = 40, verbose=1)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Wall time: 1min 29s


#### Prediction and evaluation

**Make the prediction**

In [37]:
%%time
ncf2_predictions = ncf2_model.predict(ds_test)
data_test["ncf_predictions"] = ncf2_predictions
data_test.head()

Wall time: 764 ms


Unnamed: 0,Authors,Articles,Lbl_Authors,Lbl_Articles,View,ncf_predictions
35368,5485810adabfae9b40133700,5550411745ce0a409eb38760,485,31,0.0,0.93397
46207,562d148645cedb3398d49e87,573696026e3b12023e515eec,615,48,1.0,0.997932
35273,53f4c02adabfaedce5658d38,5550411745ce0a409eb38760,297,31,0.0,9e-06
21235,53f438c7dabfaeee229c1616,53e9aa48b7602d97033b6b90,140,12,0.0,9.9e-05
61424,5409688bdabfae450f481730,58d82fced649053542fd7289,386,65,1.0,0.993731


**Delete duplicates**

In [38]:
len(data_test[data_test.duplicated(subset=['Lbl_Authors','Lbl_Articles'])])

6931

In [39]:
data_test = data_test.drop_duplicates(subset=['Lbl_Authors','Lbl_Articles'])

In [40]:
len(data_test[data_test.duplicated(subset=['Lbl_Authors','Lbl_Articles'])])

0

**Compute Precision, Recall and RMSE metrics**

In [67]:
from tensorflow.keras.metrics import Precision, Recall

Precision = Precision(top_k=20
#Recall = Recall(top_k=5)

Precision.update_state(data_test["View"], data_test["ncf_predictions"])
#Recall.update_state(data_test["View"], data_test["ncf_predictions"])

print("We have a precision of",  Precision.result().numpy()) #, ",a recall of", Recall.result().numpy())

We have a precision of 0.95


**Εquate predictions to 1 if it is grater than 0.5 or else to 0** 

In [42]:
data_test['ncf_predictions_dummy'] = data_test['ncf_predictions'].apply(lambda rating : +1 if rating >= 0.5 else 0)
print(data_test['ncf_predictions_dummy'].value_counts())
data_test.head()

0    9308
1    3609
Name: ncf_predictions_dummy, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_test['ncf_predictions_dummy'] = data_test['ncf_predictions'].apply(lambda rating : +1 if rating >= 0.5 else 0)


Unnamed: 0,Authors,Articles,Lbl_Authors,Lbl_Articles,View,ncf_predictions,ncf_predictions_dummy
35368,5485810adabfae9b40133700,5550411745ce0a409eb38760,485,31,0.0,0.93397,1
46207,562d148645cedb3398d49e87,573696026e3b12023e515eec,615,48,1.0,0.997932,1
35273,53f4c02adabfaedce5658d38,5550411745ce0a409eb38760,297,31,0.0,9e-06,0
21235,53f438c7dabfaeee229c1616,53e9aa48b7602d97033b6b90,140,12,0.0,9.9e-05,0
61424,5409688bdabfae450f481730,58d82fced649053542fd7289,386,65,1.0,0.993731,1


**Compute the accuracy score**

In [43]:
print("Accuracy is equal to", accuracy_score(data_test["View"], data_test["ncf_predictions_dummy"]))

Accuracy is equal to 0.9453433459781683


**Make an example recommendation**

In [44]:
#Pick a random author
smpl = data_test.sample()
#Find his predictions
author_pred = data_test.loc[data_test['Lbl_Authors'] == smpl.iloc[0]['Lbl_Authors']]
#Sort them by ncf_prediction from largest to smallest.
recommendation = author_pred.sort_values(by=['ncf_predictions'], ascending=False)
#Make the 5 most likely to like recommendations
recommendation.head()

Unnamed: 0,Authors,Articles,Lbl_Authors,Lbl_Articles,View,ncf_predictions,ncf_predictions_dummy
69091,53f48a48dabfaea6f277b420,53e99a85b7602d97022f8644,271,3,1.0,0.913931,1
25066,53f48a48dabfaea6f277b420,53e9b068b7602d9703acf032,271,17,0.0,0.402783,0
31161,53f48a48dabfaea6f277b420,53e9a62eb7602d9702f5a6a2,271,8,0.0,0.032814,0
52186,53f48a48dabfaea6f277b420,53e9986eb7602d97020a7ef9,271,0,0.0,0.001351,0
11508,53f48a48dabfaea6f277b420,53e99a20b7602d9702279af2,271,2,0.0,0.000126,0
