Collaborative Filtering Utilizing Neural Networks/book-crossing dataset

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-pre-processing" data-toc-modified-id="Data-pre-processing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data pre-processing</a></span></li><li><span><a href="#Models" data-toc-modified-id="Models-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Models</a></span><ul class="toc-item"><li><span><a href="#MF-MLP-model" data-toc-modified-id="MF-MLP-model-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>MF-MLP model</a></span><ul class="toc-item"><li><span><a href="#Model-Construction" data-toc-modified-id="Model-Construction-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Model Construction</a></span></li><li><span><a href="#Prediction-and-evaluation" data-toc-modified-id="Prediction-and-evaluation-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Prediction and evaluation</a></span></li></ul></li><li><span><a href="#MF-LSTM-model" data-toc-modified-id="MF-LSTM-model-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>MF-LSTM model</a></span><ul class="toc-item"><li><span><a href="#Model-Construction" data-toc-modified-id="Model-Construction-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Model Construction</a></span></li><li><span><a href="#Prediction-and-evaluation" data-toc-modified-id="Prediction-and-evaluation-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Prediction and evaluation</a></span></li></ul></li></ul></li></ul></div>

## Data pre-processing

**Load data and split columns by delimiter**

In [95]:
import pandas as pd
import tensorflow as tf
from sklearn import preprocessing

data = pd.read_csv('BX/BX-Book-Ratings.csv', delimiter = ';"', skiprows = 1, names=["User-ID", "ISBN", "Book-Rating"], encoding="latin1")
data.head()

  data = pd.read_csv('BX/BX-Book-Ratings.csv', delimiter = ';"', skiprows = 1, names=["User-ID", "ISBN", "Book-Rating"], encoding="latin1")


Unnamed: 0,User-ID,ISBN,Book-Rating
0,"""276725","""034545104X""""","""0"""""""
1,"""276726","""0155061224""""","""5"""""""
2,"""276727","""0446520802""""","""0"""""""
3,"""276729","""052165615X""""","""3"""""""
4,"""276729","""0521795028""""","""6"""""""


In [96]:
data['User-ID'] = data['User-ID'].str[1:].astype(str)
data['ISBN'] = data['ISBN'].str[1:-2].astype(str)
data['Book-Rating'] = data['Book-Rating'].str[1:-3].astype(int)
data.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


**Check for duplicates**

In [97]:
data[data.duplicated()]

Unnamed: 0,User-ID,ISBN,Book-Rating


**Check the number of registrations, the number of users and the number of books we have in our dataset**

In [98]:
print('len:', len(data))
print('Users:', data['User-ID'].nunique())
print('Books:', data['ISBN'].nunique())

len: 1048575
Users: 95513
Books: 323416


**Keep users who has read at least 40 books and books with at least 20 users read them in out dataset**

In [99]:
# we keep users who has rated at least 200 books. And Books with at least 40 ratings
users_keep = data['User-ID'].value_counts() > 200
y = users_keep[users_keep].index
data = data[data['User-ID'].isin(y)]

data.reset_index(inplace = True, drop = True)

books_keep = data['ISBN'].value_counts() > 40
y = books_keep[books_keep].index
data = data[data['ISBN'].isin(y)]

**Check the number of registrations, the number of users and the number of books we have in our dataset**

In [100]:
print('len:', len(data))
print('Users:', data['User-ID'].nunique())
print('Books:', data['ISBN'].nunique())

len: 43560
Users: 801
Books: 654


**Check the ratings of the books**

In [101]:
data['Book-Rating'].value_counts()

0     32840
8      2646
10     2344
9      2168
7      1627
5       953
6       677
4       154
3        77
1        40
2        34
Name: Book-Rating, dtype: int64

**Keep only registrations which represents that user liked the book (Book-Rating>5) or hasn't read it yet(Book-Rating=0)**

In [102]:
data = data[(data['Book-Rating'] == 0) | (data['Book-Rating'] > 5)]
data['Book-Rating'].value_counts()

0     32840
8      2646
10     2344
9      2168
7      1627
6       677
Name: Book-Rating, dtype: int64

**Reset the index column**

In [103]:
data.reset_index(inplace = True, drop = True)

**Book-Rating=1 represents that user liked the corresponding book & Book-Rating=0 that he hasn't read it yet**

In [104]:
data['Book-Rating'] = data['Book-Rating'].apply(lambda rating : +1 if rating != 0 else 0)
print(data['Book-Rating'].value_counts())
data.head()

0    32840
1     9462
Name: Book-Rating, dtype: int64


Unnamed: 0,User-ID,ISBN,Book-Rating
0,277427,002542730X,1
1,277427,0060930535,0
2,277427,0060934417,0
3,277427,0061009059,1
4,277427,0140067477,0


**Convert users and books names into encoded IDs**

In [105]:
lbl_user = preprocessing.LabelEncoder()
lbl_book = preprocessing.LabelEncoder()

data['Lbl_User-ID'] = lbl_user.fit_transform(data['User-ID'].values)
data['Lbl_ISBN'] = lbl_book.fit_transform(data['ISBN'].values)

data.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Lbl_User-ID,Lbl_ISBN
0,277427,002542730X,1,562,0
1,277427,0060930535,0,562,11
2,277427,0060934417,0,562,12
3,277427,0061009059,1,562,21
4,277427,0140067477,0,562,32


**Handling Imbalanced Data && split dataset in train and test datasets**

In [106]:
print(data['Book-Rating'].value_counts())

0    32840
1     9462
Name: Book-Rating, dtype: int64


In [107]:
from imblearn import over_sampling
from imblearn.over_sampling import RandomOverSampler
from sklearn import model_selection

oversample = RandomOverSampler(sampling_strategy='minority')

Y = data['Book-Rating'] 
X = data[['User-ID', 'ISBN', 'Lbl_User-ID', 'Lbl_ISBN']]

X , Y = oversample.fit_resample(X , Y)

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2)

data_train = X_train
data_train['Book-Rating'] = Y_train
data_test = X_test
data_test['Book-Rating'] = Y_test

## Models

### MF-MLP model

#### Model Construction

In [108]:
import tensorflow.keras as keras
from tensorflow.keras.layers import Concatenate, Dense, Embedding, Flatten, Input, Multiply, LSTM, Dropout, Reshape, BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l2
from typing import List

**Define the model**

In [109]:
def create_ncf(
    number_of_users: int,
    number_of_items: int,
    latent_dim_mf: int = 4,
    latent_dim_mlp: int = 32,
    reg_mf: int = 0,
    reg_mlp: int = 0.01
) -> keras.Model:

    # input layer
    user = Input(shape=(), dtype="int32", name="Lbl_User-ID")
    item = Input(shape=(), dtype="int32", name="Lbl_ISBN")

    # embedding layers
    mf_user_embedding = Embedding(input_dim = number_of_users, output_dim = latent_dim_mf, name = "mf_user_embedding", 
        embeddings_initializer = "RandomNormal", embeddings_regularizer = l2(reg_mf), input_length = 1)
    mf_item_embedding = Embedding(input_dim = number_of_items, output_dim = latent_dim_mf, name = "mf_item_embedding",
        embeddings_initializer = "RandomNormal", embeddings_regularizer = l2(reg_mf), input_length = 1)

    mlp_user_embedding = Embedding(input_dim = number_of_users, output_dim = latent_dim_mlp, name = "mlp_user_embedding",
        embeddings_initializer = "RandomNormal", embeddings_regularizer = l2(reg_mlp), input_length = 1)
    mlp_item_embedding = Embedding(input_dim = number_of_items, output_dim = latent_dim_mlp, name = "mlp_item_embedding",
        embeddings_initializer = "RandomNormal", embeddings_regularizer = l2(reg_mlp), input_length = 1)

    # MF vector
    mf_user_latent = Flatten()(mf_user_embedding(user))
    mf_item_latent = Flatten()(mf_item_embedding(item))
    mf_cat_latent = Multiply()([mf_user_latent, mf_item_latent])

    # MLP vector
    mlp_user_latent = Flatten()(mlp_user_embedding(user))
    mlp_item_latent = Flatten()(mlp_item_embedding(item))
    mlp_cat_latent = Concatenate()([mlp_user_latent, mlp_item_latent])
    # Add a first dropout layer.
    dropout = Dropout(0.2)(mlp_cat_latent)
    #Αdd four hidden layers along with batch normalization and dropouts.
    layer_1 = Dense(64, activation='relu', name='layer1')(dropout)
    batch_norm1 = BatchNormalization(name='batch_norm1')(layer_1)
    dropout1 = Dropout(0.2, name='dropout1')(batch_norm1)

    layer_2 = Dense(32, activation='relu', name='layer2')(layer_1)
    batch_norm2 = BatchNormalization(name='batch_norm1')(layer_2)
    dropout2 = Dropout(0.2, name='dropout1')(batch_norm2)

    layer_3 = Dense(16, activation='relu', name='layer3')(layer_2)
    layer_4 = Dense(8, activation='relu', name='layer4')(layer_3)
    
    #Merge the two networks together
    merged_vector = Concatenate()([mf_cat_latent, layer_4])
    #Add the final single neuron output layer.
    output_layer = Dense(1, activation = "sigmoid", kernel_initializer="lecun_uniform", name="Book-Rating")(merged_vector)
    
    model = Model(inputs = [user, item], outputs = [output_layer])

    return model

**Create and compile the model**

In [110]:
from tensorflow.keras.optimizers import Adam

n_users = data['Lbl_User-ID'].nunique()
n_items = data['Lbl_ISBN'].nunique()

ncf_model = create_ncf(n_users, n_items)

ncf_model.compile(optimizer = Adam(), loss = "binary_crossentropy",
    metrics=[
        tf.keras.metrics.TruePositives(name="tp"),
        tf.keras.metrics.FalsePositives(name="fp"),
        tf.keras.metrics.TrueNegatives(name="tn"),
        tf.keras.metrics.FalseNegatives(name="fn"),
        tf.keras.metrics.BinaryAccuracy(name="accuracy"),
        tf.keras.metrics.Precision(name="precision"),
        tf.keras.metrics.Recall(name="recall"),
        tf.keras.metrics.AUC(name="auc"),
    ],
)
ncf_model._name = "neural_collaborative_filtering"
ncf_model.summary()

Model: "neural_collaborative_filtering"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 Lbl_User-ID (InputLayer)       [(None,)]            0           []                               
                                                                                                  
 Lbl_ISBN (InputLayer)          [(None,)]            0           []                               
                                                                                                  
 mlp_user_embedding (Embedding)  (None, 32)          25568       ['Lbl_User-ID[0][0]']            
                                                                                                  
 mlp_item_embedding (Embedding)  (None, 32)          20928       ['Lbl_ISBN[0][0]']               
                                                                     

**Make TensorFlow dataset from Pandas DataFrame to use it as input**

In [111]:
def make_tf_dataset(
    df: pd.DataFrame,
    targets: List[str],
    val_split: float = 0.1,
    batch_size: int = 512,
    seed=42,
):
    """Make TensorFlow dataset from Pandas DataFrame.
    :param df: input DataFrame - only contains features and target(s)
    :param targets: list of columns names corresponding to targets
    :param val_split: fraction of the data that should be used for validation
    :param batch_size: batch size for training
    :param seed: random seed for shuffling data - `None` won't shuffle the data"""

    n_val = round(df.shape[0] * val_split)
    if seed:
        # shuffle all the rows
        x = df.sample(frac=1, random_state=seed).to_dict("series")
    else:
        x = df.to_dict("series")
    y = dict()
    for t in targets:
        y[t] = x.pop(t)
    ds = tf.data.Dataset.from_tensor_slices((x, y))

    ds_val = ds.take(n_val).batch(batch_size)
    ds_train = ds.skip(n_val).batch(batch_size)
    return ds_train, ds_val

**Create train and validation datasets**

In [112]:
ds_train, ds_val = make_tf_dataset(data_train[['Lbl_User-ID', 'Lbl_ISBN', 'Book-Rating']], ["Book-Rating"])

**Fit the model**

In [136]:
%%time
train_hist = ncf_model.fit(ds_train, validation_data = ds_val, epochs =30, verbose=1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30


Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Wall time: 29.8 s


#### Prediction and evaluation

**Make tf testing dataset** 

In [114]:
ds_test, _ = make_tf_dataset(data_test[['Lbl_User-ID', 'Lbl_ISBN', 'Book-Rating']], ["Book-Rating"], val_split=0, seed=None)

**Make the prediction**

In [115]:
%%time
ncf_predictions = ncf_model.predict(ds_test)
data_test["ncf_predictions"] = ncf_predictions
data_test.head()

Wall time: 250 ms


Unnamed: 0,User-ID,ISBN,Lbl_User-ID,Lbl_ISBN,Book-Rating,ncf_predictions
30229,183196,074343627X,309,603,0,0.734587
22168,133689,0312278586,123,58,0,0.602596
30823,187145,0804111359,317,621,0,0.062038
12649,77809,068484267X,717,586,0,0.646334
7206,39616,0449221512,610,395,0,0.218145


**Delete duplicates**

In [116]:
len(data_test[data_test.duplicated(subset=['Lbl_User-ID','Lbl_ISBN'])])

1679

In [117]:
df_test = data_test.drop_duplicates(subset=['User-ID','ISBN'])

In [118]:
len(df_test[data_test.duplicated(subset=['Lbl_User-ID','Lbl_ISBN'])])

  len(df_test[data_test.duplicated(subset=['Lbl_User-ID','Lbl_ISBN'])])


0

**Compute Precision and Recall metrics**

In [119]:
from tensorflow.keras.metrics import Precision, Recall
from sklearn.metrics import average_precision_score

Precision = Precision(top_k=17)
#Recall = Recall(top_k=10)

Precision.update_state(df_test["Book-Rating"], df_test["ncf_predictions"])
#Recall.update_state(df_test["Book-Rating"], df_test["ncf_predictions"])

print("We have a precision of",  Precision.result().numpy())#, "and a recall of", Recall.result().numpy(), average_precision_score(df_test["Book-Rating"], df_test["ncf_predictions"]))

We have a precision of 0.8235294


**Εquate predictions to 1 if it is grater than 0.5 or else to 0** 

In [120]:
df_test['ncf_predictions_dummy'] = df_test['ncf_predictions'].apply(lambda rating : +1 if rating >= 0.5 else 0)
print(df_test['ncf_predictions_dummy'].value_counts())
df_test.head()

1    6105
0    5352
Name: ncf_predictions_dummy, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test['ncf_predictions_dummy'] = df_test['ncf_predictions'].apply(lambda rating : +1 if rating >= 0.5 else 0)


Unnamed: 0,User-ID,ISBN,Lbl_User-ID,Lbl_ISBN,Book-Rating,ncf_predictions,ncf_predictions_dummy
30229,183196,074343627X,309,603,0,0.734587,1
22168,133689,0312278586,123,58,0,0.602596,1
30823,187145,0804111359,317,621,0,0.062038,0
12649,77809,068484267X,717,586,0,0.646334,1
7206,39616,0449221512,610,395,0,0.218145,0


**Compute the accuracy score**

In [121]:
from sklearn.metrics import accuracy_score

print("We have an accuracy score of", accuracy_score(df_test["Book-Rating"], df_test["ncf_predictions_dummy"]))

We have an accuracy score of 0.7854586715545082


**Make an example recommendation**

In [122]:
#Pick a random author
smpl = df_test.sample()
#Find his predictions
author_pred = df_test.loc[data_test['User-ID'] == smpl.iloc[0]['User-ID']]
#Sort them by ncf_prediction from largest to smallest.
recommendation = author_pred.sort_values(by=['ncf_predictions'], ascending=False)
#Make the 5 most likely to like recommendations
recommendation.head()

Unnamed: 0,User-ID,ISBN,Lbl_User-ID,Lbl_ISBN,Book-Rating,ncf_predictions,ncf_predictions_dummy
62454,168064,446602213,248,349,1,0.942224,1
50399,168064,786817070,248,610,1,0.933502,1
42395,168064,805063897,248,626,1,0.91793,1
27721,168064,385335482,248,194,0,0.901821,1
27776,168064,786868716,248,612,1,0.892277,1


### MF-LSTM model

#### Model Construction

**Define the model**

In [123]:
def create_ncf2(
    number_of_users: int,
    number_of_items: int,
    latent_dim_mf: int = 4,
    latent_dim_lstm: int = 32,
    reg_mf: int = 0,
    reg_lstm: int = 0.01,
    dense_layers: List[int] = [8, 4],
    reg_layers: List[int] = [0.01, 0.01],
    activation_dense: str = "relu",
) -> keras.Model:

    # input layer
    user = Input(shape=(1,), name="Lbl_User-ID")
    item = Input(shape=(1,), name="Lbl_ISBN")

    # embedding layers
    mf_user_embedding = Embedding(input_dim = number_of_users, output_dim = latent_dim_mf, name = "mf_user_embedding", 
        embeddings_initializer = "RandomNormal", embeddings_regularizer = l2(reg_mf), input_length = 1)
    mf_item_embedding = Embedding(input_dim = number_of_items, output_dim = latent_dim_mf, name = "mf_item_embedding",
        embeddings_initializer = "RandomNormal", embeddings_regularizer = l2(reg_mf), input_length = 1)

    lstm_user_embedding = Embedding(input_dim = number_of_users, output_dim = latent_dim_lstm, name = "lstm_user_embedding",
        embeddings_initializer = "RandomNormal", embeddings_regularizer = l2(reg_lstm), input_length = 1)
    lstm_item_embedding = Embedding(input_dim = number_of_items, output_dim = latent_dim_lstm, name = "lstm_item_embedding",
        embeddings_initializer = "RandomNormal", embeddings_regularizer = l2(reg_lstm), input_length = 1)
    
    # MF vector
    mf_user_latent = Flatten()(mf_user_embedding(user))
    mf_item_latent = Flatten()(mf_item_embedding(item))
    mf_cat_latent = Multiply()([mf_user_latent, mf_item_latent])

    # LSTM vector
    lstm_user_latent = Flatten()(lstm_user_embedding(user))
    lstm_item_latent = Flatten()(lstm_item_embedding(item))
    nn = Concatenate()([lstm_user_latent, lstm_item_latent])
    lstm_cat_latent = Reshape((1, latent_dim_lstm * 2), input_shape=(latent_dim_lstm * 2,))(nn)
    
    lstm1 = LSTM(name="LSTM1", units=latent_dim_lstm, activation='relu')(lstm_cat_latent)
    lstm1 = Dropout(0.3)(lstm1)
    lstm2 = LSTM(name="LSTM2", units=latent_dim_lstm, activation='relu')(lstm_cat_latent)
    lstm2 = Dropout(0.3)(lstm2)
    lstm3 = LSTM(name="LSTM3", units=latent_dim_lstm, activation='relu')(lstm_cat_latent)
    lstm3 = Dropout(0.3)(lstm3)
    lstm4 = LSTM(name="LSTM4", units=latent_dim_lstm, activation='relu')(lstm_cat_latent)
    lstm4 = Dropout(0.3)(lstm4)
    
    output = Concatenate()([lstm1, lstm2, lstm3, lstm4])
    
    output = Dense(units=int(latent_dim_lstm / 2), activation='relu')(output)
    output = Dropout(.3)(output)
    lstm_vector = Reshape((int(latent_dim_lstm / 2),), input_shape=(1, int(latent_dim_lstm / 2)))(output) 
    
    predict_layer = Concatenate()([mf_cat_latent, lstm_vector])

    result = Dense(1, activation = "sigmoid", kernel_initializer = "lecun_uniform", name = "Book-Rating")

    output = result(predict_layer)

    model = Model(inputs = [user, item], outputs = [output])

    return model

**Create and compile the model**

In [124]:
from tensorflow.keras.optimizers import Adam

n_users = data['Lbl_User-ID'].nunique()
n_items = data['Lbl_ISBN'].nunique()

ncf2_model = create_ncf2(n_users, n_items)

ncf2_model.compile(optimizer = Adam(), loss = "binary_crossentropy",
    metrics=[
        tf.keras.metrics.TruePositives(name="tp"),
        tf.keras.metrics.FalsePositives(name="fp"),
        tf.keras.metrics.TrueNegatives(name="tn"),
        tf.keras.metrics.FalseNegatives(name="fn"),
        tf.keras.metrics.BinaryAccuracy(name="accuracy"),
        tf.keras.metrics.Precision(name="precision"),
        tf.keras.metrics.Recall(name="recall"),
        tf.keras.metrics.AUC(name="auc"),
    ],
)
ncf2_model._name = "neural_collaborative_filtering"
ncf2_model.summary()

Model: "neural_collaborative_filtering"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 Lbl_User-ID (InputLayer)       [(None, 1)]          0           []                               
                                                                                                  
 Lbl_ISBN (InputLayer)          [(None, 1)]          0           []                               
                                                                                                  
 lstm_user_embedding (Embedding  (None, 1, 32)       25568       ['Lbl_User-ID[0][0]']            
 )                                                                                                
                                                                                                  
 lstm_item_embedding (Embedding  (None, 1, 32)       20928       ['Lb

**Fit the model**

In [135]:
%%time
train_hist = ncf2_model.fit(ds_train, validation_data = ds_val, epochs = 30, verbose=1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30


Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Wall time: 54.5 s


#### Prediction and evaluation

**Make the prediction**

In [126]:
%%time
ncf2_predictions = ncf2_model.predict(ds_test)
data_test["ncf_predictions"] = ncf2_predictions
data_test.head()

Wall time: 1.16 s


Unnamed: 0,User-ID,ISBN,Lbl_User-ID,Lbl_ISBN,Book-Rating,ncf_predictions
30229,183196,074343627X,309,603,0,0.672769
22168,133689,0312278586,123,58,0,0.020289
30823,187145,0804111359,317,621,0,0.494309
12649,77809,068484267X,717,586,0,0.002745
7206,39616,0449221512,610,395,0,0.044098


**Delete duplicates**

In [127]:
len(data_test[data_test.duplicated(subset=['Lbl_User-ID','Lbl_ISBN'])])

1679

In [128]:
data_test.drop_duplicates(subset=['Lbl_User-ID','Lbl_ISBN'], inplace=True)

In [129]:
len(data_test[data_test.duplicated(subset=['Lbl_User-ID','Lbl_ISBN'])])

0

**Compute Precision and Recall metrics**

In [130]:
from tensorflow.keras.metrics import Precision, Recall
from sklearn.metrics import mean_squared_error

Precision = Precision(top_k=100)
#Recall = Recall(top_k=10)

Precision.update_state(data_test["Book-Rating"], data_test["ncf_predictions"])
#Recall.update_state(data_test["Book-Rating"], data_test["ncf_predictions"])

print("We have a precision of",  Precision.result().numpy())#, "and a recall of", Recall.result().numpy())

We have a precision of 0.78


**Εquate predictions to 1 if it is grater than 0.5 or else to 0** 

In [131]:
data_test['ncf_predictions_dummy'] = data_test['ncf_predictions'].apply(lambda rating : +1 if rating >= 0.5 else 0)
print(data_test['ncf_predictions_dummy'].value_counts())
data_test.head()

1    6027
0    5430
Name: ncf_predictions_dummy, dtype: int64


Unnamed: 0,User-ID,ISBN,Lbl_User-ID,Lbl_ISBN,Book-Rating,ncf_predictions,ncf_predictions_dummy
30229,183196,074343627X,309,603,0,0.672769,1
22168,133689,0312278586,123,58,0,0.020289,0
30823,187145,0804111359,317,621,0,0.494309,0
12649,77809,068484267X,717,586,0,0.002745,0
7206,39616,0449221512,610,395,0,0.044098,0


**Compute the accuracy score**

In [132]:
print("Accuracy is equal to", accuracy_score(data_test["Book-Rating"], data_test["ncf_predictions_dummy"]))

Accuracy is equal to 0.8002967618050101


**Make an example recommendation**

In [133]:
#Pick a random author
smpl = data_test.sample()
#Find his predictions
author_pred = data_test.loc[data_test['User-ID'] == smpl.iloc[0]['User-ID']]
#Sort them by ncf_prediction from largest to smallest.
recommendation = author_pred.sort_values(by=['ncf_predictions'], ascending=False)
#Make the 5 most likely to like recommendations
recommendation.head()

Unnamed: 0,User-ID,ISBN,Lbl_User-ID,Lbl_ISBN,Book-Rating,ncf_predictions,ncf_predictions_dummy
16882,104636,0449219461,16,391,1,0.999716,1
47840,104636,0440224764,16,312,1,0.999557,1
54272,104636,0312990456,16,78,1,0.999177,1
16792,104636,0312924585,16,63,0,0.998012,1
16807,104636,034540761X,16,128,0,0.997816,1
