# Testing Models

Testing recommender systems is less intuitive than testing other predictive models. There are many common metrics to use, which can be found [here in this useful Medium article](https://medium.com/swlh/rank-aware-recsys-evaluation-metrics-5191bba16832). The idea of these methods are all similar: generate a list of top ranked items to show the user, and your score is based on how many of the items you recommend are relevant to the user. The word "relevant" is not well-defined, but normally means that the user has interacted with this item in the past. 

In our case, our items are the restaurants and the interactions are orders. One issue with our data is that between 30-40% of our users have only ordered from a single restaurant, which makes the above methods tricky/impossible to apply.

Recall that our model takes in a user's past five (5) orders as inputs together with one restaurant R and tries to predict how likely the user will order from R next. We will evaluate our model as follows. Given a user's sequence of five restaurants, and given the target restaurant R which is next in the sequence, we use the model to generate a ranked list of $k=5$ more restaurants that it thinks is the most likely to be ordered from next. If R is among the $k$ generated restaurants, we add $+1$ to the score, otherwise $+0$. 

Since there are only $100$ restaurants to recommend from, it should be very feasible to simply score each one, sort the rankings, and slice the top 5 as recommendations. This is a very privillaged position we are in since many recommender systems are deployed in a context where there are millions of items to recommend from. In that case, one could perform clustering based on the customer and vendor embeddings before ranking within the clusters.  

Each user in the test set may possibly geenerate many length $5+1$ sequences, and we shall use all of these during the evaluation process. In the deployed version of the model, we could simply give recommendations based on the last $5$ orders made by the user.

Baselines:
1) Recommend the $5$ most popular restaurants.
2) Given a sequence of $5$ orders, $m$ of which are unique and not null, recommend those restaurants together with the $5-m$ most popular restaurants. 

We will define 'popularity' of a vendor to be equal to its number of orders in the training data. Note that the unordered set of top five most popular vendors would be the same if we changed the definition of popularity of a vendor to be equal to its number of unique customers (although their rankings are slightly different, see the last cell of "Munging.ipynb"). 

In [1]:
import pandas as pd
import pickle
import torch
from PreprocessingHelpers import CustomDataset
from torch.utils.data import Dataset, DataLoader
from Models.Models import Model1, Model2, Model3

## Load Test Data

In [2]:
with open("ProcessedData/test_sequences_padded_dataset.pkl", "rb") as file:
    test_sequences_padded_dataset = pickle.load(file)

test_loader = DataLoader(test_sequences_padded_dataset, batch_size=1)
num_trials = test_sequences_padded_dataset.vendor.shape[0]

In [3]:
with open("ProcessedData/vendors_tensor.pkl", "rb") as file:
    vendors_tensor = pickle.load(file)

In [4]:
with open("ProcessedData/popular_vendors.pkl", "rb") as file:
    popular_vendors = pickle.load(file)

## Define Scoring for Model & Baselines

In [5]:
popular_vendors.head(10)

id
28    3237
25    2717
14    2681
19    2453
18    2184
15    2159
75    1377
21    1274
36    1145
6     1078
Name: num_orders, dtype: int64

In [6]:
k_most_popular = popular_vendors[:5].index.tolist()
k_most_popular

[28, 25, 14, 19, 18]

In [7]:
def baseline1_scoring(target:int, k_most_popular:list):
    if target in k_most_popular:
        return 1
    else:
        return 0

In [8]:
def baseline2_scoring(seq:list, target:int, k_most_popular:list):
    seq = list(set(seq))    # remove duplicates
    try:
        seq.remove(0)       # remove null-token
    except ValueError:
        pass
    m = len(seq)
    seq = seq + k_most_popular[:-m]
    if target in seq:
        return 1
    else:
        return 0

In [9]:
def model_scoring(seq:torch.tensor, target:int, model, k:int=5):
    seq = seq.view(1, -1)
    y = torch.ones([100, 1], dtype=torch.long)      # 100 vendors
    seq_dupe = y @ seq                              # 100 x 5 matrix
    v_ids = torch.arange(start=1, end=101, dtype=torch.long)
    rankings = model.forward(c_seq=seq_dupe, v_id=v_ids).view(-1)    
    top_k = torch.topk(rankings, k)[1][:5]          # Essentially argmax for top k
    top_k = top_k + 1                               # Shift indices to match vendors
    if target in top_k:
        return 1
    else:
        return 0

In [27]:
# Score baselines
print("SCORING\n=======================================================")
print("Random:\t\t2384 / 47677 = 5.00% ")
baseline1_score = 0
baseline2_score = 0
for i, (c_seq, v_id) in enumerate(test_loader):
    target = v_id.item()
    c_seq_list = c_seq.view(-1).tolist()
    baseline1_score += baseline1_scoring(target=target, k_most_popular=k_most_popular)
    baseline2_score += baseline2_scoring(seq=c_seq_list, target=target, k_most_popular=k_most_popular)
print(f'Baseline1:\t{baseline1_score} / {num_trials} = {baseline1_score*100/num_trials:.2f}%')
print(f'Baseline2:\t{baseline2_score} / {num_trials} = {baseline2_score*100/num_trials:.2f}%')
print("")

SCORING
Random:		2384 / 47677 = 5.00% 
Baseline1:	9233 / 47677 = 19.37%
Baseline2:	22134 / 47677 = 46.42%



In [11]:
# Score model1_64 at different epochs
model1 = Model1(vendors=vendors_tensor, d_fc=64)
for epoch in range(0, 201, 40):
    PATH = "Models/model1_64_epoch"+str(epoch)+".pt"
    checkpoint = torch.load(PATH)
    model1.load_state_dict(checkpoint['model_state_dict'])
    model1.eval()
    model1_score = 0
    for i, (c_seq, v_id) in enumerate(test_loader):
        target = v_id.item()
        c_seq_list = c_seq.tolist()
        with torch.no_grad():
            model1_score += model_scoring(seq=c_seq, target=target, model=model1)
    print(f'Model1_64_{epoch}:\t{model1_score} / {num_trials} = {model1_score*100/num_trials:.2f}%')
    print("Done!")

SCORING
Random:		2384 / 47677 = 5.00% 
Baseline1:	9233 / 47677 = 19.37%
Baseline2:	22134 / 47677 = 46.42%

Model1_0:	15579 / 47677 = 32.68%
Model1_40:	21454 / 47677 = 45.00%
Model1_80:	22135 / 47677 = 46.43%
Model1_120:	22525 / 47677 = 47.25%
Model1_160:	22844 / 47677 = 47.91%
Model1_200:	22797 / 47677 = 47.82%



In [12]:
# Score model1_64s at different epochs
model1 = Model1(vendors=vendors_tensor, d_fc=64)
for epoch in range(0, 201, 20):
    PATH = "Models/model1_64s_epoch"+str(epoch)+".pt"
    checkpoint = torch.load(PATH)
    model1.load_state_dict(checkpoint['model_state_dict'])
    model1.eval()
    model1_score = 0
    for i, (c_seq, v_id) in enumerate(test_loader):
        target = v_id.item()
        c_seq_list = c_seq.tolist()
        with torch.no_grad():
            model1_score += model_scoring(seq=c_seq, target=target, model=model1)
    print(f'Model1_64s_{epoch}:\t{model1_score} / {num_trials} = {model1_score*100/num_trials:.2f}%')
print("Done!")

Model1_64s_0:	15644 / 47677 = 32.81%
Model1_64s_20:	21455 / 47677 = 45.00%
Model1_64s_40:	21628 / 47677 = 45.36%
Model1_64s_60:	22108 / 47677 = 46.37%
Model1_64s_80:	22378 / 47677 = 46.94%
Model1_64s_100:	22836 / 47677 = 47.90%
Model1_64s_120:	22922 / 47677 = 48.08%
Model1_64s_140:	22967 / 47677 = 48.17%
Model1_64s_160:	23027 / 47677 = 48.30%
Model1_64s_180:	23044 / 47677 = 48.33%
Model1_64s_200:	23041 / 47677 = 48.33%
Done!


In [24]:
# Score model1 at different epochs
model1 = Model1(vendors=vendors_tensor, d_fc=128)
for epoch in range(0, 201, 40):
    PATH = "Models/model1_128_epoch"+str(epoch)+".pt"
    checkpoint = torch.load(PATH)
    model1.load_state_dict(checkpoint['model_state_dict'])
    model1.eval()
    model1_score = 0
    for i, (c_seq, v_id) in enumerate(test_loader):
        target = v_id.item()
        c_seq_list = c_seq.tolist()
        with torch.no_grad():
            model1_score += model_scoring(seq=c_seq, target=target, model=model1)
    print(f'Model1_128_{epoch}:\t{model1_score} / {num_trials} = {model1_score*100/num_trials:.2f}%')

Model1_0:	15844 / 47677 = 33.23%
Model1_40:	21924 / 47677 = 45.98%
Model1_80:	22116 / 47677 = 46.39%
Model1_120:	22375 / 47677 = 46.93%
Model1_160:	22801 / 47677 = 47.82%
Model1_200:	22547 / 47677 = 47.29%


In [None]:
# Score model2 at different epochs
model2 = Model2(vendors=vendors_tensor, d_fc=64)
for epoch in range(0, 101, 10):
    PATH = "Models/model2_epoch"+str(epoch)+".pt"
    checkpoint = torch.load(PATH)
    model2.load_state_dict(checkpoint['model_state_dict'])
    model2.eval()
    model2_score = 0
    for i, (c_seq, v_id) in enumerate(test_loader):
        target = v_id.item()
        c_seq_list = c_seq.tolist()
        with torch.no_grad():
            model2_score += model_scoring(seq=c_seq, target=target, model=model2)
    print(f'Model2_{epoch}:\t{model2_score} / {num_trials} = {model2_score*100/num_trials:.2f}%')
print("Done!")

In [16]:
# Score model2s at different epochs
model2 = Model2(vendors=vendors_tensor, d_fc=64)
for epoch in range(0, 101, 10):
    PATH = "Models/model2s_epoch"+str(epoch)+".pt"
    checkpoint = torch.load(PATH)
    model2.load_state_dict(checkpoint['model_state_dict'])
    model2.eval()
    model2_score = 0
    for i, (c_seq, v_id) in enumerate(test_loader):
        target = v_id.item()
        c_seq_list = c_seq.tolist()
        with torch.no_grad():
            model2_score += model_scoring(seq=c_seq, target=target, model=model2)
    print(f'Model2_{epoch}:\t{model2_score} / {num_trials} = {model2_score*100/num_trials:.2f}%')
print("Done!")

Model2_0:	10297 / 47677 = 21.60%
Model2_10:	14955 / 47677 = 31.37%
Model2_20:	14355 / 47677 = 30.11%
Model2_30:	13962 / 47677 = 29.28%
Model2_40:	13971 / 47677 = 29.30%
Model2_50:	14559 / 47677 = 30.54%
Model2_60:	14728 / 47677 = 30.89%
Model2_70:	14832 / 47677 = 31.11%
Model2_80:	14923 / 47677 = 31.30%
Model2_90:	15288 / 47677 = 32.07%
Model2_100:	15402 / 47677 = 32.30%
Done!


In [10]:
# Score model3 at different epochs
model3 = Model3(vendors=vendors_tensor, d_fc=64)
for epoch in range(0, 201, 20):
    PATH = "Models/model3s_epoch"+str(epoch)+".pt"
    checkpoint = torch.load(PATH)
    model3.load_state_dict(checkpoint['model_state_dict'])
    model3.eval()
    model3_score = 0
    for i, (c_seq, v_id) in enumerate(test_loader):
        target = v_id.item()
        c_seq_list = c_seq.tolist()
        with torch.no_grad():
            model3_score += model_scoring(seq=c_seq, target=target, model=model3)
    print(f'Model3_{epoch}:\t{model3_score} / {num_trials} = {model3_score*100/num_trials:.2f}%')
print("Done!")

Model3_0:	17206 / 47677 = 36.09%
Model3_20:	21731 / 47677 = 45.58%
Model3_40:	21939 / 47677 = 46.02%
Model3_60:	22209 / 47677 = 46.58%
Model3_80:	22862 / 47677 = 47.95%
Model3_100:	23002 / 47677 = 48.25%
Model3_120:	22947 / 47677 = 48.13%
Model3_140:	23112 / 47677 = 48.48%
Model3_160:	23155 / 47677 = 48.57%
Model3_180:	23144 / 47677 = 48.54%
Model3_200:	23141 / 47677 = 48.54%
Done!


In [17]:
PATH = "Models/model1_epoch200.pt"
checkpoint = torch.load(PATH)
model1.load_state_dict(checkpoint['model_state_dict'])
model1.eval()

seq = torch.tensor([[1, 1, 1, 1, 1]])

y = torch.ones([100, 1], dtype=torch.long)      # 100 vendors
seq_dupe = y @ seq                              # 100 x 5 matrix
v_ids = torch.arange(start=1, end=101, dtype=torch.long)
rankings = model1.forward(c_seq=seq_dupe, v_id=v_ids).view(-1)    
top_k = torch.topk(rankings, 5)[1][:5]          # Essentially argmax for top k
top_k = top_k + 1

print(rankings.sort(descending=True)[1] + 1)
print(len(rankings.tolist()))
print(top_k)

tensor([  1,  36,   3,   6,  30,  23,  53,   2,  24,  41,  71,   4,  17,  73,
         39,  16,  67,  68,  62,  44,  28,  70,  12,  63,  43,  64,  82,  13,
         42,  27,  86,  46,  59,  15,  26,  92,  14,   9,  54,  19,  78,  25,
         89,  11,  60,  85,  76,  55,  40,  88,  74,  90,  75,   7,   5,  72,
         31,  48,  29,  61,  87,  91,  20,   8,  52,  79,  99,  22,  21,  51,
         50,  35,  95,  98,  33,  10,  32,  18,  65,  37,  56,  96,  49,  94,
         45,  81,  93,  77,  34,  66,  80,  69,  57,  83,  97,  38,  84,  58,
         47, 100])
100
tensor([ 1, 36,  3,  6, 30])
