# Weakly Supervised Recommendation Systems

Experiments steps:
 1. **User's Preferences Model**: Leverage the most *explicit* ratings to build a *rate/rank prediction model*. This is a simple *Explicit Matrix Factorization* model. 
 2. **Generate Weak DataSet**: Use the above model to *predict* for all user/item pairs $(u,i)$ in *implicit feedback dataset* to build a new *weak explicit dataset* $(u, i, r^*)$.
 3. **Evaluate**: Use the intact test split in the most explicit feedback, in order to evaluate the performance of any model.

## Explicit Model Experiments

This section contains all the experiments based on the explicit matrix factorization model.

### Explicit Rate Model

In [1]:
import utils
from spotlight.evaluation import rmse_score

dataset_recommend_train, dataset_recommend_test, dataset_recommend_dev, dataset_read, dataset_shelve = utils.parse_goodreads(
    path='/local/terrier/Collections/Recommendations/Goodreads/goodreads_interactions_fantasy_paranormal.json.gz')

print('Explicit dataset (TEST) contains %s interactions of %s users and %s items'%(
          format(len(dataset_recommend_test.ratings), ','),
          format(dataset_recommend_test.num_users, ','),
          format(dataset_recommend_test.num_items, ',')))

print('Explicit dataset (VALID) contains %s interactions of %s users and %s items'%(
          format(len(dataset_recommend_dev.ratings), ','),
          format(dataset_recommend_dev.num_users, ','),
          format(dataset_recommend_dev.num_items, ',')))

print('Explicit dataset (TRAIN) contains %s interactions of %s users and %s items'%(
          format(len(dataset_recommend_train.ratings), ','),
          format(dataset_recommend_train.num_users, ','),
          format(dataset_recommend_train.num_items, ',')))

print('Implicit dataset (READ) contains %s interactions of %s users and %s items'%(
          format(len(dataset_read.ratings), ','),
          format(dataset_read.num_users, ','),
          format(dataset_read.num_items, ',')))

print('Implicit dataset (SHELVE) contains %s interactions of %s users and %s items'%(
          format(len(dataset_shelve.ratings), ','),
          format(dataset_shelve.num_users, ','),
          format(dataset_shelve.num_items, ',')))

# train the explicit model based on recommend feedback
model = utils.train_explicit(train_interactions=dataset_recommend_train, 
                             valid_interactions=dataset_recommend_dev,
                             run_name='model_goodreads_fantasy_explicit_rate')

# evaluate the new model
mrr, ndcg, ndcg10, ndcg_5, mmap, success_10, success_5 = utils.evaluate(interactions=dataset_recommend_test,
                                                                        model=model,
                                                                        topk=20)
rmse = rmse_score(model=model, test=dataset_recommend_test)
print('-'*20)
print('RMSE: {:.4f}'.format(rmse))
print('MRR: {:.4f}'.format(mrr))
print('nDCG: {:.4f}'.format(ndcg))
print('nDCG@10: {:.4f}'.format(ndcg10))
print('nDCG@5: {:.4f}'.format(ndcg_5))
print('MAP: {:.4f}'.format(mmap))
print('success@10: {:.4f}'.format(success_10))
print('success@5: {:.4f}'.format(success_5))

Explicit dataset (TEST) contains 2,539,491 interactions of 419,774 users and 258,037 items
Explicit dataset (VALID) contains 2,539,492 interactions of 419,774 users and 258,037 items
Explicit dataset (TRAIN) contains 20,315,929 interactions of 419,774 users and 258,037 items
Implicit dataset (READ) contains 26,919,068 interactions of 419,774 users and 258,037 items
Implicit dataset (SHELVE) contains 52,141,533 interactions of 419,774 users and 258,037 items
epoch 1 start at:  Tue Apr 23 14:40:05 2019
epoch 1 end at:  Tue Apr 23 15:00:51 2019
RMSE: 0.4113
epoch 2 start at:  Tue Apr 23 15:01:02 2019
epoch 2 end at:  Tue Apr 23 15:21:45 2019
RMSE: 0.4089
epoch 3 start at:  Tue Apr 23 15:21:57 2019
epoch 3 end at:  Tue Apr 23 15:42:42 2019
RMSE: 0.4082
epoch 4 start at:  Tue Apr 23 15:42:54 2019
epoch 4 end at:  Tue Apr 23 16:03:39 2019
RMSE: 0.4079
epoch 5 start at:  Tue Apr 23 16:03:51 2019
epoch 5 end at:  Tue Apr 23 16:21:25 2019
RMSE: 0.4076
epoch 6 start at:  Tue Apr 23 16:21:37 2019

## Remove all valid/test samples

In [2]:
test_interact = set()
for (uid, iid) in zip(dataset_recommend_test.user_ids, dataset_recommend_test.item_ids):
    test_interact.add((uid, iid))

for (uid, iid) in zip(dataset_recommend_dev.user_ids, dataset_recommend_dev.item_ids):
    test_interact.add((uid, iid))

# clean implicit dataset from test/dev rating
for idx, (uid, iid, r) in enumerate(zip(dataset_read.user_ids, dataset_read.item_ids, dataset_read.ratings)):
    if (uid, iid) in test_interact:
        dataset_read.ratings[idx] = -1

### Explicit Read Model

Leverage the **explicit rate model** trained at the previous section to annotate **missing values** in the **read** dataset.

In [3]:
# annotate the missing values in the play dataset based on the explicit recommend model
dataset_read = utils.annotate(interactions=dataset_read, 
                              model=model, 
                              run_name='dataset_goodreads_fantasy_read_explicit_annotated')

# train the explicit model based on recommend feedback
model = utils.train_explicit(train_interactions=dataset_read, 
                             valid_interactions=dataset_recommend_dev,
                             run_name='model_goodreads_fantasy_explicit_read')

# evaluate the new model
mrr, ndcg, ndcg10, ndcg_5, mmap, success_10, success_5 = utils.evaluate(interactions=dataset_recommend_test,
                                                                        model=model,
                                                                        topk=20)
rmse = rmse_score(model=model, test=dataset_recommend_test)
print('-'*20)
print('RMSE: {:.4f}'.format(rmse))
print('MRR: {:.4f}'.format(mrr))
print('nDCG: {:.4f}'.format(ndcg))
print('nDCG@10: {:.4f}'.format(ndcg10))
print('nDCG@5: {:.4f}'.format(ndcg_5))
print('MAP: {:.4f}'.format(mmap))
print('success@10: {:.4f}'.format(success_10))
print('success@5: {:.4f}'.format(success_5))

epoch 1 start at:  Wed Apr 24 17:33:55 2019
epoch 1 end at:  Wed Apr 24 17:54:50 2019
RMSE: 0.4156
epoch 2 start at:  Wed Apr 24 17:55:02 2019
epoch 2 end at:  Wed Apr 24 18:15:59 2019
RMSE: 0.4144
epoch 3 start at:  Wed Apr 24 18:16:11 2019
epoch 3 end at:  Wed Apr 24 18:37:09 2019
RMSE: 0.4141
epoch 4 start at:  Wed Apr 24 18:37:21 2019
epoch 4 end at:  Wed Apr 24 18:58:17 2019
RMSE: 0.4139
epoch 5 start at:  Wed Apr 24 18:58:29 2019
epoch 5 end at:  Wed Apr 24 19:19:17 2019
RMSE: 0.4139
epoch 6 start at:  Wed Apr 24 19:19:29 2019
epoch 6 end at:  Wed Apr 24 19:40:22 2019
RMSE: 0.4139
epoch 7 start at:  Wed Apr 24 19:40:34 2019
epoch 7 end at:  Wed Apr 24 20:01:19 2019
RMSE: 0.4139
--------------------
RMSE: 0.4141
MRR: 0.0247
nDCG: 0.0208
nDCG@10: 0.0170
nDCG@5: 0.0130
MAP: 0.0101
success@10: 0.0661
success@5: 0.0396


## Implicit Model Experiments

This section contains all the experiments based on the implicit matrix factorization model.

### Implicit Model using Negative Sampling

In [3]:
import utils
from spotlight.evaluation import rmse_score

dataset_recommend_train, dataset_recommend_test, dataset_recommend_dev, dataset_read, dataset_shelve = utils.parse_goodreads(
    path='/local/terrier/Collections/Recommendations/Goodreads/goodreads_interactions_fantasy_paranormal.json.gz')

print('Explicit dataset (TEST) contains %s interactions of %s users and %s items'%(
          format(len(dataset_recommend_test.ratings), ','),
          format(dataset_recommend_test.num_users, ','),
          format(dataset_recommend_test.num_items, ',')))

print('Explicit dataset (VALID) contains %s interactions of %s users and %s items'%(
          format(len(dataset_recommend_dev.ratings), ','),
          format(dataset_recommend_dev.num_users, ','),
          format(dataset_recommend_dev.num_items, ',')))

print('Explicit dataset (TRAIN) contains %s interactions of %s users and %s items'%(
          format(len(dataset_recommend_train.ratings), ','),
          format(dataset_recommend_train.num_users, ','),
          format(dataset_recommend_train.num_items, ',')))

print('Implicit dataset (READ) contains %s interactions of %s users and %s items'%(
          format(len(dataset_read.ratings), ','),
          format(dataset_read.num_users, ','),
          format(dataset_read.num_items, ',')))

print('Implicit dataset (SHELVE) contains %s interactions of %s users and %s items'%(
          format(len(dataset_shelve.ratings), ','),
          format(dataset_shelve.num_users, ','),
          format(dataset_shelve.num_items, ',')))

# train the explicit model based on recommend feedback
model = utils.train_implicit_negative_sampling(train_interactions=dataset_read, 
                                               valid_interactions=dataset_recommend_dev,
                                               run_name='model_goodreads_fantasy_implicit')

# evaluate the new model
mrr, ndcg, ndcg10, ndcg_5, mmap, success_10, success_5 = utils.evaluate(interactions=dataset_recommend_test,
                                                                        model=model,
                                                                        topk=20)
rmse = rmse_score(model=model, test=dataset_recommend_test)
print('-'*20)
print('RMSE: {:.4f}'.format(rmse))
print('MRR: {:.4f}'.format(mrr))
print('nDCG: {:.4f}'.format(ndcg))
print('nDCG@10: {:.4f}'.format(ndcg10))
print('nDCG@5: {:.4f}'.format(ndcg_5))
print('MAP: {:.4f}'.format(mmap))
print('success@10: {:.4f}'.format(success_10))
print('success@5: {:.4f}'.format(success_5))

Explicit dataset (TEST) contains 2,539,491 interactions of 419,774 users and 258,037 items
Explicit dataset (VALID) contains 2,539,492 interactions of 419,774 users and 258,037 items
Explicit dataset (TRAIN) contains 20,315,929 interactions of 419,774 users and 258,037 items
Implicit dataset (READ) contains 26,919,068 interactions of 419,774 users and 258,037 items
Implicit dataset (SHELVE) contains 52,141,533 interactions of 419,774 users and 258,037 items
--------------------
RMSE: 3.3754
MRR: 0.1145
nDCG: 0.0982
nDCG@10: 0.0829
nDCG@5: 0.0647
MAP: 0.0528
success@10: 0.2732
success@5: 0.1757


## Popularity

In [4]:
import utils
from spotlight.evaluation import rmse_score
from popularity import PopularityModel

dataset_recommend_train, dataset_recommend_test, dataset_recommend_dev, dataset_read, dataset_shelve = utils.parse_goodreads(
    path='/local/terrier/Collections/Recommendations/Goodreads/goodreads_interactions_fantasy_paranormal.json.gz')

print('Explicit dataset (TEST) contains %s interactions of %s users and %s items'%(
          format(len(dataset_recommend_test.ratings), ','),
          format(dataset_recommend_test.num_users, ','),
          format(dataset_recommend_test.num_items, ',')))

print('Explicit dataset (VALID) contains %s interactions of %s users and %s items'%(
          format(len(dataset_recommend_dev.ratings), ','),
          format(dataset_recommend_dev.num_users, ','),
          format(dataset_recommend_dev.num_items, ',')))

print('Explicit dataset (TRAIN) contains %s interactions of %s users and %s items'%(
          format(len(dataset_recommend_train.ratings), ','),
          format(dataset_recommend_train.num_users, ','),
          format(dataset_recommend_train.num_items, ',')))

print('Implicit dataset (READ) contains %s interactions of %s users and %s items'%(
          format(len(dataset_read.ratings), ','),
          format(dataset_read.num_users, ','),
          format(dataset_read.num_items, ',')))

print('Implicit dataset (SHELVE) contains %s interactions of %s users and %s items'%(
          format(len(dataset_shelve.ratings), ','),
          format(dataset_shelve.num_users, ','),
          format(dataset_shelve.num_items, ',')))

# train the explicit model based on recommend feedback
model = PopularityModel()
print('fit the model')
model.fit(interactions=dataset_recommend_train)

# evaluate the new model
print('evaluate the model')
mrr, ndcg, ndcg10, ndcg_5, mmap, success_10, success_5 = utils.evaluate(interactions=dataset_recommend_test,
                                                                        model=model,
                                                                        topk=20)
# rmse = rmse_score(model=model, test=dataset_recommend_test, batch_size=512)
# print('-'*20)
# print('RMSE: {:.4f}'.format(rmse))
print('MRR: {:.4f}'.format(mrr))
print('nDCG: {:.4f}'.format(ndcg))
print('nDCG@10: {:.4f}'.format(ndcg10))
print('nDCG@5: {:.4f}'.format(ndcg_5))
print('MAP: {:.4f}'.format(mmap))
print('success@10: {:.4f}'.format(success_10))
print('success@5: {:.4f}'.format(success_5))

Explicit dataset (TEST) contains 2,539,491 interactions of 419,774 users and 258,037 items
Explicit dataset (VALID) contains 2,539,492 interactions of 419,774 users and 258,037 items
Explicit dataset (TRAIN) contains 20,315,929 interactions of 419,774 users and 258,037 items
Implicit dataset (READ) contains 26,919,068 interactions of 419,774 users and 258,037 items
Implicit dataset (SHELVE) contains 52,141,533 interactions of 419,774 users and 258,037 items
fit the model
evaluate the model
MRR: 0.1143
nDCG: 0.0981
nDCG@10: 0.0819
nDCG@5: 0.0651
MAP: 0.0527
success@10: 0.2669
success@5: 0.1773
