# Your task: implement several models and compare them via CV

- You'll need to implement several models and use standard polara functionality to perform comprehensive evaluation.
- You'll use `BookCrossing` dataset for your experiments.

# Models to choose from:

<div class="alert alert-block alert-info">Implement one of the following 3 groups of models:</div>
  
1:  
 - simple content-based model (recommend items based on their feature similarity)
 - folding-in for unbiased matrix factorization (you can reuse the code for MF shared with you previously)

2:
 - simple item-to-item model
 - biased matrix factorization model + folding-in
  
3:
- folding-in for the LightFM model according to the solution provided here: https://github.com/lyst/lightfm/issues/300.  

# Cross-validation for model comparison:

All models should be compared via CV experiements with the following two baselines:
    - Popularity-based model
    - PureSVD (or ScaledSVD)

<div class="alert alert-block alert-warning">You must pefform a fair hyper-parameter tuning for both your models (using random grid search) and PureSVD (using rank truncation).</div>

Use one of the test folds for tuning. Evaluation settings:
1. Explicit data:
    - warm_start = True
    - holdout_size = 3
    - random_holdout = True
    - models' switch_positive = 7
    - evaluation metric: Informedness@10
2. Implicit data:
    - warm_start = True
    - holdout_size = 1
    - random_holdout = True
    - evaluation metric: MRR@10 (use `model.evaluate('ranking')`)

Provide average value of the metric across all 5 folds as well as confidence intervals for all models.

# Preparing data

## Loading the dataset for experiments

In this homework, you will be using another dataset - **BookCrossing**.
It can be downloaded from  
http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip.

You are provided with the function that will do it for you. As it may take quite a long time, it would be probably better to manually download that data and provide path to the local file as a first input argument to the function instead.

In [None]:
from bookcrossing import get_bx_data
%matplotlib inline

In [None]:
bx_data, bx_books_meta = get_bx_data('c:/Users/evfro/Downloads/BX-CSV-Dump.zip', get_books=True)

In [None]:
bx_data.head()

In [None]:
bx_books_meta.head()

What are the ratings values?

In [None]:
# your answer

What is the number of unique users and items with respect to explicit and implicit rating?

In [None]:
# your answer

<div class="alert alert-block alert-info">Depending on the type of algorithm you've implemented, perform experiments either on implicit or on explicit part of the dataset.  
If your algorithm allows, you can run both (not mandatory).</div>

## Example: selecting the implicit part

*Explicit part of experiments can be set up in a similar way with the same filtering rules applied. Don't forget to replace implicit data with explicit in the code below!*

In [None]:
implicit_data = bx_data.query('rating==0').drop('rating', axis=1) # or just bx_data.query('rating>0') for explicit part

The number of unique entities will be probably too high to fit their embeddings into standard computer's memory. That's why you need to subsample the data.

The following rules should be applied:
- filter out all entities with only a single preference
- filter out users with too many preferences

Does entities with only a single known preference contribute into standard collaborative filtering models?

In [None]:
# your answer

Defining valied entities:

In [None]:
# frequency of books
books_pref_count = implicit_data['isbn'].value_counts()
# mark books with more than 1 user preference
valid_books = books_pref_count > 1

What about the distribution og the number of books per user?

In [None]:
(implicit_data.userid.value_counts().value_counts().sort_index().cumsum()
              .plot(logy=True, logx=True, title='Cumulative distribution of user profile length'));

As you can see the number of users with more than 100 books in their preferences present a tiny fraction of the dataset. Let's filter them out as well as users with only a single preference.

In [None]:
users_pref_count = implicit_data['userid'].value_counts()
valid_users = (users_pref_count > 1) & (users_pref_count < 100)

In [None]:
valid_book_index = valid_books.index[valid_books]
valid_user_index = valid_users.index[valid_users]
sampled_data = implicit_data.query('isbn in @valid_book_index and userid in @valid_user_index')

What is the resulting data sparsity and the number of unique entities?

In [None]:
# your answer

How is it different from the Movielens data?

In [None]:
# your answer

# Appendix: functions for content-based model

In [None]:
from polara.lib.similarity import combine_similarity_data

You'll need to build similarity matrix. It can be achieved with polara's builtin function  `combine_similarity_data`. The input to this function should be a pandas dataframe with index corresponding to items and columns corresponding to different types of features. Each entry of the dafarame should be a list of feature values. Empty features should be represented by empty list. The following code make the necessary modification:

In [None]:
meta_info = (bx_books_meta.query('isbn in @valid_book_index').set_index('isbn').fillna('')
                          .applymap(lambda x: x.split(',') if len(x) else [])
                          .reindex(sampled_data.isbn.unique(), fill_value=[])) # avoid missing isbn index in similarity data

In [None]:
meta_info.head()

## Different ways to build similarity matrix

### Weighted similarity

You can mix similarities computed for various features with different weights or simply sum it with uniform weights (default).

In [None]:
jw = 'jaccard-weighted'
jd = 'jaccard'
cs = 'cosine'
tc = 'tfidf-cosine'

sim_type = {'publisher':cs, 'author':cs}
item_similarity = combine_similarity_data(meta_info[list(sim_type.keys())], similarity_type=sim_type, weights=None)

In [None]:
item_similarity

### All-at-once similarity

**Alternatively**, you can build a single matrix of features of all types and compute similarity on top of it:

In [None]:
from polara.lib.similarity import get_similarity_data

In [None]:
all_features = meta_info.author.combine(meta_info.publisher, lambda x, y: x+y).to_frame('all_features')

In [None]:
all_similarity = get_similarity_data(all_features, similarity_type=tc)['all_features']
all_similarity

You'll also need to add the functionality to operate with similarity data into polara's data model. Here's the way to do it:

## Preparing data model

Let's first define a new data model with the necessary functionality.

In [None]:
from polara import RecommenderData
from polara.recommender.coldstart.data import FeatureSimilarityMixin

In [None]:
# new class to mix in the similarity data
class SimilarityDataModel(FeatureSimilarityMixin, RecommenderData): pass

Below is an example of how it should be implemented for the implicit data. **You'll need to add `rating` field for explicit data as in the standard data model.**

In [None]:
similarities = {'userid': None, 'isbn': item_similarity}
sim_indices = {'userid': None, 'isbn': meta_info.index}

In [None]:
data_model = SimilarityDataModel(similarities, sim_indices,
                                 sampled_data, 'userid', 'isbn', # rating is omitted for implicit case
                                 seed=42) 

In [None]:
data_model.random_holdout = True
data_model.holdout_size = 1
data_model.warm_start = True
data_model.prepare()

In [None]:
data_model.item_similarity

<div class="alert alert-block alert-info">Your content-based model should use this similarity data to find aggregated scores based on known user preferences for all test users.</div>

Hint: use the following command to get sparse representation of test data:  
`test_matrix, slice_data = self.get_test_matrix(test_data, shape, (start, stop))`


In [None]:
from polara import RecommenderModel

class ContentBased(RecommenderModel):
    def __init__(self, *args, **kwargs):
        super(ContentBased, self).__init__(*args, **kwargs)
        self.method = 'CB'
    
    def build(self, *args, **kwargs):
        # your implementation
        
    def slice_recommendations(self, test_data, shape, start, stop, test_users=None):
        # your implementation