In [16]:
import requests
import numpy as np
import pandas as pd

from polara import RecommenderData
from polara import get_movielens_data
from polara import SVDModel

# Preparing data

The test data now is provided externally. The holdout is not shared with you. Your task is to build a model using the `Movielens-10M` dataset (either full or subsampled, it's up to you) and generate recommendations for provided test users. Everything can be done within polara's API, so you only need to configure the `data_model` accordingly. Here's how to do it

In [17]:
data = get_movielens_data('<path to ml-10m.zip file>')

In [18]:
data_model = RecommenderData(data, 'userid', 'movieid', 'rating', seed=0)
data_model.fields

Fields(userid='userid', itemid='movieid', feedback='rating')

## Setting the training data

If you want to use the entire ML-10M dataset for training, simply call `prepare_training_only` method, which doesn't perform any data splitiing and only reindexes users and movies:

In [19]:
data_model.prepare_training_only()

Preparing data...
Done.
There are 10000054 events in the training and 0 events in the holdout.


As you can see, the entire dataframe went into the training part.

## Setting the test data

Now you need to specify, which test data should be used by the data model. First of all, download the test data itself:

In [20]:
testset = pd.read_csv('https://github.com/Evfro/acaml2018_recsys/raw/master/Part%204/team_testset.gz', header=0)

In [21]:
print(testset.shape)
testset.apply('nunique')

(388502, 3)


userid     8671
movieid    7381
rating       10
dtype: int64

Note that the **test users are not from the ML-10M dataset**, so the `warm_start` scenario should be invoked. There's a special method `set_test_data`, allowing to assign the externally provided test data and to activate the necessary configuration:

In [22]:
data_model.set_test_data(testset=testset, warm_start=True)

The method will automatically perform all the necessary steps in order to ensure consistency between training and test data.

# Submitting your model

Now everything is ready to use the model of your choice as before. Below is an example based on `PureSVD`.

In [23]:
svd = SVDModel(data_model)

Let's assume you've already performed hyper-parameter tuning and verified it with cross-validation. All is needed now is to generate recommendations with your optimally tuned model and submit them to the leaderboard.

In [24]:
svd.rank = 35
svd.build()

PureSVD training time: 3.204495320337614s


Below is a convenience function, which ensure correct configuration of your model when submitting it. The configuration corresponds to the rules of the competition.

In [25]:
def submit_model(model, submission_name):
    model.topk = 50 # recommendations will be evaluated in a range from 1 to 50
    recs = model.recommendations
    # restoring actual movieid indices instead of internal ones
    mapping = model.data.index.itemid.set_index('new').old
    recs = pd.Series(recs.ravel()).map(mapping).values.reshape(recs.shape)    
    # saving the array and submitting it
    np.savez(submission_name, recs=recs)
    files = {'upload': open(f'{submission_name}.npz','rb')}
    url = "http://recsysvalley.azurewebsites.net/upload"
    r = requests.post(url, files=files)
    return r.status_code, r.reason

In [26]:
submit_model(svd, 'svd_baseline')

(200, 'OK')