# Quickstart
In this example, we'll build an implicit feedback recommender using the Movielens 100k dataset (http://grouplens.org/datasets/movielens/100k/).

The code behind this example is available as a [Jupyter notebook](https://github.com/lyst/lightfm/tree/master/examples/quickstart/quickstart.ipynb)

LightFM includes functions for getting and processing this dataset, so obtaining it is quite easy.

In [4]:
#!pip install lightfm

In [1]:
import numpy as np

from lightfm.datasets import fetch_movielens

data = fetch_movielens(min_rating=5.0)

In [14]:
data

{'train': <943x1682 sparse matrix of type '<class 'numpy.float32'>'
 	with 19048 stored elements in COOrdinate format>,
 'test': <943x1682 sparse matrix of type '<class 'numpy.int32'>'
 	with 2153 stored elements in COOrdinate format>,
 'item_features': <1682x1682 sparse matrix of type '<class 'numpy.float32'>'
 	with 1682 stored elements in Compressed Sparse Row format>,
 'item_feature_labels': array(['T', 'G', 'F', ..., 'S', 'Y', 'S'], dtype='<U1'),
 'item_labels': array(['T', 'G', 'F', ..., 'S', 'Y', 'S'], dtype='<U1')}

In [2]:
x = data['test'].toarray()
x[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int32)

In [3]:
x[0][:500]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

This downloads the dataset and automatically pre-processes it into sparse matrices suitable for further calculation. In particular, it prepares the sparse user-item matrices, containing positive entries where a user interacted with a product, and zeros otherwise.

We have two such matrices, a training and a testing set. Both have around 1000 users and 1700 items. We'll train the model on the train matrix but test it on the test matrix.

In [4]:
print(repr(data['train']))
print(repr(data['test']))

<943x1682 sparse matrix of type '<class 'numpy.int32'>'
	with 19048 stored elements in COOrdinate format>
<943x1682 sparse matrix of type '<class 'numpy.int32'>'
	with 2153 stored elements in COOrdinate format>


We need to import the model class to fit the model:

In [5]:
from lightfm import LightFM

We're going to use the WARP (Weighted Approximate-Rank Pairwise) model. WARP is an implicit feedback model: all interactions in the training matrix are treated as positive signals, and products that users did not interact with they implicitly do not like. The goal of the model is to score these implicit positives highly while assigining low scores to implicit negatives.

Model training is accomplished via SGD (stochastic gradient descent). This means that for every pass through the data --- an epoch --- the model learns to fit the data more and more closely. We'll run it for 30 epochs in this example. We can also run it on multiple cores, so we'll set that to 2. (The dataset in this example is too small for that to make a difference, but it will matter on bigger datasets.)

In [6]:
model = LightFM(loss='warp')
%time model.fit(data['train'], epochs=30, num_threads=2)

CPU times: user 328 ms, sys: 1.8 ms, total: 329 ms
Wall time: 168 ms


<lightfm.lightfm.LightFM at 0x7f75387deee0>

Done! We should now evaluate the model to see how well it's doing. We're most interested in how good the ranking produced by the model is. Precision@k is one suitable metric, expressing the percentage of top k items in the ranking the user has actually interacted with. `lightfm` implements a number of metrics in the `evaluation` module. 

In [7]:
from lightfm.evaluation import precision_at_k

We'll measure precision in both the train and the test set.

In [8]:
print("Train precision: %.2f" % precision_at_k(model, data['train'], k=5).mean())
print("Test precision: %.2f" % precision_at_k(model, data['test'], k=5).mean())

Train precision: 0.40
Test precision: 0.05


Unsurprisingly, the model fits the train set better than the test set.

For an alternative way of judging the model, we can sample a couple of users and get their recommendations. To make predictions for given user, we pass the id of that user and the ids of all products we want predictions for into the `predict` method.

In [12]:
def sample_recommendation(model, data, user_ids):
    

    n_users, n_items = data['train'].shape

    for user_id in user_ids:
        known_positives = data['item_labels'][data['train'].tocsr()[user_id].indices]
        
        scores = model.predict(user_id, np.arange(n_items))
        top_items = data['item_labels'][np.argsort(-scores)]
        
        print("User %s" % user_id)
        print("     Known positives:")
        
        for x in known_positives[:3]:
            print("        %s" % x)

        print("     Recommended:")
        
        for x in top_items[:3]:
            print("        %s" % x)
        
sample_recommendation(model, data, [0])

User 0
     Known positives:
        T
        S
        D
     Recommended:
        P
        S
        W


In [13]:
user_id = 0
n_users, n_items = data['train'].shape

known_positives = data['item_labels'][data['train'].tocsr()[user_id].indices]
known_positives

array(['T', 'S', 'D', 'U', 'M', 'P', 'M', 'F', 'A', 'C', 'C', 'D', 'E',
       'H', 'S', 'P', 'P', 'T', 'T', 'S', 'H', 'J', 'R', 'S', 'B', 'N',
       'W', 'T', 'F', 'K', 'M', 'T', 'H', 'W', 'H', 'M', 'L', 'G', 'B',
       'B', 'S', 'S', 'M', 'J', 'M', 'M', 'W', 'C', 'E', 'P', 'R', 'B',
       'A', 'G', '1', 'R', 'A', 'H', 'A', 'T', 'D', 'G', 'N', 'B', 'C',
       'Y', 'W', 'B', 'S', 'R', 'S', 'M', 'K', 'C', 'P', 'C', 'C', 'F',
       'G'], dtype='<U1')

In [20]:
n_users

943

In [15]:
scores = model.predict(user_id, np.arange(n_items))
scores

array([-2.2952812, -4.640448 , -4.1432896, ..., -5.41136  , -5.703436 ,
       -5.4059935], dtype=float32)

In [21]:

for user_id in range(n_users):
    scores = model.predict(user_id, np.arange(n_items))
    print(np.min(scores), np.max(scores))

-6.556858 -0.47575802
-3.1770997 1.4096434
-3.3157828 2.0125866
-3.9707596 1.7174916
-5.02475 0.79588413
-5.7206655 1.4674811
-8.6262455 -0.50813156
-4.2476745 1.5475407
-3.3032255 1.7626998
-5.920145 1.209553
-4.2111654 0.052727282
-4.2030234 0.9638717
-8.423445 -2.4151134
-5.4317484 0.36516747
-4.3386316 1.5031738
-7.008901 0.47597763
-1.8157058 1.6345513
-6.5243464 -0.8122159
-2.8556573 2.913535
-3.091422 1.7814847
-4.2773423 1.2104431
-5.446405 1.4592605
-4.7075725 0.7746018
-4.9070935 1.6396271
-3.712769 1.2036506
-2.122909 1.619989
-2.272413 1.9895868
-3.5128317 0.97010696
-2.5518458 1.1367903
-2.7387 1.1719729
-3.8282833 1.3638606
-2.3939452 1.4449996
-0.8746483 1.8665171
-3.6348116 1.3461726
-2.1938632 1.8587593
-4.1293464 2.0228665
-3.9330215 1.8903208
-6.7547235 0.0425042
-2.7759597 1.3796371
-0.82939315 1.7720833
-3.8176112 2.128035
-5.8681016 0.9882653
-5.426277 0.36067486
-4.5531726 1.6548772
-2.7685957 1.4569209
-2.995668 0.88857937
-2.619432 1.6688845
-4.6731553 1.957157

-4.6583934 1.1450193
-3.5621912 1.8757315
-1.8278301 1.6257634
-3.7996156 1.7247596
-3.0712247 1.4068571
-5.0579367 1.4599751
-1.7415048 1.8526328
-3.5515354 1.8679216
-5.038834 2.1318593
-4.3525324 1.7180948
-5.1029363 1.1498827
-4.69254 0.33163947
-5.3490076 1.0731844
-4.425092 2.602664
-2.847752 1.2219634
-2.9971726 1.851233
-2.3218784 1.6484113
-4.877856 1.4685138
-4.1584415 1.219297
-5.7060466 0.7090313
-3.8925138 1.6564789
-0.84083337 1.7403606
-5.710891 2.3723028
-0.82883924 1.7372036
-3.3049517 1.3284956
-2.6885865 1.4194441
-2.9398108 0.8126015
-2.835842 0.9620273
-4.376765 1.8963068
-4.115374 1.2456505
-5.0712757 0.003282236
-4.0712433 1.6019183
-8.030024 -0.22679071
-5.4572544 1.2821655
-2.3456888 1.2816072
-5.9027276 2.4793649
-1.5216643 1.8015379
-3.5479047 1.3365421
-6.7767663 0.98848724
-2.7085524 1.7042911
-4.412075 2.3154268
-2.2897441 1.6460238
-0.8356899 1.7382066
-4.156502 1.3217481
-4.678862 1.4393145
-4.7204704 1.1039667
-2.5272014 2.0614457
-2.400654 1.190763
-4.

In [None]:

        top_items = data['item_labels'][np.argsort(-scores)]
        
        print("User %s" % user_id)
        print("     Known positives:")
        
        for x in known_positives[:3]:
            print("        %s" % x)

        print("     Recommended:")
        
        for x in top_items[:3]:
            print("        %s" % x)