Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support negative scores #114

Closed
ravimody opened this issue May 24, 2018 · 9 comments
Closed

Support negative scores #114

ravimody opened this issue May 24, 2018 · 9 comments

Comments

@ravimody
Copy link

This library should support negative scores. They can be quite useful, and work fit elegantly with the original ALS algorithm. Spark's ALS implemented this change years ago: https://gite.lirmm.fr/yagoubi/spark/commit/9e63f80e75bb6d9bbe6df268908c3219de6852d9

@benfred
Copy link
Owner

benfred commented May 25, 2018

You're the second person to ask for this : #90 =)

I've run some experiments on doing this before (at previous a job working on news recommendations) - and found that it didn't make an appreciable difference in the results for me. I had much better luck using the implicit positive signals only to rank items, and then using a different model with the positive/negative values to filter the ranked list. However, the link you supplied has a different point of view and says that they found it helpful 🤷‍♂️

Since it is an easy change to make that multiple people are asking for, I'll add this sometime soon.

@lucidyan
Copy link

@benfred
I think a lot of people used before this implementation of Spark, where it is supported out of the box and maybe they already implemented some kind of business logic with negative feedback. FYI, in our case, it showed some benefition in production. So this would be a reasonable step.

I admit, I became one of such people and instead of convergence I received NaN, negative loss and incomprehensible mistakes, until I understood what was the matter. Perhaps handling this type of error for the positive confidence case would be a good start

@ravimody
Copy link
Author

To add a data point, at iheartradio we had quite a bit of thumbs down data. I found adding the negative information was useful, especially for users with complex taste in music. For example, a user who likes punk but thumbed down emo would understandbly expect less emo in their recommendations/playlists.

@benfred
Copy link
Owner

benfred commented May 26, 2018

I have a rough draft of this here - https://github.com/benfred/implicit/compare/negative_als_prefs

I ran a simple experiment to test this out using the movielens datasets. The idea is to pass in low ratings (1 or 2 stars) as negative items here - and experiment with included these as negative values versus just removing them entirely. I experimented with different confidence values (multiplying by a constant alpha).

Anyways - this code improved things in 3 cases and hurt perfomance in 5 cases. I will run some more tests on larger datasets, I only included the smaller movielens ones to test out quickly.

dataset alpha p@10 with negatives p@10 without negatives ratio
movielens-100k 20 0.3036 0.3238 0.9377
movielens-1m 20 0.2671 0.2813 0.9493
movielens-100k 10 0.3287 0.3402 0.9662
movielens-1m 10 0.2996 0.3075 0.9743
movielens-100k 5 0.3239 0.3190 1.0153
movielens-1m 5 0.3311 0.3330 0.9943
movielens-100k 2 0.2435 0.2432 1.0013
movielens-1m 2 0.3209 0.3205 1.0010
movielens-100k 1 0.1909 0.1909 1.0000
movielens-1m 1 0.2556 0.2556 1.0000

The code for this experiment is this:

import numpy as np

from implicit.evaluation import precision_at_k, train_test_split
from implicit.als import AlternatingLeastSquares
from implicit.datasets.movielens import get_movielens

import logging
logging.basicConfig(level=logging.DEBUG)


# Train model including negative prefs and report p@10
# Note: setting alpha=1 should have identical results (since missing entries are assumed
# negative with a confidence of 1)
SEED = 32
FACTORS = 128
REGULARIZATION = 50
USE_CG = True

stats = []
for ALPHA in [20, 10, 5, 2, 1]:
    for dataset in ["100k", "1m"]:
        movies, ratings = get_movielens(dataset)

        # remove ratings between 2 and 4 (ambigiuous)
        # set ratings >= 4 as being positive
        # set ratings <= 2 as being negative
        ratings.data[(ratings.data > 2) & (ratings.data < 4)] = 0
        ratings.eliminate_zeros()
        ratings.data[ratings.data <= 2] = -ALPHA
        ratings.data[ratings.data >= 4] = ALPHA

        # generate a random test/train split (removing negatives from test set)
        np.random.seed(32)
        train, test = train_test_split(ratings)
        test.data[test.data < 0] = 0
        test.eliminate_zeros()

        # train a first model with the negatives in the training set, and get p@10
        np.random.seed(SEED)
        model = AlternatingLeastSquares(factors=FACTORS, regularization=REGULARIZATION,
                                        use_cg=USE_CG,
                                        calculate_training_loss=False)
        model.fit(train)
        p_neg = precision_at_k(model, train.T.tocsr(), test.T.tocsr(), K=10, num_threads=4)

        # train another model without the negatives and get p@10
        positive_train = train.copy()
        positive_train.data[train.data < 0] = 0
        positive_train.eliminate_zeros()

        np.random.seed(SEED)
        model = AlternatingLeastSquares(factors=FACTORS, regularization=REGULARIZATION,
                                        use_cg=USE_CG,
                                        calculate_training_loss=False)
        model.fit(positive_train)
        p_pos = precision_at_k(model, train.T.tocsr(), test.T.tocsr(), K=10, num_threads=4)

        # write out some stats
        print("ALPHA %s" % ALPHA)
        print("Have %i positive entries, %i negative entries" %
              (len(ratings.data[ratings.data > 0]), len(ratings.data[ratings.data < 0])))
        print("p@10 (including negatives): %.4f" % p_neg)
        print("p@10 (excluding negatives): %.4f" % p_pos)
        print("Ratio %.4f" % (p_neg/p_pos))

        stats.append(("movielens-" + dataset.ljust(4), str(ALPHA),
                      "%.4f" % p_neg, "%.4f" % p_pos, "%.4f" % (p_neg/p_pos)))

# print out markdown table body
for row in stats:
    print("| " + (" | ".join(row)) + " |")

It's possible I messed up somewhere, but I got similar results using both the CG and cholesky optimizer - and the change for supporting this in the cholesky optimizer is pretty simple.

Edit: A better test might be to test out if including the negative items prevents the withheld negative items from being recommended. Also just testing on movielens probably isn't sufficient. Will try those out later

@lucidyan
Copy link

I decided to conduct an experiment on my data. Here's what I got

Parameters:

  • Regularization: 0.1
  • Iterations: 10
  • Factors: 40

Data:

  • Implicit feedback

  • Negative/Positive interactions ratio: 0.004266

  • Splitted by date ~ 90 / 10

  • Train data
    <3940135x25367 sparse matrix of type '<class 'numpy.float64'>'

  • Test data
    <3949967x25433 sparse matrix of type '<class 'numpy.float64'>'

Shows small improvements on my implicit data

Implicit ALS (positive feedback only)
['recall_at', 'precision_at', 'f1_at', 'success_at']
[[0.0306  0.0897  0.04054 0.0897 ]
 [0.05156 0.07805 0.05359 0.1561 ]
 [0.06889 0.0713  0.06032 0.2139 ]
 [0.08381 0.06545 0.06312 0.2618 ]
 [0.09788 0.06148 0.06508 0.3074 ]
 [0.10962 0.05793 0.06561 0.3476 ]
 [0.12102 0.05513 0.06585 0.3859 ]
 [0.13202 0.05257 0.06573 0.4206 ]
 [0.14143 0.0505  0.06534 0.4545 ]]

Implicit ALS  (With negative feedback)
['recall_at', 'precision_at', 'f1_at', 'success_at']
[[0.03048 0.0902  0.04038 0.0902 ]
 [0.05267 0.0792  0.05477 0.1584 ]
 [0.07034 0.07193 0.06119 0.2158 ]
 [0.08487 0.06632 0.06405 0.2653 ]
 [0.09879 0.0619  0.06567 0.3095 ]
 [0.11111 0.0584  0.06631 0.3504 ]
 [0.12109 0.0552  0.06602 0.3864 ]
 [0.13186 0.05305 0.0661  0.4244 ]
 [0.14207 0.05084 0.06573 0.4576 ]]

Spark ALS (positive feedback only)
['recall_at', 'precision_at', 'f1_at', 'success_at']
[[0.00904 0.0285  0.01202 0.0285 ]
 [0.01881 0.02855 0.01948 0.0571 ]
 [0.02917 0.02933 0.02495 0.088  ]
 [0.04095 0.03125 0.03045 0.125  ]
 [0.05183 0.03172 0.03395 0.1586 ]
 [0.06395 0.03245 0.03733 0.1947 ]
 [0.07618 0.03337 0.04047 0.2336 ]
 [0.08914 0.03422 0.04343 0.2738 ]
 [0.1021  0.03457 0.04563 0.3111 ]]

Spark ALS (With negative feedback)
['recall_at', 'precision_at', 'f1_at', 'success_at']
[[0.00886 0.0272  0.01205 0.0272 ]
 [0.01921 0.02985 0.02024 0.0597 ]
 [0.02903 0.0306  0.02562 0.0918 ]
 [0.03983 0.0312  0.03004 0.1248 ]
 [0.05142 0.03208 0.03419 0.1604 ]
 [0.06304 0.03297 0.03764 0.1978 ]
 [0.0752  0.03357 0.04061 0.235  ]
 [0.08738 0.03406 0.04312 0.2725 ]
 [0.10022 0.0345  0.04544 0.3105 ]]

But the speed left much to be desired:
01:52/05:36 (Negative Feedback)

However, I think this was a draft decision

@lucidyan
Copy link

@benfred
I also think that MovieLens wasn't the best dataset. The ratio of negatives (<= 2) to positives (>=4) there is 40(1M) and 27 (100K)!

@benfred
Copy link
Owner

benfred commented May 28, 2018

@lucidyan: thanks for running the experiment on your data! If I’m reading that right - implicit seems to generate more accurate recommendations than spark ? =).

I think movielens has a pretty ordinary ratings distribution. I might be calculating this wrong - but for the 1m dataset it seems like ~16% are negative (<=2 stars) and ~57% are positive (>= 4 stars):

In [1]: from implicit.datasets.movielens import get_movielens

In [2]: _, ratings = get_movielens("100k")

In [3]: len(ratings.data), len(ratings.data[ratings.data <= 2]), len(ratings.data[ratings.data >= 4])
Out[3]: (100000, 17480, 55375)

In [4]: _, ratings = get_movielens("1m")

In [5]: len(ratings.data), len(ratings.data[ratings.data <= 2]), len(ratings.data[ratings.data >= 4])
Out[5]: (1000209, 163731, 575281)

A long time ago I wrote a blog post looking at recommender dataset distributions: https://www.benfrederickson.com/rating-set-distributions/ I looked at datasets from netflix/yelp/amazon/reddit and found they all broke down roughly like that. (Also datasets from companies I’ve worked at have been similarly skewed).

I still want to try out different datasets though, movielens seems to be truncated to only contain users with at least 20 ratings.

benfred added a commit that referenced this issue May 29, 2018
Allow negative confidence values to be passed in, signifying that the
user disliked the item. This lets use set higher confidence
values for the known negatives.

Experiments here indicate setting known negatives to a higher confidence
level slightly reduces p@10, but signifincantly reduces the likelihood
of withheld negative items to be recommended: #114
@benfred
Copy link
Owner

benfred commented May 29, 2018

I ran this also on the ml-10m and ml-20m datasets:

dataset alpha p@10 with negatives p@10 without negatives ratio
movielens-100k 20 0.3034 0.3233 0.9386
movielens-1m 20 0.2672 0.2812 0.9500
movielens-10m 20 0.2734 0.2783 0.9824
movielens-20m 20 0.2707 0.2742 0.9873
movielens-100k 10 0.3287 0.3402 0.9662
movielens-1m 10 0.2997 0.3074 0.9747
movielens-10m 10 0.2954 0.2965 0.9962
movielens-20m 10 0.2900 0.2904 0.9985
movielens-100k 5 0.3239 0.3190 1.0153
movielens-1m 5 0.3312 0.3331 0.9942
movielens-10m 5 0.2958 0.2954 1.0014
movielens-20m 5 0.2902 0.2896 1.0021
movielens-100k 2 0.2435 0.2432 1.0013
movielens-1m 2 0.3209 0.3205 1.0010
movielens-10m 2 0.2794 0.2794 1.0000
movielens-20m 2 0.2661 0.2659 1.0008
movielens-100k 1 0.1909 0.1909 1.0000
movielens-1m 1 0.2556 0.2556 1.0000
movielens-10m 1 0.3081 0.3081 1.0000
movielens-20m 1 0.2600 0.2600 1.0000

Including the negative ratings seems to reduce p@10 in most cases, but the impact is pretty small for the larger datasets.

However, the interesting thing is that including the negatives with higher confidence also seems to reduce how often the withheld negatives in the test set get recommended. Changing the above code to only test for withheld negatives (by filtering out positive test data by goingtest.data[test.data > 0] = 0) - and the same code then computes something like the false discovery rate@k:

dataset alpha fdr@10 with negatives fdr@10 without negatives ratio
movielens-100k 20 0.0329 0.0391 0.8426
movielens-1m 20 0.0175 0.0251 0.6977
movielens-100k 10 0.0344 0.0406 0.8482
movielens-1m 10 0.0180 0.0238 0.7571
movielens-100k 5 0.0406 0.0485 0.8358
movielens-1m 5 0.0180 0.0220 0.8186
movielens-100k 2 0.0402 0.0438 0.9174
movielens-1m 2 0.0193 0.0203 0.9520
movielens-100k 1 0.0348 0.0348 1.0000
movielens-1m 1 0.0178 0.0178 1.0000

So for the movielens datasets, P@10 is only slightly reduce by including known negatives at a higher confidence level - but withheld negatives are much less likely to be recommended

I also ran this on the reddit dataset I just added here, and got similar results. Setting alpha=20

Reddit, alpha=20
Have 19446392 positive entries, 3633062 negative entries
p@10 (including negatives): 0.0617
p@10 (excluding negatives): 0.0618
fdr@10 (including negatives): 0.0041
fdr@10 (excluding negatives): 0.0075
Ratio 0.5383

So p@10 for the reddit dataset here is basically unchanged, while the negatives are almost half as likely to be recommended.

Anyways, I'm probably going to merge the PR. It doesn't seem to hurt p@K much - and can significantly reduce the number of bad results being recommended.

edit: link to pr #119

benfred added a commit that referenced this issue Jun 4, 2018
Allow negative confidence values to be passed in, signifying that the
user disliked the item. This lets use set higher confidence
values for the known negatives.

Experiments here indicate setting known negatives to a higher confidence
level slightly reduces p@10, but signifincantly reduces the likelihood
of withheld negative items to be recommended: #114
@benfred
Copy link
Owner

benfred commented Jun 4, 2018

Done here #119 , support will be in next version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants