Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/add_bias_terms_to_als #310

Closed
wants to merge 10 commits into from

Conversation

howtodowtle
Copy link

Added optional user and item bias terms to als.py as suggested by ita9naiwa in issue #176 and already implemented in LMF. Set use_bias=True to train with bias terms (not implemented for GPU yet).

howtodowtle and others added 7 commits December 19, 2019 16:00
default value: False
... so that second-to-last element in self.item_factors becomes a learned item bias term
... so that last element in self.user_factors becomes a learned user bias term
Training with bias terms not implemented for GPU yet.
@dselivanov
Copy link

This is far from being the same as LMF. Just setting to 1 part of latent factors doesn't do anything useful.

@benfred
Copy link
Owner

benfred commented Jan 2, 2020

I'm also not sure that this is doing the right thing here - just setting to 1 after each iteration isn't quite right afaict (like we shouldn't have to set back to 1 after each iteration because the least squares regression at each iteration shouldn't have changed the value - the fact that its changing means that the coefficients on the other parameters aren't totally correct).

@ita9naiwa
Copy link
Collaborator

ita9naiwa commented Jan 16, 2020

Sorry for the late response
setting the last element of user/item factors to be 1.0 will work with setting the last/last-1 vector of user/item vectors to be 1.0 in the first place like in

implicit/implicit/lmf.pyx

Lines 143 to 152 in fbed621

if self.item_factors is None:
self.item_factors = np.random.normal(size=(items, self.factors + 2)).astype(np.float32)
self.item_factors[:, -1] = 1.0
# set factors to all zeros for items without any ratings
self.item_factors[item_counts == 0] = np.zeros(self.factors + 2)
if self.user_factors is None:
self.user_factors = np.random.normal(size=(users, self.factors + 2)).astype(np.float32)
self.user_factors[:, -2] = 1.0
.

it's actually the same representation with ALS with bias. (I'm not definitely sure)
and, I think that adding bias term to the ALS model will not give performance improvements.

Copy link

@SavanGowda SavanGowda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ita9naiwa

Thank you very much for this implementation of ALS with bias. I have a doubt -
In the change done, shouldn't we add the bias to the factors (self.factors+2) as done in LMF?

self.user_factors = np.random.rand(users, self.factors+ 2).astype(self.dtype) * 0.01 
self.item_factors = np.random.rand(items, self.factors + 2).astype(self.dtype) * 0.01

Thanks in advance,
Savan

@ita9naiwa
Copy link
Collaborator

shouldn't we add the bias to the factors (self.factors+2) as done in LMF?
I think we need to. I missed.

Could you please add some comparison between ALS with bias and ALS on datasets like Movielens 20M or lastfm?

@IanEisenberg
Copy link

This functionality would be very helpful! Along with the additions already proposed, wrapping similar_items and rank_items with functionality aimed at removing popularity bias would be helpful.

@lucatrovato
Copy link

lucatrovato commented Mar 29, 2021

Hi all,
can this be reviewed? It is a very interesting feature which would actually have a lot of added value to the model.

I had a look to the code and in principle the approached implemented by @ita9naiwa is correct, although having these additional ones in the positions last and before the last will have as consequence that the regularization terms will be very large and in the order of magnitude of number of users and items.

In theory, one should suppress this elements for the calculation of the regularization part, but I am not sure how it would fit the overall implementation.

@dselivanov
Copy link

dselivanov commented Mar 29, 2021 via email

@lucatrovato
Copy link

lucatrovato commented Mar 29, 2021

@dselivanov,
thanks a lot for your answer.

Would you mind to expand on your comment, given that if we write the biased ALS equation:

CodeCogsEqn (4)

and we define the extended vectors, which corresponds to the implementation suggested by @ita9naiwa

CodeCogsEqn (2)

We then have:

CodeCogsEqn (3)

which, apart from the issues with the regularization term, to me it looks like it actually leads to the right implementation? We have a bit of mismatch between the number of factors (which should be handled better), but in reality one of spots in the factor array is occupied by the bias , and the other is occupied by the one.

Am i missing something here?

@dselivanov
Copy link

dselivanov commented Mar 30, 2021

You can take a look here which is mostly correct:

  • need to subtract user bias (beta) from p_ui when you solve system for item embeddings
  • need to subtract item bias (gamma) from p_ui when you solve system for user embeddings

What articles misses is that when you subtract these biases from p_ui (which is very sparse) you ruin the sparsity. And you need to solve much large least squares problem for each user/item unless you use some smart cache/init.

Long story short - this PR requires quite some work on C/cuda side in order to solve problem with biases correctly.

@lucatrovato
Copy link

Hi @dselivanov,
I already went through that page and the approach is different, because the user and item factors vectors are not fully extended with two items (a 1 and the bias properly placed). The implementation from @ita9naiwa exactly does that, and you can avoid ruining the sparsity of p_u,i. The only difference is in the regularization terms, that will have many 1's in it equal to the number of user and items, leading to a very different regularization term for equivalent problems

@benfred
Copy link
Owner

benfred commented Apr 2, 2021

@lucatrovato I agree with you that once these factors are calculated - the loss function calculations are basically correct (aside from regularization). However the method for calculating these factors in this PR is wrong (if it was correct we wouldn't have to set user/item bias terms back to 1 every iteration after calculating the item/user factors - it could just be set once at the beginning). When learning the user factors, we only want to update Xu and the user bias (but not update the '1' value that corresponds to the item bias).

Also - does anyone have any proof that bias terms have any benefit for implicit feedback models? I realize that they have benefit for ratings prediction tasks with datasets like movielens/netflix price etc (and other explicit feedback scenarios), but implicit feedback models are different in that the missing data indicates a true 0 (no interactions) rather than an unknown rating. I haven't seen any benefit myself to bias terms in limited testing with the BPR model - and I'm not sure that adding this feature is worth the effort tbh.

@skrypkvi
Copy link

skrypkvi commented Apr 2, 2021

I agree with @dselivanov adding two coordinates and re-assigning one of them to 1 is not sufficient.

Assume for user we extend the coordinates
image
and for item as
image

Then as @lucatrovato pointed out we can write the objective function as
image
with exception that we do not want regularize with lambda the coordinate corresp. to 1 (we will correct it later). For now we proceed as usual: we take derivative of the objective function and write solution in the matrix form:
image

In order to take care of (*) and not penalize the last coordinate in
image
we define
image
which is equal to I but the last element on diagonal is set to 0.
So until now it was basic algebra with equalities, so one should not worry about "interaction of variables" or "side effects."
Now in code one has to disable the impact of last coordinate corresp. to "1":

  • one has to remove the lambda regularization on "1" coordinate, as described above in the formula
  • in conjugate gradient we do not care about the residual (r) for this coordinate "1", so have to remove it either

Therefore, in my opinion the following additional code changes should suffice in als.py (cpu version)

def least_squares_cg(Cui, X, Y, regularization, num_threads=0, cg_steps=3):

a - in line: YtY = Y.T.dot(Y) + regularization * np.eye(factors, dtype=Y.dtype)
instead of taking np.eye, we need to set the last diagonal element (corresp. to "1" in xu) to 0
so we disable the regularization of "1"

b- in conjugate gradient we also need to disable the tracking of fit for coorinate "1":
prior to line 378 p = r.copy(), we need to set the last coordinate in vector residual "r" to 0 (corresp. to "1")
after line 395 r -= alpha * Ap we again overwrite the last coordinate in residual by 0
With these changes we ensure the conjugate gradient is not impacted by the dummy coordinate "1"

@benfred : regarding benefits of intercept: I believe it is helpful to say "I recommend you this movie because it is popular and 80% of people watched it" vs "I recommend you this movie because of your affinity to it". That's what bias helps to do, w/o it one cannot do such separation.

@ita9naiwa
Copy link
Collaborator

@dselivanov @skrypkvi
I made a wrong point. it behaves awkardly when we use Conjugate Gradient. I found that 1 affects CG and make convergent slower.

@ita9naiwa
Copy link
Collaborator

ita9naiwa commented Apr 2, 2021

imho, I personally see no performance improvements from bias terms when we deal with implicit dataset, unlike on explicit dataset a lot of gain comes from.

I'd love to see(might be @skrypkvi) conducts benchmark on some publicly available dataset with/without bias term. I think there's no serious problem if we use exact least squares for model update.

@dselivanov
Copy link

With the proposed reparametrization biases need to be subtracted from p_u_i on the RHS of the least squares equation. Which destroys sparsity.
As for the usefulness - it naturally allows to recommended popular items (item biases usually strongly correlated with popularity) to users with few/no interactions. But overall ndcg/map gain is tiny/moderate.

@julioasotodv
Copy link

julioasotodv commented Apr 2, 2021

@benfred as for whether biases are useful for implicit datasets, sometimes they should be; especially if you have very heavy users (with tons of interactions on quite some items vs regular users) or very popular items. An example for this could be the Steam V1 dataset (https://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data) where there are some players with tons of game hours (minutes in the dataset) that completely dominate the dataset; alongside very popular games that are recommended to pretty much everyone just because they are very popular (therefore the preferences almost fade away).

With that said, you can always preprocess the dataset to normalize this (for instance, subtracting the mean for each user row/column or using IDF similar to how you do it in the examples folder)

@skrypkvi
Copy link

skrypkvi commented Apr 6, 2021

@dselivanov :

With the proposed reparametrization biases need to be subtracted from p_u_i on the RHS of the least squares equation. Which destroys sparsity.

-> I am quite certain the solution I propose to Conjugate gradient (with 2 coordinates added) should be 1-1 with the Conjugate gradient derived for the problem defined in #176 where 1 coordinate is added (so subtracting of bias is taken care of). I will try to upload the derivation in the next few days. Who can check it?

@ita9naiwa :

I'd love to see(might be @skrypkvi) conducts benchmark on some publicly available dataset with/without bias term. I think there's no serious problem if we use exact least squares for model update.

-> I believe once the equivalence above is proven, there is no need to conduct empirical analysis.
-> I am afraid one cannot use the exact least squares solution, i.e. not use Conjugate gradient, since the matrices are large, so one needs a numerical method.

@dselivanov
Copy link

dselivanov commented Apr 6, 2021

problem defined in #176 where 1 coordinate is added

This is applicable to the LHS, but not to RHS of the least squares equation.

@skrypkvi
Copy link

skrypkvi commented Apr 6, 2021

@skrypkvi
Copy link

skrypkvi commented May 10, 2021

Dear all,

as promised, you can find a detailed explanation of bias integration in the document https://github.com/skrypkvi/recommender/blob/main/Recommender_with_bias.pdf.
The structure is the following:
Sections 1-2: intro, motivation and overview of the original algorithm 'collaborative filtering for implicit feedback datasets' to introduce notation
Section 3.1: description of original proposal at http://activisiongamescience.github.io/2016/01/11/Implicit-Recommender-Systems-Biased-Matrix-Factorization/#Implict-ALS-with-Biases (1 coordinate added at a time), which has been so far supported by the community
Section 3.2: description of the alternative (preferred) proposal discussed in this thread with 2 coordinates added and analytical proof of its equivalence to original incl. the conjugate gradient method
Section 4: Final list of the required changes in the package

Since the changes are so immaterial, it would be great if the community could endorse the proposal and the bias terms could be added to the package. As said in the document, it would be a great added value when it comes to the interpretation of the ratings.

@lucatrovato
Copy link

Hi all,
I went through the document and in my opinion completely and correctly describes the idea that I and @ita9naiwa mentioned, including the step to fix in the conjugate gradient algo.

Any opinion on this @benfred or @dselivanov?

@benfred benfred deleted the branch benfred:master October 2, 2021 00:03
@benfred benfred closed this Oct 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants