Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added dot product everywhere were cosine similarity was used #676

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

joancf
Copy link

@joancf joancf commented Apr 5, 2024

Related Issue

Support for inner product similarity measures

Changes

almost everywhere (except some tests) where there was a cosine reference a parallel dot function/option is added

What changed?

most files where cosine was used

Testing and Validation

Not done , needs to be done, it is pending.
I'll try to generate and replace the plugin in my ES isntallation and check it.
but not sure which prodedures I must follow.

Copy link
Owner

@alexklibisz alexklibisz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. I left some initial comments.

Here are some other items that don't quite fit as a comment:

  • what happens when the vectors are not normalized? I don't want to return a negative score to Elasticsearch. I'm not sure how that would behave. I would rather detect this failure state and return an explicit exception. I'm not totally sure where we should detect it. I'm thinking maybe add an if-statement here and here. If the similarity is < 0, we throw an ElastiknnRuntimeException.
  • We should add some regression tests to the RecallSuite, similar to these.

docs/_posts/2021-07-30-how-does-elastiknn-work.md Outdated Show resolved Hide resolved
docs/_posts/2021-07-30-how-does-elastiknn-work.md Outdated Show resolved Hide resolved
Three of these are problematic with respect to this scoring requirement.

Specifically, L1 and L2 are generally defined as _distance_ functions, rather than similarity functions,
which means that higher relevance (i.e., lower distance) yields _lower_ scores.
Cosine similarity is defined over $$[-1, 1]$$, and we can't have negative scores.

Dot similarity is defined over $$[-1, 1]$$, and we can't have negative scores, if vectors have a magnitude of 1, then it's equivalent to cosine similarity.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rewrite this part slightly:

Cosine similarity is defined over $$[-1, 1]$$.
Dot similarity is defined over $$[-1, 1]$$, and if vectors have a magnitude of 1, then it's equivalent to cosine similarity.
Elasticsearch does not allow negative scores.
To work around this, Elastiknn applies simple transformations to produce L1, L2, Cosine, and Dot scores in accordance with the Elasticsearch requirements.

|L1|`1 / (1 + l1 distance)`|0|1|
|L2|`1 / (1 + l2 distance)`|0|1|

Dot similirarity will produce negative scores if the vectors are not normalized
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure to catch this and return an error in the plugin.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added max(0,distance) so, it will always be positive.

docs/pages/api.md Outdated Show resolved Hide resolved
docs/pages/api.md Outdated Show resolved Hide resolved
docs/pages/index.md Outdated Show resolved Hide resolved
@@ -110,13 +110,16 @@ class XContentCodecSuite extends AnyFreeSpec with Matchers {
("L2", Similarity.L2),
("cosine", Similarity.Cosine),
("Cosine", Similarity.Cosine),
("COSINE", Similarity.Cosine)
("COSINE", Similarity.Cosine),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment for this file: Let's also add a block of tests below similar to this:

"CosineLsh" - {
"roundtrip" in {
for {
_ <- 1 to 100
(dims, l, k) = (rng.nextInt(), rng.nextInt(), rng.nextInt())
mapping = Mapping.CosineLsh(dims, l, k)
expected = Json.obj(
"type" -> "elastiknn_dense_float_vector".asJson,
"elastiknn" -> Json.obj(
"model" -> "lsh".asJson,
"dims" -> dims.asJson,
"similarity" -> "cosine".asJson,
"L" -> l.asJson,
"k" -> k.asJson
)
)
} {
roundtrip[Mapping](expected, mapping)
}
}
"errors" in {
val ex1 = intercept[XContentParseException](decodeUnsafeFromString[Mapping]("""
|{
| "type": "elastiknn_dense_float_vector",
| "elastiknn": {
| "model": "lsh",
| "dims": 33,
| "similarity": "cosine",
| "L": "33",
| "k": 3
| }
|}
|""".stripMargin))
ex1.getMessage shouldBe "Expected [L] to be one of [VALUE_NUMBER] but found [VALUE_STRING]"
}
}

joancf and others added 10 commits April 5, 2024 16:38
Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>
Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>
Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>
Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>
Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>
…tLshModel.java

Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>
@joancf
Copy link
Author

joancf commented Apr 5, 2024

Hi @alexklibisz
First thanks for fast response, and the plug-in itself!!
let me apologize , for sending the pull request before deeply testing it. It's my first time doing things in Scala, and I'm a bit confused on how to do some things.
Finally i could compile and build the zip in my side, so I can try to run the plugin! (and check if it works and the performance)
But as I said my knowledge of scala is my knowledge of Java. ... and for some of the things you ask me, i'm not able to do them. (basically testing is where I think it will take a me a while to understand everything!)

About the changes you asked. I did all of them .
One I did in a different way was to ensure that the similarity was returning a positive value with max(0,1+dotProduct)
In this way we don't raise an exception and negative values will have a 0 similartiy

Thanks
Joan

Copy link
Owner

@alexklibisz alexklibisz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fixes! I can help with the tests once we're happy with the implementation.

I'll have to think about more about the strategy for accounting for non-normalized vectors. As a user I think I would rather get a failed response than a bunch of missing results. Also, we say that the similarity is in [0, 2], but AFAICT the current strategy would allow for scores > 2, wouldn't it?

@@ -446,9 +470,12 @@ The exact transformations are described below.
|Jaccard|N/A|0|1.0|
|Hamming|N/A|0|1.0|
|Cosine[^note-angular-cosine]|`cosine similarity + 1`|0|2|
|Dot[^note-dot-product]|`Dot similarity + 1`|0|2|
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be something wrong with the table formatting:
image

Also, if we end up using it, we should describe the updated transformation here: max(0, 1 + dot product)

@alexklibisz
Copy link
Owner

Hi @joancf can you try adding the exceptions for scores outside [0, 2]? If you're having trouble I can try this, but probably not til the weekend or next week.

@joancf
Copy link
Author

joancf commented Apr 17, 2024

hi. @alexklibisz my company doesn't want to use it. So, I will finish it out-hours
I'll do my best with exceptions and testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants