Added dot product everywhere were cosine similarity was used #676

joancf · 2024-04-05T09:49:33Z

Related Issue

Support for inner product similarity measures

Support for inner product similarity measures (dot product) #265

Changes

almost everywhere (except some tests) where there was a cosine reference a parallel dot function/option is added

What changed?

most files where cosine was used

Testing and Validation

Not done , needs to be done, it is pending.
I'll try to generate and replace the plugin in my ES isntallation and check it.
but not sure which prodedures I must follow.

alexklibisz

Thanks for the contribution. I left some initial comments.

Here are some other items that don't quite fit as a comment:

what happens when the vectors are not normalized? I don't want to return a negative score to Elasticsearch. I'm not sure how that would behave. I would rather detect this failure state and return an explicit exception. I'm not totally sure where we should detect it. I'm thinking maybe add an if-statement here and here. If the similarity is < 0, we throw an ElastiknnRuntimeException.
We should add some regression tests to the RecallSuite, similar to these.

docs/_posts/2021-07-30-how-does-elastiknn-work.md

alexklibisz · 2024-04-05T13:55:06Z

docs/_posts/2021-07-30-how-does-elastiknn-work.md

 Three of these are problematic with respect to this scoring requirement.

 Specifically, L1 and L2 are generally defined as _distance_ functions, rather than similarity functions,
 which means that higher relevance (i.e., lower distance) yields _lower_ scores.
 Cosine similarity is defined over $$[-1, 1]$$, and we can't have negative scores.
-
+Dot similarity is defined over $$[-1, 1]$$, and we can't have negative scores, if vectors have a magnitude of 1, then it's equivalent to cosine similarity.


I would rewrite this part slightly:

Cosine similarity is defined over $$[-1, 1]$$. Dot similarity is defined over $$[-1, 1]$$, and if vectors have a magnitude of 1, then it's equivalent to cosine similarity. Elasticsearch does not allow negative scores. To work around this, Elastiknn applies simple transformations to produce L1, L2, Cosine, and Dot scores in accordance with the Elasticsearch requirements.

alexklibisz · 2024-04-05T13:56:30Z

docs/pages/api.md

 |L1|`1 / (1 + l1 distance)`|0|1|
 |L2|`1 / (1 + l2 distance)`|0|1|

+Dot similirarity will produce negative scores if the vectors are not normalized


We should make sure to catch this and return an error in the plugin.

i added max(0,distance) so, it will always be positive.

docs/pages/api.md

docs/pages/index.md

alexklibisz · 2024-04-05T14:01:25Z

elastiknn-api4s/src/test/scala/com/klibisz/elastiknn/api/XContentCodecSuite.scala

@@ -110,13 +110,16 @@ class XContentCodecSuite extends AnyFreeSpec with Matchers {
          ("L2", Similarity.L2),
          ("cosine", Similarity.Cosine),
          ("Cosine", Similarity.Cosine),
-          ("COSINE", Similarity.Cosine)
+          ("COSINE", Similarity.Cosine),


General comment for this file: Let's also add a block of tests below similar to this:

elastiknn/elastiknn-api4s/src/test/scala/com/klibisz/elastiknn/api/XContentCodecSuite.scala

Lines 371 to 406 in 39fa610

"CosineLsh" - {

"roundtrip" in {

for {

_ <- 1 to 100

(dims, l, k) = (rng.nextInt(), rng.nextInt(), rng.nextInt())

mapping = Mapping.CosineLsh(dims, l, k)

expected = Json.obj(

"type" -> "elastiknn_dense_float_vector".asJson,

"elastiknn" -> Json.obj(

"model" -> "lsh".asJson,

"dims" -> dims.asJson,

"similarity" -> "cosine".asJson,

"L" -> l.asJson,

"k" -> k.asJson

)

)

} {

roundtrip[Mapping](expected, mapping)

}

}

"errors" in {

val ex1 = intercept[XContentParseException](decodeUnsafeFromString[Mapping]("""

|{

| "type": "elastiknn_dense_float_vector",

| "elastiknn": {

| "model": "lsh",

| "dims": 33,

| "similarity": "cosine",

| "L": "33",

| "k": 3

| }

|}

|""".stripMargin))

ex1.getMessage shouldBe "Expected [L] to be one of [VALUE_NUMBER] but found [VALUE_STRING]"

}

}

elastiknn-models/src/main/java/com/klibisz/elastiknn/models/DotLshModel.java

Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>

…tLshModel.java Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>

joancf · 2024-04-05T17:05:31Z

Hi @alexklibisz
First thanks for fast response, and the plug-in itself!!
let me apologize , for sending the pull request before deeply testing it. It's my first time doing things in Scala, and I'm a bit confused on how to do some things.
Finally i could compile and build the zip in my side, so I can try to run the plugin! (and check if it works and the performance)
But as I said my knowledge of scala is my knowledge of Java. ... and for some of the things you ask me, i'm not able to do them. (basically testing is where I think it will take a me a while to understand everything!)

About the changes you asked. I did all of them .
One I did in a different way was to ensure that the similarity was returning a positive value with max(0,1+dotProduct)
In this way we don't raise an exception and negative values will have a 0 similartiy

Thanks
Joan

alexklibisz

Thanks for the quick fixes! I can help with the tests once we're happy with the implementation.

I'll have to think about more about the strategy for accounting for non-normalized vectors. As a user I think I would rather get a failed response than a bunch of missing results. Also, we say that the similarity is in [0, 2], but AFAICT the current strategy would allow for scores > 2, wouldn't it?

alexklibisz · 2024-04-05T23:21:35Z

docs/pages/api.md

@@ -446,9 +470,12 @@ The exact transformations are described below.
 |Jaccard|N/A|0|1.0|
 |Hamming|N/A|0|1.0|
 |Cosine[^note-angular-cosine]|`cosine similarity + 1`|0|2|
+|Dot[^note-dot-product]|`Dot similarity + 1`|0|2| 


I think there might be something wrong with the table formatting:

Also, if we end up using it, we should describe the updated transformation here: max(0, 1 + dot product)

alexklibisz · 2024-04-10T17:23:06Z

Hi @joancf can you try adding the exceptions for scores outside [0, 2]? If you're having trouble I can try this, but probably not til the weekend or next week.

joancf · 2024-04-17T09:10:24Z

hi. @alexklibisz my company doesn't want to use it. So, I will finish it out-hours
I'll do my best with exceptions and testing.

Added dot product everywhere were cosine similarity was used

39fa610

alexklibisz reviewed Apr 5, 2024

View reviewed changes

joancf and others added 10 commits April 5, 2024 16:38

Found some bugs when trying to build/run tests

a706edf

Update docs/_posts/2021-07-30-how-does-elastiknn-work.md

10e3487

Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>

Update docs/_posts/2021-07-30-how-does-elastiknn-work.md

39c66e3

Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>

Update docs/pages/index.md

61404e0

Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>

Update docs/pages/api.md

0cf5f58

Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>

Update docs/pages/api.md

60fde2a

Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>

Update elastiknn-models/src/main/java/com/klibisz/elastiknn/models/Do…

3f835eb

…tLshModel.java Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>

Addd changes to footnote

59dff2c

dotSimilarity does not return negative floats

1028572

zero as min value

281e6c4

alexklibisz reviewed Apr 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added dot product everywhere were cosine similarity was used #676

Added dot product everywhere were cosine similarity was used #676

joancf commented Apr 5, 2024 •

edited by alexklibisz

alexklibisz left a comment

alexklibisz Apr 5, 2024

alexklibisz Apr 5, 2024

joancf Apr 8, 2024

alexklibisz Apr 5, 2024

joancf commented Apr 5, 2024

alexklibisz left a comment

alexklibisz Apr 5, 2024

alexklibisz commented Apr 10, 2024

joancf commented Apr 17, 2024

	"CosineLsh" - {
	"roundtrip" in {
	for {
	_ <- 1 to 100
	(dims, l, k) = (rng.nextInt(), rng.nextInt(), rng.nextInt())
	mapping = Mapping.CosineLsh(dims, l, k)
	expected = Json.obj(
	"type" -> "elastiknn_dense_float_vector".asJson,
	"elastiknn" -> Json.obj(
	"model" -> "lsh".asJson,
	"dims" -> dims.asJson,
	"similarity" -> "cosine".asJson,
	"L" -> l.asJson,
	"k" -> k.asJson
	)
	)
	} {
	roundtrip[Mapping](expected, mapping)
	}
	}
	"errors" in {
	val ex1 = intercept[XContentParseException](decodeUnsafeFromString[Mapping]("""
	\|{
	\| "type": "elastiknn_dense_float_vector",
	\| "elastiknn": {
	\| "model": "lsh",
	\| "dims": 33,
	\| "similarity": "cosine",
	\| "L": "33",
	\| "k": 3
	\| }
	\|}
	\|""".stripMargin))
	ex1.getMessage shouldBe "Expected [L] to be one of [VALUE_NUMBER] but found [VALUE_STRING]"
	}
	}

Added dot product everywhere were cosine similarity was used #676

Are you sure you want to change the base?

Added dot product everywhere were cosine similarity was used #676

Conversation

joancf commented Apr 5, 2024 • edited by alexklibisz

Related Issue

Changes

Testing and Validation

alexklibisz left a comment

Choose a reason for hiding this comment

alexklibisz Apr 5, 2024

Choose a reason for hiding this comment

alexklibisz Apr 5, 2024

Choose a reason for hiding this comment

joancf Apr 8, 2024

Choose a reason for hiding this comment

alexklibisz Apr 5, 2024

Choose a reason for hiding this comment

joancf commented Apr 5, 2024

alexklibisz left a comment

Choose a reason for hiding this comment

alexklibisz Apr 5, 2024

Choose a reason for hiding this comment

alexklibisz commented Apr 10, 2024

joancf commented Apr 17, 2024

joancf commented Apr 5, 2024 •

edited by alexklibisz