SOLR-16574: Demonstrate Dense Vectors and KNN as part of the Films example by gabrielmagno · Pull Request #1213 · apache/solr

gabrielmagno · 2022-12-05T13:24:31Z

https://issues.apache.org/jira/browse/SOLR-16574

Description

Enrich the films example to demonstrate how to use the Dense Vectors feature.

Solution

Added the field film_vector to the films dataset. This is an embedding vector created to represent the movie with 10 dimensions. The vector is created by combining the first 5 dimensions of a pre-trained BERT sentence model applied on the name of the movies plus the name of the genres, followed by an item2vec 5-dimensions model of itemset co-occurrence of genres in the movies, totaling 10 dimensions. Even though it is expected that similar movies will be close to each other, this is just a "toy example" model to serve as source for creating the films vectors.

The README of the example was also updated to include the specification of the Dense Vector field in the schema. Also, a new section was created, with examples showing how to make KNN queries with the vectors.

Tests

Added the new field film_vector to the 3 dataset formats (JSON, XML, CSV), making sure to preserve the exact same data from the original datasets, so that the "diff" will be only the appendage of the new field.
Checked the creation of the collection for the 3 dataset formats. Regardless of the format all the 1100 films were indexed, and the film_vector field was correctly parsed and indexed as well.
Checked the KNN example queries for all the 3 dataset formats.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide

…ilm_vector field

…e films example

gabrielmagno · 2022-12-05T13:25:32Z

I tried running gradlew check, but got an error that I don't think is related to the changes of this PR:

> Task :solr:solr-ref-guide:buildLocalAntoraSite FAILED
/home/gabrielmagno/dev-solr/apache-solr/.gradle/node/solr-ref-guide/node_modules/@antora/ui-loader/lib/load-ui.js:180
      if (response.statusCode !== 200) {
                   ^

TypeError: Cannot read properties of undefined (reading 'statusCode')
    at /home/gabrielmagno/dev-solr/apache-solr/.gradle/node/solr-ref-guide/node_modules/@antora/ui-loader/lib/load-ui.js:180:20
    at ClientRequest.<anonymous> (/home/gabrielmagno/dev-solr/apache-solr/.gradle/node/solr-ref-guide/node_modules/simple-get/index.js:88:21)
    at ClientRequest.f (/home/gabrielmagno/dev-solr/apache-solr/.gradle/node/solr-ref-guide/node_modules/once/once.js:25:25)
    at ClientRequest.emit (node:events:527:28)
    at TLSSocket.socketErrorListener (node:_http_client:454:9)
    at TLSSocket.emit (node:events:527:28)
    at emitErrorNT (node:internal/streams/destroy:157:8)
    at emitErrorCloseNT (node:internal/streams/destroy:122:3)
    at processTicksAndRejections (node:internal/process/task_queues:83:21)

epugh · 2022-12-05T13:47:13Z

How hard would it be to include the steps you did for vectorization? I notice that most tutorials start with the vectors... However, what if I wanted to play with this example and add my own tweaks to the vector? Is creating the vector a simple script, or was it a ton of work and out of scope????

gabrielmagno · 2022-12-05T14:13:23Z

@epugh for this version I combined two "example" models (BERT + item2vec), just to server as an example.

If we are willing provide the instructions on how to create the models and the vectors itself, I guess it would be better to use a single model solution, for simplicity. I could recreate the vectors using only BERT (which I believe is good enought for our example).

The easiest way I know to create a vector representation of text data is by using the sentence_transformers Python library with a pre-trained BERT model. It is possible to create vectors with 3 lines of code:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-mpnet-base-v2")

my_vector = model.encode("This is my text")

The only issue is that the vectors from this model have 768 dimensions. For the example I simply got the first 5 dimensions and concatenate to the other model. This is not a really appropriate way to create the vector in real scenarios. There are other techniques (e.g. Model Distillation) that could reduce the number of dimensions.

epugh · 2022-12-05T14:16:49Z

i think I'm mostly thinking, what if I add a new movie to the list... or want to play with it, can we provide that script as well? I sometimes dream that the vecotrization process is supported by Solr ;-) You could imagine a ScriptingUpdateRequestProcessor stage that called out to a service to do it.. Or even do it direclty!

gabrielmagno · 2022-12-05T14:42:27Z

Having a builtin model loader and vector encoder in Solr would be amazing!

Regarding the current vector example, how about I recreate the vectors with the 1-algorithm solution (extracting the first 10 dimensions), then provide the instructions (code) of how to vectorize the movies?

gabrielmagno · 2022-12-09T21:42:56Z

Hey @epugh !

I've created a new model and have put the Python scripts that were used to create the model and calculate the films vectors inside a subfolder called vectors.

gabrielmagno · 2022-12-09T22:10:24Z

Gradlew returned and error:

Execution failed for task ':solr:example:rat'.
> Detected license header issues (skip with -Pvalidation.rat.failOnError=false):
  Unknown license: /home/runner/work/solr/solr/solr/example/films/vectors/films.py
  Unknown license: /home/runner/work/solr/solr/solr/example/films/vectors/create_dataset.py
  Unknown license: /home/runner/work/solr/solr/solr/example/films/vectors/create_model.py

Not sure how to proceed.

sonatype-lift · 2022-12-09T22:21:50Z

solr/example/films/vectors/create_dataset.py

+from films import *
+
+#### Load the 10-dimensions model
+model = SentenceTransformer(FILEPATH_FILMS_MODEL)


💬 7 similar findings have been found in this PR

Unbound name: Name FILEPATH_FILMS_MODEL is used but not defined in the current scope.

🔎 Expand here to view all instances of this finding

File Path Line Number

solr/example/films/vectors/create_dataset.py 10

solr/example/films/vectors/create_dataset.py 13

solr/example/films/vectors/create_dataset.py 47

solr/example/films/vectors/create_dataset.py 48

solr/example/films/vectors/create_model.py 68

solr/example/films/vectors/create_model.py 69

solr/example/films/vectors/create_model.py 94

Visit the Lift Web Console to find more details in your report.

ℹ️ Learn about @sonatype-lift commands

You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.

Command Usage

@sonatype-lift ignore Leave out the above finding from this PR

@sonatype-lift ignoreall Leave out all the existing findings from this PR

@sonatype-lift exclude <file|issue|path|tool> Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file

Note: When talking to LiftBot, you need to refresh the page to see its response.
_{Click here to add LiftBot to another repo.}

Was this a good recommendation?
[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]

Gradlew returned and error:

Execution failed for task ':solr:example:rat'. > Detected license header issues (skip with -Pvalidation.rat.failOnError=false): Unknown license: /home/runner/work/solr/solr/solr/example/films/vectors/films.py Unknown license: /home/runner/work/solr/solr/solr/example/films/vectors/create_dataset.py Unknown license: /home/runner/work/solr/solr/solr/example/films/vectors/create_model.py

Not sure how to proceed.

I think you need to just add teh ASL license text at the top... look at some of the scripts in dev-tools/scripts for examples...

sonatype-lift · 2022-12-10T14:09:57Z

solr/example/films/vectors/films.py

+
+import json
+import csv
+from lxml import etree


blacklist: Using etree to parse untrusted XML data is known to be vulnerable to XML attacks. Replace etree with the equivalent defusedxml package.

ℹ️ Learn about @sonatype-lift commands

You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.

Command Usage

@sonatype-lift ignore Leave out the above finding from this PR

@sonatype-lift ignoreall Leave out all the existing findings from this PR

@sonatype-lift exclude <file|issue|path|tool> Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file

Note: When talking to LiftBot, you need to refresh the page to see its response.
_{Click here to add LiftBot to another repo.}

Was this a good recommendation?
[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]

sonatype-lift · 2022-12-10T14:09:59Z

solr/example/films/vectors/films.py

+
+import json
+import csv
+from lxml import etree


opt.semgrep.python.lang.security.use-defused-xml.use-defused-xml: Found use of the native Python XML libraries, which is vulnerable to XML external entity (XXE)
attacks. The Python documentation recommends the 'defusedxml' library instead if the XML being
loaded is untrusted.

ℹ️ Learn about @sonatype-lift commands

You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.

Command Usage

@sonatype-lift ignore Leave out the above finding from this PR

@sonatype-lift ignoreall Leave out all the existing findings from this PR

@sonatype-lift exclude <file|issue|path|tool> Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file

Note: When talking to LiftBot, you need to refresh the page to see its response.
_{Click here to add LiftBot to another repo.}

Was this a good recommendation?
[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]

epugh · 2022-12-10T14:25:38Z

progress is looking great!

epugh · 2022-12-15T14:59:18Z

@gabrielmagno are you ready for me to review this PR? I was sort of waiting till you gave me the all clear!

gabrielmagno · 2022-12-15T15:06:22Z

Hey @epugh!

Yeah, at least from my side I think I have finished it.
Sorry for not making that more clear 😅

By the way, sonatype complained about using lxml, but since we are creating the XML and not really parsing a XML, I think we could ignore it.

You can go ahead and review it.
Thank you very much!

epugh · 2022-12-16T16:45:49Z

@gabrielmagno how do you want to be credited in CHANGES.txt?

epugh · 2022-12-16T18:08:45Z

bin/solr start -example films no longer creates all the fields....

epugh · 2022-12-16T21:01:32Z

This is looking really cool... I wonder, is it worth docuemnting running the python scripts?

gabrielmagno · 2022-12-17T14:18:14Z

Hey @epugh !

You can credit me as "Gabriel Magno".

What about I create a readme file inside the "vectors" folder, explaining the utilization of the scripts?

…ample (#1213) * Extend SolrCLI to support the additional fieldtype so that bin/solr start -e films works. * Introduce documentation and scripts for creating a vector in the first place, an often missing part of a demo. Co-authored-by: Eric Pugh <epugh@opensourceconnections.com>

krickert · 2022-12-19T20:28:13Z

Hello - I just saw these requests and I'd like to (eventually) have a full blog post as to how to create the vector for input with the example being a full wikipedia dump using java, a pretrained model, and deeplearning4j. I suspect once I have this completed, this can help create an example that can lead to this being a feature in solr.

Would anyone be up for a discussion on this? I suppose this point might not be the best place to ask - but this thread is the closest to finding someone who is trying to accomplish the same thing as I'm doing.

epugh · 2022-12-22T13:28:32Z

@krickert one thought that I have is that having this sequence of events would be useful in benchmarking Solr's performance in this evolving space of dense vectors ;-). I think it would be a fantastic tutorial as well... I suggest you join the dev@solr.apache.org mailing list and ask there!

krickert · 2022-12-22T13:35:53Z

Thanks! I have to get it to the point where I get the dents vectors and the vector search working first :) after that I would be glad to write a tutorial. I'll open up a discussion. I'd like someone to ensure that I'm calculating the debate vectors right. Thanks. I'll ask there!

…

On Thu, Dec 22, 2022, 08:28 Eric Pugh ***@***.***> wrote: @krickert <https://github.com/krickert> one thought that I have is that having this sequence of events would be useful in benchmarking Solr's performance in this evolving space of dense vectors ;-). I think it would be a fantastic tutorial as well... I suggest you join the ***@***.*** mailing list and ask there! — Reply to this email directly, view it on GitHub <#1213 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACMSDFYSLVPUID6MNAICSDWORJQZANCNFSM6AAAAAASUIVWLU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dsmiley · 2022-12-27T00:17:48Z

Next time, please put newlines at the end of each sentence in our asciidoc, as is consistent with most of our asciidoc. It's much easier to consume in diffs.

epugh · 2022-12-27T16:54:26Z

I wish we had a lint checker for this kind of thing. We have a lot of "best practices" in writing code, but they are all enforced manually ;-(. I did google briefly for "ventilated prose lint checker" with no luck. https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line

gabrielmagno added 15 commits December 2, 2022 18:42

Created the films-vectors example

17f58f2

Remove superfluous paragraph

b0e32f3

Fix name of collection

3dc0eca

Updated films dataset files adding the new film_vector field

cf82a5d

Properly encoded XML fields

aacf535

Remove empty tags from fields

3074cfa

Update instructions for creating the collection considering the new f…

d40a564

…ilm_vector field

Add paragraph explaining how the film_vector field was generated

9da8012

Typo

c52753d

Add examples of KNN queries

8ca3ade

Deleted the separeted vectors example, since now it is merged with th…

9453699

…e films example

Fix typo

cb10b89

Remove complex example

5ae3786

Better wording

06c7c6d

Merge branch 'apache:main' into main

3410e0b

gabrielmagno added 4 commits December 9, 2022 18:18

Updated vectors considering the new simpler model

6734917

Included Python scripts to create the model and the vectors of the films

3badd89

Updated readme example considering the new vectors

5f697ef

Updated the explanation of the model in the readme

92ff897

sonatype-lift bot reviewed Dec 9, 2022

View reviewed changes

gabrielmagno added 3 commits December 10, 2022 09:58

Add licence to the Python scripts

1952fd7

Improve packaging importing

b3245fc

Improve code and add exporting of dataset for XML and CSV

56d02b5

gabrielmagno added 2 commits December 10, 2022 10:26

Set scripts as executable

70df38d

Merge branch 'apache:main' into main

f87771b

sonatype-lift bot reviewed Dec 10, 2022

View reviewed changes

epugh added 2 commits December 16, 2022 15:41

Add additional schema fields set up to the films example script.

0bf725b

track changes

ca7703e

epugh added 2 commits December 16, 2022 16:03

Merge remote-tracking branch 'upstream/main' into pr/1213

f013257

fix formatting

cc4bfd4

gabrielmagno and others added 2 commits December 17, 2022 17:59

Created a readme for the vectors scripts

17b77bd

add license

56346f2

epugh merged commit 516180f into apache:main Dec 18, 2022

cpoerschke mentioned this pull request Sep 11, 2025

Add Solr reader integration run-llama/llama_index#19843

Merged

18 tasks

Command	Usage
`@sonatype-lift ignore`	Leave out the above finding from this PR
`@sonatype-lift ignoreall`	Leave out all the existing findings from this PR
`@sonatype-lift exclude <file\|issue\|path\|tool>`	Exclude specified `file\|issue\|path\|tool` from Lift findings by updating your config.toml file

Conversation

gabrielmagno commented Dec 5, 2022

Description

Solution

Tests

Checklist

Uh oh!

gabrielmagno commented Dec 5, 2022

Uh oh!

epugh commented Dec 5, 2022

Uh oh!

gabrielmagno commented Dec 5, 2022

Uh oh!

epugh commented Dec 5, 2022

Uh oh!

gabrielmagno commented Dec 5, 2022

Uh oh!

gabrielmagno commented Dec 9, 2022

Uh oh!

gabrielmagno commented Dec 9, 2022

Uh oh!

sonatype-lift bot Dec 9, 2022

Choose a reason for hiding this comment

Uh oh!

epugh Dec 9, 2022

Choose a reason for hiding this comment

Uh oh!

sonatype-lift bot Dec 10, 2022

Choose a reason for hiding this comment

Uh oh!

sonatype-lift bot Dec 10, 2022

Choose a reason for hiding this comment

Uh oh!

epugh commented Dec 10, 2022

Uh oh!

epugh commented Dec 15, 2022

Uh oh!

gabrielmagno commented Dec 15, 2022

Uh oh!

epugh commented Dec 16, 2022

Uh oh!

epugh commented Dec 16, 2022

Uh oh!

epugh commented Dec 16, 2022

Uh oh!

gabrielmagno commented Dec 17, 2022

Uh oh!

krickert commented Dec 19, 2022

Uh oh!

epugh commented Dec 22, 2022

Uh oh!

krickert commented Dec 22, 2022 via email

Uh oh!

dsmiley commented Dec 27, 2022

Uh oh!

epugh commented Dec 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants