Skip to content

SOLR-16574: Demonstrate Dense Vectors and KNN as part of the Films example#1213

Merged
epugh merged 30 commits intoapache:mainfrom
gabrielmagno:main
Dec 18, 2022
Merged

SOLR-16574: Demonstrate Dense Vectors and KNN as part of the Films example#1213
epugh merged 30 commits intoapache:mainfrom
gabrielmagno:main

Conversation

@gabrielmagno
Copy link
Contributor

https://issues.apache.org/jira/browse/SOLR-16574

Description

Enrich the films example to demonstrate how to use the Dense Vectors feature.

Solution

Added the field film_vector to the films dataset. This is an embedding vector created to represent the movie with 10 dimensions. The vector is created by combining the first 5 dimensions of a pre-trained BERT sentence model applied on the name of the movies plus the name of the genres, followed by an item2vec 5-dimensions model of itemset co-occurrence of genres in the movies, totaling 10 dimensions. Even though it is expected that similar movies will be close to each other, this is just a "toy example" model to serve as source for creating the films vectors.

The README of the example was also updated to include the specification of the Dense Vector field in the schema. Also, a new section was created, with examples showing how to make KNN queries with the vectors.

Tests

  • Added the new field film_vector to the 3 dataset formats (JSON, XML, CSV), making sure to preserve the exact same data from the original datasets, so that the "diff" will be only the appendage of the new field.
  • Checked the creation of the collection for the 3 dataset formats. Regardless of the format all the 1100 films were indexed, and the film_vector field was correctly parsed and indexed as well.
  • Checked the KNN example queries for all the 3 dataset formats.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@gabrielmagno
Copy link
Contributor Author

I tried running gradlew check, but got an error that I don't think is related to the changes of this PR:

> Task :solr:solr-ref-guide:buildLocalAntoraSite FAILED
/home/gabrielmagno/dev-solr/apache-solr/.gradle/node/solr-ref-guide/node_modules/@antora/ui-loader/lib/load-ui.js:180
      if (response.statusCode !== 200) {
                   ^

TypeError: Cannot read properties of undefined (reading 'statusCode')
    at /home/gabrielmagno/dev-solr/apache-solr/.gradle/node/solr-ref-guide/node_modules/@antora/ui-loader/lib/load-ui.js:180:20
    at ClientRequest.<anonymous> (/home/gabrielmagno/dev-solr/apache-solr/.gradle/node/solr-ref-guide/node_modules/simple-get/index.js:88:21)
    at ClientRequest.f (/home/gabrielmagno/dev-solr/apache-solr/.gradle/node/solr-ref-guide/node_modules/once/once.js:25:25)
    at ClientRequest.emit (node:events:527:28)
    at TLSSocket.socketErrorListener (node:_http_client:454:9)
    at TLSSocket.emit (node:events:527:28)
    at emitErrorNT (node:internal/streams/destroy:157:8)
    at emitErrorCloseNT (node:internal/streams/destroy:122:3)
    at processTicksAndRejections (node:internal/process/task_queues:83:21)

@epugh
Copy link
Contributor

epugh commented Dec 5, 2022

How hard would it be to include the steps you did for vectorization? I notice that most tutorials start with the vectors... However, what if I wanted to play with this example and add my own tweaks to the vector? Is creating the vector a simple script, or was it a ton of work and out of scope????

@gabrielmagno
Copy link
Contributor Author

@epugh for this version I combined two "example" models (BERT + item2vec), just to server as an example.

If we are willing provide the instructions on how to create the models and the vectors itself, I guess it would be better to use a single model solution, for simplicity. I could recreate the vectors using only BERT (which I believe is good enought for our example).

The easiest way I know to create a vector representation of text data is by using the sentence_transformers Python library with a pre-trained BERT model. It is possible to create vectors with 3 lines of code:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-mpnet-base-v2")

my_vector = model.encode("This is my text")

The only issue is that the vectors from this model have 768 dimensions. For the example I simply got the first 5 dimensions and concatenate to the other model. This is not a really appropriate way to create the vector in real scenarios. There are other techniques (e.g. Model Distillation) that could reduce the number of dimensions.

@epugh
Copy link
Contributor

epugh commented Dec 5, 2022

i think I'm mostly thinking, what if I add a new movie to the list... or want to play with it, can we provide that script as well? I sometimes dream that the vecotrization process is supported by Solr ;-) You could imagine a ScriptingUpdateRequestProcessor stage that called out to a service to do it.. Or even do it direclty!

@gabrielmagno
Copy link
Contributor Author

Having a builtin model loader and vector encoder in Solr would be amazing!

Regarding the current vector example, how about I recreate the vectors with the 1-algorithm solution (extracting the first 10 dimensions), then provide the instructions (code) of how to vectorize the movies?

@gabrielmagno
Copy link
Contributor Author

Hey @epugh !

I've created a new model and have put the Python scripts that were used to create the model and calculate the films vectors inside a subfolder called vectors.

@gabrielmagno
Copy link
Contributor Author

Gradlew returned and error:

Execution failed for task ':solr:example:rat'.
> Detected license header issues (skip with -Pvalidation.rat.failOnError=false):
  Unknown license: /home/runner/work/solr/solr/solr/example/films/vectors/films.py
  Unknown license: /home/runner/work/solr/solr/solr/example/films/vectors/create_dataset.py
  Unknown license: /home/runner/work/solr/solr/solr/example/films/vectors/create_model.py

Not sure how to proceed.

from films import *

#### Load the 10-dimensions model
model = SentenceTransformer(FILEPATH_FILMS_MODEL)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12% of developers fix this issue

💬 7 similar findings have been found in this PR


Unbound name: Name FILEPATH_FILMS_MODEL is used but not defined in the current scope.


🔎 Expand here to view all instances of this finding
File Path Line Number
solr/example/films/vectors/create_dataset.py 10
solr/example/films/vectors/create_dataset.py 13
solr/example/films/vectors/create_dataset.py 47
solr/example/films/vectors/create_dataset.py 48
solr/example/films/vectors/create_model.py 68
solr/example/films/vectors/create_model.py 69
solr/example/films/vectors/create_model.py 94

Visit the Lift Web Console to find more details in your report.


ℹ️ Learn about @sonatype-lift commands

You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.

Command Usage
@sonatype-lift ignore Leave out the above finding from this PR
@sonatype-lift ignoreall Leave out all the existing findings from this PR
@sonatype-lift exclude <file|issue|path|tool> Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file

Note: When talking to LiftBot, you need to refresh the page to see its response.
Click here to add LiftBot to another repo.


Was this a good recommendation?
[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gradlew returned and error:

Execution failed for task ':solr:example:rat'.
> Detected license header issues (skip with -Pvalidation.rat.failOnError=false):
  Unknown license: /home/runner/work/solr/solr/solr/example/films/vectors/films.py
  Unknown license: /home/runner/work/solr/solr/solr/example/films/vectors/create_dataset.py
  Unknown license: /home/runner/work/solr/solr/solr/example/films/vectors/create_model.py

Not sure how to proceed.

I think you need to just add teh ASL license text at the top... look at some of the scripts in dev-tools/scripts for examples...


import json
import csv
from lxml import etree
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

19% of developers fix this issue

blacklist: Using etree to parse untrusted XML data is known to be vulnerable to XML attacks. Replace etree with the equivalent defusedxml package.


ℹ️ Learn about @sonatype-lift commands

You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.

Command Usage
@sonatype-lift ignore Leave out the above finding from this PR
@sonatype-lift ignoreall Leave out all the existing findings from this PR
@sonatype-lift exclude <file|issue|path|tool> Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file

Note: When talking to LiftBot, you need to refresh the page to see its response.
Click here to add LiftBot to another repo.


Was this a good recommendation?
[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]


import json
import csv
from lxml import etree
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opt.semgrep.python.lang.security.use-defused-xml.use-defused-xml: Found use of the native Python XML libraries, which is vulnerable to XML external entity (XXE)
attacks. The Python documentation recommends the 'defusedxml' library instead if the XML being
loaded is untrusted.


ℹ️ Learn about @sonatype-lift commands

You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.

Command Usage
@sonatype-lift ignore Leave out the above finding from this PR
@sonatype-lift ignoreall Leave out all the existing findings from this PR
@sonatype-lift exclude <file|issue|path|tool> Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file

Note: When talking to LiftBot, you need to refresh the page to see its response.
Click here to add LiftBot to another repo.


Was this a good recommendation?
[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]

@epugh
Copy link
Contributor

epugh commented Dec 10, 2022

progress is looking great!

@epugh
Copy link
Contributor

epugh commented Dec 15, 2022

@gabrielmagno are you ready for me to review this PR? I was sort of waiting till you gave me the all clear!

@gabrielmagno
Copy link
Contributor Author

Hey @epugh!

Yeah, at least from my side I think I have finished it.
Sorry for not making that more clear 😅

By the way, sonatype complained about using lxml, but since we are creating the XML and not really parsing a XML, I think we could ignore it.

You can go ahead and review it.
Thank you very much!

@epugh
Copy link
Contributor

epugh commented Dec 16, 2022

@gabrielmagno how do you want to be credited in CHANGES.txt?

@epugh
Copy link
Contributor

epugh commented Dec 16, 2022

bin/solr start -example films no longer creates all the fields....

@epugh
Copy link
Contributor

epugh commented Dec 16, 2022

This is looking really cool... I wonder, is it worth docuemnting running the python scripts?

@gabrielmagno
Copy link
Contributor Author

Hey @epugh !

You can credit me as "Gabriel Magno".

What about I create a readme file inside the "vectors" folder, explaining the utilization of the scripts?

@epugh epugh merged commit 516180f into apache:main Dec 18, 2022
epugh added a commit that referenced this pull request Dec 18, 2022
…ample (#1213)

* Extend SolrCLI to support the additional fieldtype so that bin/solr start -e films works.
* Introduce documentation and scripts for creating a vector in the first place, an often missing part of a demo.

Co-authored-by: Eric Pugh <epugh@opensourceconnections.com>
@krickert
Copy link

Hello - I just saw these requests and I'd like to (eventually) have a full blog post as to how to create the vector for input with the example being a full wikipedia dump using java, a pretrained model, and deeplearning4j. I suspect once I have this completed, this can help create an example that can lead to this being a feature in solr.

Would anyone be up for a discussion on this? I suppose this point might not be the best place to ask - but this thread is the closest to finding someone who is trying to accomplish the same thing as I'm doing.

@epugh
Copy link
Contributor

epugh commented Dec 22, 2022

@krickert one thought that I have is that having this sequence of events would be useful in benchmarking Solr's performance in this evolving space of dense vectors ;-). I think it would be a fantastic tutorial as well... I suggest you join the dev@solr.apache.org mailing list and ask there!

@krickert
Copy link

krickert commented Dec 22, 2022 via email

@dsmiley
Copy link
Contributor

dsmiley commented Dec 27, 2022

Next time, please put newlines at the end of each sentence in our asciidoc, as is consistent with most of our asciidoc. It's much easier to consume in diffs.

@epugh
Copy link
Contributor

epugh commented Dec 27, 2022

I wish we had a lint checker for this kind of thing. We have a lot of "best practices" in writing code, but they are all enforced manually ;-(. I did google briefly for "ventilated prose lint checker" with no luck. https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants