SOLR-16574: Demonstrate Dense Vectors and KNN as part of the Films example#1213
SOLR-16574: Demonstrate Dense Vectors and KNN as part of the Films example#1213epugh merged 30 commits intoapache:mainfrom
Conversation
|
I tried running |
|
How hard would it be to include the steps you did for vectorization? I notice that most tutorials start with the vectors... However, what if I wanted to play with this example and add my own tweaks to the vector? Is creating the vector a simple script, or was it a ton of work and out of scope???? |
|
@epugh for this version I combined two "example" models (BERT + item2vec), just to server as an example. If we are willing provide the instructions on how to create the models and the vectors itself, I guess it would be better to use a single model solution, for simplicity. I could recreate the vectors using only BERT (which I believe is good enought for our example). The easiest way I know to create a vector representation of text data is by using the The only issue is that the vectors from this model have 768 dimensions. For the example I simply got the first 5 dimensions and concatenate to the other model. This is not a really appropriate way to create the vector in real scenarios. There are other techniques (e.g. Model Distillation) that could reduce the number of dimensions. |
|
i think I'm mostly thinking, what if I add a new movie to the list... or want to play with it, can we provide that script as well? I sometimes dream that the vecotrization process is supported by Solr ;-) You could imagine a ScriptingUpdateRequestProcessor stage that called out to a service to do it.. Or even do it direclty! |
|
Having a builtin model loader and vector encoder in Solr would be amazing! Regarding the current vector example, how about I recreate the vectors with the 1-algorithm solution (extracting the first 10 dimensions), then provide the instructions (code) of how to vectorize the movies? |
|
Hey @epugh ! I've created a new model and have put the Python scripts that were used to create the model and calculate the films vectors inside a subfolder called |
|
Gradlew returned and error: Not sure how to proceed. |
| from films import * | ||
|
|
||
| #### Load the 10-dimensions model | ||
| model = SentenceTransformer(FILEPATH_FILMS_MODEL) |
There was a problem hiding this comment.
💬 7 similar findings have been found in this PR
Unbound name: Name FILEPATH_FILMS_MODEL is used but not defined in the current scope.
🔎 Expand here to view all instances of this finding
| File Path | Line Number |
|---|---|
| solr/example/films/vectors/create_dataset.py | 10 |
| solr/example/films/vectors/create_dataset.py | 13 |
| solr/example/films/vectors/create_dataset.py | 47 |
| solr/example/films/vectors/create_dataset.py | 48 |
| solr/example/films/vectors/create_model.py | 68 |
| solr/example/films/vectors/create_model.py | 69 |
| solr/example/films/vectors/create_model.py | 94 |
Visit the Lift Web Console to find more details in your report.
ℹ️ Learn about @sonatype-lift commands
You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.
| Command | Usage |
|---|---|
@sonatype-lift ignore |
Leave out the above finding from this PR |
@sonatype-lift ignoreall |
Leave out all the existing findings from this PR |
@sonatype-lift exclude <file|issue|path|tool> |
Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file |
Note: When talking to LiftBot, you need to refresh the page to see its response.
Click here to add LiftBot to another repo.
Was this a good recommendation?
[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]
There was a problem hiding this comment.
Gradlew returned and error:
Execution failed for task ':solr:example:rat'. > Detected license header issues (skip with -Pvalidation.rat.failOnError=false): Unknown license: /home/runner/work/solr/solr/solr/example/films/vectors/films.py Unknown license: /home/runner/work/solr/solr/solr/example/films/vectors/create_dataset.py Unknown license: /home/runner/work/solr/solr/solr/example/films/vectors/create_model.pyNot sure how to proceed.
I think you need to just add teh ASL license text at the top... look at some of the scripts in dev-tools/scripts for examples...
|
|
||
| import json | ||
| import csv | ||
| from lxml import etree |
There was a problem hiding this comment.
blacklist: Using etree to parse untrusted XML data is known to be vulnerable to XML attacks. Replace etree with the equivalent defusedxml package.
ℹ️ Learn about @sonatype-lift commands
You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.
| Command | Usage |
|---|---|
@sonatype-lift ignore |
Leave out the above finding from this PR |
@sonatype-lift ignoreall |
Leave out all the existing findings from this PR |
@sonatype-lift exclude <file|issue|path|tool> |
Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file |
Note: When talking to LiftBot, you need to refresh the page to see its response.
Click here to add LiftBot to another repo.
Was this a good recommendation?
[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]
|
|
||
| import json | ||
| import csv | ||
| from lxml import etree |
There was a problem hiding this comment.
opt.semgrep.python.lang.security.use-defused-xml.use-defused-xml: Found use of the native Python XML libraries, which is vulnerable to XML external entity (XXE)
attacks. The Python documentation recommends the 'defusedxml' library instead if the XML being
loaded is untrusted.
ℹ️ Learn about @sonatype-lift commands
You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.
| Command | Usage |
|---|---|
@sonatype-lift ignore |
Leave out the above finding from this PR |
@sonatype-lift ignoreall |
Leave out all the existing findings from this PR |
@sonatype-lift exclude <file|issue|path|tool> |
Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file |
Note: When talking to LiftBot, you need to refresh the page to see its response.
Click here to add LiftBot to another repo.
Was this a good recommendation?
[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]
|
progress is looking great! |
|
@gabrielmagno are you ready for me to review this PR? I was sort of waiting till you gave me the all clear! |
|
Hey @epugh! Yeah, at least from my side I think I have finished it. By the way, sonatype complained about using lxml, but since we are creating the XML and not really parsing a XML, I think we could ignore it. You can go ahead and review it. |
|
@gabrielmagno how do you want to be credited in |
|
|
|
This is looking really cool... I wonder, is it worth docuemnting running the python scripts? |
|
Hey @epugh ! You can credit me as "Gabriel Magno". What about I create a readme file inside the "vectors" folder, explaining the utilization of the scripts? |
…ample (#1213) * Extend SolrCLI to support the additional fieldtype so that bin/solr start -e films works. * Introduce documentation and scripts for creating a vector in the first place, an often missing part of a demo. Co-authored-by: Eric Pugh <epugh@opensourceconnections.com>
|
Hello - I just saw these requests and I'd like to (eventually) have a full blog post as to how to create the vector for input with the example being a full wikipedia dump using java, a pretrained model, and deeplearning4j. I suspect once I have this completed, this can help create an example that can lead to this being a feature in solr. Would anyone be up for a discussion on this? I suppose this point might not be the best place to ask - but this thread is the closest to finding someone who is trying to accomplish the same thing as I'm doing. |
|
@krickert one thought that I have is that having this sequence of events would be useful in benchmarking Solr's performance in this evolving space of dense vectors ;-). I think it would be a fantastic tutorial as well... I suggest you join the dev@solr.apache.org mailing list and ask there! |
|
Thanks!
I have to get it to the point where I get the dents vectors and the vector
search working first :) after that I would be glad to write a tutorial.
I'll open up a discussion. I'd like someone to ensure that I'm calculating
the debate vectors right. Thanks. I'll ask there!
…On Thu, Dec 22, 2022, 08:28 Eric Pugh ***@***.***> wrote:
@krickert <https://github.com/krickert> one thought that I have is that
having this sequence of events would be useful in benchmarking Solr's
performance in this evolving space of dense vectors ;-). I think it would
be a fantastic tutorial as well... I suggest you join the
***@***.*** mailing list and ask there!
—
Reply to this email directly, view it on GitHub
<#1213 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACMSDFYSLVPUID6MNAICSDWORJQZANCNFSM6AAAAAASUIVWLU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Next time, please put newlines at the end of each sentence in our asciidoc, as is consistent with most of our asciidoc. It's much easier to consume in diffs. |
|
I wish we had a lint checker for this kind of thing. We have a lot of "best practices" in writing code, but they are all enforced manually ;-(. I did google briefly for "ventilated prose lint checker" with no luck. https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line |
https://issues.apache.org/jira/browse/SOLR-16574
Description
Enrich the
filmsexample to demonstrate how to use the Dense Vectors feature.Solution
Added the field
film_vectorto the films dataset. This is an embedding vector created to represent the movie with 10 dimensions. The vector is created by combining the first 5 dimensions of a pre-trained BERT sentence model applied on the name of the movies plus the name of the genres, followed by an item2vec 5-dimensions model of itemset co-occurrence of genres in the movies, totaling 10 dimensions. Even though it is expected that similar movies will be close to each other, this is just a "toy example" model to serve as source for creating the films vectors.The
READMEof the example was also updated to include the specification of the Dense Vector field in the schema. Also, a new section was created, with examples showing how to make KNN queries with the vectors.Tests
film_vectorto the 3 dataset formats (JSON, XML, CSV), making sure to preserve the exact same data from the original datasets, so that the "diff" will be only the appendage of the new field.film_vectorfield was correctly parsed and indexed as well.Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.