Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
17f58f2
Created the films-vectors example
gabrielmagno Dec 2, 2022
b0e32f3
Remove superfluous paragraph
gabrielmagno Dec 2, 2022
3dc0eca
Fix name of collection
gabrielmagno Dec 2, 2022
cf82a5d
Updated films dataset files adding the new film_vector field
gabrielmagno Dec 3, 2022
aacf535
Properly encoded XML fields
gabrielmagno Dec 3, 2022
3074cfa
Remove empty tags from fields
gabrielmagno Dec 3, 2022
d40a564
Update instructions for creating the collection considering the new f…
gabrielmagno Dec 4, 2022
9da8012
Add paragraph explaining how the film_vector field was generated
gabrielmagno Dec 4, 2022
c52753d
Typo
gabrielmagno Dec 4, 2022
8ca3ade
Add examples of KNN queries
gabrielmagno Dec 4, 2022
9453699
Deleted the separeted vectors example, since now it is merged with th…
gabrielmagno Dec 4, 2022
cb10b89
Fix typo
gabrielmagno Dec 4, 2022
5ae3786
Remove complex example
gabrielmagno Dec 4, 2022
06c7c6d
Better wording
gabrielmagno Dec 5, 2022
3410e0b
Merge branch 'apache:main' into main
gabrielmagno Dec 5, 2022
6734917
Updated vectors considering the new simpler model
gabrielmagno Dec 9, 2022
3badd89
Included Python scripts to create the model and the vectors of the films
gabrielmagno Dec 9, 2022
5f697ef
Updated readme example considering the new vectors
gabrielmagno Dec 9, 2022
92ff897
Updated the explanation of the model in the readme
gabrielmagno Dec 9, 2022
1952fd7
Add licence to the Python scripts
gabrielmagno Dec 10, 2022
b3245fc
Improve packaging importing
gabrielmagno Dec 10, 2022
56d02b5
Improve code and add exporting of dataset for XML and CSV
gabrielmagno Dec 10, 2022
70df38d
Set scripts as executable
gabrielmagno Dec 10, 2022
f87771b
Merge branch 'apache:main' into main
gabrielmagno Dec 10, 2022
0bf725b
Add additional schema fields set up to the films example script.
epugh Dec 16, 2022
ca7703e
track changes
epugh Dec 16, 2022
f013257
Merge remote-tracking branch 'upstream/main' into pr/1213
epugh Dec 16, 2022
cc4bfd4
fix formatting
epugh Dec 16, 2022
17b77bd
Created a readme for the vectors scripts
gabrielmagno Dec 17, 2022
56346f2
add license
epugh Dec 18, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions solr/CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,8 @@ Other Changes

* SOLR-16569: Add java system property to overseer queue size (Nick Ginther via noble)

* SOLR-16574: Demonstrate Dense Vectors and KNN as part of the Films example (Gabriel Magno via Eric Pugh)

================== 9.1.1 ==================

Bug Fixes
Expand Down
27 changes: 26 additions & 1 deletion solr/core/src/java/org/apache/solr/util/SolrCLI.java
Original file line number Diff line number Diff line change
Expand Up @@ -3009,7 +3009,26 @@ protected void runExample(CommandLine cli, String exampleName) throws Exception
} else if ("films".equals(exampleName) && !alreadyExists) {
SolrClient solrClient = new HttpSolrClient.Builder(solrUrl).build();

echo("Adding name and initial_release_data fields to films schema \"_default\"");
echo("Adding dense vector field type to films schema \"_default\"");
try {
SolrCLI.postJsonToSolr(
solrClient,
"/" + collectionName + "/schema",
"{\n"
+ " \"add-field-type\" : {\n"
+ " \"name\":\"knn_vector_10\",\n"
+ " \"class\":\"solr.DenseVectorField\",\n"
+ " \"vectorDimension\":10,\n"
+ " \"similarityFunction\":cosine\n"
+ " \"knnAlgorithm\":hnsw\n"
+ " }\n"
+ " }");
} catch (Exception ex) {
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, ex);
}

echo(
"Adding name, initial_release_date, and film_vector fields to films schema \"_default\"");
try {
SolrCLI.postJsonToSolr(
solrClient,
Expand All @@ -3025,6 +3044,12 @@ protected void runExample(CommandLine cli, String exampleName) throws Exception
+ " \"name\":\"initial_release_date\",\n"
+ " \"type\":\"pdate\",\n"
+ " \"stored\":true\n"
+ " },\n"
+ " \"add-field\" : {\n"
+ " \"name\":\"film_vector\",\n"
+ " \"type\":\"knn_vector_10\",\n"
+ " \"indexed\":true\n"
+ " \"stored\":true\n"
+ " }\n"
+ " }");
} catch (Exception ex) {
Expand Down
87 changes: 75 additions & 12 deletions solr/example/films/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,14 @@ This data consists of the following fields:
* `directed_by` - The person(s) who directed the making of the film
* `initial_release_date` - The earliest official initial film screening date in any country
* `genre` - The genre(s) that the movie belongs to
* `film_vector` - The 10 dimensional vector representing the film, according to a toy example embedding model

The `name` and `initial_release_date` are created via the Schema API, and the `genre` and `direct_by` fields
are created by the use of an Update Request Processor Chain called `add-unknown-fields-to-the-schema`.

The below steps walk you through learning how to start up Solr, setup the films collection yourself, and then load data. We'll then create ParamSets to organize our query parameters.
The `film_vector` is an embedding vector created to represent the movie with 10 dimensions. The vector is created from a BERT pre-trained model, followed by a dimension reduction technique to reduce the embeddings from 768 to 10 dimensions. Even though it is expected that similar movies will be close to each other, this model is just a "toy example", so it's not guaranteed to be a good representation for the movies. The Python scripts utilized to create the model and calculate the films vectors are in the [vectors directory](./vectors).

The below steps walk you through learning how to start up Solr, setup the films collection yourself, and then load data. We'll then create ParamSets to organize our query parameters. Finally, we also show some example of KNN queries exploring the Dense Vectors feature.

You can also run `bin/solr start -example films` or `bin/solr start -c -example films` for SolrCloud version which does all the below steps for you.

Expand All @@ -34,17 +37,32 @@ This data consists of the following fields:

```
curl http://localhost:8983/solr/films/schema -X POST -H 'Content-type:application/json' --data-binary '{
"add-field" : {
"name":"name",
"type":"text_general",
"multiValued":false,
"stored":true
"add-field-type" : {
"name":"knn_vector_10",
"class":"solr.DenseVectorField",
"vectorDimension":10,
"similarityFunction":"cosine",
"knnAlgorithm":"hnsw"
},
"add-field" : {
"name":"initial_release_date",
"type":"pdate",
"stored":true
}
"add-field" : [
{
"name":"name",
"type":"text_general",
"multiValued":false,
"stored":true
},
{
"name":"initial_release_date",
"type":"pdate",
"stored":true
},
{
"name":"film_vector",
"type":"knn_vector_10",
"indexed":true,
"stored":true
}
]
}'
```

Expand All @@ -57,7 +75,7 @@ This data consists of the following fields:
bin/post \
-c films \
example/films/films.csv \
-params "f.genre.split=true&f.directed_by.split=true&f.genre.separator=|&f.directed_by.separator=|"
-params "f.genre.split=true&f.directed_by.split=true&f.film_vector.split=true&f.genre.separator=|&f.directed_by.separator=|&f.film_vector.separator=|"
```


Expand Down Expand Up @@ -125,8 +143,53 @@ This data consists of the following fields:
We can say that we believe Algorithm *B* is better then Algorithm *A*. You can extend
this to online A/B testing very easily to confirm with real users.

* Now let's search with Dense Vectors and KNN!

- Before making the queries, we define an example target vector, simulating a person that
watched 3 movies: _Finding Nemo_, _Bee Movie_, and _Harry Potter and the Chamber of Secrets_.
We get the 3 vectors of each movie, then calculate the resulting average vector, which will
be used as input vector for all the following example queries.

```
[-0.1784, 0.0096, -0.1455, 0.4167, -0.1148, -0.0053, -0.0651, -0.0415, 0.0859, -0.1789]
```
- Search for the top 10 movies most similar to the target vector (KNN Query for recommendation):

http://localhost:8983/solr/films/query?q={!knn%20f=film_vector%20topK=10}[-0.1784,0.0096,-0.1455,0.4167,-0.1148,-0.0053,-0.0651,-0.0415,0.0859,-0.1789]

* Notice that among the results, there are some animation family movies, such as _Curious George_ and _Bambi_, which makes sense, since the target vector was created with two other animation family movies (_Finding Nemo_ and _Bee Movie_).
* We also notice that among the results there are two movies that the person already watched. In the next example we will filter then out.

- Search for the top 10 movies most similar to the resulting vector, excluding the movies already watched (KNN query with Filter Query):

http://localhost:8983/solr/films/query?q={!knn%20f=film_vector%20topK=10}[-0.1784,0.0096,-0.1455,0.4167,-0.1148,-0.0053,-0.0651,-0.0415,0.0859,-0.1789]&fq=-id:("%2Fen%2Ffinding_nemo"%20"%2Fen%2Fbee_movie"%20"%2Fen%2Fharry_potter_and_the_chamber_of_secrets_2002")

- Search for movies with "cinderella" in the name among the top 50 movies most similar to the target vector (KNN as Filter Query):

http://localhost:8983/solr/films/query?q=name:cinderella&fq={!knn%20f=film_vector%20topK=50}[-0.1784,0.0096,-0.1455,0.4167,-0.1148,-0.0053,-0.0651,-0.0415,0.0859,-0.1789]

* There are 3 "cinderella" movies in the index, but only 1 is among the top 50 most similar to the target vector (_Cinderella III: A Twist in Time_).

- Search for movies with "animation" in the genre, and rerank the top 5 documents by combining (sum) the original query score with twice (2x) the similarity to the target vector (KNN with ReRanking):

http://localhost:8983/solr/films/query?q=genre:animation&rqq={!knn%20f=film_vector%20topK=10000}[-0.1784,0.0096,-0.1455,0.4167,-0.1148,-0.0053,-0.0651,-0.0415,0.0859,-0.1789]&rq={!rerank%20reRankQuery=$rqq%20reRankDocs=5%20reRankWeight=2}

* To guarantee we calculate the vector similarity score for all the movies, we set `topK=10000`, a number higher than the total number of documents (`1100`).

* It's possible to combine the vector similarity scores with other scores, by using Sub-query,
Function Queries and Parameter Dereferencing Solr features:

- Search for "harry potter" movies, ranking the results by the similarity to the target vector instead of the lexical query score. Beside the `q` parameter, we define a "sub-query" named `q_vector`, that will calculate the similarity score between all the movies (since we set `topK=10000`). Then we use the sub-query parameter name as input for the `sort`, specifying that we want to rank descending according to the vector similarity score (`sort=$q_vector desc`):

http://localhost:8983/solr/films/query?q=name:"harry%20potter"&q_vector={!knn%20f=film_vector%20topK=10000}[-0.1784,0.0096,-0.1455,0.4167,-0.1148,-0.0053,-0.0651,-0.0415,0.0859,-0.1789]&sort=$q_vector%20desc

- Search for movies with "the" in the name, keeping the original lexical query ranking, but returning only movies with similarity to the target vector of 0.8 or higher. Like previously, we define the sub-query `q_vector`, but this time we use it as input for the `frange` filter, specifying that we want documents with at least 0.8 of vector similarity score:

http://localhost:8983/solr/films/query?q=name:the&q_vector={!knn%20f=film_vector%20topK=10000}[-0.1784,0.0096,-0.1455,0.4167,-0.1148,-0.0053,-0.0651,-0.0415,0.0859,-0.1789]&fq={!frange%20l=0.8}$q_vector

- Search for "batman" movies, ranking the results by combining 70% of the original lexical query score and 30% of the similarity to the target vector. Besides the `q` main query and the `q_vector` sub-query, we also specify the `q_lexical` query, which will hold the lexical score of the main `q` query. Then we specify a parameter variable called `score_combined`, which scales the lexical and similarity scores, applies the 0.7 and 0.3 weights, then sum the result. We set the `sort` parameter to order according the combined score, and also set the `fl` parameter so that we can view the intermediary and the combined score values in the response:

http://localhost:8983/solr/films/query?q=name:batman&q_lexical={!edismax%20v=$q}&q_vector={!knn%20f=film_vector%20topK=10000}[-0.1784,0.0096,-0.1455,0.4167,-0.1148,-0.0053,-0.0651,-0.0415,0.0859,-0.1789]&score_combined=sum(mul(scale($q_lexical,0,1),0.7),mul(scale($q_vector,0,1),0.3))&sort=$score_combined%20desc&fl=name,score,$q_lexical,$q_vector,$score_combined


FAQ:
Expand Down
Loading