Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dot_product not work! #107780

Open
longday1102 opened this issue Apr 23, 2024 · 5 comments
Open

dot_product not work! #107780

longday1102 opened this issue Apr 23, 2024 · 5 comments
Assignees
Labels
:Search/Vectors Vector search Team:Search Meta label for search team

Comments

@longday1102
Copy link

longday1102 commented Apr 23, 2024

Elasticsearch Version

8.13.0

Below is the code where I define my data fields, in which the content_vector field with similarity is dot_product. In the process of getting the embedding vectors from my model, I normalized the values in each vector to a length of 1. The actions variable is the setting for me to push my data to elasticsearch.
However, this code encounters the error elasticsearch.helpers.BulkIndexError: 423 document(s) failed to index.
I found that if I replace _source in actions with doc or change the similarity of the content_vector field to cosine, no error will occur.

fields_setting = {
        "mappings": {
            "properties": {
                "content": {
                    "type": "text",
                    "similarity": "BM25",
                    "analyzer": "vn_analyzer"
                },
                "grade": {
                    "type": "integer"
                },
                "subject": {
                    "type": "integer"
                },
                "unit": {
                    "type": "text",
                    "similarity": "BM25",
                    "analyzer": "vn_analyzer"
                },
                "title": {
                    "type": "text",
                    "similarity": "BM25",
                    "analyzer": "vn_analyzer"
                },
                "section": {
                    "type": "text",
                    "similarity": "BM25",
                    "analyzer": "vn_analyzer"
                },
                "content_vector": {
                    "type": "dense_vector",
                    "index": True,
                    "dims": hidden_size,
                    "element_type": "float",
                    "similarity": "dot_product"
                }
            }
        }
    }

index_config = ESConfig.index_config(stopwords = stopwords, fields_setting = fields_setting)
es.indices.create(index = "es_text", body = index_config, request_timeout = 1000)

actions = [
        {
            "_op_type": "index",
            "_index": args.index_name,
            "_id": cfg["_id"],
            "_source": {
                "content": cfg["content"],
                "content_vector": cfg["content_vector"],
                "grade": cfg["grade"],
                "subject": cfg["subject"],
                "unit": cfg["unit"],
                "title": cfg["title"],
                "section": cfg["section"],
            }
        } for cfg in document
    ]
helpers.bulk(es, actions, request_timeout = 1000)

Can someone explain to me why there is such an error?

@longday1102 longday1102 added >bug needs:triage Requires assignment of a team area label labels Apr 23, 2024
@longday1102 longday1102 reopened this Apr 24, 2024
@benwtrent
Copy link
Member

@longday1102 this is likely due to your vector values not being normalized (magnitude of length 1).

The bulk response should have some details contained in the response indicating the failure.

@benwtrent benwtrent added :Search/Vectors Vector search and removed >bug needs:triage Requires assignment of a team area label labels Apr 24, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Apr 24, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@benwtrent
Copy link
Member

@longday1102 if you have issues getting the bulk response to print out the issue, you can also make a manual request with one of your documents in the kibana console (if you have kibana). Or via curl, and the response body will contain the reason for the failure.

@benwtrent benwtrent self-assigned this Apr 24, 2024
@longday1102
Copy link
Author

longday1102 commented Apr 26, 2024

@longday1102 this is likely due to your vector values not being normalized (magnitude of length 1).

The bulk response should have some details contained in the response indicating the failure.

@benwtrent I normalized the length of the dense_vector to 1 but that didn't solve the error.

@benwtrent
Copy link
Member

@longday1102 to solve the issue, we will need to know what the bulk error failure details are. Could you provide those? An option is as well to attempt to index the same documents (or just one of the failing ones) manually via the Kibana Console or CURL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search/Vectors Vector search Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

3 participants