Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sparse_vector query #108254

Merged
merged 81 commits into from
May 22, 2024
Merged

Conversation

kderusso
Copy link
Member

@kderusso kderusso commented May 3, 2024

Relates to #106261

This PR introduces a new sparse_vector query that combines the functionality of the text_expansion and weighted_tokens queries. Eventually this query will replace the other two queries.

Actions that will occur in future PRs:

  • Deprecating the text_expansion and weighted_tokens queries
  • Removing references to the weighted_tokens query in the SparseVectorQueryBuilder
  • Moving the sparse_vector query outside of the ML plugin (requires some inference API work)

Examples of how to use this new query type:

POST /docs/_search
{
  "query": {
    "sparse_vector": {
      "field": "content_embedding",
      "inference_id": "my-elser-model",
      "query": "how is the weather in jamaica"
    }
  }
}

POST /docs/_search
{
  "query": {
    "sparse_vector": {
      "field": "content_embedding",
      "query_vector": {
        "heat": 0.8471008,
        "atmosphere": 0.24251926,
        "very": 0.15496244,
        "brazil": 0.34641686,
        "winter": 0.5070546,
        "hardy": 0.15455529,
        "cold": 0.34308356,
        "sun": 0.052155916,
        "summer": 0.44945887,
        "beautiful": 0.28135294,
        "caribbean": 1.0542055,
        "jamaican": 1.4637164,
        "jamaica": 2.7778716,
        "geography": 0.48493838,
        "weather": 1.9407427,
        "temperature": 1.0894402,
        "season": 0.50890195,
        "quite": 0.20918,
        "cuba": 0.17437862,
        "rain": 0.14961956,
        "africa": 0.35994464,
        "festival": 0.3104579,
        "pleasant": 0.28202823,
        "island": 0.11068311,
        "forecast": 0.34817883,
        "climate": 1.2159579,
        "humid": 0.8120001,
        "fiji": 0.28404987,
        "tropical": 1.0291497,
        "te": 0.15929775,
        "warm": 1.3682823,
        "kingston": 0.104007065,
        "culture": 0.1396368,
        "beach": 0.085419096,
        "visit": 0.007837615,
        "barbados": 0.23561467,
        "desert": 0.05903566
      }
    }
  }
}

POST /docs/_search
{
  "query": {
    "sparse_vector": {
      "field": "content_embedding",
      "inference_id": "my-elser-model",
      "query": "how is the weather in jamaica",
      "prune": true,
      "pruning_config": {
          "tokens_freq_ratio_threshold": 5,
          "tokens_weight_threshold": 0.4,
          "only_score_pruned_tokens": false
      }
    }
  }
}

POST /docs/_search
{
  "query": {
    "sparse_vector": {
      "field": "content_embedding",
      "query_vector": {
        "heat": 0.8471008,
        "atmosphere": 0.24251926,
        "very": 0.15496244,
        "brazil": 0.34641686,
        "winter": 0.5070546,
        "hardy": 0.15455529,
        "cold": 0.34308356,
        "sun": 0.052155916,
        "summer": 0.44945887,
        "beautiful": 0.28135294,
        "caribbean": 1.0542055,
        "jamaican": 1.4637164,
        "jamaica": 2.7778716,
        "geography": 0.48493838,
        "weather": 1.9407427,
        "temperature": 1.0894402,
        "season": 0.50890195,
        "quite": 0.20918,
        "cuba": 0.17437862,
        "rain": 0.14961956,
        "africa": 0.35994464,
        "festival": 0.3104579,
        "pleasant": 0.28202823,
        "island": 0.11068311,
        "forecast": 0.34817883,
        "climate": 1.2159579,
        "humid": 0.8120001,
        "fiji": 0.28404987,
        "tropical": 1.0291497,
        "te": 0.15929775,
        "warm": 1.3682823,
        "kingston": 0.104007065,
        "culture": 0.1396368,
        "beach": 0.085419096,
        "visit": 0.007837615,
        "barbados": 0.23561467,
        "desert": 0.05903566
      },
      "prune": true,
      "pruning_config": {
          "tokens_freq_ratio_threshold": 5,
          "tokens_weight_threshold": 0.4,
          "only_score_pruned_tokens": false
      }
    }
  }
}

@kderusso kderusso changed the title Add sparse_vector query WIP: Add sparse_vector query May 3, 2024
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto formatting by IntelliJ

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for the iterations!

Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good work!

Some doc changes I think are important - can be done on a separate PR but let's not forget about them ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should modify the sparse vector field type doc to specify that sparse_vector is the preferred query to use instead of text_expansion.

We could say that text_expansion still works but will be deprecated in the future or something along those lines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a followup issue to deprecate the text_expansion query as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of other documentation that should be updated (examples, etc.) but those will be updated in a followup dedicated PR

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kderusso Excellent job on main code, documentation and tests! I left some clarifying questions, but nothing major.

@kderusso
Copy link
Member Author

@elasticmachine update branch

@kderusso kderusso merged commit 7f35f1b into elastic:main May 22, 2024
15 checks passed
@kderusso kderusso deleted the kderusso/sparse-vector-query-2 branch July 8, 2024 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning Team:ML Meta label for the ML team v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants