Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform a Wildcard search #1365

Closed
nicovak opened this issue Jan 24, 2020 · 10 comments
Closed

Perform a Wildcard search #1365

nicovak opened this issue Jan 24, 2020 · 10 comments

Comments

@nicovak
Copy link
Contributor

@nicovak nicovak commented Jan 24, 2020

Hello,

Actually it seems that searchkick uses Regexp for wildcard's search (LIKE operator, etc)

Elasticsearch provides natively wildcard search with : link to docs

elasticsearch-ruby also had this functionality : link to git file

I am concerned with an issue, indeed I wanna make a search on a json field eg: "name":"John"
With like operator it fails but it works with wildcard operator.

Is there any reason to use regexp over wildcard ? Should I do a PR to integrate wildcard operator or I misunderstood something ?

IMO, wildcard would be more efficient for the new like operator added in 4.1.0 (2019-08-01)

Thanks,

@ankane

This comment has been minimized.

Copy link
Owner

@ankane ankane commented Jan 24, 2020

Hey @nicovak, thanks for the suggestion. Happy to consider this. Can you put together a script with the elasticsearch-ruby gem that shows:

  1. How they differ with JSON fields
  2. How they differ in performance
@nicovak

This comment has been minimized.

Copy link
Contributor Author

@nicovak nicovak commented Jan 24, 2020

@ankane I'm not doing the request on a real JSON field. It's a postgres text column.
Here is a simple comparison over my local data set (approximatively 25 millions of lines) with the same query in Kibana

My query le should hit the maximum of records.

{
  "query": {
    "bool": {
      "must": {
        "match_all": {}
      },
      "filter": [
        {
          "regexp": {
            "data": {
              "value": ".*le.*"
            }
          }
        }
      ]
    }
  },
  "sort": {
    "created_at": "desc"
  },
  "timeout": "11s",
  "_source": false,
  "size": 150
}

regexp took

  • 12481
  • 11047
  • 1775
  • 1807
  • 1628
{
  "query": {
    "bool": {
      "must": {
        "match_all": {}
      },
      "filter": [
        {
          "wildcard": {
            "data": {
              "value": "*le*"
            }
          }
        }
      ]
    }
  },
  "sort": {
    "created_at": "desc"
  },
  "timeout": "11s",
  "_source": false,
  "size": 150
}

wildcard took

  • 11099
  • 9221
  • 1760
  • 1764
  • 1781

Performance looks pretty much the same for this type of query.

Now with a "json" like query

{
  "query": {
    "bool": {
      "must": {
        "match_all": {}
      },
      "filter": [
        {
          "regexp": {
            "data": {
              "value": ".*\"ch\":\"2\"*."
            }
          }
        }
      ]
    }
  },
  "sort": {
    "created_at": "desc"
  },
  "timeout": "11s",
  "_source": false,
  "size": 150
}

regexp took

  • 5120
  • 5913
  • 719
  • 516
  • 512

regexp didn't match any results...

{
  "query": {
    "bool": {
      "must": {
        "match_all": {}
      },
      "filter": [
        {
          "wildcard": {
            "data": {
              "value": "*\"ch\":\"2\"*"
            }
          }
        }
      ]
    }
  },
  "sort": {
    "created_at": "desc"
  },
  "timeout": "11s",
  "_source": false,
  "size": 150
}

wildcard took

  • 6671
  • 6374
  • 710
  • 723
  • 730

wildcard worked as expected

@ankane

This comment has been minimized.

Copy link
Owner

@ankane ankane commented Jan 24, 2020

Hey @nicovak, thanks for the additional info, but without a reproducible script, it's not clear why the matching is different and there's not way to confirm the benchmarks.

@nicovak

This comment has been minimized.

Copy link
Contributor Author

@nicovak nicovak commented Jan 24, 2020

I'm sorry to ask you that @ankane but can you please tell me what do you expect exactly ? A blank rails app with a set a data and both implementation with searchkick and with elasticsearch-ruby ?

@ankane

This comment has been minimized.

Copy link
Owner

@ankane ankane commented Jan 24, 2020

The two things you've mentioned are different matching behavior and performance, so it's probably best to create separate Ruby scripts to demonstrate each of them. It could be with Searchkick or elasticsearch-ruby. The gist in the Contributing Guide is a good place to start.

@nicovak

This comment has been minimized.

Copy link
Contributor Author

@nicovak nicovak commented Jan 28, 2020

Ok @ankane thanks :)
Here It is : https://gist.github.com/nicovak/3433806c4c257cbd2d3c46ae46db393d

And the output

Searchkick version: 4.2.1
Elasticsearch version: 7.5.0
-- create_table(:products)
   -> 0.0062s
{"took"=>1, "timed_out"=>false, "_shards"=>{"total"=>1, "successful"=>1, "skipped"=>0, "failed"=>0}, "hits"=>{"total"=>{"value"=>1, "relation"=>"eq"}, "max_score"=>1.0, "hits"=>[{"_index"=>"products_development_20200128162835840", "_type"=>"_doc", "_id"=>"3", "_score"=>1.0}]}}
"-----------"
"1 found"
"-----------"
{"took"=>2, "timed_out"=>false, "_shards"=>{"total"=>1, "successful"=>1, "skipped"=>0, "failed"=>0}, "hits"=>{"total"=>{"value"=>0, "relation"=>"eq"}, "max_score"=>nil, "hits"=>[]}}
"-----------"
"0 found"
"-----------"
{"took"=>7, "timed_out"=>false, "_shards"=>{"total"=>1, "successful"=>1, "skipped"=>0, "failed"=>0}, "hits"=>{"total"=>{"value"=>1, "relation"=>"eq"}, "max_score"=>1.0, "hits"=>[{"_index"=>"products_development_20200128162835840", "_type"=>"_doc", "_id"=>"3", "_score"=>1.0, "_source"=>{"name"=>"Test2", "data"=>"{\"2\":{\"ch\":\"2\"}}"}}]}}
"-----------"
"1 found"
"-----------"
ankane added a commit that referenced this issue Jan 28, 2020
@ankane

This comment has been minimized.

Copy link
Owner

@ankane ankane commented Jan 28, 2020

Hey @nicovak, thanks for the gist. Was an issue with ". Should be fixed on master.

@ankane

This comment has been minimized.

Copy link
Owner

@ankane ankane commented Jan 28, 2020

It looks like it's possible to use a wildcard query (see wildcard branch) with slightly less escaping. However, it's not documented that \ escapes the wildcard characters (but does work). https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html

@nicovak

This comment has been minimized.

Copy link
Contributor Author

@nicovak nicovak commented Jan 29, 2020

If you use Kibana and do an auto indent on the query it automatically double quote starting and ending double quote.

Example :

GET _search
{
  "query": {
    "wildcard": {
      "data": {
        "value": "*{\"name\":\"John\"}*"
      }
    }
  }
}

Become

GET _search
{
  "query": {
    "wildcard": {
      "data": {
        "value": """*{"name":"John"}*"""
      }
    }
  }
}

Maybe It can helps, I don't really understand why but results returned is the same in both cases

@ankane

This comment has been minimized.

Copy link
Owner

@ankane ankane commented Jan 30, 2020

I don't see a strong reason to switch to wildcards right now, but if you're able to put together a benchmark that shows it's significantly faster (can use same gist and benchmark-ips), feel free to open a new issue.

@ankane ankane closed this Jan 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.