Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case insensitive field names #68561

Closed
niemyjski opened this issue Feb 4, 2021 · 12 comments
Closed

Case insensitive field names #68561

niemyjski opened this issue Feb 4, 2021 · 12 comments
Labels
>enhancement high hanging fruit :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@niemyjski
Copy link
Contributor

niemyjski commented Feb 4, 2021

I've searched the issues and community forum but couldn't really find any requests or issues talking about this.

I'd love for field names to be case insensitive, this would really allow for more scenarios for things like source includes (take a field list from an api of data to include and it just work). It probably would cause a lot less time for people tracking down other issues as well...

POST test-v1/_doc/test
{
  "test": "abc",
  "Test": "abcd",
  "tEst": "abcde"
}

GET /test-v1/_mapping
{
  "test-v1" : {
    "mappings" : {
      "properties" : {
        "Test" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "tEst" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "test" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

For example, we have a query parser (https://github.com/FoundatioFx/Foundatio.Parsers) and we can resolve any mapped field to the correct case, but cannot with unmapped fields.

Also, I'd really love to know why fields names are still case sensitive (I understand that JSON is case sensitive) and how there hasn't been a breaking change to change the field name behavior. One could logging a warning when multiple field names are present, or just indexing the first field (and discarding any extra with the same name?). I could see the document having different cases of a field and that's ok, it would share a single mapping. This would also help out by preventing field explosions and make querying and using the elastic api easier.

Databases have a variety of sensitivities. SQL, by default, is case insensitive to identifiers and keywords, but case sensitive to data. JSON is case sensitive to both field names and data.
https://blog.couchbase.com/json-case-sensitive-insensitive-search-index-data/#:~:text=SQL%2C%20by%20default%2C%20is%20case,both%20field%20names%20and%20data.

@niemyjski niemyjski added >enhancement needs:triage Requires assignment of a team area label labels Feb 4, 2021
@niemyjski niemyjski changed the title Support case insensitive field names Case insensitive field names Feb 4, 2021
@markharwood
Copy link
Contributor

markharwood commented Feb 5, 2021

Thanks for the comments.

Also, I'd really love to know why fields names are still case sensitive (I understand that JSON is case sensitive)

Unfortunately I think that's the answer. We're built on JSON and its behaviour is something we can't change.
As far as I can tell MongoDB is the same in this regard

and how there hasn't been a breaking change to change the field name behavior

Any breaking change has to reach a certain level of importance for it to be considered. The importance can be measured by things like:

  1. The number of people calling for the change
  2. The lack of any good workarounds in the status quo
  3. Our ability to migrate cleanly (e.g. having old indices and new indices co-exist under new software)

By the above measures:

  1. I think this is the first time we've had this issue logged
  2. Clients can normalise data in their client code or using ingest pipelines (although query/agg field names have no equivalent of doc ingest pipelines to change fieldnames)
  3. I imagine it will be very difficult to provide software that allows a cluster to run a mix of old and new indices. Also, the alternative of asking customers to reindex all historical data is typically a no-no

I'll keep this issue open to see if it attracts any more interest but I wouldn't bank on this change happening anytime soon.

@markharwood markharwood added the :Search Foundations/Mapping Index mappings, including merging and defining field types label Feb 5, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Feb 5, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@markharwood markharwood removed the needs:triage Requires assignment of a team area label label Feb 5, 2021
@ejsmith
Copy link

ejsmith commented Feb 5, 2021

I've wondered this as well. It seems like insanity to me that you can create 2 fields with the same name and different casing.

@markharwood
Copy link
Contributor

I discussed this with the team today and we agreed that while a nice-to-have for some users the impact of such a change is huge and not something we will realistically attempt in the foreseeable future.
Closing, but will reopen if this ever changes.

Thanks for reaching out and sorry we're not able to help with this.

@StingyJack
Copy link

a nice-to-have for some user

@markharwood - Would any of you assign a different meaning to the word "DOG" if it were spelled "dog"? No, because its still referring to canis familiaris

This problem manifests itself as duplicated data points, often with different values. Think thats rare? Happens every time I use NEST. I can create a mapping that is this...

{
    "settings": {
        "analysis": {
            "normalizer": {
                "customSearchNormalizer": {
                    "type": "custom",
                    "char_filter": [],
                    "filter": ["lowercase", "asciifolding"]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "ListName": {
                "type": "keyword",
                "normalizer": "customSearchNormalizer"
            },
            "FieldNames": {
                "type": "keyword",
                "normalizer": "customSearchNormalizer"
            }          
        }
    }
}

... and verify that in kibana, but when the first document is indexed, NEST will decide to use some other casing for field names and changes the mapping to this...

{
  "mappings": {
    "_doc": {
      "properties": {
        "FieldNames": {
          "type": "keyword",
          "normalizer": "customSearchNormalizer"
        },
        "ListName": {
          "type": "keyword",
          "normalizer": "customSearchNormalizer"
        }
        "fieldNames": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "listName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

... and then I cant find any documents because I created the index using FieldNames and its added to that mapping to make fieldNames the place where the data is actually stored.

 "_source" : {
          "listName" : "MyList",
          "fieldNames" : [
            "OrgField1",
...

Yes, I know about .DefaultFieldNameInferrer(p => p), but the point is I shouldnt have to google around or remember to do that.

JSON is spec-ed wrong for the same reason. Nobody really wants to have two object properties with different cases and different values in a data payload. Thats part of the recipe for a nightmare-level support and troubleshooting experience, and ES isnt required to jump into that pit just because JSON does.

MSSQL (or Sybase at the time) figured out 30+ years ago that users dont want to deal with differences in casing when they go look for their data, and that they dont want their data altered by forcing some normalization scheme in order to enable that searching. The scenario in your userbase where someone actually depends on having field names of different cases is going to be exceedingly rare if at all. Give an option to allow case different duplicates if you think there are any users who need it, but please dont make the rest of us continue to suffer this problem.

@markharwood
Copy link
Contributor

markharwood commented Sep 20, 2021

Would any of you assign a different meaning to the word "DOG" if it were spelled "dog"?

We're not compiling a dictionary here :)
As you well know, some things in computing are case sensitive e.g. the unix file system and the same questions over "usefulness" could be raised there.
I happen to agree that case sensitivity is not generally useful in field names but it is so firmly entrenched in so many deployments that we cannot simply flick a switch to change this.

where someone actually depends on having field names of different cases is going to be exceedingly rare if at all.

Trust me, someone out there somewhere is using field names to store hashes where a change in case would be catastrophic to them. As a result, we have to go through a complex procedure of introducing opt-in case-insensitivity flags, deprecation warnings and backward compatibility code for old clusters before flipping default behaviour etc. This migration effort for us and our users is what puts this firmly in the "high-hanging fruit" category and why we are not rushing to fix right now.

@StingyJack
Copy link

Are you sure you arent compiling a dictionary somewhere? At least as a test case?

Thank you for hearing my complaints.

@olfek
Copy link

olfek commented Feb 28, 2023

I need this

Maybe you can introduce a new API that will let us do this

{
    "my_index": {
        "mappings": {
            "dynamic": "strict",
            "properties": {
                "name": {
                    "type": "text",
                    "key_case_sensitivity": "case_insensitive"  <-- THIS
                }
            }
        }
    }
}

@olfek
Copy link

olfek commented Feb 28, 2023

Then I could create documents with name looking like

  • nAmE
  • nAME
  • NAME
  • ...

with the "dynamic": "strict", setting, documents with the correct keys but different casing are rejected.

@olfek
Copy link

olfek commented Feb 28, 2023

without "dynamic": "strict", the mapping can become polluted with casing variations each of which do not adhere to the manually configured mapping potentially altering or breaking application functionality.

@olfek
Copy link

olfek commented Feb 28, 2023

"key_case_sensitivity": "case_insensitive" can be made OPT-IN so it is not a breaking change.

@olfek
Copy link

olfek commented Feb 28, 2023

@niemyjski thanks for creating this issue, I was confused why elasticsearch was behaving like this.

Pinging @elastic/es-search (Team:Search)

Can we open this issue up again to discuss an OPT-IN non breaking solution?

@javanna javanna added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement high hanging fruit :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

7 participants