Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GET: Add parameter to GET for checking if generated fields can be retrieved #6676

Closed
brwe opened this Issue Jul 2, 2014 · 9 comments

Comments

Projects
None yet
3 participants
@brwe
Copy link
Contributor

commented Jul 2, 2014

I index a text field with type token_count. When indexing, this creates an additional field that holds the number of tokens in the text field.
When a document is retrieved from the transaction log (because no flush happened yet), and I want to get the token_count of my text field, I would assume that the token_count field is simply not retrieved, because it does not exist yet. Instead I get a NumberFormatException.

Here are the steps to reproduce:

DELETE testidx

PUT testidx
{
  "settings": {
    "index.translog.disable_flush": true,
    "index.number_of_shards": 1,
    "refresh_interval": "1h"
  },
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "fields": {
            "word_count": {
              "type": "token_count",
              "store": "yes",
              "analyzer": "standard"
            }
          },
          "type": "string"
        }
      }
    }
  }
}

PUT testidx/doc/1
{
  "text": "some text"
}

#ok, get document from translog
GET testidx/doc/1?realtime=true
#ok, get document from index but it is not there yet
GET testidx/doc/1?realtime=false
# try to get the document from translog but also field text.word_count which is not there yet: NumberFormatException
GET testidx/doc/1?fields=text.word_count&realtime=true

@clintongormley clintongormley added the bug label Jul 2, 2014

@brwe brwe added v1.2.2 labels Jul 2, 2014

@brwe brwe self-assigned this Jul 2, 2014

@brwe

This comment has been minimized.

Copy link
Contributor Author

commented Jul 4, 2014

Here is what happens:

For multi-fields, the parent field is returned instead of null if a sub-field is requested. For the example above, when getting text.word_count, text is retrieved from the source and returned.
We could prevent this easily like this: 7f522fbd9542

However, the FastVectorHighlighter relies on this functionality to highlight on multi-fields (see here), so this is not really a solution unless we want to prevent highlighting with the FastVectorHighlighter on multi-fields.

The other option is to simply catch the NumberFormatException and handle it like here: be999b1042

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2014

@brwe what is the status of this?

@s1monw s1monw added v1.2.3 and removed v1.2.2 labels Jul 9, 2014

@brwe

This comment has been minimized.

Copy link
Contributor Author

commented Jul 9, 2014

@s1monw Need to write more tests, did not get to it yet. Will continue Friday.

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2014

cool ok but it's going to be ready for 1.3 right?

@brwe

This comment has been minimized.

Copy link
Contributor Author

commented Jul 9, 2014

depends on when the release is

brwe added a commit to brwe/elasticsearch that referenced this issue Jul 11, 2014

Catch the NumberFormatException and handle it in GET
If a field is a multi field and requested with a GET request then
this will return the value of the parent field in case the document
is retrieved from the transaction log.
If the type is numeric but the value a string, the numeric value will
be parsed. This caused the NumberFprmatException in case the field
is of type `token_count`.

closes elastic#6676

@s1monw s1monw added v1.4.0 and removed v1.3.0 labels Jul 14, 2014

brwe added a commit to brwe/elasticsearch that referenced this issue Jul 18, 2014

Add parameter to GET for checking if generated fields can be retrieved
Fields of type `token_count` are generated only when indexing.
If a GET requests accesses the transaction log (because no refresh
between indexing and GET request) and the token_count field is requested,
then GET will try to parse the number from the String and the result
is a NumberFormatException.
Now, GET accepts a parameter `ignore_errors_on_generated_fields` which has
the following effect:
- Throw exception with meaningful error message explaining the problem if set to false (default)
- Ignore the field if set to true

closes elastic#6676
@brwe

This comment has been minimized.

Copy link
Contributor Author

commented Jul 18, 2014

A field of type murmur3 actually has the same issue. In addition, if murmur3 and token_count fields are not stored, GET will also return NumberFormatException after refresh, example below.

DELETE testidx
PUT testidx
{
  "settings": {
    "index.translog.disable_flush": true,
    "index.number_of_shards": 1,
    "refresh_interval": "1h"
  },
  "mappings": {
    "doc": {
      "properties": {
        "token_count": {
          "type": "token_count",
          "analyzer": "standard"
        },
        "murmur": {
          "type": "murmur3"
        }
      }
    }
  }
}

POST testidx/doc/1
{
  "murmur": "Some value that can be hashed",
  "token_count": "A text with five words."
}

GET testidx/doc/1?routing=2&fields=murmur,token_count

POST testidx/_refresh

GET testidx/doc/1?routing=2&fields=murmur,token_count

@brwe

This comment has been minimized.

Copy link
Contributor Author

commented Jul 18, 2014

Following the discussion on pull request #6826 I checked all field mappers and tried to figure out what they should return upon GET. I will call a field "generated" if the content is only available after indexing.
We discussed that we either throw a meaningful exception if the field is generated or ignore the field silently if a (new) parameter is set with the GET request. FieldMapper should get a new method isGenerated() which indicates weather the field will be found in the source or not.

Here is what I think we should do:

For some core field types (integer, float, string,...), the behavior (isGenerated() returns true or false) should be configurable. The reason is that a different mapper might use them and store generated data in them. The Mapper attachment plugin does that: Fields like author (string), content_type (string) etc. are only available after tika parsing.

There are currently four field types (detailed list below):

  1. Fields that should not be configurable, because they are always generated
  2. Fields that not be configurable because they are never generated
  3. Fields that should not be configurable because they are never stored
  4. Fields that should be configurable

For 1-3 we simply have to implement isGenerated() accordingly.

To make the fields configurable we could add a parameter "is_generated" to the mapping which steers the behavior.

Pro: would be easy to do and also allow different types in plugin to very easily use the feature.

Con: This would allow users to set "is_generated" accidentally - fields that are accessible via source would then still cause an exception if requested via GET while the document is not yet indexed

For fields that are not configurable, the parameter "is_generated" could be ignored without warning like so many other parameters.

List of types and their category:

There is core types, root types, geo an ip.

Core types

These should be configurable:

IntegerFieldMapper.java
ShortFieldMapper.java
BinaryFieldMapper.java      
DateFieldMapper.java        
LongFieldMapper.java        
StringFieldMapper.java
BooleanFieldMapper.java     
DoubleFieldMapper.java      

The following two should not be configurable because they are always generated:

Murmur3FieldMapper.java     
TokenCountFieldMapper.java

This should not be configurable because it is never stored:

CompletionFieldMapper.java

ip an geo

Should be configurable:

GeoPointFieldMapper.java    
GeoShapeFieldMapper.java
IpFieldMapper.java

root types

Never generated and should not be configurable:

RoutingFieldMapper.java     
TimestampFieldMapper.java
IdFieldMapper.java      
SizeFieldMapper.java        
TypeFieldMapper.java
BoostFieldMapper.java       
IndexFieldMapper.java       
SourceFieldMapper.java  
ParentFieldMapper.java      
TTLFieldMapper.java     
VersionFieldMapper.java

Always generated and should not be configurable:

AllFieldMapper.java     
FieldNamesFieldMapper.java  

The following should not be configurable, because they are never stored:

AnalyzerMapper.java     
UidFieldMapper.java
@brwe

This comment has been minimized.

Copy link
Contributor Author

commented Jul 19, 2014

hmpf. while writing tests I figured there are actually more cases to consider. will update soon...

@brwe

This comment has been minimized.

Copy link
Contributor Author

commented Jul 23, 2014

There are two numeric fields that are currently generated (Murmur3FieldMapper.java and TokenCountFieldMapper.java) and two string fields (AllFieldMapper.java and FieldNamesFieldMapper.java ).

These should only be returned with GET (fields=...) if set to stored and not retuned if not stored regardless of if source is enabled or not (this was not so, see example above). If refresh has not been called between indexing and GET then this should cause an Exception unless ignore_errors_on_generated_fields=true (working title) is set with the GET request.
Until now _all and _field_names where silently ignored and getting the numeric fields caused a NumberFormatException.

I am now unsure if we should make the core types configurable. By configurable, I actually meant adding a parameter to the type mapping such as

{
   type: string,
   is_generated: true/false
   ...
}

I'll make a pull request without that and then maybe we can discuss further.

Just for completeness, below is a list of all ungenerated field types and how they behave with GET.


Fields with fixed behavior:

Never stored -> should never be returned via GET

CompletionFieldMapper

Always stored -> should always be returned via GET

ParentFieldMapper.java      
TTLFieldMapper.java   

Stored or source enabled -> always return via GET, else never return

BoostFieldMapper.java  

Stored (but independent of source) -> always return via GET, else never return

TimestampFieldMapper.java
SizeFieldMapper.java    
RoutingFieldMapper.java   

Fields that might be configurable

IntegerFieldMapper.java
ShortFieldMapper.java
BinaryFieldMapper.java      
DateFieldMapper.java        
LongFieldMapper.java        
StringFieldMapper.java
BooleanFieldMapper.java     
DoubleFieldMapper.java      
GeoPointFieldMapper.java    
GeoShapeFieldMapper.java
IpFieldMapper.java

Special fields which can never be in the "fields" list returned by GET anyway

IdFieldMapper.java       
TypeFieldMapper.java
IndexFieldMapper.java       
SourceFieldMapper.java  
VersionFieldMapper.java
AnalyzerMapper.java     
UidFieldMapper.java

brwe added a commit to brwe/elasticsearch that referenced this issue Jul 24, 2014

Add parameter to GET for checking if generated fields can be retrieved
Fields of type `token_count`, `murmur3`, `_all` and `_field_names` are generated only when indexing.
If a GET requests accesses the transaction log (because no refresh
between indexing and GET request) then these fields cannot be retrieved at all.
Before the behavior was so:

`_all, _field_names`: The field was siletly ignored
`murmur3, token_count`: `NumberFormatException` because GET tried to parse the values from the source.

In addition, if these fields were not stored, the same behavior occured if the fields were
retrieved with GET after a `refresh()` because here also the source was used to get the fields.

Now, GET accepts a parameter `ignore_errors_on_generated_fields` which has
the following effect:
- Throw exception with meaningful error message explaining the problem if set to false (default)
- Ignore the field if set to true
- Always ignore the field if it was not set to stored

This changes the behavior for `_all` and `_field_names` as now an Exception is thrown if a user
tries to GET them before a `refresh()`.

closes elastic#6676

brwe added a commit that referenced this issue Aug 4, 2014

Add parameter to GET for checking if generated fields can be retrieved
Fields of type `token_count`, `murmur3`, `_all` and `_field_names` are generated only when indexing.
If a GET requests accesses the transaction log (because no refresh
between indexing and GET request) then these fields cannot be retrieved at all.
Before the behavior was so:

`_all, _field_names`: The field was siletly ignored
`murmur3, token_count`: `NumberFormatException` because GET tried to parse the values from the source.

In addition, if these fields were not stored, the same behavior occured if the fields were
retrieved with GET after a `refresh()` because here also the source was used to get the fields.

Now, GET accepts a parameter `ignore_errors_on_generated_fields` which has
the following effect:
- Throw exception with meaningful error message explaining the problem if set to false (default)
- Ignore the field if set to true
- Always ignore the field if it was not set to stored

This changes the behavior for `_all` and `_field_names` as now an Exception is thrown if a user
tries to GET them before a `refresh()`.

closes #6676
closes #6973

@brwe brwe closed this in 5706858 Aug 4, 2014

@brwe brwe changed the title NumberFormatException when using type `token_count` and realtime get GET: Add parameter to GET for checking if generated fields can be retrieved Aug 4, 2014

brwe added a commit that referenced this issue Sep 8, 2014

Add parameter to GET for checking if generated fields can be retrieved
Fields of type `token_count`, `murmur3`, `_all` and `_field_names` are generated only when indexing.
If a GET requests accesses the transaction log (because no refresh
between indexing and GET request) then these fields cannot be retrieved at all.
Before the behavior was so:

`_all, _field_names`: The field was siletly ignored
`murmur3, token_count`: `NumberFormatException` because GET tried to parse the values from the source.

In addition, if these fields were not stored, the same behavior occured if the fields were
retrieved with GET after a `refresh()` because here also the source was used to get the fields.

Now, GET accepts a parameter `ignore_errors_on_generated_fields` which has
the following effect:
- Throw exception with meaningful error message explaining the problem if set to false (default)
- Ignore the field if set to true
- Always ignore the field if it was not set to stored

This changes the behavior for `_all` and `_field_names` as now an Exception is thrown if a user
tries to GET them before a `refresh()`.

closes #6676
closes #6973
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.