Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for 2 date formats: #70

Merged
merged 58 commits into from
Dec 4, 2019
Merged

Added support for 2 date formats: #70

merged 58 commits into from
Dec 4, 2019

Conversation

viglia
Copy link

@viglia viglia commented Nov 27, 2019

This PR addresses #22 by adding support for 2 date formats:

  1. epoch_millis
  2. epoch_seconds

When the mapping are fetched each date type will carry additional format information.
Ex. date[epoch_millis], date[epoch_second], etc.

This will later on be converted to pandas types according to their format: datetime64[ms], datetime64[s]

@viglia viglia added the enhancement New feature or request label Nov 27, 2019
@viglia viglia self-assigned this Nov 27, 2019
Copy link
Contributor

@stevedodson stevedodson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A large number of existing tests fail with this PR - please always make sure all tests pass before submitting (via ci or manually).

No tests are supplied with this PR - please add tests.

@viglia
Copy link
Author

viglia commented Nov 27, 2019

ok. checking it, thanks!

Copy link
Contributor

@stevedodson stevedodson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described in previous comment, we should try to write tests first and then develop code against these tests.

Also, can we not map the es_dtype/format to the appropriate pd.to_datetime arguments in mappings, so we can avoid this code in flatten.

eland/mappings.py Outdated Show resolved Hide resolved
eland/mappings.py Outdated Show resolved Hide resolved
eland/query_compiler.py Outdated Show resolved Hide resolved
@stevedodson
Copy link
Contributor

Another thing we may want to consider is if we should parse doc_values for datetime rather than _source. For example, how would we resolve this:

PUT my_index
{
  "mappings": {
    "properties": {
      "date": {
        "type": "date" 
      }
    }
  }
}

PUT my_index/_doc/1
{ "date": "2015-01-01" } 

PUT my_index/_doc/2
{ "date": "2015-01-01T12:10:30Z" } 

PUT my_index/_doc/3
{ "date": 1420070400001 } 

@viglia
Copy link
Author

viglia commented Nov 27, 2019

@stevedodson another question I have is: I noticed that by default date is mapped to datetime64[ns] is there a specific reason for mapping to that specific dtype? Should we keep it that way?

* adds support for __add__ ops for string objects and literals

* adds tests for string arithmetic

* updates comment in numeric field resolution

* adds op_type parameter for numeric_ops
@stevedodson
Copy link
Contributor

@viglia - for now I'd look to implement the simplest way to supported epoch_millis as a format + test against other common formats (e.g. the ones that contains date and time).

Having the default map to datetime64[ns] is ok.

Winterflower and others added 6 commits November 27, 2019 21:03
…g and avoid repulling dependencies in requirements.txt if they don't change
…5.1 is not compatible with Python 3.8 and Pandas 0.25.2 is breaking some eland tests
The following formats are currently explicitly supported:

1. implicit: epoch_millis (no format defined)
2. implicit: strict_date_optional_time (no format defined)
3. explicit: epoch_millis
4. explicit: epoch_second

In order to keep track of the format of the different date fields, a new instance attribute `_date_fields_format` has been added to Mappings

Support for further formats can be handled by adding new cases to the `eland_date_to_pandas_date` function in the `date_utils.py` file
@stevedodson
Copy link
Contributor

stevedodson commented Dec 3, 2019

FYI - It is good to test multiple time formats in a single index alongside a single time format.

  • Happy for you to write all the tests if you want, but I'd be satisfied for now if the test above passes when a eland.DataFrame is constructed from the index and all the results are consistent with the indexed values. We can then move on to other issues.

@stevedodson
Copy link
Contributor

@blaklaybul don't worry about review at this stage - I clicked incorrect reviewer in UI.

@stevedodson
Copy link
Contributor

From slack:

The implementation of this should be very simple. In elasticsearch_date_to_pandas_datetime simply add the correct conversions for all date formats. E.g.

    if date_format == "epoch_millis":
        return pd.to_datetime(x, unit='ms')
    elif date_format == "epoch_second":
        return pd.to_datetime(x, unit='s')
    elif date_format == "basic_ordinal_date":
        return pd.to_datetime(x, format='%Y%j')
    ...

etc.
In general, vanilla pd.to_datetime will work (and we should use it where it does as it will probably be more efficient). The cases it doesn’t work use an explicit format.

@stevedodson
Copy link
Contributor

@viglia + please merge with master. Thanks!

@stevedodson
Copy link
Contributor

Francesco Vigliaturo 8:58 AM
@steve.dodson that is actually the “hack” I was thinking of 😅
I think it’s hacky because the actual dtypes differ between the eland dtype (datetime64[ns]) and the pandas one (depending on the date object/datetime64[ns, UTC]). There is a mismatch.

Steve Dodson 9:02 AM
There is no one-to-one mapping between pandas datetime dtypes and Elasticsearch dates. For instance, basic_ordinal_date doesn’t map to a pandas datetime. So whatever you do there is require translation. Assuming all eland dtypes map to datetime64[ns] or datetime64[s] is not a hacky as we have to make an informed decision about datetime conversion. (edited)
Adding this as a comment would be useful to justify the decision.

Francesco Vigliaturo and others added 3 commits December 4, 2019 13:41
Adds support to multiple date formats and a test case that tests all of this formats

Currently the test fails. Pushing this commit to facilitate debugging and share the current code
Some %G formats explicitly not supported.
@stevedodson
Copy link
Contributor

@viglia - please add support for all non 'strict_' formats.

Francesco Vigliaturo added 2 commits December 4, 2019 16:30
1. removed index creation from setup_tests and added proper setup_class and tear_down class
2. added support for non-strict dates in query_compile and in tests
@viglia
Copy link
Author

viglia commented Dec 4, 2019

Hi @stevedodson

I've addressed all the feedback left. Let me know if I missed anything.

Regarding this one:

Also, do we need to add a new member variable for date_fields_format that is separate from _mappings_capabilities

I did not merged them and I've kept date_fields_formatseparate as it seems that that info is very specific to only date field so I wasn't sure it made sense having in the matrix along with searchable, aggregatable, etc. this info

Ps. I've also run all the tests and everything run fine. I'm going to check what Jenkins complains about

Francesco Vigliaturo added 2 commits December 4, 2019 17:01
1. `epoch_millis`
2. `epoch_seconds`

When the mapping are fetched each date type will carry additional format information.
Ex. date[epoch_millis], date[epoch_second], etc.

This will later on be converted to pandas types according to their format: datetime64[ms], datetime64[s]
@stevedodson
Copy link
Contributor

Please merge into master!

Francesco Vigliaturo and others added 9 commits December 4, 2019 17:09
The following formats are currently explicitly supported:

1. implicit: epoch_millis (no format defined)
2. implicit: strict_date_optional_time (no format defined)
3. explicit: epoch_millis
4. explicit: epoch_second

In order to keep track of the format of the different date fields, a new instance attribute `_date_fields_format` has been added to Mappings

Support for further formats can be handled by adding new cases to the `eland_date_to_pandas_date` function in the `date_utils.py` file
1. All the handling of the date formatting has been moved into the 'eland_date_to_pandas_date' in date_utils.py
Also, 'eland_date_to_pandas_date' now returns the parsed date and not an anonymous function.

2. a warning has been added in cases where the user is using a date formatting that we don't support explicitly

3. a log was added to the new indices setup
Adds support to multiple date formats and a test case that tests all of this formats

Currently the test fails. Pushing this commit to facilitate debugging and share the current code
Some %G formats explicitly not supported.
1. removed index creation from setup_tests and added proper setup_class and tear_down class
2. added support for non-strict dates in query_compile and in tests
@viglia viglia merged commit 99bfea4 into elastic:master Dec 4, 2019
@stevedodson
Copy link
Contributor

@viglia - next time please wait until the PR is approved before merging.

@viglia
Copy link
Author

viglia commented Dec 9, 2019

@stevedodson sorry for the misunderstanding!

I've read your comment above "Please merge into master!" and I thought it was a green light to rebase on master and merge

@viglia viglia deleted the fix_date branch December 9, 2019 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants