add ability to coerce incomplete datetime info + tests #58

elbeejay · 2022-12-05T19:42:48Z

This PR is designed to close #47.

This PR does not change current default functionality

Problem Summary

The gist of the problem (as I understood it) is that some groundwater data records lack full date-time information. Sometimes the timestamp is missing, or even the month or day is not provided. The current approach to creating pandas datetime objects assumes all of that information is known, and if not, creates the equivalent of a NaN entry (NaT). This is fundamentally a conservative solution, as no datetime information is provided when it is incomplete; the user can always look at the original columns (lev_dt, lev_tm, lev_tz_cd) if they wanted to dig into the available partial data or construct their own index/ordering for the data frame. The existing implementation could be the final implementation - the burden of handling atypical or incomplete date-times could be left to the user.

### Proposed Solution
A number of people have had issues with the existing approach, however, hence the creation of issue #47 and comments within it. One potential resolution is provided in this PR. Effectively this PR proposes:

Creation of an optional key-word argument coerce_datetime that is False by default, but can be input as True

If True, incomplete dates are coerced into complete pandas datetime objects. This coercion sets any missing value to 0 - for example if no time information is available, the time assigned is 00:00:00+00:00

If False (default behavior) the function works as it currently does, and incomplete dates are set as NaT

This PR adds 3 unit tests using 3 groundwater sites with incomplete information. The test are simple, first they confirm that the default behavior results in NaT data frame index values. Next they use the optional coercion functionality, and confirm that the new output has no NaT index values.

The suggestion by @SarkenD in issue #47 to have a separate date and time column, is (somewhat) already present by default. The lev_dt column provides date information, and the lev_tm and lev_tz_cd provide time information for each measurement. The current practice of creating a combined datetime index for the data frames was not something I wanted to change, especially as it seems to make sense and work for the vast majority of cases. This optional method gives users a way to force their data to have complete datetimes. Due to the loss of timestamp integrity, this functionality should not be the default. To me the real question is: Should this optional date-time coercion even be provided as an option?

Revised Solution

Follows @thodson-usgs's proposed standardized approach below:

default behavior is to parse datetimes and use them to set the index (datetime_index=True)
if there are NaT datetime values, their count is provided as a warning and it is suggested that the parameter be switched to datetime_index=False
When datetime_index=False then the indexing is simply by integers (i.e. no datetime formatting is done)

To further standardize the different service functions, the order of parameters was made more consistent and is now - sites, start, end, multi_index, wide_format, datetime_index, **kwargs - as relevant for the individual functions.

This standardization impacted some functions, including get_qwdata, get_gwlevels, get_dv and get_iv. Consequently, this has the potential to change function behavior in two ways:

Prior function calls where argument names were not supplied and the order of arguments was used may be different from the proposed (standardized) order of function arguments
Conversely, new function arguments, for example, the addition of the argument sites to get_iv, make it so that a function call made with the revised code might not work using the existing code - get_iv(siteno, start, end) would work with proposed changes, but would not have previously worked without naming the arguments because get_iv does not currently expect the first argument to be the site numbers

Ultimately, the proposed changes did not break any of the existing unit tests. But because of the potential impacts to workflows identified above, I incremented the minor version of the package.

thodson-usgs · 2022-12-07T03:29:54Z

@elbeejay, this is a clear and well organized PR. Thank you!
I think we should follow the approach in nwis.get_qwdata and add a datetime_index=True flag as the default case, though it would be good to settle on a standard approach:

by default, parse datetimes and use them for the index
drop the columns used to create the index (perhaps we should not do this)
throw a warning if NaT's were encountered, their number, and advise trying datetime_index=False
implement a datetime_index flag that when false, the dataframe is returned with the original date columns and an integer index.

What do you think? I like this way because we don't coerce the data, but the user is still allowed to do so when needed.

elbeejay · 2022-12-07T13:14:00Z

Your suggested approach works for me, I will modify the PR to create the datetime_index parameter rather than doing any of this data coercion.

drop the columns used to create the index (perhaps we should not do this)

I'm inclined to leave the columns in place (which is the current behavior?) as I think this can help diagnose any NaT values without needing to change the datetime_index flag and re-run the function. The "cost" also seems relatively low to keeping them in place, in my view.

SarkenD · 2022-12-07T13:40:45Z

Hi @elbeejay, Thanks for jumping on this. I'm fine with your solution, provided the result is that we will be able to get all of the data publicly available.

elbeejay · 2022-12-07T17:36:41Z

Proposed solution text has been edited above to reflect latest commits.

@SarkenD the existing code and proposed changes both retain all data obtained by the API queries. The proposed changes only affect the indexing of the resulting data frame - those data which have incomplete datetime information are not (and will not be) discarded, their location (index) within the data frame right now is just a bit awkward.

thodson-usgs · 2022-12-07T18:57:03Z

Nice @elbeejay,
Right, I think the columns are kept here, which I agree should be the standard behavior; however, early in the project I dropped them. There may be some lingering instances of that behavior.

jjkennedy · 2022-12-07T19:44:03Z

I'm still a relative newbie to coding so I'm not sure if I'm completely tracking with the proposed solution. Will this solution keep the date of measurement somewhere within the retrieved data regardless if the time stamp is blank?

elbeejay · 2022-12-08T00:45:28Z

@jjkennedy to clarify, current behavior does not discard any data, but when incomplete datetime information is present those rows are assigned an index value of NaT, which puts them out of sequence and makes the resulting data frame harder to work with.

The proposed solution (now merged) preserves that same behavior as the default, except now a warning is printed telling the user how many rows have NaT index values, and suggests changing the datetime_index parameter (which is True by default) to False. If the datetime_index parameter is changed to False, then the index of the dataframe is no longer datetimes, but rather are integers and the data is ordered as it was received from the web-service. Whether the datetime_index parameter is True or False, the original measurement date/time information (which exist in separate columns) are still present, and this is consistent with the behavior before the code was modified.

add ability to coerce incomplete datetime info + tests

97b4f07

elbeejay requested a review from thodson-usgs December 5, 2022 19:43

elbeejay added 2 commits December 7, 2022 12:00

handle datetime and make args consistent across functions

d743381

minor version increment

fe78cdd

thodson-usgs merged commit 4318d07 into DOI-USGS:master Dec 7, 2022

elbeejay mentioned this pull request Dec 19, 2022

Allow user to work in timezones besides UTC #3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add ability to coerce incomplete datetime info + tests #58

add ability to coerce incomplete datetime info + tests #58

Uh oh!

elbeejay commented Dec 5, 2022 •

edited

Loading

Uh oh!

thodson-usgs commented Dec 7, 2022 •

edited

Loading

Uh oh!

elbeejay commented Dec 7, 2022

Uh oh!

SarkenD commented Dec 7, 2022

Uh oh!

elbeejay commented Dec 7, 2022

Uh oh!

thodson-usgs commented Dec 7, 2022

Uh oh!

jjkennedy commented Dec 7, 2022

Uh oh!

elbeejay commented Dec 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

add ability to coerce incomplete datetime info + tests #58

add ability to coerce incomplete datetime info + tests #58

Uh oh!

Conversation

elbeejay commented Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Summary

Revised Solution

Uh oh!

thodson-usgs commented Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elbeejay commented Dec 7, 2022

Uh oh!

SarkenD commented Dec 7, 2022

Uh oh!

elbeejay commented Dec 7, 2022

Uh oh!

thodson-usgs commented Dec 7, 2022

Uh oh!

jjkennedy commented Dec 7, 2022

Uh oh!

elbeejay commented Dec 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

elbeejay commented Dec 5, 2022 •

edited

Loading

thodson-usgs commented Dec 7, 2022 •

edited

Loading