Skip to content

Conversation

@elbeejay
Copy link
Contributor

@elbeejay elbeejay commented Dec 5, 2022

This PR is designed to close #47.

This PR does not change current default functionality

Problem Summary

The gist of the problem (as I understood it) is that some groundwater data records lack full date-time information. Sometimes the timestamp is missing, or even the month or day is not provided. The current approach to creating pandas datetime objects assumes all of that information is known, and if not, creates the equivalent of a NaN entry (NaT). This is fundamentally a conservative solution, as no datetime information is provided when it is incomplete; the user can always look at the original columns (lev_dt, lev_tm, lev_tz_cd) if they wanted to dig into the available partial data or construct their own index/ordering for the data frame. The existing implementation could be the final implementation - the burden of handling atypical or incomplete date-times could be left to the user.

### Proposed Solution

A number of people have had issues with the existing approach, however, hence the creation of issue #47 and comments within it. One potential resolution is provided in this PR. Effectively this PR proposes:

  1. Creation of an optional key-word argument coerce_datetime that is False by default, but can be input as True
  2. If True, incomplete dates are coerced into complete pandas datetime objects. This coercion sets any missing value to 0 - for example if no time information is available, the time assigned is 00:00:00+00:00
  3. If False (default behavior) the function works as it currently does, and incomplete dates are set as NaT

This PR adds 3 unit tests using 3 groundwater sites with incomplete information. The test are simple, first they confirm that the default behavior results in NaT data frame index values. Next they use the optional coercion functionality, and confirm that the new output has no NaT index values.

The suggestion by @SarkenD in issue #47 to have a separate date and time column, is (somewhat) already present by default. The lev_dt column provides date information, and the lev_tm and lev_tz_cd provide time information for each measurement. The current practice of creating a combined datetime index for the data frames was not something I wanted to change, especially as it seems to make sense and work for the vast majority of cases. This optional method gives users a way to force their data to have complete datetimes. Due to the loss of timestamp integrity, this functionality should not be the default. To me the real question is: Should this optional date-time coercion even be provided as an option?

Revised Solution

Follows @thodson-usgs's proposed standardized approach below:

  • default behavior is to parse datetimes and use them to set the index (datetime_index=True)
  • if there are NaT datetime values, their count is provided as a warning and it is suggested that the parameter be switched to datetime_index=False
  • When datetime_index=False then the indexing is simply by integers (i.e. no datetime formatting is done)

To further standardize the different service functions, the order of parameters was made more consistent and is now - sites, start, end, multi_index, wide_format, datetime_index, **kwargs - as relevant for the individual functions.

This standardization impacted some functions, including get_qwdata, get_gwlevels, get_dv and get_iv. Consequently, this has the potential to change function behavior in two ways:

  1. Prior function calls where argument names were not supplied and the order of arguments was used may be different from the proposed (standardized) order of function arguments
  2. Conversely, new function arguments, for example, the addition of the argument sites to get_iv, make it so that a function call made with the revised code might not work using the existing code - get_iv(siteno, start, end) would work with proposed changes, but would not have previously worked without naming the arguments because get_iv does not currently expect the first argument to be the site numbers

Ultimately, the proposed changes did not break any of the existing unit tests. But because of the potential impacts to workflows identified above, I incremented the minor version of the package.

@elbeejay elbeejay requested a review from thodson-usgs December 5, 2022 19:43
@thodson-usgs
Copy link
Collaborator

thodson-usgs commented Dec 7, 2022

@elbeejay, this is a clear and well organized PR. Thank you!
I think we should follow the approach in nwis.get_qwdata and add a datetime_index=True flag as the default case, though it would be good to settle on a standard approach:

  • by default, parse datetimes and use them for the index
  • drop the columns used to create the index (perhaps we should not do this)
  • throw a warning if NaT's were encountered, their number, and advise trying datetime_index=False
  • implement a datetime_index flag that when false, the dataframe is returned with the original date columns and an integer index.

What do you think? I like this way because we don't coerce the data, but the user is still allowed to do so when needed.

@elbeejay
Copy link
Contributor Author

elbeejay commented Dec 7, 2022

Your suggested approach works for me, I will modify the PR to create the datetime_index parameter rather than doing any of this data coercion.

drop the columns used to create the index (perhaps we should not do this)

I'm inclined to leave the columns in place (which is the current behavior?) as I think this can help diagnose any NaT values without needing to change the datetime_index flag and re-run the function. The "cost" also seems relatively low to keeping them in place, in my view.

@SarkenD
Copy link

SarkenD commented Dec 7, 2022

Hi @elbeejay, Thanks for jumping on this. I'm fine with your solution, provided the result is that we will be able to get all of the data publicly available.

@elbeejay
Copy link
Contributor Author

elbeejay commented Dec 7, 2022

Proposed solution text has been edited above to reflect latest commits.

@SarkenD the existing code and proposed changes both retain all data obtained by the API queries. The proposed changes only affect the indexing of the resulting data frame - those data which have incomplete datetime information are not (and will not be) discarded, their location (index) within the data frame right now is just a bit awkward.

@thodson-usgs thodson-usgs merged commit 4318d07 into DOI-USGS:master Dec 7, 2022
@thodson-usgs
Copy link
Collaborator

Nice @elbeejay,
Right, I think the columns are kept here, which I agree should be the standard behavior; however, early in the project I dropped them. There may be some lingering instances of that behavior.

@jjkennedy
Copy link

I'm still a relative newbie to coding so I'm not sure if I'm completely tracking with the proposed solution. Will this solution keep the date of measurement somewhere within the retrieved data regardless if the time stamp is blank?

@elbeejay
Copy link
Contributor Author

elbeejay commented Dec 8, 2022

@jjkennedy to clarify, current behavior does not discard any data, but when incomplete datetime information is present those rows are assigned an index value of NaT, which puts them out of sequence and makes the resulting data frame harder to work with.

The proposed solution (now merged) preserves that same behavior as the default, except now a warning is printed telling the user how many rows have NaT index values, and suggests changing the datetime_index parameter (which is True by default) to False. If the datetime_index parameter is changed to False, then the index of the dataframe is no longer datetimes, but rather are integers and the data is ordered as it was received from the web-service. Whether the datetime_index parameter is True or False, the original measurement date/time information (which exist in separate columns) are still present, and this is consistent with the behavior before the code was modified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Datetime Records Missing if time code is not present

4 participants