-
Notifications
You must be signed in to change notification settings - Fork 52
add ability to coerce incomplete datetime info + tests #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@elbeejay, this is a clear and well organized PR. Thank you!
What do you think? I like this way because we don't coerce the data, but the user is still allowed to do so when needed. |
|
Your suggested approach works for me, I will modify the PR to create the
I'm inclined to leave the columns in place (which is the current behavior?) as I think this can help diagnose any |
|
Hi @elbeejay, Thanks for jumping on this. I'm fine with your solution, provided the result is that we will be able to get all of the data publicly available. |
|
Proposed solution text has been edited above to reflect latest commits. @SarkenD the existing code and proposed changes both retain all data obtained by the API queries. The proposed changes only affect the indexing of the resulting data frame - those data which have incomplete datetime information are not (and will not be) discarded, their location (index) within the data frame right now is just a bit awkward. |
|
Nice @elbeejay, |
|
I'm still a relative newbie to coding so I'm not sure if I'm completely tracking with the proposed solution. Will this solution keep the date of measurement somewhere within the retrieved data regardless if the time stamp is blank? |
|
@jjkennedy to clarify, current behavior does not discard any data, but when incomplete datetime information is present those rows are assigned an index value of The proposed solution (now merged) preserves that same behavior as the default, except now a warning is printed telling the user how many rows have |
This PR is designed to close #47.
This PR does not change current default functionality
Problem Summary
The gist of the problem (as I understood it) is that some groundwater data records lack full date-time information. Sometimes the timestamp is missing, or even the month or day is not provided. The current approach to creating
pandasdatetime objects assumes all of that information is known, and if not, creates the equivalent of aNaNentry (NaT). This is fundamentally a conservative solution, as no datetime information is provided when it is incomplete; the user can always look at the original columns (lev_dt,lev_tm,lev_tz_cd) if they wanted to dig into the available partial data or construct their own index/ordering for the data frame. The existing implementation could be the final implementation - the burden of handling atypical or incomplete date-times could be left to the user.### Proposed SolutionA number of people have had issues with the existing approach, however, hence the creation of issue #47 and comments within it. One potential resolution is provided in this PR. Effectively this PR proposes:
coerce_datetimethat is False by default, but can be input as Truepandasdatetime objects. This coercion sets any missing value to 0 - for example if no time information is available, the time assigned is 00:00:00+00:00NaTThis PR adds 3 unit tests using 3 groundwater sites with incomplete information. The test are simple, first they confirm that the default behavior results in
NaTdata frame index values. Next they use the optional coercion functionality, and confirm that the new output has noNaTindex values.The suggestion by @SarkenD in issue #47 to have a separate date and time column, is (somewhat) already present by default. The
lev_dtcolumn provides date information, and thelev_tmandlev_tz_cdprovide time information for each measurement. The current practice of creating a combined datetime index for the data frames was not something I wanted to change, especially as it seems to make sense and work for the vast majority of cases. This optional method gives users a way to force their data to have complete datetimes. Due to the loss of timestamp integrity, this functionality should not be the default. To me the real question is: Should this optional date-time coercion even be provided as an option?Revised Solution
Follows @thodson-usgs's proposed standardized approach below:
datetime_index=True)NaTdatetime values, their count is provided as a warning and it is suggested that the parameter be switched todatetime_index=Falsedatetime_index=Falsethen the indexing is simply by integers (i.e. no datetime formatting is done)To further standardize the different service functions, the order of parameters was made more consistent and is now - sites, start, end, multi_index, wide_format, datetime_index, **kwargs - as relevant for the individual functions.
This standardization impacted some functions, including
get_qwdata,get_gwlevels,get_dvandget_iv. Consequently, this has the potential to change function behavior in two ways:sitestoget_iv, make it so that a function call made with the revised code might not work using the existing code -get_iv(siteno, start, end)would work with proposed changes, but would not have previously worked without naming the arguments becauseget_ivdoes not currently expect the first argument to be the site numbersUltimately, the proposed changes did not break any of the existing unit tests. But because of the potential impacts to workflows identified above, I incremented the minor version of the package.