-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Munge GCM downloads into GLM-ready variables #266
Munge GCM downloads into GLM-ready variables #266
Conversation
Merge branch 'gcm_driver_data_munge_pipeline' of github.com:USGS-R/lake-temperature-model-prep into gcm_driver_data_munge_pipeline # Conflicts: # _targets.R
…r rebuilds appropriately
@jread-usgs looking for your review specifically on the |
#' @description the final GCM driver data will need to be daily, but geoknife | ||
#' returns hourly values. This step summarizes the data into daily values. It | ||
#' creates a file with the exact same name, except that the "_raw" part of the | ||
#' `in_file` filepath is replaced with "_daily". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to add the conversion steps to this function description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I focused on the munge_to_glm
function.
) %>% | ||
|
||
# Simply rename GDP variables into GLM variables | ||
mutate(RelHum = qas, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RelHum isn't equivalent to qas
, so conversion is needed here. To calculate RelHum, you need to specific humidity, pressure, and air temperature (see here). See helper function here
ps
is the surface air pressure in hPa (see here) and would need to be part of your download and would need to be converted to mb from hPa to use that helper function above.
# Convert from hourly to daily data | ||
group_by(time, cell) %>% | ||
# TODO: should we `na.rm = TRUE`? | ||
summarize(across(.cols = -WindSpeed, .fns = ~ mean(.x, na.rm = FALSE)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice way of doing this 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! This is a really clean approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets a bit muddied by the precip need - "meters/day". Wouldn't this need a sum
of the hourly rate and not a mean
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I can just follow the pattern I did for WindSpeed
where it has its own method defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's a rate, in units of m/day already, rather than the depth value that some models use, I think using mean() is fine, right @jread-usgs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, mean would be equivalent in this case. If it were m/hour though, we'd need to treat it differently
n = length(time), | ||
.groups = "keep") %>% | ||
ungroup() %>% | ||
# This drops Jan 1, 1980 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm...we lose the first day of the year because it isn't a complete day? Bummer. Are we working in the appropriate timezone for that summation?
) %>% | ||
|
||
# Create a column with just the date to use for summarizing | ||
mutate(date = as.Date(DateTime)) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not running this so I can't tell quite what you are seeing here, but I am wondering if we're calculating date
using the timezone of the local machine running this code instead of the timezone of the dataset itself. If so, that may explain why we're losing Jan 1 1980 (but funny, as I'd expect we'd also lose the leading date in each of the GCM time chunks...).
#' creates a file with the exact same name, except that the "_raw" part of the | ||
#' `in_file` filepath is replaced with "_daily". | ||
#' @param in_file filepath to a feather file containing the hourly geoknife data | ||
munge_to_glm <- function(in_file) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to call this munge_notaro_to_glm
since this function is specific to a particular set of variables and unites.
In the future, might want to pull out the generic munging components into a separate function and keep the munge prep work that is specific to a particular source in a unique function. (e.g., could use the same generic for NLDAS and Notaro GCMs, but unique prep functions for both). But probably not important now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm what would you say are the "generic munging components" here? Feels like most are Notaro specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see generic as: going from hourly to daily, writing the files, and perhaps some of the variable changes. Probably good to just ignore this since I agree there isn't a clear divide and NLDAS isn't handled here currently anyhow.
#' @param dim_time_input vector of GCM driver data dates | ||
#' @param dim_cell_input vector of all GCM driver grid cells (whether or not data was pulled) | ||
#' @param vars_info variables and descriptions to store in NetCDF | ||
#' @param global_att global attribute description for the observations (e.g. notaro_ACCESS_1980_1999) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you may have tweaked the format of this global attribute description since providing this example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes. I should probably just delete all of this documentation since I anticipate arguments passed to the function will change immensely with #252 but this runs in the pipeline, so was keeping for now.
Commenting just to say that I am seeing all these comments and starting to work through them! Thanks for these very helpful reviews @jread-usgs and @hcorson-dosch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay - done with my review! I looked over the targets
code as well as the munge_glm()
function. Thanks for all of the detailed comments and function descriptions, Lindsay. Those really helped. I think this is looking great, and will be good to merge in once the unit conversions are finalized.
Working my way through the following list after combing through your reviews:
|
Pausing for now (just relative humidity and tz stuff left) because the updated downscaled, debiased GCM data might make some of this moot. See #273 |
Decided that merging this, but noting those two incomplete items - timezone issues & correct relative humidity conversions - is the best way to keep moving forward. Given the new GCMs are already daily and mostly in the units we need already, these two concerns likely won't be present anyways. |
In addition to the munging, this adds
Putting this up here now but just want it to sit for now. Need to do the following before it can be reviewed:
retry()
to GCM download codeadd some basic optimization to the pivoting code(waiting to do this later given mixed results. See GCM pipeline: scale up to MN #258 (comment))