Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package review, updates and addition of translation and cleaning functions #16

Merged
merged 203 commits into from
May 15, 2023

Conversation

joshwlambert
Copy link
Contributor

This PR contains several updates to the package. These are separated into sections, organised by headings, and the change and reasoning for the change is stated.

Package maintenance

  • Use up-to-date tidyverse styles (data-masking and tidy-select). This ensures the best practises of the tidyverse are implemented and that the developments of tidyverse packages (e.g. {dplyr}, {tidyr}) are being fully utilised.
  • change output to consistently use tibbles. This provides a cleaner output for users to view the output of functions, especially in cases when the API returns tabular data with many columns and/or rows.
  • removed function argument defaults that are recursive (i.e. equal to the name of the argument). Supplying an (potentially undefined) variable as an argument default to a function will result in an error when the user does not supply this argument when calling the function. This error will most likely look like this: promise already under evaluation: recursive default argument reference or earlier problems? which is not very intuitive especially for a new R users. By removing the recursive argument defaults this will resolve this issue.
  • removed pipe operator (%>%) from functions. This makes the code more modular and can improve debugging.
  • Use explicit namespaces in functions instead of importFrom in function documentation. The choice of using explicit namespacing instead of importing a package in the documentation is subjective, but the benefit of the namespacing is it makes clear which functions come from other packages and which are in {godataR}.
  • Linted the style of the package (using devtools::lint()). This is a style guide set out by the tidyverse team. The use of a consistent, widely used, code style makes it easier to read the code and is likely easier for people to contribute given they are likely familiar with the style. One future additon could be a style check within the package that checks future changes also conform to this style.

New functionality

  • Added translate_categories() function (exported), as well as translate_token() and any_tokens() functions (internal). These take the data returned by the API and use the language tokens (returned by get_language_tokens()) to translate hard to read strings to a simplier form given by the language tokens.
  • Added cleaning functions that previously stored as a cleaning script. Having the cleaning code as functions allows for better documentation, testing and distribution. It also allows the users of {godataR} to use the cleaning functions after importing data without have to access a separate script.

Testing

  • Testing infrastructure has been added to the package. The {testthat} unit testing framework was used as it is one of, if not the most use testing framework and provides lots of useful functions for testing. It also integrates nicely with {devtools} to run devtools::test() and devtools::check() to run tests.
  • Tests were added for API functions (get_*() functions). These are skipped by default as they require credential to connect to the the API. However, they can be run locally if a user has credientials.
  • Tests were added for non-API functions to check they work as expected.

Please do not merge pull request until fully reviewed.

@jamesfuller-cdc
Copy link
Collaborator

jamesfuller-cdc commented Mar 31, 2023 via email

@joshwlambert
Copy link
Contributor Author

@jamesfuller-cdc @sarahollis I've added the first draft of a get_cases_questionnaire() function (in commit f05251f).

Some design considerations:

  • it only uses the export_downloader() as the batch_downloader() does not have a file_type argument. I assumed the default would be JSON for the batch_downloader() and therefore removed it for this function. However, if I'm wrong and the default of the batch_downloader() is "csv" please let me know and I can add it to the function.
  • Due to the above comment this function does not currently have a batch_size argument.
  • When using the export_downloader() to retrieve csv from the API the column names do not contain "questionnaireAnswers" as they do for the JSON file type. Therefore it is harder to distinguish questionnaire answer columns from others. I have subset by columns containing "FA", however, this might not be selecting all of the correct columns. Please let me know what you think. Alternatively for this function we could retrieve JSON, subset by columns containing "questionnaireAnswers" and then unnest. It would require more code but might be more certain that we are selecting the correct columns.

@pratikunterwegs
Copy link

pratikunterwegs commented Apr 3, 2023

@pratikunterwegs

clean_cases()
I have checked parts of this function up until the stage that cases_address_history_clean is required, works fine until then - see issue with clean_case_address_history().

I cannot reproduce the issue, please provide a reprex.

It is possible commits in the interim have fixed possible issues identified in the review. If this is the case no need for a reprex, just let me know the issue is resolved.

Update: This is because your NAMESPACE is not populated correctly.

Update: After populating NAMEPSACE with devtools::document(), and passing location data to to clean_case_address_history(), clean_cases() works. Suggest updating documentation to pass location data as well. This might not be caught in testing as these API-requiring tests are skipped if I understand correctly.

Update: cases_from_contacts() also works after these edits, addressing the other function that didn't work.

Here's a reprex @joshwlambert -

Previously: "not sure whether some functions have been removed in the interim, because the example from clean_cases() documentation now fails because functions required for clean_cases() input are not found. If these have been replaced with other functions, suggest updating the documentation to reflect this."

Previously: "Now, see error in clean_case_address_history()"

suppressWarnings({library(godataR)})
devtools::package_info("godataR", dependencies = FALSE)
#>  package * version date (UTC) lib source
#>  godataR * 2.0.0   2023-04-03 [1] local
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library

# Your Go.Data URL
url <- "https://godata-r19.who.int/"

# Your email address to log in to Go.Data
username <- getPass::getPass(msg = "Enter your Go.Data username (email address):")
#> Please enter password in TK window (Alt+Tab)

# Your password to log in to Go.Data
password <- getPass::getPass(msg = "Enter your Go.Data password:")
#> Please enter password in TK window (Alt+Tab)

# Get ID for active outbreak:
outbreak_id <- godataR::get_active_outbreak(url = url,
                                            username = username,
                                            password = password)

# get cases from current outbreak
cases <- get_cases(
  url = url,
  username = username,
  password = password,
  outbreak_id = outbreak_id
)
#> ...beginning download
#> ...download complete!

locations <- get_locations(
  url = url,
  username = username,
  password = password
)

locations_clean <- clean_locations(locations = locations)

case_address_history <- clean_case_address_history(
  cases = cases,
  locations_clean = locations_clean
)

# from example in `clean_cases()` documentation
# other cleaned data required for `clean_cases()`
cases_vacc_history_clean <- clean_case_vax_history(cases = cases)
cases_address_history_clean <- clean_case_address_history(
  cases = cases,
  locations_clean = locations_clean
)
cases_dateranges_history_clean <- clean_case_med_history(cases = cases)

cases_clean <- clean_cases(
  cases = cases,
  cases_address_history_clean = cases_address_history_clean,
  cases_vacc_history_clean = cases_vacc_history_clean,
  cases_dateranges_history_clean = cases_dateranges_history_clean
)

cases_from_contacts <- cases_from_contacts(cases_clean)

Created on 2023-04-03 by the reprex package (v2.0.1)

@pratikunterwegs
Copy link

Hi @joshwlambert just adding results from another look at this code; some issues are listed below from the test suite. These issue primarily relate to the accessed data not having the expected columns, either because some columns are missing, or some have been added. It may be useful to determine whether there is a data schema that could be tested against rather than hardcoding the column name and type expectations. Hope this helps!

  1. clean_contacts_of_contacts() is broken because the function expects columns called "relationship_x_x" (where "x_x" are further names) to exist in the output of get_contacts_of_contacts() but these columns do not exist. This appears to be a pre-existing problem.
  2. clean_events() is broken as it expects location data which are not provided in the function arguments
  3. Test for clean_followups() fails as the expected number of rows is not returned. May be good to check whether data are being added to and not freeze these tests to expect specific numbers of rows.
  4. Tets for clean_relationships() also fails as get_relationships() returns more rows than expected, same reason as above likely.
  5. Same issue with clean_teams()
  6. Same issue with exposures_per_case(), but overall functions with this issue seem to work fine.
  7. Tests for get_cases() fails with extra rows; test also fails as "responsibleUserId" is missing from case data returned by get_cases(); test for "questionnaireAnswer" in column names also fails on columns 50 -- 55 inclusive, expected for all cols 50 - 357 inclusive; test for column types also fail as some columns appear to be missing
  8. Test for get_contacts_of_contacts() fails as column "responsibleUserId" is replaced with "responsibleUser"
  9. Test for get_contacts() fails as expected cols are not returned, "responsibleUserId" is missing, this col appears to have been split into 3 new cols with firstName, lastName, and id as separate identifiers for responsibleUser.
  10. Test for get_events() fails as "responsibleUserId" is missing from data
  11. Test for get_followups() fails as columns are missing including "responsibleUserId" and some "questionnaireAnswers" cols are missing
  12. Test for get_language_tokens() fails as there are more rows than expected
  13. Same as above for get_reference_data()
  14. Tests for get_relationships() fails as some column types have changed from logical to character
  15. Test for get_teams() fails due to more rows than expected
  16. Test for get_users() fails as new columns relating to addresses have been added which are not expected

@sarahollis sarahollis merged commit ba1f8a8 into WorldHealthOrganization:main May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants