Package review, updates and addition of translation and cleaning functions #16

joshwlambert · 2023-03-15T11:35:26Z

This PR contains several updates to the package. These are separated into sections, organised by headings, and the change and reasoning for the change is stated.

Package maintenance

Use up-to-date tidyverse styles (data-masking and tidy-select). This ensures the best practises of the tidyverse are implemented and that the developments of tidyverse packages (e.g. {dplyr}, {tidyr}) are being fully utilised.
change output to consistently use tibbles. This provides a cleaner output for users to view the output of functions, especially in cases when the API returns tabular data with many columns and/or rows.
removed function argument defaults that are recursive (i.e. equal to the name of the argument). Supplying an (potentially undefined) variable as an argument default to a function will result in an error when the user does not supply this argument when calling the function. This error will most likely look like this: promise already under evaluation: recursive default argument reference or earlier problems? which is not very intuitive especially for a new R users. By removing the recursive argument defaults this will resolve this issue.
removed pipe operator (%>%) from functions. This makes the code more modular and can improve debugging.
Use explicit namespaces in functions instead of importFrom in function documentation. The choice of using explicit namespacing instead of importing a package in the documentation is subjective, but the benefit of the namespacing is it makes clear which functions come from other packages and which are in {godataR}.
Linted the style of the package (using devtools::lint()). This is a style guide set out by the tidyverse team. The use of a consistent, widely used, code style makes it easier to read the code and is likely easier for people to contribute given they are likely familiar with the style. One future additon could be a style check within the package that checks future changes also conform to this style.

New functionality

Added translate_categories() function (exported), as well as translate_token() and any_tokens() functions (internal). These take the data returned by the API and use the language tokens (returned by get_language_tokens()) to translate hard to read strings to a simplier form given by the language tokens.
Added cleaning functions that previously stored as a cleaning script. Having the cleaning code as functions allows for better documentation, testing and distribution. It also allows the users of {godataR} to use the cleaning functions after importing data without have to access a separate script.

Testing

Testing infrastructure has been added to the package. The {testthat} unit testing framework was used as it is one of, if not the most use testing framework and provides lots of useful functions for testing. It also integrates nicely with {devtools} to run devtools::test() and devtools::check() to run tests.
Tests were added for API functions (get_*() functions). These are skipped by default as they require credential to connect to the the API. However, they can be run locally if a user has credientials.
Tests were added for non-API functions to check they work as expected.

Please do not merge pull request until fully reviewed.

…(skipped by default)

jamesfuller-cdc · 2023-03-31T14:35:03Z

Thanks, Amy. I think we are going to try to use another function, e.g., get_cases_questionnaire(), with an API call that requests the output at CSV. We think this should take care of all of the unnesting. Hopefully that works.

-James From: AmyM ***@***.***> Sent: Tuesday, March 28, 2023 9:12 PM To: WorldHealthOrganization/godataR ***@***.***> Cc: Fuller, James (CDC/DDPHSIS/CGH/DGHP) ***@***.***>; Comment ***@***.***> Subject: Re: [WorldHealthOrganization/godataR] Package review, updates and addition of translation and cleaning functions (PR #16) Had meant to add some thoughts to this,some great additions here but on the subject of questionnaire fields - I thought the nested fields issue had been mostly taken care of? That is, if you use the correct API they are not nested any more? Or maybe I am just thinking of the core nested fields which are now unpacked... Either way, difficult as they are to deal with I think there needs to be a level agnostic un-nesting method. I've done it a few times in different ways, more recently tried to use a dplyr friendly approach. I think I might have included that in some of the lab2godata code, so have a look there in case there's anything generalisable. When I have some more time I will put a code snippet with a suggestion here.

On Tue, 28 Mar 2023 at 09:18, James Fuller ***@***.***<mailto:***@***.***>> wrote: ***@***.****<mailto:***@***.****> commented on this pull request. ------------------------------ On R/clean_cases.R <#16 (comment)> : In general, I think we can remove a lot of the data cleaning steps in the current version of clean_cases. These may also apply to other cleaning functions. @sarahollis <https://github.com/sarahollis> anything else to add? *Keep as-is:* 1. clean date fields 2. clean field names 3. current address location *Modify:* 4. clean age_years & age_months; generate numeric age variable; but no need to create the age category field 5. can we use translate categories as part of the clean_cases function? *Remove:* 6. vaccine data, hospitalization/isolation/icu data 7. remove the final step that only keeps certain fields. *For Discussion:* 8. What to do with questionnaire fields? should we keep them? should we use the get_cases(file_type="csv") to get them as a flat version? 9. Should we remove all nested fields? yes they cause problems for exporting to a flat file, but could be frustrating if the data are needed — Reply to this email directly, view it on GitHub <#16 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5XMAWQLVEXHMYGKS42D23W6LQJBANCNFSM6AAAAAAV3VXLG4> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***<mailto:***@***.***>>

-- *--* *Amy Mikhail* *Tel UK (WhatsApp):* +44 781 417 6107 *Gmail: ***@***.***<mailto:***@***.***> *Skype: *amy.fwmikhail — Reply to this email directly, view it on GitHub<#16 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQJO37RGM26CHZ7EHXUSBBLW6OD4LANCNFSM6AAAAAAV3VXLG4>. You are receiving this because you commented.Message ID: ***@***.******@***.***>>

joshwlambert · 2023-03-31T15:24:07Z

@jamesfuller-cdc @sarahollis I've added the first draft of a get_cases_questionnaire() function (in commit f05251f).

Some design considerations:

it only uses the export_downloader() as the batch_downloader() does not have a file_type argument. I assumed the default would be JSON for the batch_downloader() and therefore removed it for this function. However, if I'm wrong and the default of the batch_downloader() is "csv" please let me know and I can add it to the function.
Due to the above comment this function does not currently have a batch_size argument.
When using the export_downloader() to retrieve csv from the API the column names do not contain "questionnaireAnswers" as they do for the JSON file type. Therefore it is harder to distinguish questionnaire answer columns from others. I have subset by columns containing "FA", however, this might not be selecting all of the correct columns. Please let me know what you think. Alternatively for this function we could retrieve JSON, subset by columns containing "questionnaireAnswers" and then unnest. It would require more code but might be more certain that we are selecting the correct columns.

pratikunterwegs · 2023-04-03T08:24:31Z

@pratikunterwegs

clean_cases()
I have checked parts of this function up until the stage that cases_address_history_clean is required, works fine until then - see issue with clean_case_address_history().

I cannot reproduce the issue, please provide a reprex.

It is possible commits in the interim have fixed possible issues identified in the review. If this is the case no need for a reprex, just let me know the issue is resolved.

Update: This is because your NAMESPACE is not populated correctly.

Update: After populating NAMEPSACE with devtools::document(), and passing location data to to clean_case_address_history(), clean_cases() works. Suggest updating documentation to pass location data as well. This might not be caught in testing as these API-requiring tests are skipped if I understand correctly.

Update: cases_from_contacts() also works after these edits, addressing the other function that didn't work.

Here's a reprex @joshwlambert -

Previously: "not sure whether some functions have been removed in the interim, because the example from clean_cases() documentation now fails because functions required for clean_cases() input are not found. If these have been replaced with other functions, suggest updating the documentation to reflect this."

Previously: "Now, see error in clean_case_address_history()"

suppressWarnings({library(godataR)})
devtools::package_info("godataR", dependencies = FALSE)
#>  package * version date (UTC) lib source
#>  godataR * 2.0.0   2023-04-03 [1] local
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library

# Your Go.Data URL
url <- "https://godata-r19.who.int/"

# Your email address to log in to Go.Data
username <- getPass::getPass(msg = "Enter your Go.Data username (email address):")
#> Please enter password in TK window (Alt+Tab)

# Your password to log in to Go.Data
password <- getPass::getPass(msg = "Enter your Go.Data password:")
#> Please enter password in TK window (Alt+Tab)

# Get ID for active outbreak:
outbreak_id <- godataR::get_active_outbreak(url = url,
                                            username = username,
                                            password = password)

# get cases from current outbreak
cases <- get_cases(
  url = url,
  username = username,
  password = password,
  outbreak_id = outbreak_id
)
#> ...beginning download
#> ...download complete!

locations <- get_locations(
  url = url,
  username = username,
  password = password
)

locations_clean <- clean_locations(locations = locations)

case_address_history <- clean_case_address_history(
  cases = cases,
  locations_clean = locations_clean
)

# from example in `clean_cases()` documentation
# other cleaned data required for `clean_cases()`
cases_vacc_history_clean <- clean_case_vax_history(cases = cases)
cases_address_history_clean <- clean_case_address_history(
  cases = cases,
  locations_clean = locations_clean
)
cases_dateranges_history_clean <- clean_case_med_history(cases = cases)

cases_clean <- clean_cases(
  cases = cases,
  cases_address_history_clean = cases_address_history_clean,
  cases_vacc_history_clean = cases_vacc_history_clean,
  cases_dateranges_history_clean = cases_dateranges_history_clean
)

cases_from_contacts <- cases_from_contacts(cases_clean)

^{Created on 2023-04-03 by the reprex package (v2.0.1)}

R/clean_case_vax_history.R

pratikunterwegs · 2023-04-06T16:49:08Z

Hi @joshwlambert just adding results from another look at this code; some issues are listed below from the test suite. These issue primarily relate to the accessed data not having the expected columns, either because some columns are missing, or some have been added. It may be useful to determine whether there is a data schema that could be tested against rather than hardcoding the column name and type expectations. Hope this helps!

clean_contacts_of_contacts() is broken because the function expects columns called "relationship_x_x" (where "x_x" are further names) to exist in the output of get_contacts_of_contacts() but these columns do not exist. This appears to be a pre-existing problem.
clean_events() is broken as it expects location data which are not provided in the function arguments
Test for clean_followups() fails as the expected number of rows is not returned. May be good to check whether data are being added to and not freeze these tests to expect specific numbers of rows.
Tets for clean_relationships() also fails as get_relationships() returns more rows than expected, same reason as above likely.
Same issue with clean_teams()
Same issue with exposures_per_case(), but overall functions with this issue seem to work fine.
Tests for get_cases() fails with extra rows; test also fails as "responsibleUserId" is missing from case data returned by get_cases(); test for "questionnaireAnswer" in column names also fails on columns 50 -- 55 inclusive, expected for all cols 50 - 357 inclusive; test for column types also fail as some columns appear to be missing
Test for get_contacts_of_contacts() fails as column "responsibleUserId" is replaced with "responsibleUser"
Test for get_contacts() fails as expected cols are not returned, "responsibleUserId" is missing, this col appears to have been split into 3 new cols with firstName, lastName, and id as separate identifiers for responsibleUser.
Test for get_events() fails as "responsibleUserId" is missing from data
Test for get_followups() fails as columns are missing including "responsibleUserId" and some "questionnaireAnswers" cols are missing
Test for get_language_tokens() fails as there are more rows than expected
Same as above for get_reference_data()
Tests for get_relationships() fails as some column types have changed from logical to character
Test for get_teams() fails due to more rows than expected
Test for get_users() fails as new columns relating to addresses have been added which are not expected

joshwlambert added 30 commits January 19, 2023 10:34

bumped RoxygenNote in DESCRIPTION

20fc9b6

removed tidyverse dependency from DESCRIPTION

091f93f

added testing infrastructure and a test for get_cases to run locally …

b9313e0

…(skipped by default)

added testthat to DESCRIPTION

a9a5949

linted style of get_cases

f727169

changed file.type argument and variables to snake case

976d73f

updated get_cases documentation

56417f4

remove imports that are not used in get_cases

70bf982

changed get_cases output from data frame to tibble

987cd45

linted get_active_outbreak

f32c6a3

added tests for get_active_outbreak (skipped by default)

fc26e4c

removed pipes from get_active_outbreak

ceb3a7f

used explicit namespace instead of import in get_active_outbreak

6f30665

added comments to get_active_outbreak

5cf6e5d

updated get_active_outbreak documentation

f0606a1

linted check_godata_version

00cca1b

updated documentation for check_godata_version

8ad29dc

added test for check_godata_version (skipped by default)

b6dc35d

removed pipes from check_godata_version

2833b13

used explicit namespace instead of import for check_godata_version

d1833a5

updated documentation for check_godata_version

fb6581f

reduce cyclomatic complexity of check_godata_version

f73a524

linted get_godata_version

f6d894f

added test for get_godata_version (skipped by default)

e1f7972

updated documentation for get_godata_version

95511f1

removed pipe from get_godata_version

fefc74e

used explicit namespace instead of import in get_godata_version

6d5a167

linted batch_downloader

25edbda

added tests for batch_downloader

d75335f

updated batch_downloader documentation

7749434

fixed get_all_outbreaks test

b8344d7

joshwlambert added 2 commits March 31, 2023 16:14

added first draft of get_cases_questionnaire

f05251f

updated get_cases_questionnaire documentation

fbaeeac

joshwlambert added 7 commits April 6, 2023 15:20

updated documentation

e6be607

used translation function in clean_case_address_history

7084b71

used translation function in clean_case_med_history

fe63123

added translation function to clean_case_vax_history

7ed2c25

used translation function in clean_cases

4e27174

used translation function in clean_contact_address_history

5bdef54

used translation function in clean_contact_vax_history

c9460d5

pratikunterwegs reviewed Apr 6, 2023

View reviewed changes

R/clean_case_vax_history.R Show resolved Hide resolved

joshwlambert added 4 commits April 6, 2023 17:03

used translation function in clean_contacts_of_contacts_address_history

3316a80

used translation function in clean_contacts_of_contacts_vax_history

fe7009d

removing extra #' in clean_case_vax_history

ae4a7e5

used translation function in clean_contacts_of_contacts (function fails)

77c0665

joshwlambert added 10 commits April 14, 2023 11:48

added janitor as package dependency

3e4d502

used translation function in clean_contacts

dedc9df

added test for clean_contacts

dd93982

updated clean_events function and test

31734f8

used translation function for clean_followups and updated test

0076247

used translation function in clean_locations and updated test

2569f9f

used translation function in clean_relationships and updated test

ac142b3

updated clean_teams test

fdcb33a

used translation function in clean_users and updated test

5bb71f6

updated examples in documentation

f6daa6f

sarahollis merged commit ba1f8a8 into WorldHealthOrganization:main May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Package review, updates and addition of translation and cleaning functions #16

Package review, updates and addition of translation and cleaning functions #16

joshwlambert commented Mar 15, 2023

jamesfuller-cdc commented Mar 31, 2023 via email

joshwlambert commented Mar 31, 2023

pratikunterwegs commented Apr 3, 2023 •

edited

Loading

pratikunterwegs commented Apr 6, 2023

Package review, updates and addition of translation and cleaning functions #16

Package review, updates and addition of translation and cleaning functions #16

Conversation

joshwlambert commented Mar 15, 2023

Package maintenance

New functionality

Testing

jamesfuller-cdc commented Mar 31, 2023 via email

joshwlambert commented Mar 31, 2023

pratikunterwegs commented Apr 3, 2023 • edited Loading

pratikunterwegs commented Apr 6, 2023

pratikunterwegs commented Apr 3, 2023 •

edited

Loading