Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add length check <=200 bytes #194

Merged
merged 7 commits into from
Dec 15, 2023
Merged

add length check <=200 bytes #194

merged 7 commits into from
Dec 15, 2023

Conversation

kaz462
Copy link
Collaborator

@kaz462 kaz462 commented Nov 28, 2023

Thank you for your Pull Request!

We have developed a Pull Request template to aid you and our reviewers. Completing the below tasks helps to ensure our reviewers can maximize their time on your code as well as making sure the xportr codebase remains robust and consistent.

The scope of {xportr}

{xportr}'s scope is to enable R users to write out submission compliant xpt files that can be delivered to a Health Authority or to downstream validation software programs. We see labels, lengths, types, ordering and formats from a dataset specification object (SDTM and ADaM) as being our primary focus. We also see messaging and warnings to users around applying information from the specification file as a primary focus. Please make sure your Pull Request meets this scope of {xportr}. If your Pull Request moves beyond this scope, please get in touch with the {xportr} team on slack or create an issue to discuss.

Please check off each task box as an acknowledgment that you completed the task. This checklist is part of the Github Action workflows and the Pull Request will not be merged into the devel branch until you have checked off each task.

Changes Description

(descriptions of changes)

Task List

  • The spirit of xportr is met in your Pull Request
  • Place Closes #<insert_issue_number> into the beginning of your Pull Request Title (Use Edit button in top-right if you need to update)
  • Summary of changes filled out in the above Changes Description. Can be removed or left blank if changes are minor/self-explanatory.
  • Check that your Pull Request is targeting the devel branch, Pull Requests to main should use the Release Pull Request Template
  • Code is formatted according to the tidyverse style guide. Use styler package and functions to style files accordingly.
  • Updated relevant unit tests or have written new unit tests. See our Wiki for conventions used in this package.
  • Creation/updated relevant roxygen headers and examples. See our Wiki for conventions used in this package.
  • Run devtools::document() so all .Rd files in the man folder and the NAMESPACE file in the project root are updated appropriately
  • Run pkgdown::build_site() and check that all affected examples are displayed correctly and that all new/updated functions occur on the "Reference" page.
  • Update NEWS.md if the changes pertain to a user-facing function (i.e. it has an @export tag) or documentation aimed at users (rather than developers)
  • Address any updates needed for vignettes and/or templates
  • Link the issue Development Panel so that it closes after successful merging.
  • Fix merge conflicts
  • Pat yourself on the back for a job well done! Much love to your accomplishment!

Copy link

codecov bot commented Nov 28, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (b09805e) 100.00% compared to head (6a74578) 100.00%.
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #194   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           12        12           
  Lines          588       595    +7     
=========================================
+ Hits           588       595    +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@bms63 bms63 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!! Thanks for doing this one!!

@bms63
Copy link
Collaborator

bms63 commented Nov 29, 2023

For #189 are you going to implement that in here or in another PR? I see it tagged on the side

@kaz462
Copy link
Collaborator Author

kaz462 commented Nov 29, 2023

@bms63 thanks for the review!

For #189 are you going to implement that in here or in another PR? I see it tagged on the side

we used nchar(type = chars) for labels and nchar(type = bytes) for length as suggested by @siye6. I think users don't need to choose between nchar(type = bytes) and nchar(type = chars). @siye6 @bms63 do you agree?

from SAS help doc:

Variable names can be up to 8 characters, and they are stored in their original case (upper or lower).
Character variables can have lengths up to 200 bytes.
Variable labels can be up to 40 characters.

@cpiraux
Copy link
Collaborator

cpiraux commented Nov 29, 2023

For #189 are you going to implement that in here or in another PR? I see it tagged on the side

I see that the issue #91 is also tagged but there are other elements to consider regarding the length in this issue. We could have another PR for it or do it in this one; I am not sure of the best approach, but I prefer the issue not to be closed until the following points are handled:
There are two types of length:

  • Metadata length
  • Data length (maximum value length)

What to do in case these lengths are different:

  • Data length > metadata length: This will cause the truncation of the data. This case is handled by Haven. The length is the one from the data, and a message is given.
  • Data length < metadata length: In that case, to reduce the file size, the FDA requests trimming the variable length. In this case, we could give a message so the user can update the metadata length.
  • Metadata length is missing: For character variables, should we assign the data length or 200? In order to trim the variable, I suggest using the data length. Per define.xml specification, the --DTC variables should have the length missing in define.xml, so this scenario is probable.

Note:
In theory, the metadata length corresponds to the planned length and might be different from the data length. In practice, the metadata and the data length are usually similar.

The FDA requests trimming the variable across datasets. For example, if AVISIT data length = 40 in ADLB and AVISIT length = 30 in ADVS, the AVISIT length should be 40 for all datasets. In this case, the metadata length will be different from the data length. I don’t think we can check this scenario, but maybe we could mention this point in the documentation.

Copy link
Collaborator

@cpiraux cpiraux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the implementation; it looks great. I've just added a small comment.
Since the 200-bytes limitation is a technical constraint, I also believe that users don't need to choose between nchar (type = bytes) and nchar (type = chars).

R/utils-xportr.R Outdated Show resolved Hide resolved
@bms63
Copy link
Collaborator

bms63 commented Dec 3, 2023

Shall we mention the byte/character issue in the documentaion for xportr_length() would that suffice?

@siye6
Copy link

siye6 commented Dec 5, 2023

@bms63 thanks for the review!

For #189 are you going to implement that in here or in another PR? I see it tagged on the side

we used nchar(type = chars) for labels and nchar(type = bytes) for length as suggested by @siye6. I think users don't need to choose between nchar(type = bytes) and nchar(type = chars). @siye6 @bms63 do you agree?

from SAS help doc:

Variable names can be up to 8 characters, and they are stored in their original case (upper or lower).
Character variables can have lengths up to 200 bytes.
Variable labels can be up to 40 characters.

Hi @kaz462 , thanks for the updates and sorry for late reply.
Regarding label, may also need to consider bytes and warning user if it is > 40, see below my test with Chinese Label

label <- "这是一段文字,用来测试在XPTversion5中作为数据集label是否会被截断"

label_len_char <- nchar(label, type = "chars")
label_len_char
label_len_bytes <- nchar(label, type = "bytes")
label_len_bytes

df <- data.frame(test = c(1, 2, 3))

haven::write_xpt(
  data = df,
  path = "test.xpt",
  version = 5,
  label = label
)

> label_len_char
[1] 40
> label_len_bytes
[1] 88

/* Use SAS convert XPT to SAS and show the label */

libname test xport "~/test.xpt";
libname sas "~";

proc copy in=test out=sas memtype=data;
	select test;
run;

proc contents data=sas.test;
run;

Label	这是一段文字,用来测试在XPTv
Encoding	utf-8 Unicode (UTF-8)

> nchar("这是一段文字,用来测试在XPTv", type="bytes")
[1] 40

@kaz462
Copy link
Collaborator Author

kaz462 commented Dec 7, 2023

Regarding label, may also need to consider bytes and warning user if it is > 40, see below my test with Chinese Label

Thanks @siye6 for providing a detailed example :) That's a good finding for haven::write_xpt - the label got truncated without any messaging, it seems haven used max 40 bytes for labels but checked against nchar(label) > 40.

For xportr, the following error message will be displayed if we use xportr_write. Do you think this message will be sufficient?

xportr_write(
    df, 
    path = "test.xpt",
    label = "这是一段文字,用来测试在XPTversion5中作为数据集label是否会被截断"
)
Error in `xportr_write()`:
! `label` cannot contain any non-ASCII, symbol or special characters.
Run `rlang::last_trace()` to see where the error occurred.

R/utils-xportr.R Outdated Show resolved Hide resolved
@siye6
Copy link

siye6 commented Dec 8, 2023

non-ASCII

Hi @kaz462 , thanks for update the warning message, regarding the non-ascii I suggest we should allow utf8 characters (including Chinese, Japanese ...etc), For eSubmission in China, one of the request is translate the foreign language data package (e.g. English) to Chinese. (variable label, dataset label, meddra, whodrug terms, primary end point related code list ..etc need to be translated from English to Chinese).

Co-authored-by: André Veríssimo <211358+averissimo@users.noreply.github.com>
@kaz462 kaz462 mentioned this pull request Dec 14, 2023
@kaz462
Copy link
Collaborator Author

kaz462 commented Dec 14, 2023

Hi @kaz462 , thanks for update the warning message, regarding the non-ascii I suggest we should allow utf8 characters (including Chinese, Japanese ...etc), For eSubmission in China, one of the request is translate the foreign language data package (e.g. English) to Chinese. (variable label, dataset label, meddra, whodrug terms, primary end point related code list ..etc need to be translated from English to Chinese).

@siye6 Thanks for the further information! I've created two issues below -

  1. Dataset label truncated after write_xpt tidyverse/haven#746
  2. allow utf8 characters  #202

R/utils-xportr.R Outdated Show resolved Hide resolved
kaz462 and others added 2 commits December 14, 2023 12:56
Co-authored-by: André Veríssimo <211358+averissimo@users.noreply.github.com>
@bms63
Copy link
Collaborator

bms63 commented Dec 15, 2023

@kaz462, @cpiraux and I met and think this is ready to merge!! yay!

@bms63 bms63 merged commit c84f815 into main Dec 15, 2023
12 of 13 checks passed
@bms63 bms63 deleted the 189_91_max_var_length branch December 15, 2023 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants