Import of SAS files into R with associated user-defined formats. Generation of a Data Definition Table (DDT) in SAS.
Data checks regarding values in data base vs. format definitions/value labels.
The goal of this SAS program is to import SAS .7bdat files into R with associated user-defined formats by generating automatically an executable .R script file that generates R data frames - step by step - with R factors, levels and labels.
For importing of .sas7bdat files two steps must be followed:
- Run SAS program sas_r.sas in SAS
- Run impsas.R (generated by the SAS program) in R or RStudio
The results are:
- A Data Definition Table (SAS Output)
- An .R script file that contains R statements/functions to import .sas7bdat files in R data frames with variable factors, levels and labels.
SAS on Windows, R or RStudio IDE
Modify the SAS program sas_r.sas
Enter manually:
- Library location where SAS .7bdat files are stored (e.g. C:\Temp\R, NOTE: do not change library name clean)
libname clean "C:\Temp\R";
- R working directory (the same directory as in 1. but with slash /. NOTE: must end with /;)
%let RSETWD=C:/Temp/R/;
- Specify the location of the SAS program with proc format statements (alternatively copy SAS format catalog to 'work.formats')
%include "C:\Temp\R\Formats01.sas";
Afterwards:
-> Run program sas_r.sas in SAS
-> Open impsas.R (generated by the SAS program) in R or RStudio IDE
-> Set encoding to UTF-8 (e.g. RStudio -> File -> Reopen with Encoding -> UTF-8) and run the script file
Following SAS output (example) will be automatically generated by the SAS program:
Member Column Column
Name Name Column Label Type Format Decode
DM STUDYID Study Identifier char $13.
DOMAIN Domain Abbreviation char $2.
USUBJID Unique Subject Identifier char $20.
SUBJID Subject Identifier for the Study char $3.
RFSTDTC Subject Reference Start Date/Time char $16.
RFENDTC Subject Reference End Date/Time char $10.
RFXSTDTC Date/Time of First Study Treatment char $16.
RFXENDTC Date/Time of Last Study Treatment char $16.
RFICDTC Date/Time of Informed Consent char $10.
RFPENDTC Date/Time of End of Participation char $10.
DTHDTC Date/Time of Death char $10.
DTHFL Subject Death Flag char $YES. Y = Yes
SITEID Study Site Identifier char $2.
INVNAM Investigator Name char $28.
AGE Age num BEST8.
AGEU Age Units char $5.
SEX Sex char $SEX. F = Female, M = Male
RACE Race char $41.
ARMCD Planned Arm Code char $20.
ARM Description of Planned Arm char $11.
TESTDTM Datetime of (Test) dtim DATETIME16.
TESTDT Date of (Test) date DATE9.
TESTTM Time of (Test) time TIME6.
a.s.o.
Following .R script file (example) will be automatically generated by the SAS program:
#### Set R working directory
setwd("C:/Temp/R/")
#### Use install.packages("Hmisc"), install.packages("haven") if necessary
library(haven)
library(Hmisc)
#### Read all .sas7bdat files from working directory. Set factor, levels, label(s).
DM <- read_sas("DM.sas7bdat")
DS <- read_sas("DS.sas7bdat")
TA <- read_sas("TA.sas7bdat")
DM$DTHFL <- factor(DM$DTHFL, c("Y"), exclude = "")
levels(DM$DTHFL) <- c("Yes")
label(DM$DTHFL) <- "Subject Death Flag"
DM$SEX <- factor(DM$SEX, c("F", "M"), exclude = "")
levels(DM$SEX) <- c("Female", "Male")
label(DM$SEX) <- "Sex"
After running this .R script file in R or RStudio -> look for warnings (there should not be any)
Note: Beside the R base functions (factor, levels) only the functions haven::read_sas, and Hmisc::label will be used for generating R data frames.
A very much shorter - but also substantially different - method is to use haven catalog_file
option to import SAS files with user-defined formats (SAS catalog file) and to use as_factor()
to generate value labels/factors.
In this case, value labels will not be assigned to factor/levels, but 1st translated into a labelled
class and then (optionally) coerced to factor/levels.
Pls. note, that different results may be obtained (e.g., in the case of character formats with missing observations in the database).
library(haven)
### Read DM.sas7bdat file with SAS formats, i.e. SAS format catalog
DM <- read_sas("DM.sas7bdat", catalog_file = "formats.sas7bcat")
### Apply as_factor() to data frame / coerce to factors
DM <- as_factor(DM)
### Example table;
table(DM$SEX)
### Display structure
str(DM$SEX)
Test platform: SAS 9.4 windows 7 64-bit, R 3.4.0
See test data in .../testsas
Only "simple" user-defined SAS formats (numeric or character) can be used, e.g.:
********** Definition of SAS Formats in library 'work';
proc format;
value $NY (notsorted)
"N"="No"
"Y"="Yes"
"U"="Unknown";
value $YES "Y"="Yes";
value $SEX "F"="Female"
"M"="Male";
value AESEV 1="MILD"
2="MODERATE"
3="SEVERE";
run;
quit;
Note that the notsorted option is very useful if the rank order should be maintained (instead of an alphabetic order), here No, Yes, Unknown
Formats like e.g. 30 - 50 = 'Between 30 and 50', or picture formats cannot be used.
Recommendation: All SAS variables should have "labels".
Consider also to assign 'formats' to all variables, e.g. $20., BEST12., 8., date9., datetime22.3, a.s.o., depending on the type of the variables.
There may be some values in the SAS database - where a user-defined format is assigned - but with no corresponding format definition/value label, e.g. when $SEX in 'proc format' is defined as "F", "M" but in the database also "U" (for Unknown) is present. In this case the value "U" will be set to 'NA' in R, w/o any warning.
Consider extension of definitions in SAS proc format, here "U"="Unknown" (highly recommended!).
On the other hand, there may be SAS formats/value labels defined, but the value in the database is missing, e.g. SEX is defined as "F", "M" and "U", but the value "U" (for Unknown) is missing in the database.
Check, if changes are necessary. Such categorization can be intended e.g. for tabulation of categories with missing values, here category Unknown with N=0.
Include or run the SAS program sasformat_chk.sas directly after sas_r.sas. This results in an output of all Critical Errors and Non-Critical Discrepancies (if any).
This SAS program allows a quick transformation of many SAS .7bdat files to R data frames with associated user-defined SAS formats (assigned to R factors with levels - in contrast to the haven read_sas(catalog_file)
method - that translates value labels into a labelled
class - and that can produce different results, too). The program provides also a Data Definition Table to see the variable properties and the decodes of the formats. Finally, a 2nd SAS program can be run to check if there are values in the database w/o corresponding value labels/format definitions (and vice versa).
Beside the R base functions the functions haven::read_sas and Hmisc::label will be used for generating R data frames.