Skip to content
/ sas-r Public

Import of SAS files into R with associated formats

Notifications You must be signed in to change notification settings

clindocu/sas-r

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

sas-r

Import of SAS files into R with associated user-defined formats. Generation of a Data Definition Table (DDT) in SAS.

Data checks regarding values in data base vs. format definitions/value labels.

Overview

The goal of this SAS program is to import SAS .7bdat files into R with associated user-defined formats by generating automatically an executable .R script file that generates R data frames - step by step - with R factors, levels and labels.

For importing of .sas7bdat files two steps must be followed:

  1. Run SAS program sas_r.sas in SAS
  2. Run impsas.R (generated by the SAS program) in R or RStudio

The results are:

  1. A Data Definition Table (SAS Output)
  2. An .R script file that contains R statements/functions to import .sas7bdat files in R data frames with variable factors, levels and labels.

Platform

SAS on Windows, R or RStudio IDE

Usage

Modify the SAS program sas_r.sas

Enter manually:

  1. Library location where SAS .7bdat files are stored (e.g. C:\Temp\R, NOTE: do not change library name clean)
libname clean "C:\Temp\R";
  1. R working directory (the same directory as in 1. but with slash /. NOTE: must end with /;)
%let RSETWD=C:/Temp/R/;
  1. Specify the location of the SAS program with proc format statements (alternatively copy SAS format catalog to 'work.formats')
%include "C:\Temp\R\Formats01.sas";

Afterwards:

-> Run program sas_r.sas in SAS

-> Open impsas.R (generated by the SAS program) in R or RStudio IDE

-> Set encoding to UTF-8 (e.g. RStudio -> File -> Reopen with Encoding -> UTF-8) and run the script file

Result

Following SAS output (example) will be automatically generated by the SAS program:

Data Definition Table (SAS ouput)

Member Column                                                 Column
Name   Name        Column Label                        Type   Format      Decode

DM     STUDYID     Study Identifier                    char   $13.
       DOMAIN      Domain Abbreviation                 char   $2.
       USUBJID     Unique Subject Identifier           char   $20.
       SUBJID      Subject Identifier for the Study    char   $3.
       RFSTDTC     Subject Reference Start Date/Time   char   $16.
       RFENDTC     Subject Reference End Date/Time     char   $10.
       RFXSTDTC    Date/Time of First Study Treatment  char   $16.
       RFXENDTC    Date/Time of Last Study Treatment   char   $16.
       RFICDTC     Date/Time of Informed Consent       char   $10.
       RFPENDTC    Date/Time of End of Participation   char   $10.
       DTHDTC      Date/Time of Death                  char   $10.
       DTHFL       Subject Death Flag                  char   $YES.       Y = Yes
       SITEID      Study Site Identifier               char   $2.
       INVNAM      Investigator Name                   char   $28.
       AGE         Age                                 num    BEST8.
       AGEU        Age Units                           char   $5.
       SEX         Sex                                 char   $SEX.       F = Female, M = Male
       RACE        Race                                char   $41.
       ARMCD       Planned Arm Code                    char   $20.
       ARM         Description of Planned Arm          char   $11.
       TESTDTM     Datetime of (Test)                  dtim   DATETIME16.
       TESTDT      Date of (Test)                      date   DATE9.
       TESTTM      Time of (Test)                      time   TIME6.
          
a.s.o.

File impsas.R (automatically generated)

Following .R script file (example) will be automatically generated by the SAS program:

#### Set R working directory
setwd("C:/Temp/R/")
#### Use install.packages("Hmisc"), install.packages("haven") if necessary
library(haven)
library(Hmisc)
#### Read all .sas7bdat files from working directory. Set factor, levels, label(s).
DM <- read_sas("DM.sas7bdat")
DS <- read_sas("DS.sas7bdat")
TA <- read_sas("TA.sas7bdat")
DM$DTHFL <- factor(DM$DTHFL, c("Y"), exclude = "")
levels(DM$DTHFL) <- c("Yes")
label(DM$DTHFL) <- "Subject Death Flag"
DM$SEX <- factor(DM$SEX, c("F", "M"), exclude = "")
levels(DM$SEX) <- c("Female", "Male")
label(DM$SEX) <- "Sex"

After running this .R script file in R or RStudio -> look for warnings (there should not be any)

Note: Beside the R base functions (factor, levels) only the functions haven::read_sas, and Hmisc::label will be used for generating R data frames.

Haven catalog_file Option - Alternative Method

A very much shorter - but also substantially different - method is to use haven catalog_file option to import SAS files with user-defined formats (SAS catalog file) and to use as_factor() to generate value labels/factors.

In this case, value labels will not be assigned to factor/levels, but 1st translated into a labelled class and then (optionally) coerced to factor/levels.

Pls. note, that different results may be obtained (e.g., in the case of character formats with missing observations in the database).

library(haven)
### Read DM.sas7bdat file with SAS formats, i.e. SAS format catalog
DM <- read_sas("DM.sas7bdat", catalog_file = "formats.sas7bcat")

### Apply as_factor() to data frame / coerce to factors
DM <- as_factor(DM)

### Example table;
table(DM$SEX)

### Display structure
str(DM$SEX)

Tests

Test platform: SAS 9.4 windows 7 64-bit, R 3.4.0

See test data in .../testsas

Restrictions

SAS Formats

Only "simple" user-defined SAS formats (numeric or character) can be used, e.g.:

********** Definition of SAS Formats in library 'work';
proc format;
    value $NY (notsorted)
              "N"="No"
              "Y"="Yes"
              "U"="Unknown";
              
    value $YES "Y"="Yes";
              
    value $SEX "F"="Female"
               "M"="Male";

    value AESEV 1="MILD"
                2="MODERATE"
                3="SEVERE";
               
run;
quit;

Note that the notsorted option is very useful if the rank order should be maintained (instead of an alphabetic order), here No, Yes, Unknown

Formats like e.g. 30 - 50 = 'Between 30 and 50', or picture formats cannot be used.

Labels and Formats

Recommendation: All SAS variables should have "labels".

Consider also to assign 'formats' to all variables, e.g. $20., BEST12., 8., date9., datetime22.3, a.s.o., depending on the type of the variables.

Discrepancies between Values in Database and Formats

Critical Errors

There may be some values in the SAS database - where a user-defined format is assigned - but with no corresponding format definition/value label, e.g. when $SEX in 'proc format' is defined as "F", "M" but in the database also "U" (for Unknown) is present. In this case the value "U" will be set to 'NA' in R, w/o any warning.

Consider extension of definitions in SAS proc format, here "U"="Unknown" (highly recommended!).

Non-Critical Discrepancies

On the other hand, there may be SAS formats/value labels defined, but the value in the database is missing, e.g. SEX is defined as "F", "M" and "U", but the value "U" (for Unknown) is missing in the database.

Check, if changes are necessary. Such categorization can be intended e.g. for tabulation of categories with missing values, here category Unknown with N=0.

SAS program to check for these Discrepancies

Include or run the SAS program sasformat_chk.sas directly after sas_r.sas. This results in an output of all Critical Errors and Non-Critical Discrepancies (if any).

Conclusion

This SAS program allows a quick transformation of many SAS .7bdat files to R data frames with associated user-defined SAS formats (assigned to R factors with levels - in contrast to the haven read_sas(catalog_file) method - that translates value labels into a labelled class - and that can produce different results, too). The program provides also a Data Definition Table to see the variable properties and the decodes of the formats. Finally, a 2nd SAS program can be run to check if there are values in the database w/o corresponding value labels/format definitions (and vice versa).

Beside the R base functions the functions haven::read_sas and Hmisc::label will be used for generating R data frames.

About

Import of SAS files into R with associated formats

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published