# Working with Data Files in R

## Exercises

For these exercises, we will be using the **pain** data.

1. Find summary statistics for `PROMIS_PHYSICAL_FUNTION` and `PROMIS_ANXIETY` variables and observe the distribution of those patient-reported pain experiences. What striking feature do you notice in the summary? 

In [None]:
## solutions:
library(RforHDSdata)
data(pain)

summary(pain$PROMIS_PHYSICAL_FUNCTION)
summary(pain$PROMIS_ANXIETY)

2. Create frequency tables about `PAT_SEX` and `PAT_RACE` and tell more information about distributions of demographic characteristics.


In [None]:
table(pain$PAT_SEX)
table(pain$PAT_RACE)

3. Create a data frame to describe the total number of patients reported pain for each of bodily pain regions. Then, create another data frame for summary statistics.

In [None]:
colsum <- as.data.frame(colSums(pain[,c(2:75)], na.rm = TRUE))
colsum$colSumValue <- colsum$`colSums(pain[, c(2:75)], na.rm = TRUE)`

colsum_summary <- data.frame(Variable = c("Sum of Columns"),
                             Min =  min(colsum$colSumValue),
                             Median = median(colsum$colSumValue),
                             Mean  = mean(colsum$colSumValue),
                             Max = max(colsum$colSumValue),
                             SD = sd(colsum$colSumValue),
                             Var = var(colsum$colSumValue))

4. Calculate the median and interquartile range of the distribution of the total number of painful regions selected for each patient. Write a sentence to explain any interesting observations in the context of this dataset. 

In [None]:
rowsum <- as.data.frame(rowSums(pain[,2:75], na.rm = TRUE))
rowsum$RowSumValue <- rowsum$`rowSums(pain[, 2:75], na.rm = TRUE)`

median(rowsum$RowSumValue)
IQR(rowsum$RowSumValue)

5. Assume a reasonable number of painful regions for patients to be 15 and use a `subset()` command to create a trimmed version of the pain dataset called **pain.subset** that only contains data for patients with the total number of painful regions less than 15.

In [None]:
pain$rowsum <- rowsum$RowSumValue
pain.subset <- subset(pain, rowsum <=15)

6. Find the distribution of `PAIN_INTENSITY_AVERAGE.follow_up`. And create a column of missing data at this follow up variable.

In [None]:
summary(pain$PAIN_INTENSITY_AVERAGE.follow_up)
hist(pain$PAIN_INTENSITY_AVERAGE.follow_up)

which(is.na(pain$PAIN_INTENSITY_AVERAGE)==TRUE)
is.na(pain$PAIN_INTENSITY_AVERAGE.follow_up[11749])
pain$missing_follow_up <- if_else(is.na(pain$PAIN_INTENSITY_AVERAGE.follow_up)==TRUE,1,0)