# STAT 450: Missing Data Notebooks: Missing at Random

## Materials: 
- Notes on missing data can be found [here](link...). 

## Learning Objectives
In this notebook, we will explore how missing data impacts our analysis by analyzing data that is MAR. Specifically we will explore the use of complete case analysis (CCA), single imputation (SI), and multiple imputation by chained equations (MICE). 


## The Data
We will explore a subset of the `penguins` dataset that is complete, and available in R from the `palmerpenguins` package. Here are the first few rows: 


In [1]:
# run this once if needed
# install.packages(c("tidyverse", "palmerpenguins", "mice", "VIM"))

In [3]:
# Set up
library(tidyverse)
library(palmerpenguins)
library(mice)
library(VIM)

expit <- function(x){exp(x)/(1 + exp(x))}

penguins <- penguins %>% drop_na() #for simulation purposes only!
head(penguins)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Torgersen,39.1,18.7,181,3750,male,2007
Adelie,Torgersen,39.5,17.4,186,3800,female,2007
Adelie,Torgersen,40.3,18.0,195,3250,female,2007
Adelie,Torgersen,36.7,19.3,193,3450,female,2007
Adelie,Torgersen,39.3,20.6,190,3650,male,2007
Adelie,Torgersen,38.9,17.8,181,3625,female,2007


Right now the data are complete. If we wanted to estimate the association between bill length and bill depth, controlling for species and sex, we could fit the following linear regression model:

In [4]:
true_model <- lm(bill_length_mm ~ bill_depth_mm + species + sex, penguins )
summary(true_model)


Call:
lm(formula = bill_length_mm ~ bill_depth_mm + species + sex, 
    data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.1344 -1.2052 -0.0056  1.2543 10.9430 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       27.6224     2.6718  10.339  < 2e-16 ***
bill_depth_mm      0.5317     0.1513   3.514 0.000504 ***
speciesChinstrap   9.9709     0.3357  29.699  < 2e-16 ***
speciesGentoo     10.4890     0.5828  17.999  < 2e-16 ***
sexmale            2.8938     0.3385   8.548 4.82e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.286 on 328 degrees of freedom
Multiple R-squared:  0.8274,	Adjusted R-squared:  0.8253 
F-statistic: 393.2 on 4 and 328 DF,  p-value: < 2.2e-16


**<font color='red'>Given the complete data, we see that the association between `bill_depth_mm` and `bill_length_mm` is strong and positive. That is, for every unit increase in the bill depth (in mm), we estimate the bill length to increase by 0.532mm.  Keep this number (0.532) in mind as we go through our analyses.</font>** 



## Simulating MAR

It's not often that the data are missing with some equal probability (MCAR). **Data can be (and often are) missing in relation to some factors**. If we have **observed data** that relate to the probability of the value being missing, then the data are said to be missing at random (MAR). 

We will simulate data with 20% missingness in the `bill_depth_mm` covariate that is related to other observed covariates in the model to simulate data that is MAR.

Let's allow the probability of `bill_depth_mm` being missing depend on `body_mass_g` and `year`. The following code will simulate MAR with 20% missingness. 

In [5]:
p = 0.20 ### change to update the proportion of missing values for `bill_length_mm`

set.seed(5)

#Below is the code to solve for model parameters to simulate missingness. You don't need to understand it - we are just
#using it to simulate missingness!

x <- as.matrix(penguins[, 6]) #matrix form of dataset for covariate in MAR model (body mass and year columns)
eta <- c(-0.001) 
      
f <- function(t) {            # Define a path through parameter space
    sapply(t, function(y) mean(1 / (1 + exp(-y -x %*% eta))))
    }
      
# solve for eta0 yielding any specified proportions p
results <- sapply(p, function(p) {
    alpha <- uniroot(function(t) f(t) - p, c(-1e6, 1e6), tol = .Machine$double.eps^0.5)$root
    c(alpha, f(alpha))})
    dimnames(results) <- list(c("alpha", "f(alpha)"), p=p)
      
    eta0 <- results[1,1]
    
# Calculating probability of being missing and remove those values
penguins_MAR <- penguins %>%
    mutate(pmiss = expit(eta0 + eta[1]*body_mass_g)) %>%
    mutate(depthmissing = rbinom(nrow(penguins), 1, pmiss)) %>%
    mutate(bill_depth_mm = ifelse(depthmissing == 1, NA, bill_depth_mm))

head(penguins_MAR)

# Sanity check: ~10% missing: 
print(paste0(sum(is.na(penguins_MAR$bill_depth_mm))/nrow(penguins_MAR)*100, "% of depth values are missing"))

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,pmiss,depthmissing
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>,<dbl>,<int>
Adelie,Torgersen,39.1,18.7,181,3750,male,2007,0.2478089,0
Adelie,Torgersen,39.5,17.4,186,3800,female,2007,0.2386069,0
Adelie,Torgersen,40.3,,195,3250,female,2007,0.3519834,1
Adelie,Torgersen,36.7,19.3,193,3450,female,2007,0.3078197,0
Adelie,Torgersen,39.3,20.6,190,3650,male,2007,0.2669148,0
Adelie,Torgersen,38.9,17.8,181,3625,female,2007,0.271835,0


[1] "19.5195195195195% of depth values are missing"


## Complete Case Analysis

**<font color='blue'>Question 1: Re-fit the linear regression model using the `penguins_MAR` data set, ignoring the missing data. Name the model `model_CCA`. Display the output using the code cell below.</font>** 

In [7]:
### BEGIN SOLUTION ###




### END SOLUTION ###


Call:
lm(formula = bill_length_mm ~ bill_depth_mm + species + sex, 
    data = penguins_MAR)

Residuals:
   Min     1Q Median     3Q    Max 
-7.457 -1.183  0.062  1.206 10.845 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       24.9644     3.0111   8.291 5.85e-15 ***
bill_depth_mm      0.6776     0.1701   3.984 8.78e-05 ***
speciesChinstrap  10.1298     0.3907  25.924  < 2e-16 ***
speciesGentoo     11.0165     0.6598  16.698  < 2e-16 ***
sexmale            2.7951     0.3848   7.264 4.26e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.299 on 263 degrees of freedom
  (65 observations deleted due to missingness)
Multiple R-squared:  0.8249,	Adjusted R-squared:  0.8222 
F-statistic: 309.7 on 4 and 263 DF,  p-value: < 2.2e-16


**<font color='blue'>Question 2: Examine the coefficient associatd with `bill_depth_mm`. Calculate the percent change in the coefficient. Is CCA valid? </font>** Keep in mind that in the data set with no missingness, the estimate of the coefficient associated with `bill_depth_mm` was 0.532.

In [8]:
### BEGIN SOLUTION



### END SOLUTION

[1] "The coefficient changed by 27.4436090225564%"


%%% BEGIN SOLUTION

The coefficient is overestimated by more than 20%. CCA is not valid for MAR data, which we confirm here. 

%%% END SOLUTION

## Single Mean Imputation 

Mean imputation is a commonly used approach for MAR data. To do single imputation using the mean, you can:
1. Identify all missing observations in a given column
2. Calculate the mean of the observed values in the given column
3. Replace the NAs with the mean you calculated in Step 3.
4. Repeat for all columns with missing data.

You can also use other measures like the median to perform imputation. 

You can fill in the mean manually, or use a package like `mice` with the method `mean` to impute. You can do either. 

**<font color='blue'>Question 3: Create a new dataset named `penguins_meanimputation` where mean imputation was performed to fill in the missing values of `bill_depth_mm`. Use the `penguins_MAR` data set only. Do not overwrite the original `penguins_MAR` data set.</font>**



In [9]:
### BEGIN SOLUTION 1
mean_bill_depth_mm <- [FILLTHISIN]

penguins_meanimputation <- penguins_MAR %>%
    mutate(bill_depth_mm = [FILLTHISIN])
### END SOLUTION 1 

### OR

### BEGIN SOLUTION 2
imp_mean <- mice([FILLTHISIN], method = "[FILLTHISIN]", m = 1, maxit = 1) # use mice package for mean imputation
penguins_meanimputation_mice <- complete([FILLTHISIN]) #grab the imputed dataset from imp_mean
### END SOLUTION 2 



 iter imp variable
  1   1  bill_depth_mm


“Number of logged events: 1”


**<font color='blue'>Question 4: Re-fit the linear regression model using `penguins_meanimputation`. Name the model `model_meanimputation`. Display the output and examine the coefficient associatd with `bill_depth_mm` using the code cell below. .</font>**

In [10]:
### BEGIN SOLUTION




### END SOLUTION 



Call:
lm(formula = bill_length_mm ~ bill_depth_mm + species + sex, 
    data = penguins_meanimputation)

Coefficients:
     (Intercept)     bill_depth_mm  speciesChinstrap     speciesGentoo  
         28.6537            0.4763            9.9664           10.0869  
         sexmale  
          3.0772  



Call:
lm(formula = bill_length_mm ~ bill_depth_mm + species + sex, 
    data = penguins_meanimputation_mice)

Coefficients:
     (Intercept)     bill_depth_mm  speciesChinstrap     speciesGentoo  
         28.6537            0.4763            9.9664           10.0869  
         sexmale  
          3.0772  


**<font color='blue'> Question 5: Calculate the percent change in the coefficient associated with `bill_depth_mm` and comment on the change. Are these results surprising? </font>** Keep in mind that in the data set with no missingness, the estimate of the coefficient associated with `bill_depth_mm` was 0.532.

In [11]:
### BEGIN SOLUTION





### END SOLUTION

[1] "The coefficient changed by -10.3383458646617%"


%%% BEGIN SOLUTION 

Better than CCA, but still slightly biased. Although single imputation is an unbiased and commonly used approach, it is highly variable. On average, it performs well, but on individual data sets the results may not be great. 

%%% end solution

## Single Regression Imputation: The Automated Way

Instead of imputting (filling in missing values) with the mean, we can actually fit a regression model to predict the missing values using the observed data.

There are many ways to do this - you can manually fit a linear regression model to get predictions, perform KNN regression to estimate the missing values, or use packages like `mice`  to perform the regression for you. 

In this section, we'll perform regression imputation using `mice`. There are many methods available for imputing numeric data, which you can find [here](https://www.rdocumentation.org/packages/mice/versions/3.19.0/topics/mice#:~:text=Built%2Din%20univariate%20imputation%20methods%20are%3A). 

We will begin by setting the method to be a classificaton and regression tree ("cart"), but you will later look at other methods. 

In [18]:
set.seed(6)
imp_SI_cart <- mice(penguins_MAR, method = "cart", m = 1, printFlag = F) #m = 1 indicates single imputation 
penguins_SI_cart <- complete(imp_SI_cart) #extract the imputed data set

“Number of logged events: 5”


Now, we will re-fit the linear regression model using `penguins_regimputation`.

In [22]:
model_SI_cart <-lm(bill_length_mm ~ bill_depth_mm + species + sex, penguins_SI_cart) #fit model on imputed data set
summary(model_SI_cart) #view output


Call:
lm(formula = bill_length_mm ~ bill_depth_mm + species + sex, 
    data = penguins_SI_cart)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.1821 -1.3305  0.0852  1.2276 10.9333 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       26.7665     2.6527  10.090  < 2e-16 ***
bill_depth_mm      0.5806     0.1503   3.863 0.000135 ***
speciesChinstrap   9.9648     0.3345  29.791  < 2e-16 ***
speciesGentoo     10.6673     0.5823  18.320  < 2e-16 ***
sexmale            2.7641     0.3467   7.972 2.61e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.277 on 328 degrees of freedom
Multiple R-squared:  0.8287,	Adjusted R-squared:  0.8266 
F-statistic: 396.8 on 4 and 328 DF,  p-value: < 2.2e-16


Now, let's calculate the percent change in the coefficient associated with `bill_depth_mm`. Keep in mind that in the data set with no missingness, the estimate of the coefficient associated with `bill_depth_mm` was 0.532.

In [23]:

print(paste0("The coefficient changed by ", (0.581 - 0.532)/0.532*100, "%"))


[1] "The coefficient changed by 9.21052631578946%"


**<font color='blue'> Question 6: Now it's your turn. Repeat the above process using a different imputation method. Name the imputed dataset `penguins_SI_[methodname]` and the model `model_SI_[methodname]`. Do the results differ? Are they better or worse? </font>** 

## Single Regression Imputation: The Manual Way

Suppose the researchers told you that the **heavier penguins were more fiesty, and made it difficult to obtain the `bill_depth_mm ` measurements** (this is what we simulated in the data - the missingness was related to body mass!) Using this domain knowledge, we should be able to build a better model for single imputation than what the `mice` package, which used all available covariates.

**<font color='blue'> Question 7: Create a new data set called `penguins_regimpute_manual` that has the missing values imputed using a predictions from a regression model that you created. This regression model should use `body_mass_g` to predict `bill_depth_mm`. </font>** 

In [24]:
### BEGIN SOLUTION
missingnessmodel <- lm([FILLTHISIN] ~ [FILLTHISIN], data = [FILLTHISIN])


penguins_regimpute_manual <- penguins_MAR %>%
    mutate(predictedvals = predict(missingnessmodel, newdata = [FILLTHISIN])) %>%
    mutate(bill_depth_mm = ifelse(is.na(bill_depth_mm), [FILLTHISIN], [FILLTHISIN]))

head(penguins_regimpute_manual)
### END SOLUTION 

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,pmiss,depthmissing,predictedvals
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>,<dbl>,<int>,<dbl>
Adelie,Torgersen,39.1,18.7,181,3750,male,2007,0.2478089,0,17.74444
Adelie,Torgersen,39.5,17.4,186,3800,female,2007,0.2386069,0,17.68499
Adelie,Torgersen,40.3,18.33896,195,3250,female,2007,0.3519834,1,18.33896
Adelie,Torgersen,36.7,19.3,193,3450,female,2007,0.3078197,0,18.10116
Adelie,Torgersen,39.3,20.6,190,3650,male,2007,0.2669148,0,17.86335
Adelie,Torgersen,38.9,17.8,181,3625,female,2007,0.271835,0,17.89307


**<font color='blue'>Question 8: Re-fit the linear regression model using `penguins_regimpute_manual`. Name the model `model_regimputation_manual`. Display the output and examine the coefficient associatd with `bill_depth_mm` using the code cell below. .</font>**

In [26]:
### BEGIN SOLUTION



### END SOLUTION


Call:
lm(formula = bill_length_mm ~ bill_depth_mm + species + sex, 
    data = penguins_regimpute_manual)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.1627 -1.3323 -0.0674  1.2248 11.0310 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       28.2430     2.4196  11.673  < 2e-16 ***
bill_depth_mm      0.4922     0.1358   3.626 0.000334 ***
speciesChinstrap   9.9651     0.3354  29.712  < 2e-16 ***
speciesGentoo     10.2625     0.5156  19.905  < 2e-16 ***
sexmale            3.1346     0.2940  10.664  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.283 on 328 degrees of freedom
Multiple R-squared:  0.8278,	Adjusted R-squared:  0.8257 
F-statistic: 394.3 on 4 and 328 DF,  p-value: < 2.2e-16


**<font color='blue'> Question 9: Calculate the percent change in the coefficient associated with `bill_depth_mm` and comment on the change. Are these results surprising? </font>** Keep in mind that in the data set with no missingness, the estimate of the coefficient associated with `bill_depth_mm` was 0.532.

In [27]:
### BEGIN SOLUTION



### END SOLUTION

[1] "The coefficient changed by -7.4812030075188%"


%%%

We've reduced our error a lot! But we can still do better. 

%%%

## KNN Single Imputation

KNN is a non-parametric method that also may be useful for MAR data. 

Let's see if it's helpful here:




In [28]:
# Impute missing values for all variables using default k=5 neighbors
penguins_KNNimpute <- kNN(penguins_MAR, variable = c("bill_depth_mm"), k = 3, ) # need to choose value of K

model_KNN <- lm(bill_length_mm ~ bill_depth_mm + species + sex, data = penguins_KNNimpute)
summary(model_KNN)

   bill_length_mm flipper_length_mm       body_mass_g              year 
     3.210000e+01      1.720000e+02      2.700000e+03      2.007000e+03 
            pmiss      depthmissing    bill_length_mm flipper_length_mm 
     2.507884e-02      0.000000e+00      5.960000e+01      2.310000e+02 
      body_mass_g              year             pmiss      depthmissing 
     6.300000e+03      2.009000e+03      4.849215e-01      1.000000e+00 



Call:
lm(formula = bill_length_mm ~ bill_depth_mm + species + sex, 
    data = penguins_KNNimpute)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.4242 -1.2630 -0.0014  1.2852 10.9755 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       24.7737     2.7959   8.861  < 2e-16 ***
bill_depth_mm      0.6922     0.1581   4.379 1.61e-05 ***
speciesChinstrap   9.9301     0.3327  29.844  < 2e-16 ***
speciesGentoo     11.0585     0.6073  18.209  < 2e-16 ***
sexmale            2.6456     0.3447   7.674 1.92e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.263 on 328 degrees of freedom
Multiple R-squared:  0.8308,	Adjusted R-squared:  0.8288 
F-statistic: 402.7 on 4 and 328 DF,  p-value: < 2.2e-16


In [29]:
### BEGIN SOLUTION
print(paste0("The coefficient changed by ", (0.6922 - 0.532)/0.532*100, "%"))
### END SOLUTION

[1] "The coefficient changed by 30.1127819548872%"


**<font color='blue'> Question 10: Try fitting a KNN model with another value of K. Calculate the percent change in the coefficient associated with `bill_depth_mm` and comment on the change. </font>** Keep in mind that in the data set with no missingness, the estimate of the coefficient associated with `bill_depth_mm` was 0.532.

## Multiple Imputation via MICE

Instead of imputing a value once, we can actually perform imputation many times and fit the model on these individual data sets. Then, we can pool the estimate together. Multiple imputation produces estimates that are less variable than single imputation methods. 

Of course, you can do multiple imputation with various methods, but here we will just use regression imputation for the sake of time. 

Remember that single regression imputation from the `mice` package that didn't perform great? We can actually tweak the code to do multiple imputation quite easily. Let's see if it changes the results at all. 

Set `m` (the number of imputations) to be 20 for now. I've set `printFlag = F` so that the messages for each iteration are suppressed, and `maxiter` (the maximum number of iterations) to be large as this data set is relatively small. 


In [30]:
#perform multiple imputation 
set.seed(11)
imp_MI_cart <- mice(penguins_MAR, method = "cart", m = 15, maxiter = 100, printFlag = F) #m = 10 indicates 10 imputations

# fit the outcome model on each data set
model_MI_cart <- with(imp_MI_cart, lm(bill_length_mm ~ bill_depth_mm + species + sex))

# pool the estimates
pooled_MI_est <- pool(model_MI_cart)
summary(pooled_MI_est)

“Number of logged events: 75”


term,estimate,std.error,statistic,df,p.value
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),27.0720139,2.8850271,9.383626,176.6451,3.1501040000000004e-17
bill_depth_mm,0.5609639,0.1628466,3.444737,176.1676,0.0007144439
speciesChinstrap,9.9727849,0.3357426,29.703658,323.8336,1.766144e-94
speciesGentoo,10.6310072,0.6278036,16.933651,208.0835,5.20822e-41
sexmale,2.8312815,0.3523158,8.036203,268.303,2.960296e-14


In [31]:

print(paste0("The coefficient changed by ", (0.561 - 0.532)/0.532*100, "%"))


[1] "The coefficient changed by 5.45112781954888%"


Better than with only one imputation!

**<font color='blue'>Question 11: Repeat the above steps, but this time using the method you chose in Question 6. Name your objects appropriately. </font>**