<center><h1>Missing Data Methods & Mechanisms</h1></center>
<center><h3>Ellen Duong</h3></center>
<center><h3>August Guang</h3></center>
<center><h3>Paul Stey</h3></center>

# 1. Missing Data Methods

  * Entire field of research in statistics
  * Two canonical textbooks
    - _Statistical Analysis with Missing Data_, Little \& Rubin
    - _Applied Missing Data Analysis_, Enders

## 1.1 Missing Data

In general, missing data refers to any instance in which we have a variable for which one or more of our observations is not present.

<br>
<br>
<center>¯\_(ツ)_/¯</center>

### 1.1.1 Why are data missing?
    
Missing data may arise for any number of reasons. For example,
 1. Patient left our clinical trial early
 2. Survey respondent failed to complete all items on the questionnaire
 3. Hard drive failure on server storing data
 4. Respondent declined to answer question

### 1.2.1 Mechanisms of Missingness

There are a few recognized forms of missingness, often called _missingness mechanisms_


* Missing completely at random (MCAR)
* Missing at random (MAR)
* Missing not at random (MNAR)

### 1.2.2 Missing Completely At Random (MCAR)

The designation "missing completely at random" is used when the probability of missing data on a variable, $Y$, is _not_ related to other measured variables, nor to $Y$ itself.

Examples:
- Weather Data: If the temperature recording equipment fails randomly without any relation to the temperature or other factors
- Lab Measurements: If some instruments fail to record data and the failued is not related to the actual values being measured

### 1.2.3 Missing at Random (MAR)

Data are said to be "missing at random" when the probability of missing data on a variable, say $Y$, is related to some other measured variable(s) in the model, but not $Y$ itself.

Examples:
- Educational Testing: If students are more likely to skip certain types of questions on a standardized test based on their level of familiarity with the topic being tests, and this skipping behavior can be predicted by their performance on other parts of the test
- Income Reporting Survey: If individuals with high incomes are less likely to disclose their exact income and this likelihood can be predicted based on other observable factos such as education level, employment status, or age.

### 1.2.4 Missing Not at Random (MNAR)

Data are described as "missing not at random" when the probability of missing data on a variable, $Y$, is related to the variable $Y$ itself.

Examples:
- Patient Follow-up in a Clinical Trial: If patients who experience severe side effects are more likely to drop out of the study, the the missing data on the outcomes is not at random.
- Social Surverys on Illegal Activities: If those who are actively involved in illegal behaviors are less likely to participate in the survey, and this decision is related to thier involvement in illegal activities (unobserved variable), then the missing data on illegal behaviors is not at random

# 2. Methods of Addressing Missingness


Many approaches to dealing with missingness. They differ quite substantially in the properties and when they can be used (if at all).

## 2.1 Listwise Deletion

* Delete all data from observations with missing values
* Not a good idea
* But very commonly done, nonetheless

### 2.1.1. Listwise Deletion Example

In [1]:
arrests_df <- read.csv("data/pvd_arrests_2021-10-03.csv")

dim(arrests_df)

In [2]:
is_complete_obs <- complete.cases(arrests_df)

arrests_comp <- arrests_df[is_complete_obs, ]

dim(arrests_comp)

## 2.2 Single Imputation

* Examples: mean imputation, regression imputation, simple random sample ("hot-deck") imputation
* Better than deletion, but introduces bias
* Some single imputation methods reduce variance in our variables


<img src="images/mean_impute.png"></img>

<img src="images/regression_impute.png"></img>


### 2.1.1 Simple Random Sample ("hot-deck") Imputation

* Replaces missing values with random sample of observed values
* Preserve distributional properties

In [3]:
hotdeck <- function(v) {
    obs_values <- v[!is.na(v)]
    n <- length(v)
    
    for (i in 1:n) {
        if (is.na(v[i])) {
            v[i] <- sample(obs_values, 1)
        }
    }
    return(v)
}

### 2.1.2 Running "Hot-Deck" Example

In [7]:
animals <- c("cat", "dog", "cat", NA, NA, "dog", "bird", NA, "dog", NA)

table(animals)

animals
bird  cat  dog 
   1    2    3 

In [8]:
animals_comp <- hotdeck(animals)
print(animals_comp)

table(animals_comp)

 [1] "cat"  "dog"  "cat"  "cat"  "cat"  "dog"  "bird" "dog"  "dog"  "dog" 


animals_comp
bird  cat  dog 
   1    4    5 

## 2.3 Multiple Imputation

* Current "gold-standard" in missing data methods
* Statistical technique used to handle missing data by creating multiple plausible imputed datasets. Each imputed dataset is analyzed separately, and the results are then combined to provide more robust reliable estimates.
* Preserves variance-covariance matrix of data set
* Implemented in _mice_ package in R also in the _Amelia_ package

In [12]:
# Install and load the Amelia package
library(Amelia)

# Create a sample dataset with missing values
set.seed(123)
data <- data.frame(
  ID = 1:10,
  Age = c(25, 30, NA, 22, 35, NA, 28, 40, NA, 32),
  Income = c(50000, NA, 60000, 75000, NA, 80000, 90000, NA, 55000, 70000),
  Education = c("High School", "College", "College", "High School", "Graduate", "College", "Graduate", "High School", "College", "Graduate")
)

# Convert the character variable 'Education' to numeric
data$Education <- as.numeric(factor(data$Education, levels = unique(data$Education)))

# Impute missing values using Amelia
imputed_data <- amelia(data, m = 5, ty = "mix")

# View imputed datasets
print("Imputed datasets:")
print(imputed_data$imputations$imp1)  # View the first imputed dataset


Loading required package: Rcpp

## 
## Amelia II: Multiple Imputation
## (Version 1.8.1, built: 2022-11-18)
## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
## 

“You have a small number of observations, relative to the number, of variables in the imputation model.  Consider removing some variables, or reducing the order of time polynomials to reduce the number of parameters.”


-- Imputation 1 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34

-- Imputation 2 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 23

## 3. Implications and Considerations when Dealing with Missing Data

- Bias in Results: Ignoring or mishandling missing data may lead to inaccurate or misleading conclusions
- Reduced Sample Size
- Selective bias can occur when the missing data is related to specific characteristics or outcomes.
- Deleting missing data can lead to loss of potentially valuable information
- Imputation methods can introduce bias. It's crucial to understand the assumptions underlying these methods and their potential impact on the results.
- Be transparent and report how you handle missing data

## 4. Summary of strategies

| **Strategy**                       | **Description**                                                                                                                                                                   | **Pros**                                                     | **Cons**                                                                                                           |
| ---------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------- |
| **Complete Case Analysis**          | Exclude observations with missing values.                                                                         | Simple and easy to implement.                                | May lead to biased results if missingness is not completely at random (MCAR). Reduces sample size.                |
| **Mean/Median Imputation**         | Replace missing values with the mean or median of the observed values for that variable.                           | Easy and quick. Can work well if missingness is MCAR.       | May distort variable distributions. Ignores relationships.                                                            |
| **Regression Imputation**           | Predict missing values using regression models based on other observed variables.                                | Captures relationships between variables.                  | Assumes a linear relationship. Sensitive to model assumptions.                                                       |
| **Multiple Imputation**            | Generate multiple datasets with imputed values and pool results.                                                   | Accounts for uncertainty. Preserves variability.            | Requires more advanced methods. Can be computationally intensive.                                                   |
| **Last Observation Carried Forward**| Use the last available observation to impute missing values.                                                      | Simple, suitable for time-series data.                     | Assumes values are stable over time. May not be suitable for all types of data.                                    |
| **Interpolation/Extrapolation**    | Estimate missing values based on trends in the observed data.                                                      | Useful for time-series data.                               | Assumes a consistent pattern. May not be suitable for all types of data.                                           |
| **Dummy Variable Indicators**       | Create dummy variables indicating missingness.                                                                   | Preserves information about missingness.                  | Adds complexity to the model. May require careful interpretation.                                                 |
