<center><h1>Missing Data Methods & Mechanisms</h1></center>
<center><h3>Paul Stey</h3></center>

# 1. Missing Data Methods

  * Entire field of research in statistics
  * Two canonical textbooks
    - _Statistical Analysis with Missing Data_, Little \& Rubin
    - _Applied Missing Data Analysis_, Enders

## 1.1 Missing Data

In general, missing data refers to any instance in which we have a variable for which one or more of our observations is not present.

<br>
<br>
<center>¯\_(ツ)_/¯</center>

### 1.1.1 Why are data missing?
    
Missing data may arise for any number of reasons. For example,
 1. Patient left our clinical trial early
 2. Survey respondent failed to complete all items on the questionnaire
 3. Hard drive failure on server storing data
 4. Respondent declined to answer question

### 1.2.1 Mechanisms of Missingness

There are a few recognized forms of missingness, often called _missingness mechanisms_


* Missing completely at random (MCAR)
* Missing at random (MAR)
* Missing not at random (MNAR)

### 1.2.2 Missing Completely At Random (MCAR)

The designation "missing completely at random" is used when the probability of missing data on a variable, $Y$, is _not_ related to other measured variables, nor to $Y$ itself.

### 1.2.3 Missing at Random (MAR)

Data are said to be "missing at random" when the probability of missing data on a variable, say $Y$, is related to some other measured variable(s) in the model, but not $Y$ itself.

### 1.2.4 Missing Not at Random (MNAR)

Data are described as "missing not at random" when the probability of missing data on a variable, $Y$, is related to the variable $Y$ itself.

# 2. Methods of Addressing Missingness


Many approaches to dealing with missingness. They differ quite substantially in the properties and when they can be used (if at all).

## 2.1 Listwise Deletion

* Delete all data from observations with missing values
* Not a good idea
* But very commonly done, nonetheless

### 2.1.1. Listwise Deletion Example

In [None]:
arrests_df <- read.csv("data/pvd_arrests_2021-10-03.csv")

dim(arrests_df)

In [None]:
is_complete_obs <- complete.cases(arrests_df)

arrests_comp <- arrests_df[is_complete_obs, ]

dim(arrests_comp)

## 2.2 Single Imputation

* Examples: mean imputation, regression imputation, simple random sample ("hot-deck") imputation
* Better than deletion, but introduces bias
* Some single imputation methods reduce variance in our variables


<img src="images/mean_impute.png"></img>

<img src="images/regression_impute.png"></img>


### 2.1.1 Simple Random Sample ("hot-deck") Imputation

* Replaces missing values with random sample of observed values
* Preserve distributional properties

In [None]:
hotdeck <- function(v) {
    obs_values <- v[!is.na(v)]
    n <- length(v)
    
    for (i in 1:n) {
        if (is.na(v[i])) {
            v[i] <- sample(obs_values, 1)
        }
    }
    return(v)
}

### 2.1.2 Running "Hot-Deck" Example

In [None]:
animals <- c("cat", "dog", "cat", NA, NA, "dog", "bird", NA, "dog", NA)

In [None]:
animals_comp <- hotdeck(animals)

table(animals_comp)

## 2.3 Multiple Imputation

* Current "gold-standard" in missing data methods
* Preserves variance-covariance matrix of data set
* Implemented in _mice_ package in R