# Missing Data

It is not uncommon, but very time consuming to deal with missing data. Fixing problems caused by missing data sometimes takes longer than the analysis

#### Fix-up methods when data are missing for noninformative reasons:
1. Delete: it is ok if this only causes the loss of a relatively small numbe of cases (this is the simplest solution)
2. Fill in or impute the missing values. Use the rest of the data to impute the missing value. 
    - Replacying the missing value with the average value is the simplest solution.
    - Using regression on the other predictors is another possibility
3. Consider just $(x_i, y_i)$ pairs with some observation missing. Mean and SDs of $x$ and $y$ can be used in the estimate even when a member of a pair is missing
4. Maximum likelihood methods can be used assuming the multivariate normality of the data. The EM algorithm is often used here


In [4]:
library(faraway)

In [6]:
data(chmiss)
head(chmiss)

Unnamed: 0_level_0,race,fire,theft,age,involact,income
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
60626,10.0,6.2,29,60.4,,11.744
60640,22.2,9.5,44,76.5,0.1,9.323
60613,19.6,10.5,36,,1.2,9.948
60657,17.3,7.7,37,,0.5,10.656
60614,24.5,8.6,53,81.4,0.7,9.73
60610,54.0,34.1,68,52.6,0.3,8.231


* There are 20 missing observations denoted by NA here

In [7]:
g = lm(involact ~ ., chmiss)
summary(g)


Call:
lm(formula = involact ~ ., data = chmiss)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.53370 -0.16325 -0.07015  0.12615  0.66316 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.116483   0.605761  -1.843 0.079475 .  
race         0.010487   0.003128   3.352 0.003018 ** 
fire         0.043876   0.010319   4.252 0.000356 ***
theft       -0.017220   0.005900  -2.918 0.008215 ** 
age          0.009377   0.003494   2.684 0.013904 *  
income       0.068701   0.042156   1.630 0.118077    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3382 on 21 degrees of freedom
  (20 observations deleted due to missingness)
Multiple R-squared:  0.7911,	Adjusted R-squared:  0.7414 
F-statistic: 15.91 on 5 and 21 DF,  p-value: 1.594e-06


* Only 21 degrees of freedom, which means that almost half of the data is lost
* Let's try some data imputation using the mean

In [11]:
# first calculate the means
cmeans = apply(chmiss, 2, mean, na.rm=TRUE)
round(cmeans, 3)

# duplicate data to preserve the original set
mchm = chmiss

for (i in c(1,2,3,4,6)) mchm[is.na(chmiss[,i]), i] = cmeans[i]

In [12]:
g = lm(involact ~. , mchm)
summary(g)


Call:
lm(formula = involact ~ ., data = mchm)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.02401 -0.15681 -0.00333  0.22201  0.81174 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.070802   0.509453   0.139  0.89020   
race         0.007117   0.002706   2.631  0.01224 * 
fire         0.028742   0.009385   3.062  0.00402 **
theft       -0.003059   0.002746  -1.114  0.27224   
age          0.006080   0.003208   1.895  0.06570 . 
income      -0.027092   0.031678  -0.855  0.39779   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3841 on 38 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.682,	Adjusted R-squared:  0.6401 
F-statistic:  16.3 on 5 and 38 DF,  p-value: 1.409e-08


* There are important differences between the two fits. For example, theft as age are significant in the first fit, but not in the second
* The situation is analogous to the errors in variable case
* The bias introduced can be substantial 
* Let's try a regression method

In [14]:
gr = lm(race ~ fire + theft + age + income, chmiss)
chmiss[is.na(chmiss$race), ]

Unnamed: 0_level_0,race,fire,theft,age,involact,income
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
60646,,5.7,11,27.9,0.0,16.25
60651,,15.1,30,89.8,0.8,10.51
60616,,12.2,46,48.0,0.6,8.212
60617,,10.8,34,58.0,0.9,11.156


In [16]:
round(predict(gr, chmiss[is.na(chmiss$race), ]), 3)

* Notice that the first prediction is negative
* One trick that can be applied when the response is bounded between zero and one is the logit transformation

<p>&nbsp;</p>
\begin{split}
y = log \left( \frac{y}{(l-y)} \right)
\end{split}
<p>&nbsp;</p>


In [19]:
logit = function(x) log(x/ (1-x))
ilogit = function(x) exp(x)/(1+exp(x))

In [20]:
# transform the data using logit-transformed response
gr = lm(logit(race/100) ~  fire + theft + age + income, chmiss)

In [22]:
round(ilogit(predict(gr, chmiss[is.na(chmiss$race), ]))*100, 3)

* Let's now compare how the predicted values compare to the actual values

In [23]:
data(chredlin)
chredlin$race[is.na(chmiss$race)]

* The first two predictors are good, but the other two are somewhat wide of the markk
* Like the fill-in method, the regression will also introduce a bias towards zero in the coefficcient while tending to reduce the variance
* Success of the regression method depends somewhat on the collinearity of the predictors. 
* For data with substantial missing proportion, it is recommended to investigate more sophisticated methods like EM or multiple imputation