# 用mice包进行缺失值处理


In [3]:
library(mice)
library(dplyr,warn.conflicts=F)

: package 'mice' was built under R version 3.3.2Loading required package: Rcpp
mice 2.25 2015-11-09


使用`mice`包可以对数据缺失值进行处理

In [4]:
data("BostonHousing", package="mlbench")
original <- BostonHousing

In [5]:
head(original)

crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
0.02985,0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7


In [6]:
glimpse(original)

Observations: 506
Variables: 14
$ crim    (dbl) 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.088...
$ zn      (dbl) 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5...
$ indus   (dbl) 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87,...
$ chas    (fctr) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ nox     (dbl) 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.5...
$ rm      (dbl) 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.6...
$ age     (dbl) 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9...
$ dis     (dbl) 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9...
$ rad     (dbl) 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4,...
$ tax     (dbl) 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311,...
$ ptratio (dbl) 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2,...
$ b       (dbl) 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396...
$ lstat   (dbl) 4.98

## 1. 引入缺失值

In [15]:
set.seed(123)
original[sample(1:nrow(original),40), "rad"] <- NA
original[sample(1:nrow(original), 40), "ptratio"] <- NA

## 2. 检查哪里有缺失值

In [8]:
md.pattern(original)

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,tax,ptratio,b,lstat,medv,rad,Unnamed: 15
466.0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0
40.0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1
,0,0,0,0,0,0,0,0,0,0,0,0,0,40,40


### 表示有466行没有缺失值，有40行有缺失值，另外其中观测值为0表示是缺失值；最后列的值1表示有1次缺失，40为总计的40次缺失

## 3.1 缺失值的处理方式: delete the observation

删除缺失值需要保证对之后的建模过程没有较大地影响；

In [9]:
lm(medv~ptratio+rad, data=original, na.action=na.omit)


Call:
lm(formula = medv ~ ptratio + rad, data = original, na.action = na.omit)

Coefficients:
(Intercept)      ptratio          rad  
    56.6771      -1.7501      -0.1962  


## 3.2 deleting the variable

一个变量出现过多缺失值的话，我们不得不考虑删除这个变量了~

## 3.3 imputation with mean/median/mode
用均值，中位数或者众数来进行替代：前提是数据波动不大

In [19]:
library(Hmisc)
impute(original$ptratio, mean) # replace the value with mean
# impute(original$ptratio, median)  #median
# impute(original$ptratio, 20)  # 20

        1         2         3         4         5         6         7         8 
18.42232*  17.80000  17.80000  18.70000  18.70000  18.70000  15.20000  15.20000 
        9        10        11        12        13        14        15        16 
 15.20000  15.20000  15.20000  15.20000  15.20000  21.00000  21.00000  21.00000 
       17        18        19        20        21        22        23        24 
 21.00000  21.00000  21.00000  21.00000  21.00000  21.00000 18.42232*  21.00000 
       25        26        27        28        29        30        31        32 
 21.00000  21.00000  21.00000  21.00000  21.00000  21.00000  21.00000  21.00000 
       33        34        35        36        37        38        39        40 
 21.00000  21.00000  21.00000  19.20000  19.20000  19.20000  19.20000  18.30000 
       41        42        43        44        45        46        47        48 
 18.30000  17.90000  17.90000  17.90000  17.90000 18.42232*  17.90000  17.90000 
       49        50        5

其实就等价于

In [20]:
# original$ptratio[is.na(original$ptratio)] <- mean(original$ptratio, na.rm = T)


我们来计算用mean进行缺失值处理之后，效果如何？


In [37]:
actuals <- BostonHousing$ptratio[is.na(original$ptratio)]
preds <- impute(original$ptratio, mean)[is.na(original$ptratio)]
regr.eval(actuals,preds)

## 3.4 prediction
 ### i) kNN imputation
 DMwR::knnImputation uses k-Nearest Neighbours approach to impute missing values
 
用knn寻找缺失值周围k个最近邻，计算加权基于距离）平均

In [26]:
library(DMwR)
knnoutput <- knnImputation(original[,!names(original) %in% "medv"])
anyNA(knnoutput)

计算accuracy: 就是我们的原始数据与我做缺失值处理之间做回归的差异

In [30]:
actuals <- BostonHousing$ptratio[is.na(original$ptratio)]
predicteds <- knnoutput[is.na(original$ptratio), "ptratio"]
regr.eval(actuals, predicteds)

### ii) rpart 回归树
knn若是遇到factor变量就会出错

In [38]:
library(rpart)
class_mod <- rpart(rad ~.-medv, data=original[!is.na(original$rad),],
                  method = "class", na.action = na.omit)

anova_mod <- rpart(ptratio ~ . - medv, 
                   data=original[!is.na(original$ptratio), ], 
                   method="anova", na.action=na.omit)  # since ptratio is numeric.
rad_pred <- predict(class_mod, original[is.na(original$rad), ])
ptratio_pred <- predict(anova_mod, original[is.na(original$ptratio), ])


其中 accuracy of ptratio

In [40]:
actual1 <- BostonHousing$ptratio[is.na(original$ptratio)]
regr.eval(actual1, ptratio_pred)

其中accuracy of rad

In [46]:
actual2 <- BostonHousing$rad[is.na(original$rad)]
predicteds <- as.numeric(colnames(rad_pred)[apply(rad_pred, 1, which.max)])

mean(actual2 != predicteds)


### iii) mice包 multivariate imputation by chained equations 

mice()运用多种方式处理缺失值，complete()可以返回这多个结果

In [47]:
library(mice)
micemod <- mice(original[, !names(original) %in% "medv"],
               method="rf")   # baesd on random forests

miceoutput <- complete(micemod)
anyNA(miceoutput)


 iter imp variable
  1   1  rad  ptratio
  1   2  rad  ptratio
  1   3  rad  ptratio
  1   4  rad  ptratio
  1   5  rad  ptratio
  2   1  rad  ptratio
  2   2  rad  ptratio
  2   3  rad  ptratio
  2   4  rad  ptratio
  2   5  rad  ptratio
  3   1  rad  ptratio
  3   2  rad  ptratio
  3   3  rad  ptratio
  3   4  rad  ptratio
  3   5  rad  ptratio
  4   1  rad  ptratio
  4   2  rad  ptratio
  4   3  rad  ptratio
  4   4  rad  ptratio
  4   5  rad  ptratio
  5   1  rad  ptratio
  5   2  rad  ptratio
  5   3  rad  ptratio
  5   4  rad  ptratio
  5   5  rad  ptratio


计算accuracy of ptratio, 比使用rpart效果提升不少

In [50]:
actuals <- BostonHousing$ptratio[is.na(original$ptratio)]
predict_rf <- miceoutput[is.na(original$ptratio), "ptratio"]
regr.eval(actuals, predict_rf)

In [52]:
actuals <- BostonHousing$rad[is.na(original$rad)]
predict_rf <- miceoutput[is.na(original$rad), "rad"]
mean(actuals != predict_rf)

以上内容来自于http://datascienceplus.com/missing-value-treatment/