# Identifying and Fixing Missing Values

We will cover the following in this chapter
- identifying missing values
- cleaning missing values
- imputing values with regression
- using KNN imputation
- using random forest for imputation

## Identifying Missing Values

In [1]:
import numpy as np
import pandas as pd

from data.load import load_covid, load_nls97b

In [2]:
nls97 = load_nls97b()
covid = load_covid()

In [3]:
covid.shape

(221, 16)

In [4]:
nls97.shape

(8984, 93)

We count the number of missing values for columns that we may use as features.

We use `axis=0` to sum over the rows for each columns. If we want the number of missing values for each row, we can specify `axis=1` when summing.

In [5]:
demo_vars = [
    "population_density",
    "aged_65_older",
    "gdp_per_capita",
    "life_expectancy",
    "diabetes_prevalence",
]

# Find rows with missing values.
covid[demo_vars].isnull().sum(axis=0)

population_density     15
aged_65_older          33
gdp_per_capita         28
life_expectancy         4
diabetes_prevalence    21
dtype: int64

In [6]:
# Find countries with missing values in columns.
demo_vars_missing_count = covid[demo_vars].isnull().sum(axis=1)
demo_vars_missing_count.value_counts().sort_index()

0    181
1     15
2      6
3      5
4     11
5      3
dtype: int64

^ 181 countries have values for all of the features, 11 are missing values for four of the five features, and three are missing values for all of the features.

In [7]:
covid.loc[demo_vars_missing_count >= 4, ["location"] + demo_vars].sample(
    6, random_state=1
).T

iso_code,FLK,NIU,MSR,COK,SYR,GGY
location,Falkland Islands,Niue,Montserrat,Cook Islands,Syria,Guernsey
population_density,,,,,,
aged_65_older,,,,,,
gdp_per_capita,,,,,,
life_expectancy,81.44,73.71,74.16,76.25,72.7,
diabetes_prevalence,,,,,,


Let's also check missing values for total cases and deaths.

In [8]:
total_vars = ["location", "total_cases_mill", "total_deaths_mill"]
covid[total_vars].isnull().sum(axis=0)

location              0
total_cases_mill     29
total_deaths_mill    36
dtype: int64

^ 29 countries have missing values for cases per million in population, and 36 have missing deaths per million.

We should also get a sense of which countries are missing both.

In [9]:
covid[total_vars].isnull().sum(axis=1).value_counts().sort_index()

0    185
1      7
2     29
dtype: int64

^ 29 countries are missing both cases and deaths, and we only have both for 185 countries.

In the NLS dataset, the codes reveal why the responded did not provide an answer for the a question:
- `-3` is an invalid skip
- `-4` is an valid skip
- `-5` is a non-interview

In [10]:
nls97.columns[-4:]

Index(['motherage', 'parentincome', 'fatherhighgrade', 'motherhighgrade'], dtype='object')

In [11]:
nls97_parents = nls97.iloc[:, -4:]
nls97_parents.shape

(8984, 4)

In [12]:
nls97_parents.loc[
    nls97_parents.motherhighgrade.between(-5, -1), "motherhighgrade"
].value_counts()

-3    523
-4    165
Name: motherhighgrade, dtype: int64

^ There are 523 invalid skips and 165 valid skips. 


Below we look at individuals that have at least one of these non-response values for these four features.

In [13]:
nls97_parents.loc[nls97_parents.apply(lambda x: x.between(-5, -1)).any(axis=1)]

Unnamed: 0_level_0,motherage,parentincome,fatherhighgrade,motherhighgrade
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100284,22,50000,12,-3
100931,23,60200,-3,13
101122,25,-4,-3,-3
101414,27,24656,10,-3
101526,-3,79500,-4,-4
...,...,...,...,...
999087,-3,121000,-4,16
999103,-3,73180,12,-4
999406,19,-4,17,15
999698,-3,13000,-4,-4


In [14]:
nls97_parents.apply(lambda x: x.between(-5, -1).sum())

motherage           608
parentincome       2396
fatherhighgrade    1856
motherhighgrade     688
dtype: int64

In [15]:
nls97_parents.isnull().sum()

motherage          0
parentincome       0
fatherhighgrade    0
motherhighgrade    0
dtype: int64

For our analysis, the reason why there is a non-response is not important. We should set these values to missing before using these features in our analysis.

In [16]:
nls97_parents.replace(list(range(-5, 0)), np.nan, inplace=True)
nls97_parents.isnull().sum()

motherage           608
parentincome       2396
fatherhighgrade    1856
motherhighgrade     688
dtype: int64

## Cleaning missing values

Approaches for handling missing values:
- dropping observations
- assigning a sample-wide summary statistic, such as the mean
- assinging a value based on the mean value for an appropriate subset of the dataa

In [17]:
nls97 = load_nls97b()
school_record_list = [
    "satverbal",
    "satmath",
    "gpaoverall",
    "gpaenglish",
    "gpamath",
    "gpascience",
    "highestdegree",
    "highestgradecompleted",
]
school_record = nls97[school_record_list]
school_record.shape

(8984, 8)

In [18]:
school_record.isnull().sum(axis=0)

satverbal                7578
satmath                  7577
gpaoverall               2980
gpaenglish               3186
gpamath                  3218
gpascience               3300
highestdegree              31
highestgradecompleted    2321
dtype: int64

^ The overwhelming majority of observations have missing values for `satverbal`. Only 31 observations have missing values for `highestdegree`.

We can create a Series, `missing_counts`, that specifies the number of missing features for each observation.

In [19]:
missing_counts = school_record.isnull().sum(axis=1)
missing_counts.value_counts().sort_index()

0    1087
1     312
2    3210
3    1102
4     176
5     101
6    2039
7     946
8      11
dtype: int64

^ 946 observations have seven missing values for the educational data, while 11 are missing values for all eight features.

Let's drop observations that have missing values for seven or more features out of eight. We can accomplish this by setting the `thresh` parameter of `dropna` to `2`. This will drop observations that have fewer than two non-missing values; that is, 0 or 1 non-missing values. We get the expected number of observations after using `dropna`; that is, 8984-946-11=8027:

In [20]:
school_record = school_record.dropna(thresh=2)
school_record.shape

(8027, 8)

In [21]:
school_record.isnull().sum(axis=1).value_counts().sort_index()

0    1087
1     312
2    3210
3    1102
4     176
5     101
6    2039
dtype: int64

The following code uses the panda Series `fillna` method to assign all missing values of `gpaoverall` to the mean value of the Series.

In [22]:
school_record.gpaoverall.agg(["mean", "std", "count"])

mean      281.840773
std        61.635667
count    6004.000000
Name: gpaoverall, dtype: float64

In [23]:
school_record.gpaoverall.fillna(school_record.gpaoverall.mean(), inplace=True)
school_record.gpaoverall.isnull().sum()

0

In [24]:
school_record.gpaoverall.agg(["mean", "std", "count"])

mean      281.840773
std        53.304846
count    8027.000000
Name: gpaoverall, dtype: float64

^ The mean has not changed. However, there is a substansial reduction in the standard deviation, from 61.6 to 53.5. This is one of the disadvantages of using the dataset's mean for all missing values.

In [25]:
wage_income = nls97.wageincome.copy(deep=True)
wage_income.isnull().sum()

3893

In [26]:
wage_income.agg(["mean", "std", "count"])

mean     49477.022196
std      40677.696798
count     5091.000000
Name: wageincome, dtype: float64

In [27]:
wage_income.head().T

personid
100061     12500.0
100139    120000.0
100284     58000.0
100292         NaN
100583     30000.0
Name: wageincome, dtype: float64

For `wageincome`, we assign the nearest non-missing value from a preceding observation.

In [28]:
wage_income.fillna(method="ffill", inplace=True)
wage_income.head().T

personid
100061     12500.0
100139    120000.0
100284     58000.0
100292     58000.0
100583     30000.0
Name: wageincome, dtype: float64

In [29]:
wage_income.isnull().sum()

0

In [30]:
wage_income.agg(["mean", "std", "count"])

mean     49549.331256
std      40014.338205
count     8984.000000
Name: wageincome, dtype: float64

In [31]:
# Doing a backward fill instead of forward fill.
wage_income = nls97.wageincome.copy(deep=True)
wage_income.fillna(method="bfill", inplace=True)

In [32]:
wage_income.head().T

personid
100061     12500.0
100139    120000.0
100284     58000.0
100292     30000.0
100583     30000.0
Name: wageincome, dtype: float64

In [33]:
wage_income.agg(["mean", "std", "count"])

mean     49419.050757
std      41111.535089
count     8984.000000
Name: wageincome, dtype: float64

In [34]:
wage_income = nls97.wageincome.copy(deep=True)
wage_income.fillna(wage_income.mean(), inplace=True)
wage_income.agg(["mean", "std", "count"])

mean     49477.022196
std      30619.954865
count     8984.000000
Name: wageincome, dtype: float64

^ If missing values are randomly distributed, then forward or backward filling has one advantage over using the mean: it is more likely to approximate the distribution of non-missing values for the feature. Notice that the standard deviation did not drop substantially.

In [35]:
# We should not use the mean across all degrees, but instead find the mean per degree.
nls97.weeksworked17.mean()

39.01664167916042

In [36]:
nls97.groupby(["highestdegree"])["weeksworked17"].mean()

highestdegree
0. None            28.719608
1. GED             34.587264
2. High School     38.150469
3. Associates      40.443508
4. Bachelors       43.565574
5. Masters         45.143123
6. PhD             44.313725
7. Professional    47.195876
Name: weeksworked17, dtype: float64

The following code assigns the mean value of weeks worked across observations with the same degree attainment level, for those observations missing weeks worked.

In [37]:
nls97[~nls97.highestdegree.isnull()].weeksworked17

personid
100061    48.0
100139    52.0
100284     0.0
100292     NaN
100583    52.0
          ... 
999291    52.0
999406    52.0
999543    30.0
999698     0.0
999963    52.0
Name: weeksworked17, Length: 8953, dtype: float64

In [38]:
nls97[~nls97.highestdegree.isnull()].groupby(["highestdegree"])["weeksworked17"].apply(
    lambda group: group.fillna(np.mean(group))
)

personid
100061    48.000000
100139    52.000000
100284     0.000000
100292    43.565574
100583    52.000000
            ...    
999291    52.000000
999406    52.000000
999543    30.000000
999698     0.000000
999963    52.000000
Name: weeksworked17, Length: 8953, dtype: float64

In [39]:
nls97.loc[~nls97.highestdegree.isnull(), "weeksworked17imp"] = (
    nls97[~nls97.highestdegree.isnull()]
    .groupby(["highestdegree"])["weeksworked17"]
    .apply(lambda group: group.fillna(np.mean(group)))
)
nls97[["weeksworked17imp", "weeksworked17", "highestdegree"]].head(10)

Unnamed: 0_level_0,weeksworked17imp,weeksworked17,highestdegree
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100061,48.0,48.0,2. High School
100139,52.0,52.0,2. High School
100284,0.0,0.0,0. None
100292,43.565574,,4. Bachelors
100583,52.0,52.0,2. High School
100833,47.0,47.0,2. High School
100931,52.0,52.0,3. Associates
101089,52.0,52.0,2. High School
101122,38.150469,,2. High School
101132,44.0,44.0,0. None


### Summary

Imputation strategies
- removing observations with missing values
- assigning a dataset's mean or median, using forward or backward filling
- using a group mean for a correlated feature


## Imputing values with regression

Regression imputation replaces a feature's missing values with values predicted by a regression model of correlated features.

In [40]:
import statsmodels.api as sm

nls97 = load_nls97b()
nls97.shape

(8984, 93)

In [41]:
nls97[["wageincome", "highestdegree", "weeksworked16", "parentincome"]].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8984 entries, 100061 to 999963
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   wageincome     5091 non-null   float64
 1   highestdegree  8953 non-null   object 
 2   weeksworked16  7068 non-null   float64
 3   parentincome   8984 non-null   int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 350.9+ KB


Let's convert the `highestdegree` feature into a numeric value. This will make the analysis we will be doing in the rest of the section easier:

In [42]:
nls97["hdegnum"] = nls97.highestdegree.str[0:1].astype("float")
nls97.groupby(["highestdegree", "hdegnum"]).size()

highestdegree    hdegnum
0. None          0.0         953
1. GED           1.0        1146
2. High School   2.0        3667
3. Associates    3.0         737
4. Bachelors     4.0        1673
5. Masters       5.0         603
6. PhD           6.0          54
7. Professional  7.0         120
dtype: int64

We need to replace logical missing values for `parentincome` with actual missings.

In [43]:
nls97.parentincome.replace(list(range(-5, 0)), np.nan, inplace=True)
nls97[["wageincome", "hdegnum", "weeksworked16", "parentincome"]].corr()

Unnamed: 0,wageincome,hdegnum,weeksworked16,parentincome
wageincome,1.0,0.399572,0.18088,0.273167
hdegnum,0.399572,1.0,0.235785,0.326685
weeksworked16,0.18088,0.235785,1.0,0.098737
parentincome,0.273167,0.326685,0.098737,1.0


In [44]:
nls97.weeksworked16.fillna(nls97.weeksworked16.mean(), inplace=True)
nls97.parentincome.fillna(nls97.parentincome.mean(), inplace=True)
nls97["degltcol"] = np.where(nls97.hdegnum <= 2, 1, 0)
nls97["degcol"] = np.where(nls97.hdegnum.between(3, 4), 1, 0)
nls97["degadv"] = np.where(nls97.hdegnum > 4, 1, 0)

In [45]:
def get_linear_model(df, y_colname, x_colnames):
    df = df[[y_colname] + x_colnames].dropna()
    y = df[y_colname]
    X = df[x_colnames]
    X = sm.add_constant(X)
    lm = sm.OLS(y, X).fit()
    coefficients = pd.DataFrame(
        zip(["constant"] + x_colnames, lm.params, lm.pvalues),
        columns=["features", "params", "pvalues"],
    )
    return coefficients, lm

In [46]:
x_vars = ["weeksworked16", "parentincome", "degcol", "degadv"]
coefficients, lm = get_linear_model(nls97, "wageincome", x_vars)
coefficients

Unnamed: 0,features,params,pvalues
0,constant,7389.368875,0.000694236
1,weeksworked16,494.068765,1.71525e-31
2,parentincome,0.181831,1.923812e-33
3,degcol,15770.073363,1.028133e-40
4,degadv,36737.842205,2.8412099999999995e-100


^ We use the `get_linear_model` function to get the parameter estimates and the model summary. All of the coefficients are positive and significant at the 95% level since they have `pvalues` less than 0.05.

Other observations includes:
- wage income increases with the number of weeks worked and with parental income
- having a college degree gives a nearly 16k boost to earning
- a post-graduate degree bumps up the earnings prediction even more - almost 37k more than those with less than a college degree

> NOTE: The coefficients of `degcol` and `degadv` are interpreted as relative to those without a college degree since that is the omitted dummy variable.

We can use this model to impute values for wage income where they are missing.

In [47]:
pred = lm.predict(sm.add_constant(nls97[x_vars])).to_frame().rename(columns={0: "pred"})
pred.head()

Unnamed: 0_level_0,pred
personid,Unnamed: 1_level_1
100061,32450.219808
100139,43939.386701
100284,39702.15634
100292,36547.437777
100583,36938.888957


In [48]:
nls97 = nls97.join(pred)

In [49]:
nls97["wageincomeimp"] = np.where(
    nls97.wageincome.isnull(), nls97.pred, nls97.wageincome
)

In [50]:
nls97[["wageincomeimp", "wageincome"] + x_vars].head(10)

Unnamed: 0_level_0,wageincomeimp,wageincome,weeksworked16,parentincome,degcol,degadv
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
100061,12500.0,12500.0,48.0,7400.0,0,0
100139,120000.0,120000.0,53.0,57000.0,0,0
100284,58000.0,58000.0,47.0,50000.0,0,0
100292,36547.437777,,4.0,62760.0,1,0
100583,30000.0,30000.0,53.0,18500.0,0,0
100833,39000.0,39000.0,45.0,37000.0,0,0
100931,56000.0,56000.0,53.0,60200.0,1,0
101089,36000.0,36000.0,53.0,32307.0,0,0
101122,35151.031821,,39.127476,46361.69915,0,0
101132,0.0,0.0,22.0,2470.0,0,0


In [51]:
nls97[["wageincomeimp", "wageincome"]].agg(["count", "mean", "std"])

Unnamed: 0,wageincomeimp,wageincome
count,8984.0,5091.0
mean,42558.776723,49477.022196
std,33405.858412,40677.696798


^ The mean for the imputed wage income feature is lower than the original wage income mean. This is not surprising since, as we have seen, individuals with missing wage income have lower values for positively correlated features.

What is surprising is the sharp reduction in the standard deviation. This is one of the drawbacks of deterministic regression imputation.

Stochastic regression imputation adds a normally distributed error to the predictions based on the residuals from our model. 

We want this error to have a mean of 0 with the same standard deviation as our residuals. 

We can use NumPy's normal function for that with `np.random.normal(0, lm.resid.std(), nls97.shape[0])`. 

The `lm.resid.std()` parameter gets us the standard deviation of the residuals from our model.
The final parameter value, `nls97.shape[0]` indicates how many values to create; in this case, we want a value for every row in our data.

In [52]:
random_add = np.random.normal(0, lm.resid.std(), nls97.shape[0])
random_add_df = pd.DataFrame(random_add, columns=["randomadd"], index=nls97.index)
nls97 = nls97.join(random_add_df)
nls97["stochastic_pred"] = nls97.pred + nls97.randomadd
nls97["wageincomeimpstoc"] = np.where(
    nls97.wageincome.isnull(), nls97.stochastic_pred, nls97.wageincome
)

In [54]:
nls97[["wageincomeimpstoc", "wageincome"]].agg(["count", "mean", "std"])

Unnamed: 0,wageincomeimpstoc,wageincome
count,8984.0,5091.0
mean,42768.316196,49477.022196
std,41320.627162,40677.696798


^ Our stochastic prediction has pretty much the same standard deviation as the original wage income feature.

### Summary

Regression imputation is a good way to take advantage of all the data we have to impute value for a feature. It is often superior to the imputation methods we examined in the previous section, particularly when missing values _are not random_. If we use stochastic regression imputation, we will not artificially reduce our variance.

## Using KNN imputation

In [56]:
from sklearn.impute import KNNImputer

nls97 = load_nls97b()

In [58]:
nls97.parentincome.fillna(nls97.parentincome.mean(), inplace=True)
nls97["hdegnum"] = nls97.highestdegree.str[0:1].astype("float")
nls97["degltcol"] = np.where(nls97.hdegnum <= 2, 1, 0)
nls97["degcol"] = np.where(nls97.hdegnum.between(3, 4), 1, 0)
nls97["degadv"] = np.where(nls97.hdegnum > 4, 1, 0)

In [62]:
wage_data_list = [
    "wageincome",
    "weeksworked16",
    "parentincome",
    "degltcol",
    "degcol",
    "degadv",
]
wage_data = nls97[wage_data_list]

We need to specify the value to use for the number of nearest neighbors, for `k`. We use a general rule of thumb for determining `k` - the square root of the number of observations divided by 2 `(sqrt(N)/2)`.

In [63]:
import math

k = math.floor(math.sqrt(wage_data.shape[0]) / 2)

In [64]:
impKNN = KNNImputer(n_neighbors=k)
new_values = impKNN.fit_transform(wage_data)
wage_data_list_imp = [
    "wageincomeimp",
    "weeksworked16imp",
    "parentincomeimp",
    "degltcol",
    "degcol",
    "degadv",
]
wage_data_imp = pd.DataFrame(new_values, columns=wage_data_list_imp, index=nls97.index)

In [65]:
wage_data = wage_data.join(wage_data_imp[["wageincomeimp", "weeksworked16imp"]])
wage_data[
    ["wageincome", "weeksworked16", "parentincome", "degcol", "degadv", "wageincomeimp"]
].head(10)

Unnamed: 0_level_0,wageincome,weeksworked16,parentincome,degcol,degadv,wageincomeimp
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
100061,12500.0,48.0,7400,0,0,12500.0
100139,120000.0,53.0,57000,0,0,120000.0
100284,58000.0,47.0,50000,0,0,58000.0
100292,,4.0,62760,1,0,61213.87234
100583,30000.0,53.0,18500,0,0,30000.0
100833,39000.0,45.0,37000,0,0,39000.0
100931,56000.0,53.0,60200,1,0,56000.0
101089,36000.0,53.0,32307,0,0,36000.0
101122,,,-4,0,0,42402.12766
101132,0.0,22.0,2470,0,0,0.0


In [66]:
wage_data[["wageincome", "wageincomeimp"]].agg(["count", "mean", "std"])  # CMS

Unnamed: 0,wageincome,wageincomeimp
count,5091.0,8984.0
mean,49477.022196,47504.603655
std,40677.696798,31990.535058


### Summary

KNN does imputations without making any assumptions about the distribution of the underlying data. With regression imputation, the standard assumptions for linear regression apply - that is, that there is a linear relationship between features and that they are distributed normally. 

KNN imputation has some limitations
- we must tune the model with an initial assumption about a good value for `k`
- KNN is computationally more expensive and may be impractical for large datasets

## Using random forest for imputation

In [68]:
!pip install -qq missingpy

In [70]:
import sys

import sklearn.neighbors._base

sys.modules["sklearn.neighbors.base"] = sklearn.neighbors._base
from missingpy import MissForest

nls97 = load_nls97b()

In [71]:
nls97.parentincome.fillna(nls97.parentincome.mean(), inplace=True)
nls97["hdegnum"] = nls97.highestdegree.str[0:1].astype("float")
nls97["degltcol"] = np.where(nls97.hdegnum <= 2, 1, 0)
nls97["degcol"] = np.where(nls97.hdegnum.between(3, 4), 1, 0)
nls97["degadv"] = np.where(nls97.hdegnum > 4, 1, 0)

In [72]:
wage_data_list = [
    "wageincome",
    "weeksworked16",
    "parentincome",
    "degltcol",
    "degcol",
    "degadv",
]
wage_data = nls97[wage_data_list]

In [73]:
imputer = MissForest()
new_values = imputer.fit_transform(wage_data)
wage_data_list_imp = [
    "wageincomeimp",
    "weeksworked16imp",
    "parentincomeimp",
    "degltcol",
    "degcol",
    "degadv",
]
wage_data_imp = pd.DataFrame(
    new_values, columns=wage_data_list_imp, index=wage_data.index
)

  warn(
  warn(
  warn(
  warn(


Iteration: 0


  warn(
  warn(
  warn(
  warn(


Iteration: 1


  warn(
  warn(
  warn(
  warn(


Iteration: 2


In [74]:
wage_data = wage_data.join(wage_data_imp[["wageincomeimp", "weeksworked16imp"]])
wage_data[
    ["wageincome", "weeksworked16", "parentincome", "degcol", "degadv", "wageincomeimp"]
].head(10)

Unnamed: 0_level_0,wageincome,weeksworked16,parentincome,degcol,degadv,wageincomeimp
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
100061,12500.0,48.0,7400,0,0,12500.0
100139,120000.0,53.0,57000,0,0,120000.0
100284,58000.0,47.0,50000,0,0,58000.0
100292,,4.0,62760,1,0,22040.0
100583,30000.0,53.0,18500,0,0,30000.0
100833,39000.0,45.0,37000,0,0,39000.0
100931,56000.0,53.0,60200,1,0,56000.0
101089,36000.0,53.0,32307,0,0,36000.0
101122,,,-4,0,0,14956.133333
101132,0.0,22.0,2470,0,0,0.0


In [75]:
wage_data[["wageincome", "wageincomeimp"]].agg(["count", "mean", "std"])  # CMS

Unnamed: 0,wageincome,wageincomeimp
count,5091.0,8984.0
mean,49477.022196,40675.682434
std,40677.696798,35139.887381


### Summary

`MissForest` uses the random forest algorithm to generate highly accurate predictions.

Unlike KNN, it doesn't need to be tuned with an initial value for `k`. It is also computationally less expensive than KNN.

Random forest imputation is less sensitive to low or very high correlation among features.