# How to Handle Missing Data with Python
by Jason Brownlee on June 30, 2020. [Reference](https://machinelearningmastery.com/handle-missing-data-python/)

Handling missing data is important as many machine learning algorithms do not support data with missing values.

After completing this tutorial you will know:

- How to marking invalid or corrupt values as missing in your dataset.
- How to remove rows with missing data from your dataset.
- How to impute missing values with mean values in your dataset.

## Overview
This tutorial is divided into 6 parts:

1. `Diabetes Dataset`: where we look at a dataset that has known missing values.
2. `Mark Missing Values`: where we learn how to mark missing values in a dataset.
3. `Missing Values Causes Problems`: where we see how a machine learning algorithm can fail when it contains missing values.
4. `Remove Rows With Missing Values`: where we see how to remove rows that contain missing values.
5. `Impute Missing Values`: where we replace missing values with sensible values.
6. `Algorithms that Support Missing Values`: where we learn about algorithms that support missing values.

## 1. Diabetes Dataset
The Diabetes Dataset involves predicting the onset of diabetes within 5 years in given medical details.

- [Dataset File](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv).
- [Dataset Details](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names).

It is a binary (2-class) classification problem. The number of observations for each class *`is not balanced`*. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:

- 0. Number of times pregnant.
- 1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
- 2. Diastolic blood pressure (mm Hg).
- 3. Triceps skinfold thickness (mm).
- 4. 2-Hour serum insulin (mu U/ml).
- 5. Body mass index (weight in kg/(height in m)^2).
- 6. Diabetes pedigree function.
- 7. Age (years).
- 8. Class variable (0 or 1).

## 2. Mark Missing Values
Most data has missing values, and the likelihood of having missing values increases with the size of the dataset.

In this section, we will look at how we can identify and mark values as missing.

We can use plots and summary statistics to help identify missing or corrupt data.

In [2]:
# load and summarize the dataset
from pandas import read_csv

# load the dataset
dataset = read_csv('..//..//..//data/pima-indians-diabetes.csv', header=None)

# summarize the dataset
print(dataset.shape)
print(dataset.describe())

(768, 9)
                0           1           2           3           4           5  \
count  768.000000  768.000000  768.000000  768.000000  768.000000  768.000000   
mean     3.845052  120.894531   69.105469   20.536458   79.799479   31.992578   
std      3.369578   31.972618   19.355807   15.952218  115.244002    7.884160   
min      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   
25%      1.000000   99.000000   62.000000    0.000000    0.000000   27.300000   
50%      3.000000  117.000000   72.000000   23.000000   30.500000   32.000000   
75%      6.000000  140.250000   80.000000   32.000000  127.250000   36.600000   
max     17.000000  199.000000  122.000000   99.000000  846.000000   67.100000   

                6           7           8  
count  768.000000  768.000000  768.000000  
mean     0.471876   33.240885    0.348958  
std      0.331329   11.760232    0.476951  
min      0.078000   21.000000    0.000000  
25%      0.243750   24.000000    0.000000

We can see that there are columns that have a minimum value of zero (0). On some columns, a value of zero does not make sense and indicates an invalid or missing value.

Specifically, the following columns have an invalid zero minimum value:

- 1: Plasma glucose concentration
- 2: Diastolic blood pressure
- 3: Triceps skinfold thickness
- 4: 2-Hour serum insulin
- 5: Body mass index

Confirm printing first 20 rows of data.

In [3]:
# print the first 20 rows of data
print(dataset.head(20))

0    1   2   3    4     5      6   7  8
0    6  148  72  35    0  33.6  0.627  50  1
1    1   85  66  29    0  26.6  0.351  31  0
2    8  183  64   0    0  23.3  0.672  32  1
3    1   89  66  23   94  28.1  0.167  21  0
4    0  137  40  35  168  43.1  2.288  33  1
5    5  116  74   0    0  25.6  0.201  30  0
6    3   78  50  32   88  31.0  0.248  26  1
7   10  115   0   0    0  35.3  0.134  29  0
8    2  197  70  45  543  30.5  0.158  53  1
9    8  125  96   0    0   0.0  0.232  54  1
10   4  110  92   0    0  37.6  0.191  30  0
11  10  168  74   0    0  38.0  0.537  34  1
12  10  139  80   0    0  27.1  1.441  57  0
13   1  189  60  23  846  30.1  0.398  59  1
14   5  166  72  19  175  25.8  0.587  51  1
15   7  100   0   0    0  30.0  0.484  32  1
16   0  118  84  47  230  45.8  0.551  31  1
17   7  107  74   0    0  29.6  0.254  31  1
18   1  103  30  38   83  43.3  0.183  33  0
19   1  115  70  30   96  34.6  0.529  32  1


We can get a count of the number of missing values on each of these columns. We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.

We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.

In [4]:
# count the number of missing values for each column
num_missing = (dataset[[1,2,3,4,5]] == 0).sum()

# get % of missing values for each column
num_missing_prc = ((dataset[[1,2,3,4,5]] == 0).sum()/768) * 100

# report the results
print(num_missing)
print(num_missing_prc)

1      5
2     35
3    227
4    374
5     11
dtype: int64
1     0.651042
2     4.557292
3    29.557292
4    48.697917
5     1.432292
dtype: float64


This highlights that different “missing value” strategies may be needed for different columns, e.g. to ensure that there are still a sufficient number of records left to train a predictive model.

In Python, specifically Pandas, NumPy and Scikit-Learn, `we mark missing values as NaN`.

Values with a NaN value are ignored from operations like sum, count, etc.

After we have marked the missing values, we can use the isnull() function to mark all of the NaN values in the dataset as True and get a count of the missing values for each column.

In [5]:
# example of marking missing values with nan values
from numpy import nan

# replace '0' values with 'nan'
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)

# count the number of nan values in each column
print(dataset.isnull().sum())
dataset.head()

0      0
1      5
2     35
3    227
4    374
5     11
6      0
7      0
8      0
dtype: int64


Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1


## 3. Missing Values Causes Problems
Missing values are common occurrences in data. Unfortunately, most predictive modeling techniques cannot handle any missing values. Therefore, this problem must be addressed prior to modeling.

In this section, we will try to evaluate a the `Linear Discriminant Analysis (LDA)` algorithm on the dataset with missing values.

This is an algorithm that does not work when there are missing values in the dataset.

In [7]:
# example where missing values cause errors
from numpy import nan
from pandas import read_csv
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# load the dataset
dataset = read_csv('..//..//..//data/pima-indians-diabetes.csv', header=None)

# replace '0' values with 'nan'
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)

# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8] # input
y = values[:,8] # output

# define the model
model = LinearDiscriminantAnalysis()

# define the model evaluation procedure
cv = KFold(n_splits=3, shuffle=True, random_state=1)

# evaluate the model
result = cross_val_score(model, X, y, cv=cv, scoring='accuracy')

# report the mean performance
print('Accuracy: %.3f' % result.mean())

Accuracy: nan


Running the example results in an error, as follows:

`Accuracy: nan`

We are prevented from evaluating an `*LDA algorithm*` (and other algorithms) on the dataset with missing values.

Many popular predictive models such as *`support vector`* machines, the `*glmnet*`, and `*neural networks*`, cannot tolerate any amount of missing values.

## 4. Remove Rows With Missing Values
he simplest approach for dealing with missing values is to remove entire predictor(s) and/or sample(s) that contain missing values.

We can do this by creating a new Pandas DataFrame with the rows containing missing values removed. Use dropna() to remove all rows with missing data, as follows:

In [8]:
# example of removing rows that contain missing values
from numpy import nan
from pandas import read_csv

# load the dataset
dataset = read_csv('..//..//..//data/pima-indians-diabetes.csv', header=None)

# summarize the shape of the raw data
print('Before: ' + str(dataset.shape))

# replace '0' values with 'nan'
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)

# drop rows with missing values
dataset.dropna(inplace=True)

# summarize the shape of the data with missing rows removed
print('After: ' + str(dataset.shape))

Before: (768, 9)
After: (392, 9)


In [9]:
# example where missing values cause errors
from numpy import nan
from pandas import read_csv
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8] # input
y = values[:,8] # output

# define the model
model = LinearDiscriminantAnalysis()

# define the model evaluation procedure
cv = KFold(n_splits=3, shuffle=True, random_state=1)

# evaluate the model
result = cross_val_score(model, X, y, cv=cv, scoring='accuracy')

# report the mean performance
print('Accuracy: %.3f' % result.mean())

Accuracy: 0.781


## 5. Impute Missing Values
Missing data can be imputed. In this case, we can use information in the training set predictors to, in essence, estimate the values of other predictors.

There are many options we could consider when replacing a missing value, for example:

- A `constant value` that has meaning within the domain, such as 0, distinct from all other values.
- A value from another `randomly selected` record.
- A `mean`, `median` or `mode` value for the column.
- A `value estimated` by `another predictive model`.

*`Any imputing performed on the training dataset will have to be performed on new data in the future when predictions are needed from the finalized model. This needs to be taken into consideration when choosing how to impute the missing values`*.

Note. *For example, if you choose to impute with mean column values, these mean column values will need to be stored to file for later use on new data that has missing values*.

### - fillna() function 

In [10]:
# manually impute missing values with numpy
from pandas import read_csv
from numpy import nan

# load the dataset
dataset = read_csv('..//..//..//data/pima-indians-diabetes.csv', header=None)

# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)

# fill missing values with mean column values
dataset.fillna(dataset.mean(), inplace=True)

# count the number of NaN values in each column
print(dataset.isnull().sum())
print(dataset.head(10))

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
dtype: int64
    0      1          2         3           4          5      6   7  8
0   6  148.0  72.000000  35.00000  155.548223  33.600000  0.627  50  1
1   1   85.0  66.000000  29.00000  155.548223  26.600000  0.351  31  0
2   8  183.0  64.000000  29.15342  155.548223  23.300000  0.672  32  1
3   1   89.0  66.000000  23.00000   94.000000  28.100000  0.167  21  0
4   0  137.0  40.000000  35.00000  168.000000  43.100000  2.288  33  1
5   5  116.0  74.000000  29.15342  155.548223  25.600000  0.201  30  0
6   3   78.0  50.000000  32.00000   88.000000  31.000000  0.248  26  1
7  10  115.0  72.405184  29.15342  155.548223  35.300000  0.134  29  0
8   2  197.0  70.000000  45.00000  543.000000  30.500000  0.158  53  1
9   8  125.0  96.000000  29.15342  155.548223  32.457464  0.232  54  1


### -  SimpleImputer pre-processing

The example below uses the SimpleImputer class to replace missing values with the mean of each column then prints the number of NaN values in the transformed matrix

In [11]:
# example of imputing missing values using scikit-learn
from numpy import nan
from numpy import isnan
from pandas import read_csv
from sklearn.impute import SimpleImputer

# load the dataset
dataset = read_csv('..//..//..//data/pima-indians-diabetes.csv', header=None)

# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)

# retrieve the numpy array
values = dataset.values

# define the imputer
imputer = SimpleImputer(missing_values=nan, strategy='mean')

# transform the dataset
transformed_values = imputer.fit_transform(values)

# count the number of NaN values in each column
print('Missing: %d' % isnan(transformed_values).sum())

import numpy as np
np.set_printoptions(precision=1)
print(transformed_values[0:10,:])

Missing: 0
[[6.0e+00 1.5e+02 7.2e+01 3.5e+01 1.6e+02 3.4e+01 6.3e-01 5.0e+01 1.0e+00]
 [1.0e+00 8.5e+01 6.6e+01 2.9e+01 1.6e+02 2.7e+01 3.5e-01 3.1e+01 0.0e+00]
 [8.0e+00 1.8e+02 6.4e+01 2.9e+01 1.6e+02 2.3e+01 6.7e-01 3.2e+01 1.0e+00]
 [1.0e+00 8.9e+01 6.6e+01 2.3e+01 9.4e+01 2.8e+01 1.7e-01 2.1e+01 0.0e+00]
 [0.0e+00 1.4e+02 4.0e+01 3.5e+01 1.7e+02 4.3e+01 2.3e+00 3.3e+01 1.0e+00]
 [5.0e+00 1.2e+02 7.4e+01 2.9e+01 1.6e+02 2.6e+01 2.0e-01 3.0e+01 0.0e+00]
 [3.0e+00 7.8e+01 5.0e+01 3.2e+01 8.8e+01 3.1e+01 2.5e-01 2.6e+01 1.0e+00]
 [1.0e+01 1.2e+02 7.2e+01 2.9e+01 1.6e+02 3.5e+01 1.3e-01 2.9e+01 0.0e+00]
 [2.0e+00 2.0e+02 7.0e+01 4.5e+01 5.4e+02 3.0e+01 1.6e-01 5.3e+01 1.0e+00]
 [8.0e+00 1.2e+02 9.6e+01 2.9e+01 1.6e+02 3.2e+01 2.3e-01 5.4e+01 1.0e+00]]


The example below shows the LDA algorithm trained in the SimpleImputer transformed dataset.

We use a Pipeline to define the modeling pipeline, where data is first passed through the imputer transform, then provided to the model. `This ensures that the imputer and model are both fit only on the training dataset and evaluated on the test dataset within each cross-validation fold`. This is important to avoid data leakage.

In [12]:
# example of evaluating a model after an imputer transform
from numpy import nan
from pandas import read_csv
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# load the dataset
dataset = read_csv('..//..//..//data/pima-indians-diabetes.csv', header=None)

# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)

# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]

# define the imputer
imputer = SimpleImputer(missing_values=nan, strategy='mean')
#imputer = SimpleImputer(missing_values=nan, strategy='median')
#imputer = SimpleImputer(missing_values=nan, strategy='most_frequent')
#imputer = SimpleImputer(missing_values=nan, strategy='constant')
#imputer = SimpleImputer(missing_values=nan, strategy='constant', fill_value=10)

# define the model
lda = LinearDiscriminantAnalysis()

# define the modeling pipeline
pipeline = Pipeline(steps=[('imputer', imputer), ('model', lda)])

# define the cross validation procedure
kfold = KFold(n_splits=3, shuffle=True, random_state=1)

# evaluate the model
result = cross_val_score(pipeline, X, y, cv=kfold, scoring='accuracy')

# report the mean performance
print('Accuracy: %.3f' % result.mean())

Accuracy: 0.762


Results with several imputer values

- Accuracy (mean): 0.762
- Accuracy (median): 0.760
- Accuracy (most_frequent): 0.760
- Accuracy (constant): 0.763
- Accuracy (constant, fill_value=10): 0.767

## 6. Algorithms that Support Missing Values

There are algorithms that can be made robust to missing data, such as `k-Nearest Neighbors` that can ignore a column from a distance measure when a value is missing. `Naive Bayes` can also support missing values when making a prediction.

There are also algorithms that can use the missing value as a unique and different value when building the predictive model, such as classification and regression trees; especially `tree-based techniques`, can specifically account for missing data.

`Note`. *Sadly, the scikit-learn implementations of naive bayes, decision trees and k-Nearest Neighbors are not robust to missing values. Although it is being considered* [Here](https://github.com/scikit-learn/scikit-learn/issues/5870).

## Summary
In this tutorial, you discovered how to handle machine learning data that contains missing values.

Specifically, you learned:

- How to mark missing values in a dataset as numpy.nan.
- How to remove rows from the dataset that contain missing values.
- How to replace missing values with sensible values.