# Introduction to Missing Data

Real-world domains often have [missing data](https://ydata.ai/resources/what-is-missing-data-in-machine-learning). Data can have missing values for a [number of reasons](https://ydata.ai/resources/understanding-missing-data-mechanisms) such as observations that were not recorded or data corruption. 

Handling missing data is important as many machine learning algorithms do not support data with missing values. And event when they do, their predictions can be biased due to the presence of missing information.

In this tutorial, you will discover how to handle missing data for machine learning with Python. 

This is what we will cover:

1. How to mark invalid or corrupt values as missing in your dataset
2. How to confirm that the presence of marked missing values causes problems for learning algorithms
3. How to remove rows with missing data from your dataset and evaluate a learning algorithm on the transformed dataset.

To follow the tutorial, you can download the data from Kaggle: [Pima Indians Dataset](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)

## 1. How to identify and mark Missing Values

As the basis of this tutorial, we will use the[Pima Indians dataset](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database) (also called "diabetes" dataset) that has been widely studied as a machine learning dataset since the 1990s. 

The dataset classifies patient data as either an onset of diabetes within five years or not. 

It is a binary classification problem and there are 768 examples and 8 input variables.

You can learn more about this dataset by following our previous tutorials on [descriptive statistics](https://github.com/Data-Centric-AI-Community/awesome-python-for-data-science/blob/main/tutorials/data_descriptive_statistics.ipynb) and [data visualization](https://github.com/Data-Centric-AI-Community/awesome-python-for-data-science/blob/main/tutorials/data_basic_visualization.ipynb).

In [45]:
import pandas as pd
from ydata_profiling import ProfileReport

In [46]:
data = pd.read_csv("diabetes.csv")

# Create the ProfileReport
profile = ProfileReport(data, title="Pima Indians Diabetes")
profile.to_file("pima_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
rows, cols = data.shape

In [6]:
rows

768

In [7]:
cols # Outcome is the target, so there are 9 feature, but 8 predictors

9

Looking at the data, we can see that all nine input variables are numerical.

**This dataset is known to have missing values.**

Specifically, there are missing observations for some columns that are marked as a zero value. 

We can corroborate this by the definition of those columns and the domain knowledge that a zero value is invalid for those measures, e.g. a zero for body mass index or blood pressure is invalid.

Most data has missing values, and the likelihood of having missing values increases with the size of the dataset.

**Let's identify and mark values as missing. We can use plots and summary statistics to help identify missing or corrupt data.**

We can load the dataset as a Pandas DataFrame and print summary statistics on each attribute.

In [8]:
data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


This is useful. We can see that there are columns that have a minimum value of zero (0). On some columns, a value of zero does not make sense and indicates an invalid or missing value.

**Missing values are frequently indicated by out-of-range entries; perhaps a negative number (e.g., -1) in a numeric field that is normally only positive, or a 0 in a numeric field that can never normally be 0.**

Specifically, the following columns have an invalid zero minimum value:

- Plasma glucose concentration (`Glucose`)
- Diastolic Blood pressure (`Blood Pressure`)
- Triceps skinfold thickness (`SkinThickness`)
- 2-Hour serum insulin (`Insulin`)
- Body Mass Index (`BMI`)

**We can get a count of the number of missing values on each of these columns.**

We can do this by marking all of the values in the subset of the DataFrame we are interested in that have zero values as `Treu`. We can then count the number of true values in each column.

In [9]:
# Summarizing the number of missing values for each feature
data_missing = data[["Glucose","BloodPressure", "SkinThickness","Insulin", "BMI"]]

In [10]:
data_missing

Unnamed: 0,Glucose,BloodPressure,SkinThickness,Insulin,BMI
0,148,72,35,0,33.6
1,85,66,29,0,26.6
2,183,64,0,0,23.3
3,89,66,23,94,28.1
4,137,40,35,168,43.1
...,...,...,...,...,...
763,101,76,48,180,32.9
764,122,70,27,0,36.8
765,121,72,23,112,26.2
766,126,60,0,0,30.1


In [11]:
n_missing = (data_missing == 0).sum()

In [12]:
n_missing

Glucose            5
BloodPressure     35
SkinThickness    227
Insulin          374
BMI               11
dtype: int64

We can see that columns Glucose, BloodPressure and BMI have just a few zero values, whereas columns SkinThickness and Insulin show a lot more, nearly half of the rows. 

**This highlights that different missing value strategies may be needed for different columns**, e.g. to ensure that there are still a sufficient number of records left to train a predictive model.

**In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN.** 

Values with a NaN value are ignored from operations like sum, count, etc. 

We can mark values as NaN easily with the Pandas DataFrame by using the `replace()` function on a subset of the columns we are interested in. 

After we have marked the missing values, we can use the `isna()` or `isnull()` functions to mark all of the NaN values in the dataset as `True` and get a count of the missing values for each column.

In [13]:
# Marking Missing Values with NaN values
import numpy as np

data.iloc[:, [1, 2, 3, 4, 5]] = data.iloc[:, [1, 2, 3, 4, 5]].replace(0, np.nan)

In the code above, we're using `iloc` to make the code easider to read. Columns 1 to 5 simply correspond to Glucose, BloodPressure", SkinThickness, Insulin, and BMI, respectively. To access NaN values, we also needed to import the `numpy` module.

In [14]:
data.isna().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

The code above prints the number of missing values in each column. We can see that columns 1 to 5 have the same number of missing values as zero values identified above. This is a sign that we have marked the identified missing values correctly.

If we want to confirm that we have not fooled ourselves somehow, we can print the first 20 rows of data:

In [15]:
data.head(20)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
5,5,116.0,74.0,,,25.6,0.201,30,0
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
7,10,115.0,,,,35.3,0.134,29,0
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
9,8,125.0,96.0,,,,0.232,54,1


Running the example, we can clearly see NaN values in the columns 2, 3, 4 and 5. There are only 5 missing values in column 1, so it is not surprising we did not see an example in the first 20 rows. It is clear from the raw data that marking the missing values had the intended effect.

## 2. Confirm that missing values cause problems for machine learning algorithms

Before we look at handling missing values, let’s first demonstrate that having missing values in a dataset can cause problems.

Although missing values are common occurrences in data, **most predictive modeling techniques cannot handle any missing values**. Therefore, this problem must be addressed prior to modeling.

We'll try to evaluate the Linear Discriminant Analysis (LDA) algorithm on the dataset with missing values. This is an algorithm that does not work when there are missing values in the dataset. The example below marks the missing values in the dataset, as we did in the previous section, then attempts to evaluate LDA using 3-fold cross-validation and print the mean accuracy.

In [19]:
# Import necessary libraries
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split

In [20]:
# Split data into inputs and outputs
X = data.drop('Outcome', axis=1)
y = data['Outcome']

In [21]:
X

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148.0,72.0,35.0,,33.6,0.627,50
1,1,85.0,66.0,29.0,,26.6,0.351,31
2,8,183.0,64.0,,,23.3,0.672,32
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101.0,76.0,48.0,180.0,32.9,0.171,63
764,2,122.0,70.0,27.0,,36.8,0.340,27
765,5,121.0,72.0,23.0,112.0,26.2,0.245,30
766,1,126.0,60.0,,,30.1,0.349,47


In [22]:
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

In [23]:
# Split your data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [24]:
# Initialize the LDA model
lda = LinearDiscriminantAnalysis()

# Train the LDA model using the training data
lda.fit(X_train, y_train)

ValueError: Input X contains NaN.
LinearDiscriminantAnalysis does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

**As you can see from the above error message, LDA does not accept missing values!**

ValueError: Input X contains NaN.

LinearDiscriminantAnalysis does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

This is as we expect. We are prevented from evaluating an LDA algorithm (and other algorithms) on the dataset with missing values.

**Many popular predictive models such as support vector machines and neural networks, cannot tolerate any amount of missing values.**

## 3. Remove rows with Missing Values and try again!

The simplest strategy for handling missing data is to remove records that contain a missing value.

This is **not always the best approach**, but for an introdutory tutorial to the field, let's do it!

We can do this by creating a new Pandas DataFrame with the rows containing missing values removed. Pandas provides the `dropna()` function that can be used to drop either columns or rows with missing data. 

We can use `dropna()` to remove all rows with missing data, as follows:

In [81]:
# Start from the beginning, removing rows with missing values! 
import pandas as pd
import numpy as np

# Load the data
data = pd.read_csv("../../data/diabetes.csv")

In [82]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [83]:
rows, cols = data.shape

In [84]:
rows

768

In [85]:
cols

9

In [86]:
# Replace '0' values with 'nan' in columns 1-5
data.iloc[:, [1,2,3,4,5]] = data.iloc[:, [1,2,3,4,5]].replace(0, np.nan)

In [87]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1


In [88]:
# Drop missing values
data.dropna(inplace=True)

In the context of the `data.dropna(inplace=True)` method, the `inplace=True` argument is used to modify the DataFrame directly, without creating a new DataFrame. When `inplace=True`, the method will remove rows containing missing values (NaN) from the DataFrame, and the changes will be applied to the original DataFrame object.

In [89]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
13,1,189.0,60.0,23.0,846.0,30.1,0.398,59,1


In [90]:
rows, cols = data.shape

In [91]:
rows

392

In [92]:
cols

9

We can see that the number of rows has been aggressively cut from 768 in the original dataset to 392 with all rows containing a NaN removed.

**We now have a dataset that we could use to evaluate an algorithm sensitive to missing values like LDA.**

In [93]:
# Import necessary libraries
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split

# Split data into inputs and outputs
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Split your data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the LDA model
lda = LinearDiscriminantAnalysis()

# Train the LDA model using the training data
lda.fit(X_train, y_train)

In [94]:
# Predict on the test set
y_pred = lda.predict(X_test)

In [95]:
# Evaluate the model's performance
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report
print(classification_report(y_test, y_pred))

# Print confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.7468354430379747
              precision    recall  f1-score   support

           0       0.80      0.83      0.81        52
           1       0.64      0.59      0.62        27

    accuracy                           0.75        79
   macro avg       0.72      0.71      0.71        79
weighted avg       0.74      0.75      0.74        79

Confusion Matrix:
[[43  9]
 [11 16]]


**LDA is not able to operate! However, note that removing rows with missing values can be too limiting on some predictive modeling problems, an alternative is to impute missing values.**

# What is Next?

**Data Imputation!** In the next tutorial, we will explore how we can impute missing data values using statistic methods and machine learning methods!

### 👾 Join our [Discord community](https://tiny.ydata.ai/dcai-community-github) and follow our Code-with-Me sessions to learn more about data science!