#Missing data

- Missing data is a common problem in machine learning
- It can lead to biased or incomplete results if not properly addressed



## Challenges of Missing Data
- Most predictive models cannot handle missing data
- Missing data can impact model accuracy, interpretability, and generalization

## Techniques for Handling Missing Data
- Statistical imputation methods: mean, median, mode imputation, regression imputation, k-nearest neighbors imputation
- Deep learning techniques: autoencoders, GANs
- Ensemble methods: model averaging, stacked models

## Pros and Cons of Imputation Methods
- Statistical imputation methods: simple and easy to implement, but may not capture complex patterns
- Deep learning techniques: can capture complex patterns, but require large amounts of data and tuning
- Ensemble methods: provide robust predictions, but can be computationally expensive

## Best Practices for Handling Missing Data
- Data exploration: examine patterns and characteristics of missing data
- Multiple imputation: use multiple imputation techniques and compare results
- Sensitivity analysis: assess impact of imputation on results
- Evaluation of imputation performance: use appropriate metrics to evaluate imputation methods
- Incorporating missingness indicators: capture potential bias introduced by imputation
- Domain expertise: leverage subject matter knowledge in decision-making

## Examples of Imputation Methods
- Mean, median, mode imputation: replace missing values with the mean, median, or mode of the column
- Regression imputation: use regression models to predict missing values based on other variables
- K-nearest neighbors imputation: use k-nearest neighbors to impute missing values based on similar samples

## Examples of Deep Learning Techniques
- Autoencoders: neural networks that learn to encode and decode data, can be used for imputation
- GANs (Generative Adversarial Networks): generate synthetic data with realistic imputations

## Examples of Ensemble Methods
- Model averaging: combine predictions from multiple models, such as mean or weighted average
- Stacked models: use predictions from multiple models as input to another model for final prediction

## Conclusion
- Handling missing data is crucial in machine learning
- Choosing appropriate imputation methods depends on data characteristics, problem context, and evaluation of performance
- Following best practices can help in obtaining reliable and accurate results in machine learning applications.




##Example


###The Diabetes Dataset
The Diabetes Dataset involves predicting the onset of diabetes within 5 years in given medical details.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. 

The variable names are as follows:
0. Number of times pregnant.
1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
2. Diastolic blood pressure (mm Hg).
3. Triceps skinfold thickness (mm).
4. 2-Hour serum insulin (mu U/ml).
5. Body mass index (weight in kg/(height in m)^2).
6. Diabetes pedigree function.
7. Age (years).
8. Class variable (0 or 1).

This dataset is known to have missing values.
Specifically, there are missing observations for some columns that are marked as a zero value.

We can corroborate this by the definition of those columns and the domain knowledge that a zero value is invalid for those measures, e.g. a zero for body mass index or blood pressure is invalid.

###Marking Missing Values
Missing data are not rare in real data sets. In fact, the chance that at least one data point is missing increases as the data set size increases. 

We can use Panda’s describe function to have a general view of our data’s properties



In [None]:
# load and summarize the dataset
import pandas as pd
from pandas import read_csv
# load the dataset
url="https://raw.githubusercontent.com/profandresg/pyzero/main/pima-indians-diabetes.txt"
dataset = read_csv(url, header=None, names=['#pregnant','plasma_glucose','diastolic_BP','triceps_thickness','2h_insulin','BMI','diabetes_pedigree','age','class'])
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
display(dataset.values)
# summarize the dataset
print(dataset.describe())

  pd.set_option('display.max_colwidth', -1)


array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],
       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],
       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],
       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]])

        #pregnant  plasma_glucose  diastolic_BP  triceps_thickness  \
count  768.000000  768.000000      768.000000    768.000000          
mean   3.845052    120.894531      69.105469     20.536458           
std    3.369578    31.972618       19.355807     15.952218           
min    0.000000    0.000000        0.000000      0.000000            
25%    1.000000    99.000000       62.000000     0.000000            
50%    3.000000    117.000000      72.000000     23.000000           
75%    6.000000    140.250000      80.000000     32.000000           
max    17.000000   199.000000      122.000000    99.000000           

       2h_insulin         BMI  diabetes_pedigree         age       class  
count  768.000000  768.000000  768.000000         768.000000  768.000000  
mean   79.799479   31.992578   0.471876           33.240885   0.348958    
std    115.244002  7.884160    0.331329           11.760232   0.476951    
min    0.000000    0.000000    0.078000           21.000000   0.00000

Missing values are frequently indicated by out-of-range entries; perhaps a negative number (e.g., -1) in a numeric field that is normally only positive, or a 0 in a numeric field that can never normally be 0.
Specifically, the following columns have an invalid zero minimum value:

1: Plasma glucose concentration
2: Diastolic blood pressure
3: Triceps skinfold thickness
4: 2-Hour serum insulin
5: Body mass index

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
print(dataset.head(20))
print(dataset.tail(20))

    #pregnant  plasma_glucose  diastolic_BP  triceps_thickness  2h_insulin  \
0   6          148             72            35                 0            
1   1          85              66            29                 0            
2   8          183             64            0                  0            
3   1          89              66            23                 94           
4   0          137             40            35                 168          
5   5          116             74            0                  0            
6   3          78              50            32                 88           
7   10         115             0             0                  0            
8   2          197             70            45                 543          
9   8          125             96            0                  0            
10  4          110             92            0                  0            
11  10         168             74            0                  

  pd.set_option('display.max_colwidth', -1)


In [None]:
num_missing = (dataset[['plasma_glucose','diastolic_BP','triceps_thickness','2h_insulin','BMI']] == 0).sum()
print(num_missing)

plasma_glucose       5  
diastolic_BP         35 
triceps_thickness    227
2h_insulin           374
BMI                  11 
dtype: int64


This highlights that different “missing value” strategies may be needed for different columns, e.g. to ensure that there are still a sufficient number of records left to train a predictive model.

When a predictor is discrete in nature, missingness can be directly encoded into the predictor as if it were a naturally occurring category.

In [None]:
from numpy import nan
dataset[['plasma_glucose','diastolic_BP','triceps_thickness','2h_insulin','BMI']] = dataset[['plasma_glucose','diastolic_BP','triceps_thickness','2h_insulin','BMI']].replace(0, nan)
# count the number of nan values in each column
print(dataset.isnull().sum())

#pregnant            0  
plasma_glucose       5  
diastolic_BP         35 
triceps_thickness    227
2h_insulin           374
BMI                  11 
diabetes_pedigree    0  
age                  0  
class                0  
dtype: int64


In [None]:
display(dataset.head(20))
display(dataset.tail(20))

Unnamed: 0,#pregnant,plasma_glucose,diastolic_BP,triceps_thickness,2h_insulin,BMI,diabetes_pedigree,age,class
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
5,5,116.0,74.0,,,25.6,0.201,30,0
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
7,10,115.0,,,,35.3,0.134,29,0
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
9,8,125.0,96.0,,,,0.232,54,1


Unnamed: 0,#pregnant,plasma_glucose,diastolic_BP,triceps_thickness,2h_insulin,BMI,diabetes_pedigree,age,class
748,3,187.0,70.0,22.0,200.0,36.4,0.408,36,1
749,6,162.0,62.0,,,24.3,0.178,50,1
750,4,136.0,70.0,,,31.2,1.182,22,1
751,1,121.0,78.0,39.0,74.0,39.0,0.261,28,0
752,3,108.0,62.0,24.0,,26.0,0.223,25,0
753,0,181.0,88.0,44.0,510.0,43.3,0.222,26,1
754,8,154.0,78.0,32.0,,32.4,0.443,45,1
755,1,128.0,88.0,39.0,110.0,36.5,1.057,37,1
756,7,137.0,90.0,41.0,,32.0,0.391,39,0
757,0,123.0,72.0,,,36.3,0.258,52,1


Missing values are common occurrences in data. Unfortunately, most predictive modeling techniques cannot handle any missing values. Therefore, this problem must be addressed prior to modeling.

###Remove Rows With Missing Values
The simplest approach for dealing with missing values is to remove entire predictor(s) and/or sample(s) that contain missing values.

In [None]:
print(dataset.shape)
# drop rows with missing values
dataset_remnan=dataset.copy()
dataset_remnan.dropna(inplace=True)
# summarize the shape of the data with missing rows removed
print(dataset.shape)
print(dataset_remnan.shape)


(768, 9)
(768, 9)
(392, 9)


###Impute Missing Values
missing data can be imputed. In this case, we can use information in the training set predictors to, in essence, estimate the values of other predictors.

There are many options we could consider when replacing a missing value, for example:

A constant value that has meaning within the domain, such as 0, distinct from all other values.
A value from another randomly selected record.
A mean, median or mode value for the column.
A value estimated by another predictive model.
Any imputing performed on the training dataset will have to be performed on new data in the future when predictions are needed from the finalized model. This needs to be taken into consideration when choosing how to impute the missing values.

For example, if you choose to impute with mean column values, these mean column values will need to be stored to file for later use on new data that has missing values.

In [None]:
print('Mean Values')
display(dataset.mean())
# manually impute missing values with numpy
dataset_impmean=dataset.copy()
dataset_impmean.fillna(dataset.mean(), inplace=True)
# count the number of NaN values in each column
print('Sum of Null Values')
print(dataset_impmean.isnull().sum())
print(dataset.shape)
print(dataset_impmean.shape)


Mean Values


#pregnant            3.845052  
plasma_glucose       121.686763
diastolic_BP         72.405184 
triceps_thickness    29.153420 
2h_insulin           155.548223
BMI                  32.457464 
diabetes_pedigree    0.471876  
age                  33.240885 
class                0.348958  
dtype: float64

Sum of Null Values
#pregnant            0
plasma_glucose       0
diastolic_BP         0
triceps_thickness    0
2h_insulin           0
BMI                  0
diabetes_pedigree    0
age                  0
class                0
dtype: int64
(768, 9)
(768, 9)


In [None]:
print('Median Values')
display(dataset.median())
# manually impute missing values with numpy
dataset_impmedian=dataset.copy()
dataset_impmedian.fillna(dataset.median(), inplace=True)
# count the number of NaN values in each column
print('Sum of Null Values')
print(dataset_impmedian.isnull().sum())

Median Values


#pregnant            3.0000  
plasma_glucose       117.0000
diastolic_BP         72.0000 
triceps_thickness    29.0000 
2h_insulin           125.0000
BMI                  32.3000 
diabetes_pedigree    0.3725  
age                  29.0000 
class                0.0000  
dtype: float64

Sum of Null Values
#pregnant            0
plasma_glucose       0
diastolic_BP         0
triceps_thickness    0
2h_insulin           0
BMI                  0
diabetes_pedigree    0
age                  0
class                0
dtype: int64


In [None]:
print('Mode Values')
display(dataset.mode(dropna=True).iloc[0])
# manually impute missing values with numpy
dataset_impmode=dataset.copy()
dataset_impmode.fillna(dataset.mode(dropna=True).iloc[0], inplace=True)
# count the number of NaN values in each column
print('Sum of Null Values')
print(dataset_impmode.isnull().sum())

Mode Values


#pregnant            1.000  
plasma_glucose       99.000 
diastolic_BP         70.000 
triceps_thickness    32.000 
2h_insulin           105.000
BMI                  32.000 
diabetes_pedigree    0.254  
age                  22.000 
class                0.000  
Name: 0, dtype: float64

Sum of Null Values
#pregnant            0
plasma_glucose       0
diastolic_BP         0
triceps_thickness    0
2h_insulin           0
BMI                  0
diabetes_pedigree    0
age                  0
class                0
dtype: int64


*  Any imputing performed on the training dataset will have to be performed on new data in the future when predictions are needed from the finalized model. This needs to be taken into consideration when choosing how to impute the missing values.

*  For example, if you choose to impute with mean column values, these mean column values will need to be stored to file for later use on new data that has missing values.
