# Handling Missing Data 
We say that data is missing when there is no value for a given feature in a particular row. This can occur in the real-world for many reasons: there may have been no observation, there may have been a transcription error, or the data may have been corrupted. Whatever the case, we, as data scientists, need to deal with it.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv('diabetes.csv')
df1 = df.copy()

In [2]:
df1.shape

(768, 9)

In [3]:
df1.head()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Checking out df dot head, it looks as though there are observations where insulin is zero. And triceps, which is the thickness of the skin, is zero. These are not possible and, as we have no indication of the real values, the data is, for all intents and purposes, missing.

Before we go any further, let's make all these entries 'NaN' using the replace method on the relevant columns. So, how do we deal with missing data? One way is to drop all rows containing missing data.

In [4]:
import numpy as np
df1.insulin.replace(0, np.nan, inplace = True)
df1.diabetes.replace(0, np.nan, inplace = True)
df1.triceps.replace(0, np.nan, inplace = True)
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pregnancies  768 non-null    int64  
 1   glucose      768 non-null    int64  
 2   diastolic    768 non-null    int64  
 3   triceps      541 non-null    float64
 4   insulin      394 non-null    float64
 5   bmi          768 non-null    float64
 6   dpf          768 non-null    float64
 7   age          768 non-null    int64  
 8   diabetes     268 non-null    float64
dtypes: float64(5), int64(4)
memory usage: 54.1 KB


We can do so using the pandas DataFrame method dropna. 

In [5]:
df1 = df1.dropna()
df1.shape

(130, 9)

Checking out the shape of the resulting data frame, though, we see that we now have only approximately half the rows left! We've lost half of our data and this is unacceptable. If only a few rows contain missing values, then it's not so bad, but generally we need a more robust method. It is generally an equally bad idea to remove columns that contain NaNs.

## Imputing Missing Data
Another option is to impute missing data. All imputing means is to make an educated guess as to what the missing values could be. A common strategy is, in any given column with missing values, to compute the mean of all the non-missing entries and to replace all missing values with the mean. Let's try this now on our dataset. We import SimpleImputer from sklearn.impute and instantiate an instance of the Imputer: imp. The keyword argument missing values here specifies that missing values are represented by NaN; strategy specifies that we will use the mean as described above.

In [6]:
X1 = df.drop('diabetes',axis=1).values # drop the target
y1 = df['diabetes'].values #keep the target

In [7]:
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

Now, we can fit this imputer to our data using the fit method and then transform our data using the transform method! Due to their ability to transform our data as such, imputers are known as transformers, and any model that can transform data this way, using the transform method, is called a transformer. 

In [8]:
imp.fit(X1)

SimpleImputer()

In [9]:
X1 = imp.transform(X1)

After transforming the data, we could then fit our supervised learning model to it, but is there a way to do both at once?

## Imputing within a pipeline

We can use the scikit-learn pipeline object. We import Pipeline from sklearn.pipeline. We also instantiate a log reg model. We then build the Pipeline object! We construct a list of steps in the pipeline, where each step is a 2-tuple containing the name you wish to give the relevant step and the estimator. We then pass this list to the Pipeline constructor. We can split our data into training and test sets and fit the pipeline to the training set and then predict on the test set, as with any other model. For good measure here, we compute accuracy. Note that, in a pipeline, each step but the last must be a transformer and the last must be an estimator, such as, a classifier or a regressor.

In [10]:
from sklearn.pipeline import Pipeline
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [11]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
logreg = LogisticRegression()

In [12]:
steps = [('imputation',imp), ('logistic_regression',logreg)]

In [13]:
pipeline = Pipeline(steps)

In [14]:
X = df.drop('diabetes',axis=1).values # drop the target
y = df['diabetes'].values #keep the target

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,random_state=42)

In [16]:
pipeline.fit(X_train,y_train)

Pipeline(steps=[('imputation', SimpleImputer()),
                ('logistic_regression', LogisticRegression())])

In [17]:
y_pred = pipeline.predict(X_test)

In [18]:
pipeline.score(X_test,y_test)

0.7662337662337663