# Missing Indicator

---

🔹 What is a Missing Indicator?

A Missing Indicator is a binary feature that indicates whether a value in the dataset was missing or not.

If a value is missing, the indicator = 1.

If a value is present, the indicator = 0.

This allows the model to use the information about the "missingness" itself, which can sometimes be informative.

🔹 Why Use It?

Sometimes, the fact that a value is missing is not random and carries useful information (e.g., missing medical tests may indicate good health or negligence).

Helps the model distinguish between an imputed value (e.g., filled mean/median) and an actual recorded value.

🔹 How It’s Used

Usually, the workflow is:

Create a Missing Indicator for the feature(s) with missing values.

Example: For a feature Age, create Age_missing (1 = missing, 0 = present).

Impute the Missing Values (e.g., mean, median, constant, or model-based imputation).

Keep both the imputed column and the missing indicator in the dataset.

🔹 Example

Suppose you have this dataset:

ID	Age	Salary
1	25	50000
2	NaN	60000
3	40	NaN

After applying missing indicator:

ID	Age	Age_missing	Salary	Salary_missing
1	25	0	50000	0
2	30*	1	60000	0
3	40	0	55000*	1

* Here, 30 and 55000 are imputed values (e.g., using mean/median).

🔹 In Scikit-learn

Scikit-learn provides this functionality via MissingIndicator or SimpleImputer(add_indicator=True).

Example:

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Sample data
data = pd.DataFrame({
    "Age": [25, np.nan, 40],
    "Salary": [50000, 60000, np.nan]
})

# Imputer with missing indicator
imputer = SimpleImputer(strategy="mean", add_indicator=True)
transformed = imputer.fit_transform(data)

print(transformed)


This will:

Impute missing values with mean.

Add extra columns indicating missingness.

✅ Summary:
The Missing Indicator method handles missing data by creating additional binary columns to flag missing values. This preserves potential signal in the missingness pattern and is often used along with imputation.

In [16]:

import pandas as pd
import numpy as np

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.impute import MissingIndicator, SimpleImputer

In [18]:
df = pd.read_csv('titanic.csv', usecols=['Age','Fare','Survived'])

In [19]:
df.head()

Unnamed: 0,Survived,Age,Fare
0,0,22.0,7.25
1,1,38.0,71.2833
2,1,26.0,7.925
3,1,35.0,53.1
4,0,35.0,8.05


In [20]:
x= df.drop(columns= ['Survived'])
y = df['Survived']

In [21]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state=2)

In [22]:
x_train.head(2)

Unnamed: 0,Age,Fare
30,40.0,27.7208
10,4.0,16.7


In [23]:
si = SimpleImputer(strategy='mean')
x_train_trf = si.fit_transform(x_train)
x_test_trf = si.transform(x_test)

In [24]:
x_train_trf

array([[ 40.        ,  27.7208    ],
       [  4.        ,  16.7       ],
       [ 47.        ,   9.        ],
       ...,
       [ 71.        ,  49.5042    ],
       [ 29.78590426, 221.7792    ],
       [ 29.78590426,  25.925     ]])

In [25]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()

clf.fit(x_train_trf,y_train)

y_pred=clf.predict(x_test_trf)

from sklearn.metrics import accuracy_score
accuracy_score(y_pred,y_test)

0.6145251396648045

In [26]:
mi = MissingIndicator()
mi.fit(x_train)

In [27]:
mi.features_

array([0])

In [28]:
x_train_missing = mi.transform(x_train)

In [29]:
x_train_missing

array([[False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [

In [30]:
x_test_missing = mi.transform(x_test)

In [31]:
x_test_missing

array([[False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [ True],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [ True],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [

In [33]:
x_train['Age_NA'] = x_train_missing
x_test['Age_NA'] = x_test_missing

In [34]:
x_train

Unnamed: 0,Age,Fare,Age_NA
30,40.0,27.7208,False
10,4.0,16.7000,False
873,47.0,9.0000,False
182,9.0,31.3875,False
876,20.0,9.8458,False
...,...,...,...
534,30.0,8.6625,False
584,,8.7125,True
493,71.0,49.5042,False
527,,221.7792,True


In [35]:
si = SimpleImputer()

x_train_trf2=si.fit_transform(x_train)
x_test_trf2 = si.transform(x_test)

In [37]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(x_train_trf2,y_train)

y_pred = clf.predict(x_test_trf2)

from sklearn.metrics import accuracy_score
accuracy_score(y_pred,y_test)

0.6312849162011173

In [38]:
si = SimpleImputer(add_indicator=True)

In [39]:
x_train = si.fit_transform(x_train)
x_test = si.transform(x_test)

In [40]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(x_train,y_train)

y_pred = clf.predict(x_test)

from sklearn.metrics import accuracy_score
accuracy_score(y_pred,y_test)

0.6312849162011173