Mice is always implemented on the input columns , so remove the target column

MICE ALGO

1. Replace NaN values of a column with the column's mean
2. Now move from left to right
3. Replace back the 1st column from the left with NaN values, wherever it was imputed
4. Now use the other columns as input features, and the column where NaN values are present as the target column and train the model
5. Basically we have a new datapoint (X_test)
NaN, f1, f2
Here f1,f2 are features, and we need to predict the missing value.
6. Once the missing value is predicted, go again from left to right.
7. Now replace back all the imputed values with NaN, and treat the remaining columns as input features, and the column where NaN values are present as the target column and train the model
Then we have new datapoint (X_test)
f1,NaN,f2
We need to predict the missing values.
8. Now once the missing values of the column are predicted, we again go from left to right, repeat the same procedure for all the other columns.
9. In this way we predict the missing values using MICE algorithm.
Iteration 0 -> fill missing values of all columns with respective column's mean value

Iteration 1 -> find missing values of all columns with MICE algo (here we will predict using linear regression)

Difference -> Iteration 1 - Iteration 0


Then take Iteration 1 as the base, and do
Difference -> Iteration 2 - Iteration 1

Then take Iteration 2 as the base and do
Difference -> Iteration 3 - Iteration 2


....so on

Keep doing it untill the difference becomes 0.
Why ? Because the first time we predicted using mean values which might not be correct, so we do it using linear model again and again untill there is no further improvement in the ML model

Endgoal : The missing values should be equal to the original values.

More the number of iterations -> the better the prediction of mising values

In [247]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

In [248]:
df = pd.read_csv('50_Startups.csv')

In [249]:
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [250]:
df = np.round(
    pd.read_csv('50_Startups.csv')[[
        'R&D Spend',
        'Administration',
        'Marketing Spend',
        'Profit'
    ]]/10000
)
np.random.seed(9)
df = df.sample(5)
df

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
21,8.0,15.0,30.0,11.0
37,4.0,5.0,20.0,9.0
2,15.0,10.0,41.0,19.0
14,12.0,16.0,26.0,13.0
44,2.0,15.0,3.0,7.0


*mice is only implemented on input columns so remove the target column*

In [251]:
df = df.iloc[:,0:-1]
df

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,4.0,5.0,20.0
2,15.0,10.0,41.0
14,12.0,16.0,26.0
44,2.0,15.0,3.0


In [252]:
df.isnull().mean()*100

Unnamed: 0,0
R&D Spend,0.0
Administration,0.0
Marketing Spend,0.0


*Introduce some fake NaN values in each column*

In [253]:
df.iloc[1,0] = np.nan
df.iloc[3,1] = np.nan
df.iloc[-1,-1] = np.nan

df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.iloc[1,0] = np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.iloc[3,1] = np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.iloc[-1,-1] = np.nan


Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,,5.0,20.0
2,15.0,10.0,41.0
14,12.0,,26.0
44,2.0,15.0,


**Step 1 : Impute all missing values with mean of respective col**

In [254]:
df['R&D Spend'].fillna(df['R&D Spend'].mean(),inplace=True)
df.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['R&D Spend'].fillna(df['R&D Spend'].mean(),inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['R&D Spend'].fillna(df['R&D Spend'].mean(),inplace=True)


Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,9.25,5.0,20.0
2,15.0,10.0,41.0
14,12.0,,26.0
44,2.0,15.0,


In [255]:
df['Administration'].fillna(df['Administration'].mean(),inplace=True)
df['Marketing Spend'].fillna(df['Marketing Spend'].mean(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Administration'].fillna(df['Administration'].mean(),inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Administration'].fillna(df['Administration'].mean(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inpl

*Iteration 0*

In [256]:
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,9.25,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.25,26.0
44,2.0,15.0,29.25


*Iteration 1*

1) Remove the col1 imputed value and replace back with NaN (left to right traversal)

In [257]:
df1 = df.copy()
df1.iloc[1,0] = np.nan
df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.25,26.0
44,2.0,15.0,29.25


2) Use the remaining rows for training the model and use the row with the missing value as the test data, and for predicting the missing value

*input features*

In [258]:
X = df1.iloc[
    [0,2,3,4],1:3
]
X

Unnamed: 0,Administration,Marketing Spend
21,15.0,30.0
2,10.0,41.0
14,11.25,26.0
44,15.0,29.25


*target col*

In [259]:
y = df1.iloc[
    [0,2,3,4],0
]
y

Unnamed: 0,R&D Spend
21,8.0
2,15.0
14,12.0
44,2.0


In [260]:
lr = LinearRegression()
lr.fit(X,y)
lr.predict(
    df1.iloc[
        [1],1:3
    ].values.reshape(1,2)
)



array([23.14158651])

In [261]:
df1.iloc[1,0] = 23.14
df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,23.14,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.25,26.0
44,2.0,15.0,29.25


*Remove the col2 imputed value*

In [262]:
df1.iloc[3,1] = np.nan
df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,23.14,5.0,20.0
2,15.0,10.0,41.0
14,12.0,,26.0
44,2.0,15.0,29.25


In [263]:
X = df1.iloc[
    [0,1,2,4],[0,2]
]
X

Unnamed: 0,R&D Spend,Marketing Spend
21,8.0,30.0
37,23.14,20.0
2,15.0,41.0
44,2.0,29.25


In [264]:
y = df1.iloc[
    [0,1,2,4],1
]
y

Unnamed: 0,Administration
21,15.0
37,5.0
2,10.0
44,15.0


In [265]:
lr = LinearRegression()
lr.fit(X,y)
lr.predict(
    df1.iloc[
        3,[0,2]
    ].values.reshape(1,2)
)



array([11.06331285])

In [266]:
df1.iloc[3,1] = 11.06
df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,23.14,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.06,26.0
44,2.0,15.0,29.25


*remove the col3 imputed value*

In [267]:
df1.iloc[-1,-1] = np.nan
df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,23.14,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.06,26.0
44,2.0,15.0,


In [268]:
X = df1.iloc[
    0:-1,0:-1
]
X

Unnamed: 0,R&D Spend,Administration
21,8.0,15.0
37,23.14,5.0
2,15.0,10.0
14,12.0,11.06


In [269]:
y = df.iloc[
    0:-1,-1
]
y

Unnamed: 0,Marketing Spend
21,30.0
37,20.0
2,41.0
14,26.0


In [270]:
lr = LinearRegression()
lr.fit(X,y)
lr.predict(
    df1.iloc[
        -1,0:-1
    ].values.reshape(1,2)
)



array([31.56351448])

In [271]:
df1.iloc[4,-1] = 31.56

**After 1st Iteration**

In [272]:
df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,23.14,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.06,26.0
44,2.0,15.0,31.56


*Subtract 0th iteration from 1st iteration*

In [273]:
df1 - df

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,0.0,0.0,0.0
37,13.89,0.0,0.0
2,0.0,0.0,0.0
14,0.0,-0.19,0.0
44,0.0,0.0,2.31


*Now take df1 as the base*

*Iteration 2 starts*

In [274]:
df2 = df1.copy()
df2.iloc[1,0] = np.nan
df2

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.06,26.0
44,2.0,15.0,31.56


In [275]:
X = df2.iloc[
    [0,2,3,4],1:3
]
y = df2.iloc[
    [0,2,3,4],0
]
lr = LinearRegression()
lr.fit(X,y)
lr.predict(
    df2.iloc[
        [1],1:3
    ].values.reshape(1,2)
)



array([23.78627207])

In [276]:
df2.iloc[1,0] = 23.78

In [277]:
df2

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,23.78,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.06,26.0
44,2.0,15.0,31.56


In [278]:
X = df2.iloc[
    [0,1,2,4],[0,2]
]
y = df2.iloc[
    [0,1,2,4],1
]
lr = LinearRegression()
lr.fit(X,y)
lr.predict(
    df2.iloc[
        3,[0,2]
    ].values.reshape(1,2)
)



array([11.22020174])

In [279]:
df2.iloc[3,1] = 11.22
df2

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,23.78,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.22,26.0
44,2.0,15.0,31.56


In [280]:
df2.iloc[-1,-1] = np.nan

X = df2.iloc[
    0:-1,0:-1
]
y = df2.iloc[
    0:-1,-1
]
lr = LinearRegression()
lr.fit(X,y)
lr.predict(
    df2.iloc[
        -1,0:-1
    ].values.reshape(1,2)
)




array([38.87979054])

In [281]:
df2.iloc[4,-1] = 38.87
df2

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,23.78,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.22,26.0
44,2.0,15.0,38.87


In [282]:
df2 - df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,0.0,0.0,0.0
37,0.64,0.0,0.0
2,0.0,0.0,0.0
14,0.0,0.16,0.0
44,0.0,0.0,7.31


keep iterating until the difference becomes 0

# using sklearn

> class sklearn.impute.IterativeImputer(estimator=None, *, missing_values=nan, sample_posterior=False, max_iter=10, tol=0.001, n_nearest_features=None, initial_strategy='mean', fill_value=None, imputation_order='ascending', skip_complete=False, min_value=-inf, max_value=inf, verbose=0, random_state=None, add_indicator=False, keep_empty_features=False)

In [303]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression

In [284]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score

In [302]:
from sklearn.datasets import make_regression

In [332]:
X,y = make_regression(
    n_samples=1000, n_features=10, random_state=0
)

In [333]:
X.shape

(1000, 10)

> random.rand(d0, d1, ..., dn)
Random values in a given shape.

In [334]:
# introducing 10% missing values
mask = np.random.rand(*X.shape) < 0.1
X[mask] = np.nan

In [335]:
X[:5]

array([[ 1.60904498,  0.52768048,  1.64924824, -0.49148502, -1.09841686,
         1.06513264,  0.68972086, -0.22262126, -0.0025571 ,  0.14301667],
       [ 0.01881479, -0.99604409,  1.70377549,         nan,  1.18198079,
         1.51062759,  0.45870585,  0.18201431, -1.15953989, -1.21321363],
       [ 1.33531628, -1.8814838 ,         nan,  1.18997805,  0.38899195,
         1.08474758, -0.76203896,  0.19832706,         nan, -0.41327481],
       [-0.38487981, -0.28688719, -1.12682581, -0.10730528, -0.13370156,
        -0.73067775,  1.07774381, -0.0616264 , -0.04217145,         nan],
       [-0.09845252, -1.07993151, -0.74475482, -0.43782004, -0.06824161,
        -0.82643854,  1.71334272, -1.14746865,  1.12663592,         nan]])

In [336]:
imputer = imputer = IterativeImputer(
    estimator = LinearRegression(),
    max_iter = 100,
    add_indicator=False
)



In [337]:
X_train,X_test,y_train,y_test = train_test_split(
    X,y,test_size=0.2,random_state=2
)


In [338]:
X_train_trf = imputer.fit_transform(X_train)
X_test_trf = imputer.transform(X_test)

lr = LinearRegression()
lr.fit(X_train_trf,y_train)
y_pred = lr.predict(X_test_trf)
r2_score(y_test,y_pred)

0.9041121602862509