# **<ins style="color:aqua">Feature Engineering</ins>**
## **<ins style="color:green">Handling Missing Values</ins>**
1. ### **<ins style="color:red">(CCA : Complete Case Analysis)</ins>**
   - Remove Hole Row in which NaN value present.
   - Data missing must be random.
   - Complete Case Analysis(CCA), also called "List-Wise Deletion" of cases, consists in discarding observations(Row) where values in any of the variables(Column) are missing.
   - Complete Case Analysis means literally analyzing only those observations for which there is information in all of the variables in the dataset.
   - __Assumption For CCA:__ MCAR : Missing Completely at Random
   - __Advantage__ :
     - Easy to implement as no data manipulation required.
     - Preserves variable distribution (if data is MCAR), then the distribution of the variables of the reduced dataset should match the distribution in the original dataset.
   - __Disadvantage__:
     - It can exclude a large fraction of the original dataset (If mising data is abundant).
     - Excluded observations could be informative for the analysis (if data is not missing at random).
     - When using our models in production, the model will not know how to handle missing data.
   - __When to use CCA.__
     - MCAR : Missing Completely At Random
     - Percentage of the Missing data in column should be high. If percentage of missing data in a column is low then do not apply CCA.

3. ### **<ins style="color:red">Impute (Fill NaN Value)</ins>**
   - #### **Univariate** : _SimpleImputer_ Class Present in Scikit Learn for the _Univariate_.
   - If in a column any missing value is present then fill it using the remain data present in that column.
     - <ins style="color:blue"> __Numerical Type Column__ </ins>
       - Method to fill the Numerical Columns Missing Values.
         - Mean
         - Median
         - Random Value
         - End of Distribution Value
     - <ins style="color:blue"> __Categorical Type Column__ </ins>
       - Method to fill the Categorical Columns Missing Values.
         - Mode
         - Missing Word
   - #### **Multivariate**
   - If in a column any missing value is present then fill it using the data of all other columns.
     - __KNN Imputer__ Method
     - __Iterative Imputer__ Method
- __Missing Indicator__

# <b style="color:aqua">Multivariate Imputation Handling Missing Data</b>
## <b style="color:green">Iterative Imputer</b>
- `class sklearn.impute.IterativeImputer(estimator=None, *, missing_values=nan, sample_posterior=False, max_iter=10, tol=0.001, n_nearest_features=None, initial_strategy='mean', imputation_order='ascending', skip_complete=False, min_value=-inf, max_value=inf, verbose=0, random_state=None, add_indicator=False, keep_empty_features=False)`
- This is also called __MICE__ stands for _Multivariate Imputation by Chained Equations_
- Assumptions
  - __MCAR__ : Missing Completely At Random
  - __MAR__ : Missing At Random
  - __MNAR__ : Missing Not At Random
- __Pros__
  - Accurate.
- __Cons__
  - Speed Slow
  - Take more Memory
- First Step : Replace all NaN value with mean of column.
- Second Step : Move left to right. Replace a NaN(which is now mean) with NaN and try to predict using machine learning algorithm.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

In [2]:
df = np.round(pd.read_csv('../data/50_Startups.csv')[['R&D Spend','Administration','Marketing Spend','Profit']]/10000)
np.random.seed(9)
df = df.sample(5)
df

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
21,8.0,15.0,30.0,11.0
37,4.0,5.0,20.0,9.0
2,15.0,10.0,41.0,19.0
14,12.0,16.0,26.0,13.0
44,2.0,15.0,3.0,7.0


In [3]:
df.shape

(5, 4)

In [4]:
df.isnull().sum()

R&D Spend          0
Administration     0
Marketing Spend    0
Profit             0
dtype: int64

In [5]:
df = df.iloc[:,0:-1]
df

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,4.0,5.0,20.0
2,15.0,10.0,41.0
14,12.0,16.0,26.0
44,2.0,15.0,3.0


In [6]:
df.iloc[1,0] = np.NaN
df.iloc[3,1] = np.NaN
df.iloc[-1,-1] = np.NaN

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.iloc[1,0] = np.NaN
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.iloc[3,1] = np.NaN
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.iloc[-1,-1] = np.NaN


In [7]:
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,,5.0,20.0
2,15.0,10.0,41.0
14,12.0,,26.0
44,2.0,15.0,


In [8]:
df.isnull().sum()

R&D Spend          1
Administration     1
Marketing Spend    1
dtype: int64

In [9]:
# Step 1 - Impute all missing values with mean of respective col

df0 = pd.DataFrame()

df0['R&D Spend'] = df['R&D Spend'].fillna(df['R&D Spend'].mean())
df0['Administration'] = df['Administration'].fillna(df['Administration'].mean())
df0['Marketing Spend'] = df['Marketing Spend'].fillna(df['Marketing Spend'].mean())

### **1st Iteration**

In [10]:
# 1st Iteration
df0

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,9.25,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.25,26.0
44,2.0,15.0,29.25


In [11]:
# Remove the col1 imputed value

df1 = df0.copy()

df1.iloc[1,0] = np.NaN

df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.25,26.0
44,2.0,15.0,29.25


In [12]:
# Use first 3 rows to build a model and use the last for prediction

X = df1.iloc[[0,2,3,4],1:3]
X

Unnamed: 0,Administration,Marketing Spend
21,15.0,30.0
2,10.0,41.0
14,11.25,26.0
44,15.0,29.25


In [13]:
y = df1.iloc[[0,1,2,4],1]
y

21    15.0
37     5.0
2     10.0
44    15.0
Name: Administration, dtype: float64

In [14]:
df1.iloc[1,0:]

R&D Spend           NaN
Administration      5.0
Marketing Spend    20.0
Name: 37, dtype: float64

In [15]:
# Predict NaN of 'R&D Spend' column
lr = LinearRegression()
lr.fit(X,y)
pred = lr.predict(df1.iloc[1,1:].values.reshape(1,2))
pred = np.round(pred, 2)
pred



array([1.64])

In [16]:
df1.iloc[1,0] = pred
df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,1.64,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.25,26.0
44,2.0,15.0,29.25


In [17]:
# Remove the col2 imputed value

df1.iloc[3,1] = np.NaN

df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,1.64,5.0,20.0
2,15.0,10.0,41.0
14,12.0,,26.0
44,2.0,15.0,29.25


In [18]:
# Use last 3 rows to build a model and use the first for prediction
X = df1.iloc[[0,1,2,4],[0,2]]
X

Unnamed: 0,R&D Spend,Marketing Spend
21,8.0,30.0
37,1.64,20.0
2,15.0,41.0
44,2.0,29.25


In [19]:
y = df1.iloc[[0,1,2,4],1]
y

21    15.0
37     5.0
2     10.0
44    15.0
Name: Administration, dtype: float64

In [20]:
# predict NaN of Administration
lr = LinearRegression()
lr.fit(X,y)
pred = lr.predict(df1.iloc[3,[0,2]].values.reshape(1,2))
pred = np.round(pred, 2)
pred



array([2.8])

In [21]:
df1.iloc[3,1] = pred
df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,1.64,5.0,20.0
2,15.0,10.0,41.0
14,12.0,2.8,26.0
44,2.0,15.0,29.25


In [22]:
# Remove the col3 imputed value

df1.iloc[4,-1] = np.NaN

df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,1.64,5.0,20.0
2,15.0,10.0,41.0
14,12.0,2.8,26.0
44,2.0,15.0,


In [23]:
# Use last 3 rows to build a model and use the first for prediction
X = df1.iloc[0:4,0:2]
X

Unnamed: 0,R&D Spend,Administration
21,8.0,15.0
37,1.64,5.0
2,15.0,10.0
14,12.0,2.8


In [24]:
y = df1.iloc[0:4,-1]
y

21    30.0
37    20.0
2     41.0
14    26.0
Name: Marketing Spend, dtype: float64

In [25]:
lr = LinearRegression()
lr.fit(X,y)
pred = lr.predict(df1.iloc[4,0:2].values.reshape(1,2))
pred = np.round(pred, 2)
pred



array([25.2])

In [26]:
df1.iloc[4,-1] = pred
df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,1.64,5.0,20.0
2,15.0,10.0,41.0
14,12.0,2.8,26.0
44,2.0,15.0,25.2


### **2nd Iteration**

In [27]:
# 2nd Iteration
df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,1.64,5.0,20.0
2,15.0,10.0,41.0
14,12.0,2.8,26.0
44,2.0,15.0,25.2


In [28]:
# Subtract 0th iteration from 1st iteration

df1 - df0

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,0.0,0.0,0.0
37,-7.61,0.0,0.0
2,0.0,0.0,0.0
14,0.0,-8.45,0.0
44,0.0,0.0,-4.05


In [29]:
df2 = df1.copy()

df2.iloc[1,0] = np.NaN

df2

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,,5.0,20.0
2,15.0,10.0,41.0
14,12.0,2.8,26.0
44,2.0,15.0,25.2


In [30]:
X = df2.iloc[[0,2,3,4],1:3]
y = df2.iloc[[0,2,3,4],0]

lr = LinearRegression()
lr.fit(X,y)
pred = lr.predict(df2.iloc[1,1:].values.reshape(1,2))
pred = np.round(pred, 2)
pred



array([6.99])

In [31]:
df2.iloc[1,0] = pred

In [32]:
df2.iloc[3,1] = np.NaN
X = df2.iloc[[0,1,2,4],[0,2]]
y = df2.iloc[[0,1,2,4],1]

lr = LinearRegression()
lr.fit(X,y)
pred = lr.predict(df2.iloc[3,[0,2]].values.reshape(1,2))
pred = np.round(pred, 2)
pred



array([3.65])

In [33]:
df2.iloc[3,1] = pred

In [34]:
df2.iloc[4,-1] = np.NaN

X = df2.iloc[0:4,0:2]
y = df2.iloc[0:4,-1]

lr = LinearRegression()
lr.fit(X,y)
pred = lr.predict(df2.iloc[4,0:2].values.reshape(1,2))
pred = np.round(pred, 2)
pred



array([18.61])

In [35]:
df2.iloc[4,-1] = pred
df2

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,6.99,5.0,20.0
2,15.0,10.0,41.0
14,12.0,3.65,26.0
44,2.0,15.0,18.61


In [36]:
df2 - df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,0.0,0.0,0.0
37,5.35,0.0,0.0
2,0.0,0.0,0.0
14,0.0,0.85,0.0
44,0.0,0.0,-6.59


In [37]:
df3 = df2.copy()

df3.iloc[1,0] = np.NaN

df3

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,,5.0,20.0
2,15.0,10.0,41.0
14,12.0,3.65,26.0
44,2.0,15.0,18.61


In [38]:
X = df3.iloc[[0,2,3,4],1:3]
y = df3.iloc[[0,2,3,4],0]

lr = LinearRegression()
lr.fit(X,y)
pred = lr.predict(df3.iloc[1,1:].values.reshape(1,2))
pred = np.round(pred, 2)
pred



array([8.4])

In [39]:
df3.iloc[1,0] = pred
df3

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,8.4,5.0,20.0
2,15.0,10.0,41.0
14,12.0,3.65,26.0
44,2.0,15.0,18.61


In [40]:
df3.iloc[3,1] = np.NaN
X = df3.iloc[[0,1,2,4],[0,2]]
y = df3.iloc[[0,1,2,4],1]

lr = LinearRegression()
lr.fit(X,y)
pred = lr.predict(df3.iloc[3,[0,2]].values.reshape(1,2))
pred = np.round(pred, 2)
pred



array([3.33])

In [41]:
df3.iloc[3,1] = pred
df3

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,8.4,5.0,20.0
2,15.0,10.0,41.0
14,12.0,3.33,26.0
44,2.0,15.0,18.61


In [42]:
df3.iloc[4,-1] = np.NaN

X = df3.iloc[0:4,0:2]
y = df3.iloc[0:4,-1]

lr = LinearRegression()
lr.fit(X,y)
pred = lr.predict(df3.iloc[4,0:2].values.reshape(1,2))
pred = np.round(pred, 2)
pred



array([16.07])

In [43]:
df3.iloc[4,-1] = pred
df3

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,8.4,5.0,20.0
2,15.0,10.0,41.0
14,12.0,3.33,26.0
44,2.0,15.0,16.07


In [44]:
df3 - df2

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,0.0,0.0,0.0
37,1.41,0.0,0.0
2,0.0,0.0,0.0
14,0.0,-0.32,0.0
44,0.0,0.0,-2.54
