# Pre Processing

#### Missing-Value-Treatments

1. Mean-or-median-or-other-summary-statistic-substitution
2. Forward-fill-and-backward-fill-(can-be-used-according-to-business-problem)
3. Nearest-neighbors-imputation
4. MultiOutputRegressor
5. IterativeImpute
6. Time-Series-Specific-Methods

# __Missing Value Treatments__
The methods used to handle missing values are as follows:<br>
1. Drop missing values
2. Fill missing value with test statistic
3. Predict missing value with maching learning algoritm

In [1]:
import pandas as pd
import  numpy as np
# Check missing values in a dataset 
dict = {'First Score':[100, 90, np.nan, 95, 75], 

        'Second Score': [30, 45, 56, np.nan, np.nan], 

        'Third Score':[np.nan, 40, 98, 98, 56]} 

# creating a dataframe from list 
df = pd.DataFrame(dict)
print(df)
print('\nNo of null values:')
df.isnull().sum()

   First Score  Second Score  Third Score
0        100.0          30.0          NaN
1         90.0          45.0         40.0
2          NaN          56.0         98.0
3         95.0           NaN         98.0
4         75.0           NaN         56.0

No of null values:


First Score     1
Second Score    2
Third Score     1
dtype: int64

In [2]:
# If the missing value isn’t identified as NaN , then we have to first convert or replace such non NaN entry with a NaN
df_2 = df.copy()
df_2['First Score'].replace(np.nan,0, inplace= True)
df_2[df_2['First Score'] == 0].head(2)

Unnamed: 0,First Score,Second Score,Third Score
2,0.0,56.0,98.0


## Imputation vs Removing Data
Before jumping to the methods of data imputation, we have to understand the reason why data goes missing.
1. **Missing completely at random**: This is a case when the probability of missing variable is same for all observations. For example: respondents of data collection process decide that they will declare their earning after tossing a fair coin. If an head occurs, respondent declares his / her earnings & vice versa. Here each observation has equal chance of missing value.
2. **Missing at random**: This is a case when variable is missing at random and missing ratio varies for different values / level of other input variables. For example: We are collecting data for age and female has higher missing value compare to male.
3. **Missing that depends on unobserved predictors**: This is a case when the missing values are not random and are related to the unobserved input variable. For example: In a medical study, if a particular diagnostic causes discomfort, then there is higher chance of drop out from the study. This missing value is not at random unless we have included “discomfort” as an input variable for all patients.
4. **Missing that depends on the missing value itself**: This is a case when the probability of missing value is directly correlated with missing value itself. For example: People with higher or lower income are likely to provide non-response to their earning.
 

**Simple approaches**<br>
A number of simple approaches exist. For basic use cases, these are often enough.<br><br>
**Dropping rows with null values**
1. If the number of data points is sufficiently high that dropping some of them will not cause lose generalizability in the models built (to determine whether or not this is the case, a learning curve can be used)
2. Dropping too much data is also dangerous
3. If in a large data set is present and missinng values is in range of 5-3%; then droping missing values is feasible

In [3]:
df_3=df.copy()
df_3.dropna()

Unnamed: 0,First Score,Second Score,Third Score
1,90.0,45.0,40.0


**Dropping features with high nullity**

A feature that has a high number of empty values is unlikely to be very useful for prediction. It can often be safely dropped.
<br>**Note:** "But before deciding the variable is not usefull we should perform feature importance test for validation", tree based method can be used 

In [4]:
df_2.drop(['Second Score'], axis= 1, inplace = True)

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

## Mean or median or other summary statistic substitution
When to use example:
1. Check outlier, if less outliers is present then mean imputation can be used
2. When outliers are more median impuation can be used 
3. For categorical variables mode imputaion can be used

<br>**NOTE:**- Ok to use if missing data is less than 3%, otherwise introduces too much bias and artificially lowers variability of data

In [5]:
# Simple illustration for missing value imputation with mean 
# The imputation strategies are mean, mode & median 
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
SimpleImputer()
#This will look for all columns where we have NaN value and replace the NaN value with specified test statistic.
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imp_mean.transform(X))

[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]


In [6]:
import pandas as pd
import numpy as np

df=pd.DataFrame([["XXL", 8, "black", "class 1", 22],
["L", np.nan, "gray", "class 2", 20],
["XL", 10, "blue", "class 2", 19],
["M", np.nan, "orange","class 1", 17],
["M", 11, "green", "class 3", np.nan],
["M", 7, "red", "class 1", 22]])

df.columns=["size", "price", "color", "class", "boh"]
df_copy= df.copy()
df

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,22.0
1,L,,gray,class 2,20.0
2,XL,10.0,blue,class 2,19.0
3,M,,orange,class 1,17.0
4,M,11.0,green,class 3,
5,M,7.0,red,class 1,22.0


In [7]:
# imputation is done with respect to one column by using mean, mode and median stragey 
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(df[['boh']])
df["boh"]=imp_mean.transform(df[["boh"]])
df

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,22.0
1,L,,gray,class 2,20.0
2,XL,10.0,blue,class 2,19.0
3,M,,orange,class 1,17.0
4,M,11.0,green,class 3,20.0
5,M,7.0,red,class 1,22.0


In [8]:
# some other column 
imp_mean.fit(df[['price']])
df["price"]=imp_mean.transform(df[["price"]])
df

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,22.0
1,L,9.0,gray,class 2,20.0
2,XL,10.0,blue,class 2,19.0
3,M,9.0,orange,class 1,17.0
4,M,11.0,green,class 3,20.0
5,M,7.0,red,class 1,22.0


In [9]:
# Columns specific imputation in a dataframe 
df_4 = df.copy()
mean_value=df_4['price'].mean()
df_4['First score']=df_4['price'].fillna(mean_value)
#this will replace all NaN values with the mean of the non null values
#For Median
median_value=df_4['price'].median()
df_4['Second Score']=df_4['price'].fillna(median_value)
print(df_4)

  size  price   color    class   boh  First score  Second Score
0  XXL    8.0   black  class 1  22.0          8.0           8.0
1    L    9.0    gray  class 2  20.0          9.0           9.0
2   XL   10.0    blue  class 2  19.0         10.0          10.0
3    M    9.0  orange  class 1  17.0          9.0           9.0
4    M   11.0   green  class 3  20.0         11.0          11.0
5    M    7.0     red  class 1  22.0          7.0           7.0


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### Forward fill and backward fill (can be used according to business problem)
Forward filling means fill missing values with previous data. Backward filling means fill missing values with next data point.

In [10]:
# Creating the Series 
sr = pd.Series([100, None, None, 18, 65, None, 32, 10, 5, 24, 60]) 

# Create the Index 
index_ = pd.date_range('2010-10-09', periods = 11, freq ='M')   

# set the index
sr.index = index_   

# Print the series
print('Series  :\n',sr) 


Series  :
 2010-10-31    100.0
2010-11-30      NaN
2010-12-31      NaN
2011-01-31     18.0
2011-02-28     65.0
2011-03-31      NaN
2011-04-30     32.0
2011-05-31     10.0
2011-06-30      5.0
2011-07-31     24.0
2011-08-31     60.0
Freq: M, dtype: float64


In [11]:
result = sr.fillna(method = 'ffill')
print('Series after forward fill :\n',result)


Series after forward fill :
 2010-10-31    100.0
2010-11-30    100.0
2010-12-31    100.0
2011-01-31     18.0
2011-02-28     65.0
2011-03-31     65.0
2011-04-30     32.0
2011-05-31     10.0
2011-06-30      5.0
2011-07-31     24.0
2011-08-31     60.0
Freq: M, dtype: float64


In [12]:
result = sr.fillna(method = 'bfill')
print('Series after backward fill :\n',result)


Series after backward fill :
 2010-10-31    100.0
2010-11-30     18.0
2010-12-31     18.0
2011-01-31     18.0
2011-02-28     65.0
2011-03-31     32.0
2011-04-30     32.0
2011-05-31     10.0
2011-06-30      5.0
2011-07-31     24.0
2011-08-31     60.0
Freq: M, dtype: float64


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### Nearest neighbors imputation
It can be used for data that are continuous, discrete, ordinal and categorical which makes it particularly useful for dealing with all kind of missing data. The assumption behind using KNN for missing values is that a point value can be approximated by the values of the points that are closest to it, based on other variables. <br><br>The distance metric varies according to the type of data:
1. **Continuous Data**: The commonly used distance metrics for continuous data are Euclidean, Manhattan and Cosine
2. **Categorical Data**: Hamming distance is generally used in this case. It takes all the categorical attributes 

In [13]:
import numpy as np
from sklearn.impute import KNNImputer
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2)
imputer.fit_transform(X)

array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

In [14]:
df

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,22.0
1,L,9.0,gray,class 2,20.0
2,XL,10.0,blue,class 2,19.0
3,M,9.0,orange,class 1,17.0
4,M,11.0,green,class 3,20.0
5,M,7.0,red,class 1,22.0


In [15]:
# KNN Imputer for a dataframe 
import numpy as np
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
q = imputer.fit_transform(df[['boh']])
df['boh'] = q
df

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,22.0
1,L,9.0,gray,class 2,20.0
2,XL,10.0,blue,class 2,19.0
3,M,9.0,orange,class 1,17.0
4,M,11.0,green,class 3,20.0
5,M,7.0,red,class 1,22.0


In [16]:
q = imputer.fit_transform(df[['price']])
df['price'] = q
df

Unnamed: 0,size,price,color,class,boh
0,XXL,8.0,black,class 1,22.0
1,L,9.0,gray,class 2,20.0
2,XL,10.0,blue,class 2,19.0
3,M,9.0,orange,class 1,17.0
4,M,11.0,green,class 3,20.0
5,M,7.0,red,class 1,22.0


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### MultiOutputRegressor


This strategy consists of fitting one regressor per target. This is a simple strategy for extending regressors that do not natively support multi-target regression.

In [17]:
import numpy as np
from sklearn.datasets import load_linnerud
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import Ridge
X, y = load_linnerud(return_X_y=True)
print(X)
print(y)
clf = MultiOutputRegressor(Ridge(random_state=123)).fit(X, y)
pred =clf.predict(X[[0]])

[[  5. 162.  60.]
 [  2. 110.  60.]
 [ 12. 101. 101.]
 [ 12. 105.  37.]
 [ 13. 155.  58.]
 [  4. 101.  42.]
 [  8. 101.  38.]
 [  6. 125.  40.]
 [ 15. 200.  40.]
 [ 17. 251. 250.]
 [ 17. 120.  38.]
 [ 13. 210. 115.]
 [ 14. 215. 105.]
 [  1.  50.  50.]
 [  6.  70.  31.]
 [ 12. 210. 120.]
 [  4.  60.  25.]
 [ 11. 230.  80.]
 [ 15. 225.  73.]
 [  2. 110.  43.]]
[[191.  36.  50.]
 [189.  37.  52.]
 [193.  38.  58.]
 [162.  35.  62.]
 [189.  35.  46.]
 [182.  36.  56.]
 [211.  38.  56.]
 [167.  34.  60.]
 [176.  31.  74.]
 [154.  33.  56.]
 [169.  34.  50.]
 [166.  33.  52.]
 [154.  34.  64.]
 [247.  46.  50.]
 [193.  36.  46.]
 [202.  37.  62.]
 [176.  37.  54.]
 [157.  32.  52.]
 [156.  33.  54.]
 [138.  33.  68.]]


In [18]:
pred

array([[176.16484296,  35.0548407 ,  57.09000136]])

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

### IterativeImpute

It is a Multivariate imputer that estimates each feature from all the others. It applies a strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.

In [19]:
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp_mean = IterativeImputer(random_state=0)
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
IterativeImputer(random_state=0)
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
imp_mean.transform(X)

array([[ 6.95847623,  2.        ,  3.        ],
       [ 4.        ,  2.6000004 ,  6.        ],
       [10.        ,  4.99999933,  9.        ]])

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Pre-Processing" role="tab" aria-controls="settings">Go to Top<span class="badge badge-primary badge-pill"></span></a>

## Time-Series Specific Methods
1. **Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB)**
<br>This is a common statistical approach to the analysis of longitudinal repeated measured data where some follow-up observations may be missing. Longitudinal data track the same sample at different points in time. Both these methods can introduce bias in analysis and perform poorly when data has a visible trend
2. **Data without trend and seasonality**
mean, mode, median and random sample imputation can be used 
3. **Linear Interpolation**
This method works well for a time series with some **trend** but is not suitable for **seasonal data**
4. **Seasonal Adjustment + Linear Interpolation**
This method works well for data with both **trend and seasonality**
