## AIM

Perform Data Preprocessing like outlier detection, handling missing value, analyzing redundancy and normalization on different datasets.

## Introduction

#### All machine learning algorithms use some input data to create outputs. This input data comprise features, which are usually in the form of structured columns. Algorithms require features with some specific characteristic to work properly.
•	Preparing the proper input dataset, compatible with the machine learning algorithm requirements.

•	Improving the performance of machine learning models.


## Importing Libraries 

In [19]:
#Importing libraries
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline
import pandas as pd
from sklearn import datasets 
import seaborn as sb

## Libraries Used 

Numpy is considered as one of the most popular machine learning library in Python.Array interface is the best and the most important feature of Numpy. 

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. 
Pandas is a machine learning library in Python that provides data structures of high-level and a wide variety of tools for analysis. One of the great feature of this library is the ability to translate complex operations with data using one or two commands. 

Scikit Learn is a Python library is associated with NumPy and SciPy. It is considered as one of the best libraries for working with complex data. 

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics 


In [None]:
diabetes = datasets.load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names) 
df['target'] = diabetes.target
df.head()
print(diabetes.DESCR)

.. _diabetes_dataset:
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
**Data Set Characteristics:**
  :Number of Instances: 442
  :Number of Attributes: First 10 columns are numeric predictive values
  :Target: Column 11 is a quantitative measure of disease progression one year
after baseline
  :Attribute Information:
- age
- sex
- bmi
- bp
- s1
- s2
- s3
- s4
- s5
- s6
age in years
body mass index
average blood pressure
tc, T-Cells (a type of white blood cells)
ldl, low-density lipoproteins
hdl, high-density lipoproteins
tch, thyroid stimulating hormone
ltg, lamotrigine
glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled by
the standard deviation times `n_samples` (i.e. the sum of squares of each colum
n totals 1).
Source URL:

https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html 

(https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.

(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

In [None]:
 print("The features of the dataset are" ,diabetes.feature_names)

The features of the dataset are ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3','s4', 's5', 's6']

## Data Preprocessing

### 1. Handling missing values

Imputation fills in the missing value with some number. The imputed value won't be exactly right in most cases, but it usually gives more accurate models than dropping the column entirely. The default behavior fills in the mean value for imputation.

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan , strategy = 'mean') 
X = df.iloc[:,:-1].values
y= df.iloc[:,-1].values
imputer.fit(X[:,:])
#Inserting missing values manually
X[0][0]=np.nan X[2][1]=np.nan
#Dataset with missing values
print("Dataset with missing values") 
print(pd.DataFrame(X,columns=diabetes.feature_names).head()) 
print()
#Dataset without missing values
X[:,:] = imputer.transform(X[:,:])
print("Dataset WITHOUT missing values") 
print(pd.DataFrame(X,columns=diabetes.feature_names).head()) 
print()

Dataset with missing values
age sex bmi bp s1 s2 s3\
0       NaN  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412
2  0.085299       NaN  0.044451 -0.005671 -0.045599 -0.034194 -0.032356
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142
         s4        s5        s6
0 -0.002592  0.019908 -0.017646
1 -0.039493 -0.068330 -0.092204
2 -0.002592  0.002864 -0.025930
3  0.034309  0.022692 -0.009362
4 -0.002592 -0.031991 -0.046641
         s3        s4        s5        s6
0 -0.043401 -0.002592  0.019908 -0.017646
1  0.074412 -0.039493 -0.068330 -0.092204
2 -0.032356 -0.002592  0.002864 -0.025930
3 -0.036038  0.034309  0.022692 -0.009362
4  0.008142 -0.002592 -0.031991 -0.046641
Dataset WITHOUT missing values
            age           sex       bmi
                                               bp        s1
0 -3.634285e-16  5.068012e-02  0.061696  0.021872 -0.044223 -0.034821
1 -1.882017e-03 -4.464164e-02 -0.051474 -0.026328 -0.008449 -0.019163
2  8.529891e-02  1.308343e-16  0.044451 -0.005671 -0.045599 -0.034194
3 -8.906294e-02 -4.464164e-02 -0.011595 -0.036656  0.012191  0.024991
4  5.383060e-03 -4.464164e-02 -0.036385  0.021872  0.003935  0.015596

### 2. Data normalisation

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges

In [None]:
#normalize dataset
from sklearn import preprocessing
normalized_X = preprocessing.normalize(X)
print("This is the dataset after normalisation")
print (pd.DataFrame(normalized_X,columns=diabetes.feature_names).head())

This is the dataset after normalisation
            age           sex       bmi        bp        s1        s2  \
0 -3.235165e-15  4.511439e-01  0.549207  0.194703 -0.393668 -0.309967
1 -1.166166e-02 -2.766158e-01 -0.318952 -0.163137 -0.052351 -0.118743
2  7.141360e-01  1.095365e-15  0.372153 -0.047475 -0.381766 -0.286282
3 -7.210986e-01 -3.614412e-01 -0.093879 -0.296789  0.098701  0.202336
4  6.276940e-02 -5.205457e-01 -0.424265  0.255044  0.045883  0.181859
         s3        s4        s5        s6
0 -0.386345 -0.023076  0.177221 -0.157082
1  0.461081 -0.244715 -0.423396 -0.571330
2 -0.270889 -0.021703  0.023976 -0.217093
3 -0.291778  0.277782  0.183726 -0.075799
4  0.094941 -0.030227 -0.373038 -0.543858

### 3. Outlier Detection

In statistics, an outlier is an observation point that is distant from other observations. The above definition suggests that outlier is something which is separate/different from the crowd. We can plot boxplots to get a visual of where the median of the values and the extremes lie, and can observe the outliers

In [None]:
sb.boxplot(df['age'])

<matplotlib.axes._subplots.AxesSubplot at 0x23f1ef0cd30>

In [None]:
sb.boxplot(df['bmi'])

<matplotlib.axes._subplots.AxesSubplot at 0x23f1e173fa0>

### IQR Score

The interquartile range (IQR), also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1. In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data. It is a measure of the dispersion similar to standard deviation or variance, but is much more robust against outliers.

In [None]:
Q1 = df.quantile(0.25) 
Q3 = df.quantile(0.75) 
IQR = Q3 - Q1 
print(IQR)

age         0.075375
sex         0.095322
bmi         0.065477
bp          0.072300
s1 0.062606
s2 0.060203
s3 0.064429
s4 0.073802
s5 0.065682
s6 0.061096
target    124.500000
dtype: float64

In [None]:
print((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR)))

       age    sex    bmi     bp     s1     s2     s3     s4     s5     s6  \
0    False  False  False  False  False  False  False  False  False  False
1    False  False  False  False  False  False  False  False  False  False
2    False  False  False  False  False  False  False  False  False  False
3    False  False  False  False  False  False  False  False  False  False
4    False  False  False  False  False  False  False  False  False  False
..     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...
437  False  False  False  False  False  False  False  False  False  False
438  False  False  False  False  False  False  False  False  False  False
439  False  False  False  False  False  False  False  False  False  False
440  False  False  False  False  False  False  False  False  False  False
441  False  False  False  False  False  False   True  False  False  False
target
0 False
1 False
2 False
3 False
4 False
..      ...
437   False
438   False
439   False
440   False
441   False
[442 rows x 11 columns]

### Removing outlier values

In [None]:
print("Shape of Dataframe BEFORE outlier correction" , df.shape) 

#Removing Rows with outliers
df_outlier_corrected = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).an 
print("Shape of Dataframe AFTER outlier correction" ,df_outlier_corrected.shape)

Shape of Dataframe BEFORE outlier correction (442, 11)
Shape of Dataframe AFTER outlier correction (409, 11)