## Data Preprocessing Common Steps:

**Basic Steps:**
1. Dataset Import.
2. Basic Analysis & Feature Selection.
3. Replacing missing values using scikit-learn (imputation of missing values)
4. Transforming categorical Features for ML Algos. (Categorical Data) -> OneHotEncoding, LabelEncoding, etc...
5. Preprocessing [Building X, Y vectors]
6. Prepare the data [Train Test Splitting.]
7. Feature Scaling [Normalization / Standardization]
8. Applying same scale on both train & tests data.

## Scikit-learn library:

- We use scikit-learn library's preprocessing module to perform the certain data preprocessing operations as mentioned above.

In [1]:
# importing modules...
import numpy as np, pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

### 1. Basic Analysis and Feature Selection:

- This step includes the basic analysis of the raw imported data. Here we chooose the relevant features by performing the analysis along with visualizations like heatmap (correlations), clustermaps.

### 2. Imputation of missing values using Scikit-learn:
- We also clean the data by removing / replacing null values. Dropping irrelevant columns and many more depending upon use cases. This is used as follow:

In [19]:
from sklearn.impute import SimpleImputer

In [20]:
# Sample Data:

df = pd.DataFrame(
    {
        'A' : [1,3,4,5, np.nan],
        'B' : [43, 64, np.nan, 23, 78],
        'C' : [4.5, np.nan, 6.51, np.nan, 9.44]
    }
)
df

Unnamed: 0,A,B,C
0,1.0,43.0,4.5
1,3.0,64.0,
2,4.0,,6.51
3,5.0,23.0,
4,,78.0,9.44


In [21]:
# Initializing and fitting the data to the imputer...
imp = SimpleImputer(
    missing_values = np.nan, 
    strategy = 'mean'
)

# We will fit above sample data to replace the null values with the mean...
imp.fit(
    df
)

- Here there are many strategies to fill the missing values. [`strategy='strategy_name'`]

1. 'mean'
2. 'median'
3. 'most_frequent'
4. 'constant'

In [22]:
# We can see no missing values here in the transformed data...
imputed_arr = imp.transform(df)
imputed_df = pd.DataFrame(
    imputed_arr.T, 
    index = df.columns
)
imputed_df

Unnamed: 0,0,1,2,3,4
A,1.0,3.0,4.0,5.0,3.25
B,43.0,64.0,52.0,23.0,78.0
C,4.5,6.816667,6.51,6.816667,9.44


There are many more imputer as follow:

- `KNNImputer(n_neighbours=N, weights='uniform')` (K nearest neighbour algorithm based imputations)

- `MissingIndicator(missing_values = -1)` [missing values only...]

- `IterativeImputer` & many more...

### 3. One-Hot & Label Encoding:

- Performed in categorical data/features.
- e.g.: Male (0) & Female (1)
- e.g.: Cat (0), Dog (1), Lion(2)

**We use sklearn.preprocessing's OneHotEncoder Module:**

In [23]:
from sklearn.preprocessing import OneHotEncoder

In [24]:
# Sample Data:

df = pd.DataFrame(
    {
        'id' : [1,2,3,4,5,6,7],
        'features' : [23,43,55,56,34,23,43],
        'Gender' : ['M', 'F', 'F', 'M', 'M', 'M', 'F'],
        
    }
)
df


Unnamed: 0,id,features,Gender
0,1,23,M
1,2,43,F
2,3,55,F
3,4,56,M
4,5,34,M
5,6,23,M
6,7,43,F


In [None]:
# Let's encode the gender data...

encoder = OneHotEncoder(max_categories = 2)

encoded_data = encoder.fit_transform(df[['Gender']]).toarray()
encoded_data # 1 -> M, 0 -> Female

array([[1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.]])

**We use sklearn.preprocessing's LabelEncoder Module:**

In [95]:
# Sample Data:

df = sns.load_dataset('iris')
df['species'].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [99]:
# Label Encoder:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df['encoded_species'] = encoder.fit_transform(df['species'])
df['encoded_species'].unique()

array([0, 1, 2])

### 4. Building X, Y vectors:

- Here, we divide the data into two parts as follow:
1. X -> Input for the model 
2. Y -> Original Output.

- We perform this simply by pandas...

### 5. Feature Scaling:

We reduce the values into smaller one by implementing the scaling techniques to reduce the computing cost as well as better performance.

**1. Min-Max Scaling:**

- `sklearn.preprocessing.MinMaxScaler()`
- X_scaled = ( X_i - X_min ) / (X_max - X_min)
- X_scaled range: [0, 1]

**2. Normalization:**

- `sklearn.preprocessing.Normalizer()`
- X_normalized = (X_i - X_mean) / (X_max - X_min)
- range : [-1, 1]

**3. Standardization:**

- `sklearn.preprocessing.StandardScaler()`
- X_scaled = ( X_i - X_mean) / std_deviation


In [103]:
# Sample Data:

X_train = pd.DataFrame(
    {
        'feature_1': [1,-1,2],
        'feature_2': [2,0,0],
        'feature_3': [0,1,-1],
    }
)
X_train

Unnamed: 0,feature_1,feature_2,feature_3
0,1,2,0
1,-1,0,1
2,2,0,-1


In [110]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

# Min-Max Scaled Data:
scaler.fit_transform(X_train)

array([[0.66666667, 1.        , 0.5       ],
       [0.        , 0.        , 1.        ],
       [1.        , 0.        , 0.        ]])

In [111]:
from sklearn.preprocessing import Normalizer

scaler = Normalizer()
# Normalized Data
scaler.fit_transform(X_train)

array([[ 0.4472136 ,  0.89442719,  0.        ],
       [-0.70710678,  0.        ,  0.70710678],
       [ 0.89442719,  0.        , -0.4472136 ]])

In [113]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# Standardized Data...
scaler.fit_transform(X_train)

array([[ 0.26726124,  1.41421356,  0.        ],
       [-1.33630621, -0.70710678,  1.22474487],
       [ 1.06904497, -0.70710678, -1.22474487]])

### 6. Prepare the Data For Model:

- Here we split the train and test data as per our requirements.

In [115]:
# Sample Data:
df.head(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,encoded_species
0,5.1,3.5,1.4,0.2,setosa,0
1,4.9,3.0,1.4,0.2,setosa,0
2,4.7,3.2,1.3,0.2,setosa,0
3,4.6,3.1,1.5,0.2,setosa,0
4,5.0,3.6,1.4,0.2,setosa,0


In [120]:
X = df.drop(['species', 'encoded_species'], axis=1)
y = df['encoded_species']

In [123]:
# Let's Split the above data into train and test using train_test_split:

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    random_state = 101,
    test_size = 0.3,
    shuffle=True
)

In [126]:
print('X_train : ')
print(X_train.head(2))
print('')
print('X_test : ')
print(X_test.head(2))
print('')
print('y_train : ')
print(y_train.head(2))
print('')
print('y_test : ')
print(y_test.head(2))

X_train : 
     sepal_length  sepal_width  petal_length  petal_width
13            4.3          3.0           1.1          0.1
102           7.1          3.0           5.9          2.1

X_test : 
    sepal_length  sepal_width  petal_length  petal_width
33           5.5          4.2           1.4          0.2
16           5.4          3.9           1.3          0.4

y_train : 
13     0
102    2
Name: encoded_species, dtype: int32

y_test : 
33    0
16    0
Name: encoded_species, dtype: int32


The above all performed steps are the part of Data Preprocessing (Most Common steps). Apart from these there are much more steps which depends upon the type of data and the usecase...