# Data Preprocessing Pipeline in Python

This pipeline demonstrates a systematic approach to preprocess a dataset using `pandas`, `numpy`, and `scikit-learn`. It includes handling missing values, detecting and handling outliers, and normalizing numeric features.

### Steps:

1. **Identify Numeric and Categorical Features**
   - Use `pandas` to separate numeric and categorical columns:
     - Numeric: Columns with data types `float` or `int`.
     - Categorical: Columns with the data type `object`.

2. **Handle Missing Values**
   - For numeric features, replace missing values with the column mean.
   - For categorical features, replace missing values with the most frequent value (mode).

3. **Detect and Handle Outliers in Numeric Features**
   - Use the **Interquartile Range (IQR)** to identify outliers:
     - Calculate the first quartile (`Q1`) and third quartile (`Q3`).
     - Define the lower and upper bounds as:
       - Lower bound = `Q1 - 1.5 * IQR`
       - Upper bound = `Q3 + 1.5 * IQR`
     - Replace outliers with the column mean.

4. **Normalize Numeric Features**
   - Use `StandardScaler` from `scikit-learn` to standardize numeric features, ensuring all features have a mean of 0 and a standard deviation of 1.

5. **Return the Processed Data**
   - The processed dataset is returned with missing values addressed, outliers handled, and numeric features normalized.

### Example Implementation
#### Below is a complete example of a preprocessing pipeline:


In [6]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

def data_preprocessing_pipeline(data):
    #Identify numeric and categorical features
    numeric_features = data.select_dtypes(include=['float', 'int']).columns
    categorical_features = data.select_dtypes(include=['object']).columns

    #Handle missing values in numeric features
    data[numeric_features] = data[numeric_features].fillna(data[numeric_features].mean())

    #Detect and handle outliers in numeric features using IQR
    for feature in numeric_features:
        Q1 = data[feature].quantile(0.25)
        Q3 = data[feature].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - (1.5 * IQR)
        upper_bound = Q3 + (1.5 * IQR)
        data[feature] = np.where((data[feature] < lower_bound) | (data[feature] > upper_bound),
                                 data[feature].mean(), data[feature])

    #Normalize numeric features
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data[numeric_features])
    data[numeric_features] = scaler.transform(data[numeric_features])

    #Handle missing values in categorical features
    data[categorical_features] = data[categorical_features].fillna(data[categorical_features].mode().iloc[0])

    return data

#### Let’s have a look at the sample data first:

In [8]:
data = pd.read_csv("data.csv")

print("Original Data:")
print(data)

Original Data:
   NumericFeature1  NumericFeature2 CategoricalFeature
0              1.0                7                  A
1              2.0                8                  B
2              NaN                9                NaN
3              4.0               10                  A
4              5.0               11                  B
5              6.0               50                  C


#### And here’s how you can use this pipeline to perform all the preprocessing steps using Python:

In [10]:
#Perform data preprocessing
cleaned_data = data_preprocessing_pipeline(data)

print("Preprocessed Data:")
print(cleaned_data)

Preprocessed Data:
   NumericFeature1  NumericFeature2 CategoricalFeature
0        -1.535624        -1.099370                  A
1        -0.944999        -0.749128                  B
2         0.000000        -0.398886                  A
3         0.236250        -0.048645                  A
4         0.826874         0.301597                  B
5         1.417499         1.994431                  C
