<a href="https://colab.research.google.com/github/ashwin1410-stack/EDA-PROJECTS/blob/main/DATA_PREPROCESSING_PIPELINE_USING_PYTHON.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DATA PREPROCESSING PIPELINE USING PYTHON

A Data Preprocessing pipeline should be able to handle missing values, standardize numerical features, remove outliers, and ensure easy replication of preprocessing steps on new datasets. Now, here’s how to create a Data Preprocessing pipeline using Python based on the fundamental functions that every pipeline should perform while preprocessing any dataset:

In [38]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

In [59]:
def data_preprocessing_pipeline(data):
  #indentify numeric and categorical feature
  numeric_features = data.select_dtypes(include=['float' , 'int']).columns
  categorical_features = data.select_dtypes(include=['object']).columns

  #handling missing values in numeric feature
  data[numeric_features] = data[numeric_features] .fillna(data[numeric_features].mean())

  #detect and handling outliers using IQR
  for feature in numeric_features:
    Q1 = data[feature].quantile(0.25)
    Q3 = data[feature].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - (1.5 * IQR)
    higher_bound = Q3 -(1.5 * IQR)
    data[feature]= np.where((data[feature] < lower_bound) | (data[feature] > higher_bound),
                                 data[feature].mean(), data[feature])
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data[numeric_features])
    data[numeric_features]=scaler.transform(data[numeric_features])

    #handling missing value of categorical feature
    data[categorical_features] = data[categorical_features].fillna(data[categorical_features].mode().iloc[0])

    return data

This pipeline is designed to handle various preprocessing tasks on any given dataset. Let’s explore how each step in the pipeline contributes to the overall preprocessing process:

1) The pipeline begins by identifying the numeric and categorical features in the dataset.
Next, the pipeline addresses any missing values present in the numeric features. It fills these missing values with the mean value of each respective numeric feature (you can modify this step according to your desired way of filling in missing values of a numerical feature). It ensures that missing data does not hinder subsequent analysis and computations.\
2) The pipeline then identifies and handles outliers within the numeric features using the Interquartile Range (IQR) method. Calculating the quartiles and the IQR determines upper and lower boundaries for outliers. Any values outside these boundaries are replaced with the mean value of the respective numeric feature. This step helps prevent the influence of extreme values on subsequent analyses and model building.\
3) After handling missing values and outliers, the pipeline normalizes the numeric features. This process ensures that all numeric features contribute equally to subsequent analysis, avoiding biases caused by varying magnitudes.\
4) The pipeline proceeds to handle missing values in the categorical features. It fills these missing values with the mode value, representing the most frequently occurring category.

**By following this pipeline, data professionals can automate and streamline the process of preparing data for analysis, ensuring data quality, reliability, and consistency.**

In [60]:
data = pd.read_csv("data.csv")

print("ORIGINAL DAta")
data

ORIGINAL DAta


Unnamed: 0,NumericFeature1,NumericFeature2,CategoricalFeature
0,1.0,7,A
1,2.0,8,B
2,,9,
3,4.0,10,A
4,5.0,11,B
5,6.0,50,C


**And here’s how you can use this pipeline to perform all the preprocessing steps using Python:**

In [61]:
cleaned_data = data_preprocessing_pipeline(data)
print("PREPROCESSED DATA")
cleaned_data

PREPROCESSED DATA


Unnamed: 0,NumericFeature1,NumericFeature2,CategoricalFeature
0,-2.236068,-0.576053,A
1,0.447214,-0.510839,B
2,0.447214,-0.445626,A
3,0.447214,-0.380412,A
4,0.447214,-0.315199,B
5,0.447214,2.228129,C


# SUMMARY

**Data Preprocessing involves transforming and manipulating raw data to improve its quality, consistency, and relevance for analysis. A data preprocessing pipeline is a systematic and automated approach that combines multiple preprocessing steps into a cohesive workflow. It serves as a roadmap for data professionals, guiding them through the transformations and calculations needed to cleanse and prepare data for analysis.**