In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

def data_preprocessing_pipeline(data):
    #Identify numeric and categorical features
    numeric_features = data.select_dtypes(include=['float', 'int']).columns
    categorical_features = data.select_dtypes(include=['object']).columns

    #Handle missing values in numeric features
    data[numeric_features] = data[numeric_features].fillna(data[numeric_features].mean())

    #Detect and handle outliers in numeric features using IQR
    for feature in numeric_features:
        Q1 = data[feature].quantile(0.25)
        Q3 = data[feature].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - (1.5 * IQR)
        upper_bound = Q3 + (1.5 * IQR)
        data[feature] = np.where((data[feature] < lower_bound) | (data[feature] > upper_bound),
                                 data[feature].mean(), data[feature])

    #Normalize numeric features
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data[numeric_features])
    data[numeric_features] = scaler.transform(data[numeric_features])

    #Handle missing values in categorical features
    data[categorical_features]= data[categorical_features].fillna(data[categorical_features].mode().iloc[0])
    
    return data

In [2]:
data= pd.read_csv("data.csv")

print("Original dataset: ")
print(data)

Original dataset: 
   NumericFeature1  NumericFeature2 CategoricalFeature
0              1.0                7                  A
1              2.0                8                  B
2              NaN                9                NaN
3              4.0               10                  A
4              5.0               11                  B
5              6.0               50                  C


In [3]:
cleaned_data= data_preprocessing_pipeline(data)

print("Preprocessed dataset: ")
print(cleaned_data)

Preprocessed dataset: 
   NumericFeature1  NumericFeature2 CategoricalFeature
0        -1.535624        -1.099370                  A
1        -0.944999        -0.749128                  B
2         0.000000        -0.398886                  A
3         0.236250        -0.048645                  A
4         0.826874         0.301597                  B
5         1.417499         1.994431                  C


# Data Preprocessing Explanation

From the comparison between the original and preprocessed datasets, it is evident that several preprocessing steps were applied to the data. Here's an explanation of the transformations observed:

## 1. Handling Missing Values

- **NumericFeature1**: The `NaN` value in the third row was filled. This appears to have been done using the mean value of the other entries in this column, which is a common approach for imputing missing numeric values.
- **CategoricalFeature**: The missing value in the third row was replaced with 'A', which is the most frequent category or mode in this column.

## 2. Normalization/Standardization of Numeric Features

Both numeric columns (`NumericFeature1` and `NumericFeature2`) have been standardized. The transformation involves the following steps:
- Subtract the mean of the column from each entry.
- Divide the result by the standard deviation of the column.

This adjusts the data so that the mean of each numeric feature is 0 and the standard deviation is 1, reducing feature scale bias in modeling processes.

## 3. No Change in Categorical Features Except for Missing Value Imputation

- The categorical values remain unchanged except for the imputation of the missing value. There is no indication of encoding in the data snippet provided (e.g., converting categorical data to numerical values), which is often done in many machine learning workflows to facilitate modeling.

## Summary

These preprocessing steps are critical for preparing the dataset for effective analysis, particularly in statistical or machine learning models. They ensure that the features are normalized and that the dataset is free from basic issues like missing data, making it suitable for further analysis and modeling.



