# Data Preprocessing with Python

Data preprocessing is a crucial step in preparing raw data for analysis or modeling. This process includes cleaning, transforming, and standardizing data. Below, you'll find an overview and implementation of a typical preprocessing workflow in Python.

### Steps:

1. **Load the Data**
   - Use `pandas` to load and explore the dataset.
   - Check for missing values, duplicates, and overall structure:
     ```python
     import pandas as pd
     df = pd.read_csv("data.csv")  # Replace "data.csv" with your dataset
     print(df.info())  # Overview of the dataset
     print(df.head())  # First few rows of the dataset
     ```

2. **Handle Missing Values**
   - Identify missing data and decide whether to drop or fill missing values:
     ```python
     print(df.isnull().sum())  # Count missing values in each column
     df = df.dropna()  # Option 1: Remove rows with missing values
     # Option 2: Fill missing values
     df["column_name"].fillna(df["column_name"].mean(), inplace=True)
     ```

3. **Remove Duplicates**
   - Check and remove duplicate rows to ensure data quality:
     ```python
     df = df.drop_duplicates()
     ```

4. **Feature Scaling**
   - Normalize or standardize features to bring them to a similar scale:
     ```python
     from sklearn.preprocessing import StandardScaler
     scaler = StandardScaler()
     df[["feature1", "feature2"]] = scaler.fit_transform(df[["feature1", "feature2"]])
     ```

5. **Categorical Encoding**
   - Convert categorical variables into numeric formats using one-hot encoding or label encoding:
     ```python
     # One-Hot Encoding
     df = pd.get_dummies(df, columns=["categorical_column"])
     
     # Label Encoding
     from sklearn.preprocessing import LabelEncoder
     encoder = LabelEncoder()
     df["categorical_column"] = encoder.fit_transform(df["categorical_column"])
     ```

6. **Feature Engineering**
   - Create new features or transform existing ones to add value:
     ```python
     df["new_feature"] = df["feature1"] / df["feature2"]
     ```

7. **Split Data for Training and Testing**
   - Prepare the dataset for modeling by splitting it into training and testing sets:
     ```python
     from sklearn.model_selection import train_test_split
     X = df.drop("target_column", axis=1)
     y = df["target_column"]
     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
     ```

### Example Implementation
Below is a complete example of a preprocessing pipeline:


In [6]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

def data_preprocessing_pipeline(data):
    #Identify numeric and categorical features
    numeric_features = data.select_dtypes(include=['float', 'int']).columns
    categorical_features = data.select_dtypes(include=['object']).columns

    #Handle missing values in numeric features
    data[numeric_features] = data[numeric_features].fillna(data[numeric_features].mean())

    #Detect and handle outliers in numeric features using IQR
    for feature in numeric_features:
        Q1 = data[feature].quantile(0.25)
        Q3 = data[feature].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - (1.5 * IQR)
        upper_bound = Q3 + (1.5 * IQR)
        data[feature] = np.where((data[feature] < lower_bound) | (data[feature] > upper_bound),
                                 data[feature].mean(), data[feature])

    #Normalize numeric features
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data[numeric_features])
    data[numeric_features] = scaler.transform(data[numeric_features])

    #Handle missing values in categorical features
    data[categorical_features] = data[categorical_features].fillna(data[categorical_features].mode().iloc[0])

    return data

Let’s have a look at the sample data first:

In [8]:
data = pd.read_csv("data.csv")

print("Original Data:")
print(data)

Original Data:
   NumericFeature1  NumericFeature2 CategoricalFeature
0              1.0                7                  A
1              2.0                8                  B
2              NaN                9                NaN
3              4.0               10                  A
4              5.0               11                  B
5              6.0               50                  C


### And here’s how you can use this pipeline to perform all the preprocessing steps using Python:

In [10]:
#Perform data preprocessing
cleaned_data = data_preprocessing_pipeline(data)

print("Preprocessed Data:")
print(cleaned_data)

Preprocessed Data:
   NumericFeature1  NumericFeature2 CategoricalFeature
0        -1.535624        -1.099370                  A
1        -0.944999        -0.749128                  B
2         0.000000        -0.398886                  A
3         0.236250        -0.048645                  A
4         0.826874         0.301597                  B
5         1.417499         1.994431                  C
