# Module 3: Data Preprocessing and Feature Engineering

This module covers essential techniques to clean, preprocess, and transform raw data into features suitable for machine learning algorithms. We'll also explore privacy-preserving methods.

## 1. Data Analysis and Preprocessing Techniques

In [None]:

import pandas as pd

# Sample dataset
data = {
    'Age': [25, 27, None, 22, 28],
    'Salary': [50000, 54000, 58000, None, 52000],
    'City': ['Lahore', 'Karachi', 'Lahore', 'Islamabad', None]
}

df = pd.DataFrame(data)
df.head()
    

## 2. Data Cleaning

In [None]:

# Handling missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
df['City'].fillna('Unknown', inplace=True)
df
    

## 3. Encoding Categorical Features

In [None]:

# Convert categorical 'City' into numerical using one-hot encoding
df = pd.get_dummies(df, columns=['City'])
df
    

## 4. Detecting Outliers

In [None]:

import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(x=df['Salary'])
plt.show()
    

## 5. Feature Engineering

In [None]:

# Create new feature: 'Salary per Age'
df['Salary_per_Age'] = df['Salary'] / df['Age']
df
    

## 6. Dimensionality Reduction with PCA

In [None]:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Scale features
scaler = StandardScaler()
scaled = scaler.fit_transform(df)

# Apply PCA
pca = PCA(n_components=2)
pca_data = pca.fit_transform(scaled)
pca_data[:5]
    

## 7. Privacy-Preserving Data Preprocessing


- **Anonymization**: Removing or encrypting identifiers (e.g., names, IDs).
- **Differential Privacy**: Adding noise to data or query results to prevent individual data leakage.
- **Example**: Adding noise to Salary.
    

In [None]:

import numpy as np

# Add Laplace noise to Salary
epsilon = 1.0
sensitivity = df['Salary'].max() - df['Salary'].min()
noise = np.random.laplace(loc=0.0, scale=sensitivity/epsilon, size=df.shape[0])
df['Noisy_Salary'] = df['Salary'] + noise
df
    