# Data Preprocessing Techniques
> A comprehensive guide to cleaning, transforming, and optimizing your data for machine learning

## Table of Contents
- [Introduction](#introduction)
- [Handling Missing Values](#handling-missing-values)
- [Dealing with Outliers](#dealing-with-outliers)
  - [Standard Deviation Method](#standard-deviation-method)
  - [Z-Score Method](#z-score-method)
- [Data Encoding Techniques](#data-encoding-techniques)
  - [Label Encoding](#label-encoding)
  - [One-Hot Encoding](#one-hot-encoding)
  - [Ordinal Encoding](#ordinal-encoding)
- [Advanced Preprocessing](#advanced-preprocessing)
  - [Feature Selection](#feature-selection)
  - [Handling Imbalanced Data](#handling-imbalanced-data)
  - [Feature Scaling](#feature-scaling)
  - [Data Binning](#data-binning)
- [Summary](#summary)

## Introduction
Data preprocessing is a critical step in any data science or machine learning workflow. Clean, well-prepared data leads to better model performance, more accurate predictions, and faster development cycles. This notebook walks through essential preprocessing techniques with practical examples.


---

## Handling Missing Values
Missing values are a common problem in real-world datasets. Let's explore how to identify and handle them.


In [None]:
import pandas as pd

data_null = {'Umur': [25, 30, None, 35, 40],
        'Gaji': [50000, None, 60000, 65000, None],
        'Status': ['Single', 'Single', 'Maried', None, 'Single']}

df_null = pd.DataFrame(data_null)
print(df_null.isnull().sum())
df_null

> 💡 **Insight**: The `.isnull().sum()` method quickly shows us how many missing values exist in each column.

### Option 1: Removing Missing Values
Simply dropping rows with missing values often removes too much data:

In [None]:
df_null.dropna()

### Option 2: Imputing Missing Values
A better approach is to fill missing values with appropriate replacements:

In [None]:
df_null.fillna({'Umur': df_null['Umur'].mean(),
           'Gaji': 0,
           'Status': df_null['Status'].mode()[0]}, inplace=True)
df_null

> ⚠️ **Warning**: Choose imputation strategies carefully based on the nature of your data and the specific column.

---

## Dealing with Outliers
Outliers can significantly impact statistical analyses and model performance. Here we'll explore methods to detect and handle them.

### Standard Deviation Method
Standard deviation measures how spread out the values are from the mean.

![Standard Deviation](https://deintrovert.wordpress.com/wp-content/uploads/2017/10/std.png?w=640)

In [None]:
def std(data: list, ddof: int = 1) -> float:
    if ddof > 1 or ddof < 0:
        raise ValueError('ddof must be greater than 0')
    n = len(data)
    mean = sum(data) / n
    total_variance = sum((x - mean) ** 2 for x in data)
    variance = total_variance / (n - ddof)
    stdev = variance ** 0.5
    return stdev

In [None]:
data_std = [10,1,5,1,1,1,1,2,1,1,1]
std(data_std, ddof=1)

In [None]:
import numpy as np

np.std(data_std, ddof=1)

> 📊 **Example**: We've implemented a custom standard deviation function and compared it with NumPy's implementation.


### Z-Score Method
Z-score tells us how many standard deviations an element is from the mean.

![Z-Score](https://miro.medium.com/v2/resize:fit:748/0*yRjhv84-t1Xa-9pW.png)

In [None]:
def z_scores(data: list) -> list:
    mean = sum(data) / len(data)
    std_dev = std(data)
    if std_dev == 0:
        return [0] * len(data)
    return [(x - mean) / std_dev for x in data]

In [None]:
data_z = [1,1,1,1,1,1,1,1,1,1,1]
z_scores(data_z)

#### Is your score special(outlier)?

In [None]:
def z_scores_1(num ,data):
    mean = sum(data) / len(data)
    std_dev = std(data)
    return (num - mean) / std_dev

In [None]:
nilai_teman_teman = [75, 78, 80, 82, 85]

nilai_kamu = 90
print("rata-rata: ", np.mean(nilai_teman_teman))
print("zcore: ", z_scores_1(nilai_kamu, nilai_teman_teman))

In [None]:
nilai_teman_teman = [50, 60, 80, 90, 100]

nilai_kamu = 90
print("rata-rata: ", np.mean(nilai_teman_teman))
print("zcore: ", z_scores_1(nilai_kamu, nilai_teman_teman))

> 🔍 **Analysis**: A high absolute z-score suggests a value might be an outlier.

## Proof

In [None]:
umur_pasien = [10, 15, 5, 3, 1, 2, 2, 1,20, 21, 22, 23, 24, 3, 1, 2, 5, 8, 25, 26, 1, 9, 27, 2, 1, 1, 28, 29, 90]
print("mean: ", np.mean(umur_pasien))
print("median: ", np.median(umur_pasien))

In [None]:
z_scores_1(90, umur_pasien)

In [None]:
z_scores_1(-1, umur_pasien)

> 🧪 **Experiment**: We can see how the z-score identifies the outlier value of 90 in our patient age data.

---

## Data Encoding Techniques
Machine learning algorithms generally require numerical input. Encoding converts categorical data to numerical format.


In [None]:
data_le = pd.DataFrame({
    "Status": ["Married", "Single", "Married", "Married", "Single", "Single", "Irul", "Married", "Married", "Single"]
})
data_le.value_counts()

In [None]:
data_le['Status'] = data_le['Status'].replace("Irul", "Single")
data_le

> 🧹 **Cleaning Step**: First, we find and fix inconsistent categories.

### Label Encoding
Transforms categories into unique integer values (0, 1, 2, etc.).

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
data_le['Status_clean'] = le.fit_transform(data_le['Status'])
# data_status.drop(columns='Status')
data_le

> 🔄 **Transformation**: Label encoding creates a single numerical column, but beware of implying ordinality.

### One-hot Encoder
Creates binary columns for each category.

In [None]:
data = pd.DataFrame(
    {
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Male'],
    }
)
df_pandas_encoded = pd.get_dummies(data, columns=['Gender'], dtype=int)
df_pandas_encoded

In [None]:
from sklearn.preprocessing import OneHotEncoder
data_ohe = pd.DataFrame(
    {
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', "Male"],
    }
)

ohe = OneHotEncoder(sparse_output=False, dtype=int)
encoded = ohe.fit_transform(data_ohe[['Gender']])

encoded_df = pd.DataFrame(encoded, columns=ohe.get_feature_names_out(['Gender']))

data_ohe = pd.concat([data_ohe, encoded_df], axis=1)

data_ohe


> 📋 **Best Practice**: One-hot encoding is usually preferred for nominal categories with no inherent order.

### Ordinal Encoding
Used when categories have a natural order.

In [None]:
data = pd.DataFrame(
    {
        'Education Level': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor'],
    }
)

education_order = {
    'High School': 0,
    'Bachelor': 1,
    'Master': 2,
    'PhD': 3
}

data['Education_OrdinalEncoded'] = data['Education Level'].map(education_order)
data

In [None]:
from sklearn.preprocessing import OrdinalEncoder
data = pd.DataFrame(
    {
        'Education Level': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor'],
    }
)

order = [['High School', 'Bachelor', 'Master', 'PhD']]
oe = OrdinalEncoder(categories=order, dtype=int)
data['Education_OrdinalEncoded'] = oe.fit_transform(data[['Education Level']])
data

> 🎓 **Use Case**: Ordinal encoding preserves the hierarchical relationship between categories.

---

## Advanced Preprocessing
These additional techniques can significantly improve your model's performance.

### Feature Selection
Understanding relationships between variables is crucial for selecting the most relevant features.

In [None]:
np.random.seed(42)
umur = np.random.normal(loc=35, scale=10, size=200)
umur = np.clip(umur, 18, 65)

gaji = 3 * (umur ** 2) + np.random.normal(0, 10000, len(umur))  
gaji = np.clip(gaji, 3000, 150000)

kesehatan_skor = 100 - (umur * 0.7) + np.random.normal(0, 7, len(umur))
kesehatan_skor = np.clip(kesehatan_skor, 10, 100)

df = pd.DataFrame({
    "Umur": umur,
    "Gaji": gaji,
    "Kesehatan": kesehatan_skor
})
df.info()

#### Pearson Correlation Coefficient
Measures linear correlation between variables:

In [None]:
from scipy.stats import pearsonr

corr_gaji, _ = pearsonr(df["Umur"], df["Gaji"])
corr_kesehatan, _ = pearsonr(df["Umur"], df["Kesehatan"])

print(f"Pearson Correlation (Umur vs. Gaji): {corr_gaji:.2f}")
print(f"Pearson Correlation (Umur vs. Kesehatan): {corr_kesehatan:.2f}")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.regplot(x="Umur", y="Gaji", data=df, ax=axes[0])
axes[0].set_title(f"Positif Correlation (r={corr_gaji:.2f})")
sns.regplot(x="Umur", y="Kesehatan", data=df, ax=axes[1])
axes[1].set_title(f"Negatif Correlation (r={corr_kesehatan:.2f})")

plt.show()

> 💯 **Interpretation**: Correlation values range from -1 to 1, with 0 indicating no correlation.

### Handling Imbalanced Data
Class imbalance can significantly impact model performance, especially for classification tasks.

In [None]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=5, weights=[0.9, 0.1], random_state=42)

email = pd.DataFrame(X, columns=[f"Kata_{i}" for i in range(1, 6)])
email["Spam"] = y
email.head()

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(x=email['Spam'], hue=email['Spam'])

plt.title("Email inbox")
plt.xlabel("Kelas")
plt.ylabel("Jumlah Data")
plt.ylim(0,1000)
plt.xticks([0, 1], ['Bukan Spam', 'Spam'])
plt.yticks(range(0, 1001, 100))

plt.show()


#### Oversampling
Increases the number of minority class instances:

In [None]:
from imblearn.over_sampling import RandomOverSampler

undersampler = RandomOverSampler(random_state=42)

X_resampled, y_resampled = undersampler.fit_resample(X, y)
sns.countplot(x=y_resampled, hue=y_resampled)

plt.title("Email inbox")
plt.xlabel("Kelas")
plt.ylabel("Jumlah Data")
plt.ylim(0,1000)
plt.xticks([0, 1], ['Bukan Spam', 'Spam'])
plt.yticks(range(0, 1001, 100))
plt.show()

#### Undersampling
Reduces the number of majority class instances:

In [None]:
from imblearn.under_sampling import RandomUnderSampler

undersampler = RandomUnderSampler(random_state=42)

X_resampled, y_resampled = undersampler.fit_resample(X, y)
sns.countplot(x=y_resampled, hue=y_resampled)

plt.title("Email inbox")
plt.xlabel("Kelas")
plt.ylabel("Jumlah Data")
plt.ylim(0,1000)
plt.xticks([0, 1], ['Bukan Spam', 'Spam'])
plt.yticks(range(0, 1001, 100))
plt.show()

> ⚖️ **Balance**: Both techniques help create a more balanced dataset for training models.

### Feature Scaling
Scaling ensures features with different magnitudes don't dominate the model's learning process.

In [None]:
np.random.seed(42)
gaji = np.random.normal(10, 3, 1000)
harga_rumah = np.random.normal(500, 150, 1000)
data_scl = pd.DataFrame({"Gaji (Juta IDR)": gaji, "Harga Rumah (Ratusan Juta IDR)": harga_rumah})

data_scl.hist(bins=30, edgecolor='black', alpha=0.7)
plt.suptitle("Histogram Data Sebelum Scaling")
plt.show()

plt.figure(figsize=(8, 5))
sns.kdeplot(data_scl['Gaji (Juta IDR)'], label='Gaji', color='red')
sns.kdeplot(data_scl['Harga Rumah (Ratusan Juta IDR)'], label='Harga Rumah', color='blue')
plt.title("Distribusi Data Sebelum Scaling")
plt.legend()
plt.show()

In [None]:
from sklearn.preprocessing import StandardScaler
scaler_standard = StandardScaler()
data_standard = scaler_standard.fit_transform(data_scl)
data_standard = pd.DataFrame(data_standard, columns=data_scl.columns)
data_standard.describe()

plt.figure(figsize=(8, 5))
sns.kdeplot(data_standard['Gaji (Juta IDR)'], label='Gaji', color='red')
sns.kdeplot(data_standard['Harga Rumah (Ratusan Juta IDR)'], label='Harga Rumah', color='blue')
plt.title("Distribusi Data Setelah Standard Scaling")
plt.legend()
plt.show()

> 📏 **Note**: After standardization, both features have a mean of 0 and standard deviation of 1.

### Data Binning
Binning groups continuous data into discrete categories, which can help reduce noise.

In [None]:
from numpy.random import seed, randint
seed(42)
age = pd.DataFrame({'age' : randint(0, 100, 100)})

In [None]:
age['bin'] = pd.cut(age['age'], [0, 5, 17, 25, 50, 100])
age

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=True)

# Histogram sebelum binning
axes[0].hist(age['age'], bins=15, color='skyblue', edgecolor='black', alpha=0.7)
axes[0].set_xticks(np.arange(0, 110, 10))
axes[0].grid()
axes[0].set_title("Distribusi Umur Sebelum Binning")
axes[0].set_xlabel("Umur")
axes[0].set_ylabel("Frekuensi")

# Histogram setelah binning
age['bin'].value_counts().sort_index().plot(kind='bar', ax=axes[1], color='salmon', edgecolor='black', alpha=0.7)
axes[1].set_title("Distribusi Umur Setelah Binning")
axes[1].set_xlabel("Kategori Umur")
axes[1].set_ylabel("Frekuensi")

plt.tight_layout()
plt.show()

> 📦 **Application**: Binning is particularly useful for creating features that capture non-linear relationships.

## Summary
Proper data preprocessing is essential for successful machine learning projects. This notebook covered:

1. **Handling Missing Values**: Detection and imputation strategies
2. **Dealing with Outliers**: Standard deviation and Z-score methods
3. **Data Encoding**: Label, one-hot, and ordinal encoding for categorical data
4. **Advanced Techniques**:
   - Feature selection using correlation analysis
   - Balancing imbalanced datasets
   - Feature scaling for normalization
   - Binning continuous variables

Remember that the choice of preprocessing techniques should be guided by your specific dataset characteristics and the requirements of your machine learning algorithm.