# **COMPLETE GUIDE OF DATA PRE-PROCESSING : WHAT, WHY & HOW TO HANDLE MESSY DATA**

### Let's prepare our data for better analysis and modeling

## **A. Learning Objectives**

* **Get a clearer idea of why data preprocessing actually matters**

* **Learn how to handle different data preprocessing techniques**

## **B. Background Theory**


The quality of your data plays a big role in how accurate your analysis and models turn out. Why? Because raw data usually comes with a bunch of issues—like errors, inconsistencies, or stuff that’s just not useful—which can mess up your results and lead to misleading insights. That’s where data preprocessing comes in. It’s basically the step where you clean and organize your data so it’s ready to be used properly.

### **What is Data Pre-Processing ?**

Data preprocessing is the process of preparing raw data so it can be used effectively in machine learning models. Think of it like cleaning and organizing ingredients before cooking a meal—if the ingredients are messy or spoiled, the final dish won’t turn out well.

### **What It Involves**

Here are the key steps in data preprocessing:

1. Cleaning the Data
    
    - Fixing or removing missing values
    
    - Removing duplicates
    
    - Correcting errors or inconsistencies
    
2. Transforming the Data

    - Scaling values (e.g., making sure all numbers are on a similar scale)
    
    - Encoding categories (e.g., turning "Yes"/"No" into 1/0)
    
    - Normalizing or standardizing data

3. Reducing the Data

    - Selecting only the most useful features

    - Removing irrelevant or redundant information

    - Dimensionality reduction (e.g., using PCA)

4. Splitting the Data
    
    - Dividing into training
    
    - validation, and test sets


### **Why It’s Important in Machine Learning**

Machine learning models learn patterns from data. If the data is messy, the model might learn the wrong patterns or get confused. Preprocessing helps:

- Improve model accuracy

- Speed up training

- Avoid errors and bias

- Make the model more generalizable

## **C. Practical Steps for Effective Data Preprocessing**

let's load data set first. in this learning, we will use dataset from : Customer Churn Prediction dataset from Bharti Pasad : [view-dataset](https://www.kaggle.com/code/bhartiprasad17/customer-churn-prediction)

In [2]:
import pandas as pd
# Load the dataset

url_dataset = "https://raw.githubusercontent.com/adisetiawannn/DataCraft/main/data/raw/dataset_preprocessing/customer_churn_data.csv"

df = pd.read_csv(url_dataset)

# display the first five rows of the dataframe
df.head(5)


Unnamed: 0,CustomerID,Age,Gender,Tenure,MonthlyCharges,ContractType,InternetService,TotalCharges,TechSupport,Churn
0,1,49,Male,4,88.35,Month-to-Month,Fiber Optic,353.4,Yes,Yes
1,2,43,Male,0,36.67,Month-to-Month,Fiber Optic,0.0,Yes,Yes
2,3,51,Female,2,63.79,Month-to-Month,Fiber Optic,127.58,No,Yes
3,4,60,Female,8,102.34,One-Year,DSL,818.72,Yes,Yes
4,5,42,Male,32,69.01,Month-to-Month,,2208.32,No,Yes


### **1. Cleaning Data : Handling Missing Value**

Missing values are empty spots in your data—like blanks or NaNs—that can mess things up when training a model.

- **Why Fix Them?**

    Because most machine learning models don’t like missing data. If you ignore them, your model might crash or give bad predictions.

- **How to Handle?**

    - Drop them if they’re not important
    
    - Fill them with something like the mean, median, or a default value

    - Use forward/backward fill for time-based data


Comparison Table of Handling Missing Value Techniques

| Technique                     | When to Use                                                                 | Pros                                                                 | Cons                                                                 | How to Do                                                                                   |
|--------------------------------|------------------------------------------------------------------------------|----------------------------------------------------------------------|----------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
| **Remove Missing Values**      | - Missing data is rare (<5% of dataset)                                      | - Simple and fast                                                    | - Can lose valuable information                                    | `df.dropna()`                                                                                 |
|                                | - Missingness is random                                                      | - No risk of introducing bias from imputation                        | - Can reduce dataset size                                           |                                                                                                |
| **Fill NA (Static Value) : mean, median, average and etc**     | - Missing data is more frequent but predictable                              | - Keeps all rows                                                      | - May introduce bias if fill value is not representative            | `df.fillna(0)` or `df.fillna(df['col'].mean())`                                                |
|                                | - You have a meaningful constant or statistic (mean, median, mode)           | - Simple to implement                                                 | - Can reduce variance artificially                                  |                                                                                                |
| **Forward/Backward Fill**      | - Time series or sequential data where last/next value makes sense           | - Preserves temporal consistency                                      | - Can propagate incorrect values forward/backward                   | `df.fillna(method='ffill')` (forward) or `df.fillna(method='bfill')` (backward)                |
|                                | - Small gaps in data that should be smoothed                                 | - Good for short gaps in ordered data                                 | - Not suitable for large gaps                                       |                                                                                                |


In [None]:
# Create a sample DataFrame with missing values

import pandas as pd
import numpy as np

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Hannah', 'Ian', 'Jane',
             'Kevin', 'Laura', 'Mike', 'Nina', 'Oscar', 'Paula', 'Quinn', 'Rachel', 'Steve', 'Tina'],
    'Age': [25, np.nan, 30, 22, np.nan, 28, 35, 40, np.nan, 29,
            31, 27, np.nan, 33, 26, 24, 38, np.nan, 32, 30],
    'City': ['Jakarta', 'Medan', np.nan, 'Bandung', 'Surabaya', 'Jakarta', 'Medan', 'Bandung', 'Surabaya', np.nan,
             'Jakarta', 'Medan', 'Bandung', np.nan, 'Surabaya', 'Jakarta', 'Medan', 'Bandung', 'Surabaya', 'Jakarta'],
    'Score': [88, 92, np.nan, 85, 90, 87, np.nan, 91, 89, 86,
              93, np.nan, 84, 88, 90, 85, 92, 89, np.nan, 87]
}

# Create DataFrame
df = pd.DataFrame(data)


# check column with missing values
columns_with_missing = df.columns[df.isnull().any()].tolist() ; 

print(f"we have missing values in the following columns:", columns_with_missing);


we have missing values in the following columns: ['Age', 'City', 'Score']


In [None]:
# 1. DROP Missing Values
# from the info above, we can see that the InternetService column has missing values

df_drop = df.dropna(subset=['City', 'Score'])
null_counts = df_drop.isnull().sum()  # check null values after dropping rows
print(null_counts) # display the DataFrame after dropping rows with missing values, now we have no missing values in the 'City' and 'Score' columns but still have missing values in the 'Age' column

Name     0
Age      5
City     0
Score    0
dtype: int64


In [23]:
# 2. Fill missing numerical values with mean

df['Age_mean'] = df['Age'].fillna(df['Age'].mean())

# Fill missing numerical values with median
df['Score_median'] = df['Score'].fillna(df['Score'].median())

# Fill missing categorical/string values with a default : Unknown
df['City_filled'] = df['City'].fillna('Unknown')

print(df)   # Display the DataFrame after filling missing values

       Name   Age      City  Score  Age_mean  Score_median City_filled
0     Alice  25.0   Jakarta   88.0      25.0          88.0     Jakarta
1       Bob   NaN     Medan   92.0      30.0          92.0       Medan
2   Charlie  30.0       NaN    NaN      30.0          88.5     Unknown
3     David  22.0   Bandung   85.0      22.0          85.0     Bandung
4       Eva   NaN  Surabaya   90.0      30.0          90.0    Surabaya
5     Frank  28.0   Jakarta   87.0      28.0          87.0     Jakarta
6     Grace  35.0     Medan    NaN      35.0          88.5       Medan
7    Hannah  40.0   Bandung   91.0      40.0          91.0     Bandung
8       Ian   NaN  Surabaya   89.0      30.0          89.0    Surabaya
9      Jane  29.0       NaN   86.0      29.0          86.0     Unknown
10    Kevin  31.0   Jakarta   93.0      31.0          93.0     Jakarta
11    Laura  27.0     Medan    NaN      27.0          88.5       Medan
12     Mike   NaN   Bandung   84.0      30.0          84.0     Bandung
13    

In [None]:
# 3. Forward/Backward Fill

df['Age_ffill'] = df['Age'].ffill()  # Forward fill
df['Age_bfill'] = df['Age'].bfill()  # Backward fill

print(df[['Name','Age_ffill', 'Age_bfill']])  # Display the DataFrame with forward and backward filled values

       Name  Age_ffill  Age_bfill
0     Alice       25.0       25.0
1       Bob       25.0       30.0
2   Charlie       30.0       30.0
3     David       22.0       22.0
4       Eva       22.0       28.0
5     Frank       28.0       28.0
6     Grace       35.0       35.0
7    Hannah       40.0       40.0
8       Ian       40.0       29.0
9      Jane       29.0       29.0
10    Kevin       31.0       31.0
11    Laura       27.0       27.0
12     Mike       27.0       33.0
13     Nina       33.0       33.0
14    Oscar       26.0       26.0
15    Paula       24.0       24.0
16    Quinn       38.0       38.0
17   Rachel       38.0       32.0
18    Steve       32.0       32.0
19     Tina       30.0       30.0


### **2 Transforming Data**

- **What is Transforming Data**

    Before we feed data into a machine learning model, we often need to clean it up and reshape it a bit. That’s what data transformation is all about — making sure the data is in the right format and scale so the model can actually learn from it.

    Think of it like prepping ingredients before cooking. You wouldn’t throw whole vegetables into a blender without chopping them first, right?

- **Why we need to Transform our Data**

  Machine learning models are picky. They work best when:

  - a. Numbers are on similar scales  
  - b. Categories are turned into something they can understand (like numbers)  
  - c. Data is clean, consistent, and standardized


- **How Do We Transform Data?**

  | Technique Name     | Definition                                                                 | Popular Methods                                      | When to Use                                                                 | Pros                                                                 | Cons                                                                 | How to Do (Python Code)                                                                 |
  |--------------------|-----------------------------------------------------------------------------|------------------------------------------------------|------------------------------------------------------------------------------|----------------------------------------------------------------------|----------------------------------------------------------------------|------------------------------------------------------------------------------------------|
  | Scaling Values      | Adjusts numerical values to a common scale without changing distribution   | Min-Max Scaling, Robust Scaling, MaxAbs Scaling      | When features have different ranges (e.g., 0–1 vs 0–1000)                    | Helps models converge faster and treat features equally              | Doesn’t handle outliers well (MinMax), may distort data              | `MinMaxScaler().fit_transform(data)`<br>`RobustScaler().fit_transform(data)`            |
  | Encoding Categories | Converts categorical data into numerical format                            | One-Hot Encoding, Label Encoding, Ordinal Encoding   | When you have non-numeric features like "Yes"/"No", "Red"/"Blue"            | Makes categorical data usable by ML models                           | Can increase dimensionality (One-Hot), may imply order (Label)       | `LabelEncoder().fit_transform(data)`<br>`pd.get_dummies(data)`<br>`OrdinalEncoder()`    |
  | Normalizing         | Scales data to a fixed range, usually [0, 1]                               | Min-Max Normalization, L2 Normalization, MaxAbs      | When data needs to be bounded or when using distance-based models           | Keeps data within a consistent range                                 | Sensitive to outliers                                                | `MinMaxScaler().fit_transform(data)`<br>`normalize(data, norm='l2')`                    |
  | Standardizing       | Centers data around mean 0 and std dev 1                                   | Z-score Standardization, Robust Scaling, PowerTransform | When data has varying distributions or outliers                             | Handles outliers better than normalization                           | Assumes Gaussian distribution (may not fit all data)                 | `StandardScaler().fit_transform(data)`<br>`RobustScaler().fit_transform(data)`          |

