# **COMPLETE GUIDE OF DATA PRE-PROCESSING : WHAT, WHY & HOW TO HANDLE MESSY DATA**

### Let's prepare our data for better analysis and modeling

## **A. Learning Objectives**

* **Get a clearer idea of why data preprocessing actually matters**

* **Learn how to handle different data preprocessing techniques**

## **B. Background Theory**


The quality of your data plays a big role in how accurate your analysis and models turn out. Why? Because raw data usually comes with a bunch of issues—like errors, inconsistencies, or stuff that’s just not useful—which can mess up your results and lead to misleading insights. That’s where data preprocessing comes in. It’s basically the step where you clean and organize your data so it’s ready to be used properly.

### **What is Data Pre-Processing ?**

Data preprocessing is the process of preparing raw data so it can be used effectively in machine learning models. Think of it like cleaning and organizing ingredients before cooking a meal—if the ingredients are messy or spoiled, the final dish won’t turn out well.

### **What It Involves**

Here are the key steps in data preprocessing:

1. Cleaning the Data
    
    - Fixing or removing missing values
    
    - Removing duplicates
    
    - Correcting errors or inconsistencies
    
2. Transforming the Data

    - Scaling values (e.g., making sure all numbers are on a similar scale)
    
    - Encoding categories (e.g., turning "Yes"/"No" into 1/0)
    
    - Normalizing or standardizing data

3. Reducing the Data

    - Selecting only the most useful features

    - Removing irrelevant or redundant information

    - Dimensionality reduction (e.g., using PCA)

4. Splitting the Data
    
    - Dividing into training
    
    - validation, and test sets


### **Why It’s Important in Machine Learning**

Machine learning models learn patterns from data. If the data is messy, the model might learn the wrong patterns or get confused. Preprocessing helps:

- Improve model accuracy

- Speed up training

- Avoid errors and bias

- Make the model more generalizable

## **C. Practical Steps for Effective Data Preprocessing**

let's load data set first. in this learning, we will use dataset from : Customer Churn Prediction dataset from Bharti Pasad : [view-dataset](https://www.kaggle.com/code/bhartiprasad17/customer-churn-prediction)

In [None]:
import pandas as pd
# Load the dataset

url_dataset = "https://raw.githubusercontent.com/adisetiawannn/DataCraft/main/data/raw/dataset_preprocessing/customer_churn_data.csv"

df = pd.read_csv(url_dataset)

# display the first five rows of the dataframe
df.head(5)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerID       1000 non-null   int64  
 1   Age              1000 non-null   int64  
 2   Gender           1000 non-null   object 
 3   Tenure           1000 non-null   int64  
 4   MonthlyCharges   1000 non-null   float64
 5   ContractType     1000 non-null   object 
 6   InternetService  703 non-null    object 
 7   TotalCharges     1000 non-null   float64
 8   TechSupport      1000 non-null   object 
 9   Churn            1000 non-null   object 
dtypes: float64(2), int64(3), object(5)
memory usage: 78.3+ KB


### **1. Cleaning Data**

#### 1.1 Missing Value

Missing values are empty spots in your data—like blanks or NaNs—that can mess things up when training a model.

- Why Fix Them?

Because most machine learning models don’t like missing data. If you ignore them, your model might crash or give bad predictions.

- How to Handle?

    - Drop them if they’re not important
    
    - Fill them with something like the mean, median, or a default value
    - Use forward/backward fill for time-based data

In [48]:
# check column with missing values
columns_with_missing = df.columns[df.isnull().any()].tolist() ; 

print(f"we have missing values in the following columns:", columns_with_missing);


# from the info above, we can see that the InternetService column has missing values
# then, we will drop the rows with missing values in the InternetService column
df_drop = df.dropna(subset=['InternetService'])
null_counts = df_drop.isnull().sum() #check null values after dropping rows
print(null_counts)

we have missing values in the following columns: ['InternetService']
CustomerID         0
Age                0
Gender             0
Tenure             0
MonthlyCharges     0
ContractType       0
InternetService    0
TotalCharges       0
TechSupport        0
Churn              0
dtype: int64
