Welcome to my Data Preprocessing Project
This repository demonstrates core data preprocessing techniques in Python using Pandas, NumPy, and basic Python — all done manually without relying on any pre-built functions or libraries like get_dummies or sklearn.
It's perfect for beginners who want to understand how encoding and imputation work under the hood.
Project Overview
Data preprocessing is a crucial step in any Machine Learning pipeline. It ensures that the dataset is clean, consistent, and ready for modelling.
In this project, we focus on three main techniques:
-
Ordinal Encoding
- Converting categorical data with natural order (like education levels) into numbers.
- Example:
12th Pass = 1,Graduate = 2,Post-Graduate = 3.
-
One Hot Encoding
- Converting categorical data with no specific order into separate binary columns (0/1).
- Example: Cities like Delhi, Mumbai, Bangalore →
City_Delhi,City_Mumbai,City_Bangalore.
-
Imputation (Mean)
- Handling missing values (
NaN) by replacing them with the mean of the column. - Ensures there are no gaps in the dataset for numerical analysis.
- Handling missing values (
Dataset :
| ID | Name | City | Education | Experience (Years) | Salary (₹) |
|---|---|---|---|---|---|
| 1 | Amit | Delhi | Graduate | 2 | 32000 |
| 2 | Riya | Mumbai | Post-Graduate | 5 | 54000 |
| 3 | Sam | Delhi | 12th Pass | NaN | 25000 |
| 4 | John | Bangalore | Graduate | 3 | NaN |
| 5 | Neha | Mumbai | Post-Graduate | 4 | 58000 |
| 6 | Arjun | Delhi | 12th Pass | 1 | NaN |
| 7 | Priya | Bangalore | Graduate | NaN | 41000 |
Techniques Implemented
- Ordinal Encoding → Mapping education levels to numeric values using basic Python.
- One Hot Encoding → Creating city dummy columns using loops.
- Mean Imputation → Filling missing values using manually calculated means.