## Data Preprocessing in Machine Learning ⭐

The goal is simple: **prepare raw data so our model can learn patterns instead of noise**.

**The Complete  Data Preprocessing Pipeline:**

1️⃣ **Data Collection**

You gather raw data from:

- CSV files
- Databases
- APIs
- Web scraping
- Sensors, logs
- Data warehouses (Snowflake, BigQuery, etc.)

No preprocessing yet — just collecting.

In [11]:
import pandas as pd

df = pd.read_csv("sleep_gym.csv")

In [12]:
print(df.head())

   user_id  hours_sleep  energy_level  went_gym
0        1          7.5             8         1
1        2          5.0             4         0
2        3          6.0             6         1
3        4          4.5             3         0
4        5          8.0             9         1


2️⃣ **Data Understanding / Exploration (EDA = Exploratory Data Analysis)**

Before cleaning anything, you look at (analyze) your data to understand it.

You check:

- What each column means
- Which columns are numeric, categorical, text, dates
- Shape of dataset
- Missing values
- Outliers
- Distribution of each feature
- Correlations

Tools:

- df.info(), df.describe()
- Histograms
- Boxplots
- Pairplots
- Correlation matrix

This helps you decide what cleaning/transformation is needed.

--

**❗IMPORTANT:**
**EDA — Understanding the Dataset’s Nature and Context**

Before performing any cleaning, scaling, or transformation, **it is crucial to understand the nature of the dataset** you’re working with. EDA is not just about generating plots and summaries — **it’s about understanding the context in which the data was collected** and **what each feature truly represents**.

Many preprocessing steps (like dropping columns, filling missing values, removing outliers, or smoothing noisy data) can accidentally delete **meaningful information** if you don’t understand this context.

For example:

- A “missing” value might actually mean the event didn’t happen (not an error).
- A “weird” value might be rare but important (fraud detection, medical anomalies).
- An outlier might represent a critical edge case rather than noise.
- A categorical label with only a few samples might be a rare but valid class.

Therefore, a major part of EDA is to make sure you understand:

- How the data was collected
- What each feature means
- Why some values may be missing or extreme
- Whether irregularities are errors or meaningful signals

Only after understanding the story behind the data can you decide the right preprocessing steps.

3️⃣ **Data Cleaning**

This is the biggest and most important part.

Includes:

**✔ Handling Missing Values**

Ways:

- Delete rows/columns (if too many missing)
- Fill with mean/median (numeric)
- Fill with mode (categorical)
- Forward/backward fill (time series)
- Use models to predict missing values

**✔ Handling Noise / Errors**

Examples:

- Wrong values (“age = 300”)
- Duplicates
- Typos (“Male”, “male”, “MALE”)

**✔ Outlier Detection**

Methods:

- Z-Score
- IQR
- Isolation Forest
- Boxplot inspection

You either remove, cap, or transform outliers.

**✔ Data Type Corrections**

- Converting strings → dates
- Strings → categories
- Float → int
- Object type → numeric

4️⃣ **Data Integration (if you have multiple data sources)**

When data comes from different places, you must integrate it.

Tasks:

- Merge tables (SQL joins)
- Resolve inconsistencies across systems
    - Different names ("user_id" vs "uid")
    - Different units (kg vs lbs)
    - Different formats (timestamps)
- Detect duplicate records across datasets

Example:

User info in MySQL + user activity in MongoDB → must combine into a single dataset.

5️⃣ **Data Transformation**

This prepares the data for the algorithm.

**✔ Scaling / Normalization**

Used for algorithms sensitive to distance:

- KNN
- SVM
- Logistic Regression
- Neural Networks
- K-means

Methods:

- Standardization
- Min-Max scaling


**✔ Encoding Categorical Variables**

For ML:

- One-hot encoding (small categories)
- Label encoding (ordinal data)
- Target encoding (for large cardinality)

**✔ Binning**

- Converting numeric data to categories (e.g. Age → Age groups).

**✔ Feature Construction**

Creating new important variables:

- BMI = weight / height²
- Day of week from date
- Length of text input
- Interaction terms (feature1 × feature2)

**✔ Text preprocessing**

- Tokenization
- Lowercasing
- Stopword removal
- Lemmatization
- TF-IDF
- Embeddings (Word2Vec, BERT)

**✔ Time-series preprocessing**

- Creating lags
- Rolling means
- Seasonality decomposition
- Differencing (for stationarity)

6️⃣ **Data Reduction (optional but useful)**

This step makes data smaller, faster, and cleaner.

**✔ Dimensionality Reductio**

- PCA
- t-SNE
- UMAP

**✔ Feature Selection**

- Filter methods (correlation, chi-square)
- Wrapper methods (RFE)
- Embedded methods (Lasso)

**✔ Sampling**

- Downsampling
- Upsampling (SMOTE for classification)

Useful when dataset is extremely large or imbalanced.

7️⃣ **Train-Test Split**

You must separate the data to evaluate fairly.

Typical:

- Train: 70–80%
- Test: 20–30%

Sometimes also:

- Validation set
- K-fold cross-validation

8️⃣ **Final Preprocessing Pipeline (for production)**

You package everything into a data pipeline (sklearn Pipeline):

✔ Scaling
✔ Encoding
✔ Imputation
✔ Feature selection
✔ Model

This ensures:

- No data leakage
- Reproducibility
- Cleaner code
- Same transformation in training & prediction