Excellent! Let's dive into a **complete guide to data cleaning**, with **all key steps** and **code examples** in Python using `pandas`, `numpy`, and `sklearn`.

---

## ✅ What is Data Cleaning?

**Data cleaning** is the process of **detecting and correcting (or removing)** errors, inconsistencies, and inaccuracies in the data to improve its quality before analysis or modeling.

---

## 🔧 A. DATA CLEANING CHECKLIST (with CODE)

---

### 🔹 1. **Load and Understand the Data**

```python
import pandas as pd

df = pd.read_csv("data.csv")  # or pd.read_excel, pd.read_json, etc.
print(df.shape)
print(df.dtypes)
print(df.head())
```

---

### 🔹 2. **Identify and Handle Missing Values**

#### ✅ a. Detect Missing Values

```python
df.isnull().sum()         # Total missing per column
df.isnull().mean()*100    # Percentage missing
```

#### ✅ b. Drop Missing Values

```python
df.dropna(inplace=True)  # Drop rows with any missing values
```

#### ✅ c. Fill Missing Values

```python
df['Age'].fillna(df['Age'].mean(), inplace=True)        # Numeric: mean
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)  # Categorical: mode
```

#### ✅ d. Interpolation

```python
df['Temperature'] = df['Temperature'].interpolate(method='linear')
```

---

### 🔹 3. **Handle Duplicates**

```python
df.duplicated().sum()
df.drop_duplicates(inplace=True)
```

---

### 🔹 4. **Fix Data Types**

```python
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Age'] = df['Age'].astype('int')
```

---

### 🔹 5. **Standardize Categorical Values**

```python
df['Gender'] = df['Gender'].str.lower().str.strip()
df['Gender'] = df['Gender'].replace({'m': 'male', 'f': 'female'})
```

---

### 🔹 6. **Outlier Detection and Treatment**

#### ✅ a. Z-Score Method

```python
from scipy.stats import zscore

z_scores = zscore(df['Income'])
df = df[(abs(z_scores) < 3)]
```

#### ✅ b. IQR Method

```python
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['Income'] >= Q1 - 1.5 * IQR) & (df['Income'] <= Q3 + 1.5 * IQR)]
```

---

### 🔹 7. **Handle Inconsistent Formatting**

#### ✅ a. Remove Whitespace & Special Characters

```python
df['Name'] = df['Name'].str.strip().str.replace('[^a-zA-Z ]', '', regex=True)
```

#### ✅ b. Fix Case Sensitivity

```python
df['City'] = df['City'].str.title()
```

---

### 🔹 8. **Drop Unnecessary Columns**

```python
df.drop(columns=['Unnamed: 0', 'Temp_ID'], inplace=True)
```

---

### 🔹 9. **Detect and Handle Inconsistent Units or Scales**

```python
# Convert "cm" to "m" for height
df['Height_m'] = df['Height_cm'] / 100
```

---

### 🔹 10. **Validate Business Rules**

#### ✅ Example:

```python
# Salary should not be negative
df = df[df['Salary'] >= 0]

# Age should be within realistic range
df = df[(df['Age'] > 0) & (df['Age'] < 120)]
```

---

### 🔹 11. **Text Cleaning (for NLP)**

```python
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+", "", text)             # Remove URLs
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)       # Remove special characters
    text = re.sub(r"\s+", " ", text).strip()
    return text

df['Clean_Review'] = df['Review'].apply(clean_text)
```

---

## ✅ BONUS: Automate Cleaning with Pipelines (Scikit-learn)

```python
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

X_cleaned = pipeline.fit_transform(df[['Age', 'Salary']])
```

---

## 🧠 Summary Table

| Task                 | Tools/Methods                            |
| -------------------- | ---------------------------------------- |
| Missing values       | `dropna`, `fillna`, `interpolate`        |
| Duplicates           | `drop_duplicates`                        |
| Data type fixing     | `astype`, `to_datetime`                  |
| Categorical cleanup  | `.str.lower()`, `.replace()`, `.strip()` |
| Outlier handling     | Z-score, IQR                             |
| Inconsistent formats | Regex, `.str.replace()`, `.title()`      |
| Text cleanup         | Regex, `re`, `nltk`                      |
| Column reduction     | `drop(columns=[...])`                    |

---


---

## ✅ What is KMeans Imputer?

Instead of filling missing values with the **mean/median** (which may ignore feature relationships), **KMeans Imputer**:

* Groups similar rows using **K-Means clustering**.
* Fills missing values using the **mean of the feature within its cluster**.

It’s especially useful when:

* Your data has **structured patterns** (e.g., customer segments, behavior groups).
* You're dealing with **numerical features**.
* You want a **smarter imputation** than global averages.

---

## 🛠️ How It Works (Step-by-Step)

1. Use only rows with no missing values to **train the KMeans model**.
2. Predict clusters for all rows (including missing).
3. Impute missing values using the **cluster-wise mean** of each feature.

---

## ✅ Code Example: KMeans Imputer (Manual Implementation)

### 🔹 Step 1: Import Libraries

```python
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
```

### 🔹 Step 2: Sample Data

```python
# Sample data with missing values
df = pd.DataFrame({
    'Age': [25, 27, np.nan, 22, 35, np.nan],
    'Income': [50000, 54000, 52000, 48000, np.nan, 60000],
    'Spending': [200, 220, 210, 190, 250, np.nan]
})
```

### 🔹 Step 3: Scale & Separate Complete Rows

```python
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

# Separate rows with and without missing values
df_complete = df_scaled.dropna()
df_missing = df_scaled[df_scaled.isnull().any(axis=1)]
```

### 🔹 Step 4: Fit KMeans on Complete Rows

```python
kmeans = KMeans(n_clusters=2, random_state=42)
df_complete['cluster'] = kmeans.fit_predict(df_complete)
```

### 🔹 Step 5: Assign Cluster to Incomplete Rows

```python
# Fill missing temporarily for distance calculation
temp_imputer = SimpleImputer(strategy='mean')
temp_data = temp_imputer.fit_transform(df_scaled)

# Predict clusters for all rows (including missing)
df_scaled['cluster'] = kmeans.predict(temp_data)
```

### 🔹 Step 6: Impute Missing with Cluster Means

```python
# Inverse scaling to get back original values
df_scaled_original = pd.DataFrame(scaler.inverse_transform(df_scaled.drop(columns=['cluster'])), columns=df.columns)

# Add cluster column back
df_scaled_original['cluster'] = df_scaled['cluster']

# Fill missing with cluster-wise mean
for col in df.columns:
    for cluster in df_scaled_original['cluster'].unique():
        mask = (df_scaled_original['cluster'] == cluster) & (df_scaled_original[col].isnull())
        mean_value = df_scaled_original.loc[df_scaled_original['cluster'] == cluster, col].mean()
        df_scaled_original.loc[mask, col] = mean_value
```

---

## 🧠 Result

```python
print(df_scaled_original.drop(columns=['cluster']))
```

Now you’ll have a **clean DataFrame** with missing values **filled intelligently based on cluster behavior**.

---

## 🔍 When to Use KMeans Imputer?

| Use Case                    | Suitability            |
| --------------------------- | ---------------------- |
| Numerical data              | ✅ Ideal                |
| Obvious clustering patterns | ✅ Best case            |
| Categorical data            | ❌ Not ideal            |
| Mixed data types            | ⚠️ Needs preprocessing |

---

## ✅ Alternative: Use `FancyImpute`'s KMeansImputer (3rd-party lib)

```bash
pip install fancyimpute
```

```python
from fancyimpute import KNN, IterativeImputer, KMeans

kmeans_imputer = KMeans(n_clusters=2)
df_imputed = kmeans_imputer.fit_transform(df.values)
```

---



## 🧾 Table of Contents

1. [Simple Imputation Methods](#1)
2. [Statistical Imputation](#2)
3. [Model-Based Imputation](#3)
4. [KNN Imputation](#4)
5. [KMeans Imputation](#5)
6. [Multivariate Imputation (MICE)](#6)
7. [Deep Learning Imputation](#7)
8. [Domain-Specific or Custom Imputation](#8)
9. [Dropping Missing Data](#9)

---

## 🔹 <a name="1"></a>1. **Simple Imputation Methods**

### ✅ a. Fill with Constant

```python
df['Gender'].fillna('Unknown', inplace=True)
df['Age'].fillna(0, inplace=True)
```

### ✅ b. Fill with Mean/Median/Mode

```python
from sklearn.impute import SimpleImputer

mean_imputer = SimpleImputer(strategy='mean')
df[['Age']] = mean_imputer.fit_transform(df[['Age']])

mode_imputer = SimpleImputer(strategy='most_frequent')
df[['Gender']] = mode_imputer.fit_transform(df[['Gender']])
```

---

## 🔹 <a name="2"></a>2. **Statistical/Group-Based Imputation**

### ✅ a. Fill with Grouped Mean

```python
df['Age'] = df.groupby('Gender')['Age'].transform(lambda x: x.fillna(x.mean()))
```

### ✅ b. Interpolation (for time series)

```python
df['Temperature'] = df['Temperature'].interpolate(method='linear')  # or 'time', 'polynomial', etc.
```

---

## 🔹 <a name="3"></a>3. **Model-Based Imputation**

Train a **supervised ML model** (e.g., regression) to predict missing values.

### ✅ a. Predict Age with Random Forest

```python
from sklearn.ensemble import RandomForestRegressor

# Separate known and unknown
known = df[df['Age'].notnull()]
unknown = df[df['Age'].isnull()]

model = RandomForestRegressor()
model.fit(known[['Income', 'Spending']], known['Age'])

df.loc[df['Age'].isnull(), 'Age'] = model.predict(unknown[['Income', 'Spending']])
```

---

## 🔹 <a name="4"></a>4. **KNN Imputation**

Fills missing values using the average of the k-nearest neighbors.

```python
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=3)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
```

✅ Works well when features are **correlated**.

---

## 🔹 <a name="5"></a>5. **KMeans Imputation (Cluster-Based)**

Use KMeans to cluster similar rows and impute missing values with **cluster-wise means**.

```python
pip install fancyimpute

kmeans_imputer = KMeans(n_clusters=2)
df_imputed = kmeans_imputer.fit_transform(df.values)
```

✅ Best when dataset shows **natural clusters or segmentation**.

---

## 🔹 <a name="6"></a>6. **Multivariate Imputation (MICE - Iterative)**

Uses chained equations to model each feature as a function of the others.

```python
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(random_state=42)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
```

✅ Suitable for **complex data with multiple correlated features**.

---

## 🔹 <a name="7"></a>7. **Deep Learning Imputation (Autoencoders)**

Train an autoencoder to learn feature representations and reconstruct missing values.

```python
# Typically done with TensorFlow or PyTorch; requires custom pipeline
# Pseudo-code:
# - Normalize
# - Train autoencoder to reconstruct input
# - Use reconstruction to fill in missing
```

✅ Ideal for large datasets, high dimensionality, or image/NLP/tabular hybrids.

---

## 🔹 <a name="8"></a>8. **Domain-Specific or Custom Imputation**

Examples:

* Fill missing temperatures with seasonal averages
* Fill missing product prices with category-wise medians
* NLP: Fill missing text fields with `""` or `"unknown"`

```python
# Example: fill NaNs in 'Salary' based on JobTitle
df['Salary'] = df.groupby('JobTitle')['Salary'].transform(lambda x: x.fillna(x.median()))
```

---

## 🔹 <a name="9"></a>9. **Dropping Missing Values (When Safe)**

```python
df.dropna(axis=0, how='any', inplace=True)   # Drop rows with any missing values
df.dropna(axis=1, how='all', inplace=True)   # Drop columns where all values are NaN
```

✅ Only recommended when:

* Dataset is large
* Affected rows/columns are few
* Data is missing completely at random (MCAR)

---

## ✅ Summary Table

| Method                  | Type           | Good For                                  |
| ----------------------- | -------------- | ----------------------------------------- |
| Mean/Median/Mode        | Simple         | Basic numeric/categorical imputation      |
| Grouped Mean            | Statistical    | Data grouped by category (e.g. by gender) |
| Interpolation           | Statistical    | Time series                               |
| KNN                     | Distance-based | Correlated numeric data                   |
| KMeans                  | Cluster-based  | Segmented data                            |
| IterativeImputer (MICE) | Model-based    | Multivariate imputation                   |
| Regression/Tree Model   | Supervised     | Targeted missing value prediction         |
| Autoencoder             | Deep Learning  | Complex, high-dimensional data            |
| Drop NA                 | Last resort    | Sparse missing data or early EDA          |

---
