# üìò Section 9: Advanced Data Cleaning, Outlier Detection & Data Validation

**Level:** Advanced

In this section, we will master **data cleaning, outlier detection, and validation** using Pandas. These techniques ensure data integrity and consistency before analysis or modeling.

We'll cover:
- Handling inconsistent data types
- Detecting and treating outliers
- Dealing with duplicates and missing patterns
- Data validation and constraints
- Real-world case studies: cleaning sales and sensor data

---

## üîπ 9.1 Detecting Data Type Issues and Conversion

Real-world data often comes with **mixed types**, such as numbers stored as strings or inconsistent formats.

In [None]:
import pandas as pd
import numpy as np

# Simulated dataset with mixed data types
df = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'price': ['100', '200', 'N/A', '350', 'Two Hundred'],
    'quantity': [10, 5, np.nan, 8, 3]
})
df

### Cleaning and Converting to Numeric Types

In [None]:
# Coerce invalid entries to NaN
df['price'] = pd.to_numeric(df['price'], errors='coerce')

# Fill NaN with median value
df['price'] = df['price'].fillna(df['price'].median())
df.info()
df

## üîπ 9.2 Detecting Missing Patterns

Identifying missing values is crucial for choosing an appropriate imputation or dropping strategy.

In [None]:
df_missing = pd.DataFrame({
    'A': [1, 2, np.nan, 4, np.nan],
    'B': [np.nan, 5, 6, np.nan, 8],
    'C': ['x', 'y', 'z', 'y', np.nan]
})

# Visualize missingness
print(df_missing.isnull())
print('\nTotal missing per column:')
print(df_missing.isnull().sum())

### Fill Patterns Strategically

In [None]:
# Forward fill for time-series-like data
df_missing_ffill = df_missing.fillna(method='ffill')

# Conditional filling
df_missing['A'] = df_missing['A'].fillna(df_missing['A'].mean())
df_missing

## üîπ 9.3 Outlier Detection with IQR and Z-Score

Outliers can distort your analysis or models. Pandas works well with statistical techniques like **IQR** and **Z-score** for detection.

In [None]:
np.random.seed(42)
sales = pd.DataFrame({'revenue': np.random.normal(1000, 100, 100)})
# Add some extreme outliers
sales.loc[[10, 20, 50], 'revenue'] = [3000, 50, 5000]

# IQR Method
Q1 = sales['revenue'].quantile(0.25)
Q3 = sales['revenue'].quantile(0.75)
IQR = Q3 - Q1

outliers = sales[(sales['revenue'] < (Q1 - 1.5 * IQR)) | (sales['revenue'] > (Q3 + 1.5 * IQR))]
print('Detected outliers:')
display(outliers)

### Handling Outliers

In [None]:
# Cap outliers using quantiles
sales['revenue_capped'] = np.where(sales['revenue'] > Q3 + 1.5 * IQR, Q3 + 1.5 * IQR, sales['revenue'])
sales['revenue_capped'] = np.where(sales['revenue_capped'] < Q1 - 1.5 * IQR, Q1 - 1.5 * IQR, sales['revenue_capped'])

sales[['revenue', 'revenue_capped']].head(15)

## üîπ 9.4 Validating Data Integrity with Rules

Ensure the dataset follows logical and domain-specific rules.

In [None]:
products = pd.DataFrame({
    'product_id': ['A', 'B', 'C', 'D'],
    'price': [50, -10, 30, 0],
    'quantity': [10, 5, 0, -3]
})

# Rule 1: Price and quantity should be positive
invalid = products[(products['price'] <= 0) | (products['quantity'] <= 0)]
print('Invalid entries:')
display(invalid)

## ‚öôÔ∏è Under the Hood

- Pandas uses **NumPy masked arrays** for missing data handling.
- `pd.to_numeric(errors='coerce')` leverages fast C-level parsing.
- Boolean indexing and `np.where()` are **vectorized operations**, minimizing Python loops.
- IQR-based filtering is efficient because it uses **quantile interpolation** in C.

---

## üíº Real-World Problem 1 ‚Äî Customer Transactions Cleanup

**Scenario:** You receive a messy CSV file of customer transactions. Some records have missing values, typos in prices, and duplicate rows.

**Goal:**
1. Identify and remove duplicates.
2. Convert `price` to numeric.
3. Drop rows with invalid or incomplete data.
4. Summarize total revenue by customer.

In [None]:
data = pd.DataFrame({
    'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Bob'],
    'price': ['200', '100', 'Two Hundred', '150', None, '100'],
    'quantity': [2, 1, 2, 1, 1, 1]
})

# Clean duplicates
data = data.drop_duplicates()

# Convert and clean
data['price'] = pd.to_numeric(data['price'], errors='coerce')
data = data.dropna()
data['total'] = data['price'] * data['quantity']

# Revenue summary
revenue_summary = data.groupby('customer')['total'].sum()
revenue_summary

## üåç Real-World Problem 2 ‚Äî IoT Sensor Outlier Filtering

**Scenario:** IoT sensors sometimes send corrupted readings. You need to identify and cap anomalies.

**Goal:** Use statistical techniques to detect and fix faulty temperature readings.

In [None]:
# Simulate temperature readings
temp = pd.DataFrame({'reading': np.append(np.random.normal(25, 2, 100), [100, -10, 60])})

# Detect outliers using Z-score
mean = temp['reading'].mean()
std = temp['reading'].std()
temp['zscore'] = (temp['reading'] - mean) / std

# Cap outliers beyond 3 std
temp['cleaned'] = np.where(temp['zscore'].abs() > 3, mean, temp['reading'])
temp.head(15)

## ‚úÖ Best Practices / Pitfalls

‚úÖ Always inspect data types with `df.info()` before cleaning.
‚úÖ Use `errors='coerce'` to safely convert invalid numeric data.
‚úÖ Use **IQR** or **Z-score** methods to handle outliers instead of naive trimming.
‚ö†Ô∏è Don‚Äôt over-impute missing data ‚Äî it can introduce bias.
‚öôÔ∏è Use `DataFrame.query()` for cleaner conditional validation.

---

## üí™ Challenge Exercise

**Task:** You‚Äôre analyzing product reviews. Some entries have missing text, negative ratings, or duplicates.
1. Drop duplicates.
2. Ensure ratings are within [1, 5].
3. Fill missing review text with 'No Review'.
4. Compute average rating per product.

_(Try to implement this step-by-step in your own notebook!)_

---
# --- End of Section 9 ‚Äî Continue to Section 10 ---