<a href="https://colab.research.google.com/github/danieleduardofajardof/DataSciencePrepMaterial/blob/main/3_DataAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 3. Data Analysis Guide
# Index
- [1. Data Cleaning and Handling Missing Values](#data-clean)
- [2. Outlier Detection Techniques](#outlier)
- [3. Scaling and Normalization](#scaling)
- [4. Encoding Categorical Variables](#encoding)
- [5. Feature Engineering](#feature-eng)
- [6. Filter Methods for Feature Selection](#feature-selection)
- [7. Recursive Feature Elimination (RFE)](#rfe)
- [8. Multicollinearity Detection and Handling](#multicol)
- [9. Time-Series Feature Engineering](#ts-fe)
- [10. Model Monitoring](#model-moni)
- [11. Concept Drift Detection](#concept-drift)
- [12. Data Drift Detection](#data-drift)

## 1. Data Cleaning and Handling Missing Values  <a name="data-clean"></a>

Handling missing data is essential to ensure reliable model performance.

### Numeric Missing Value Processing:
- **Mean/Median Imputation:** Replace missing values with the mean or median of the column.
- **Interpolation:** Fill missing values using interpolation (linear, time-based, etc.).
- **Model-based Imputation:** Use predictive models like KNN or regression to estimate missing values.

### Categorical Missing Value Processing:
- **Mode Imputation:** Replace missing values with the most frequent value.
- **Create 'Unknown' Category:** Assign a placeholder category like `"Unknown"` or `"Missing"`.

---

## 2. Outlier Detection Techniques <a name="outlier"></a>

Outliers can distort statistical analyses and model performance.

- **Z-Score Method:**
  $$ z = \frac{x - \mu}{\sigma} $$
  Values with $|z| > 3$ are typically considered outliers.

- **IQR Method:**
  - Compute Q1 (25th percentile) and Q3 (75th percentile)
  - IQR = Q3 - Q1
  - Outlier if value $< Q1 - 1.5×IQR$ or $ > Q3 + 1.5×IQR$
### 2.2 Multidimensional Outlier Detection

Detecting outliers in **multidimensional data** requires methods that account for the relationships between features.


### 2.2.1 Mahalanobis Distance

The Mahalanobis distance measures the distance of a point from the center of a multivariate distribution, accounting for feature correlations.

### Formula:

$$
D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}
$$

- $\mu$: Mean vector of the dataset  
- $\Sigma$: Covariance matrix  
- $D_M(x)$: Mahalanobis distance

### Python Implementation:





In [None]:
import numpy as np
from scipy.stats import chi2

def mahalanobis_outliers(X, threshold=0.99):
    cov = np.cov(X.T)
    inv_cov = np.linalg.inv(cov)
    mean = np.mean(X, axis=0)
    diff = X - mean
    md = np.sqrt(np.sum(diff @ inv_cov * diff, axis=1))

    # Chi-square cutoff
    chi2_cutoff = chi2.ppf(threshold, df=X.shape[1])
    outliers = md**2 > chi2_cutoff
    return outliers

### 2.2.2 Isolation Forest
An unsupervised ensemble method that isolates anomalies by randomly partitioning the data.

Python Implementation:

In [None]:
from sklearn.ensemble import IsolationForest

clf = IsolationForest(contamination=0.05, random_state=42)
y_pred = clf.fit_predict(X)  # -1 = outlier, 1 = inlier

outliers = y_pred == -1


### 2.2.3. Local Outlier Factor (LOF)
Measures how isolated a data point is with respect to its surrounding neighborhood.



In [None]:
from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
y_pred = lof.fit_predict(X)  # -1 = outlier

outliers = y_pred == -1


### 2.2.4 DBSCAN (Density-Based Clustering)

A clustering algorithm that classifies points as core, border, or noise. Outliers are labeled as noise.

In [None]:
from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.5, min_samples=5).fit(X)
outliers = db.labels_ == -1  # -1 = noise

## ✅ Method Comparison Table

| Method               | Best For                                 | Advantages                         | Limitations                          |
|----------------------|-------------------------------------------|-------------------------------------|---------------------------------------|
| **Mahalanobis Distance** | Multivariate Gaussian data              | Accounts for feature correlations  | Assumes normal distribution          |
| **Isolation Forest** | High-dimensional, general-purpose data   | Fast, scalable, works on any shape | May miss local outliers              |
| **Local Outlier Factor (LOF)** | Local density variations        | Detects local anomalies            | Sensitive to choice of `n_neighbors` |
| **DBSCAN**           | Clustered data with noise points          | Finds non-linear clusters          | Requires tuning `eps`, `min_samples` |



---

## 3. Scaling and Normalization <a name="scaling"></a>

Rescale features to bring them to a similar range or distribution.

### Min-Max Scaling
Scales data to a fixed range, usually [0, 1]:
$$ x' = \frac{x - \min(x)}{\max(x) - \min(x)} $$

### Standardization (Z-score normalization)
Centers data around 0 with standard deviation of 1:
$$ x' = \frac{x - \mu}{\sigma} $$

### Normalization
Scales vector values such that the entire row has a norm (e.g., L2 norm) of 1:
$$ \text{norm}(x) = \frac{x}{\|x\|} $$

---

## 4. Encoding Categorical Variables <a name="encoding"></a>

Convert non-numeric labels into numeric formats.

### One-Hot Encoding
Creates binary columns for each category.
- Category: Red, Blue, Green → Columns: Red(0/1), Blue(0/1), Green(0/1)

### Label Encoding
Assigns a unique integer to each category.
- Red → 0, Blue → 1, Green → 2

### Binary Encoding
Converts categories to binary digits and splits them into separate columns.
- Category → Integer → Binary → Split into bits

---

## 5. Feature Engineering <a name="feature-eng"></a>

Creating new features from existing data to improve model performance.

Examples:
- Extracting year, month, or day from a date column.
- Creating interaction features (e.g., `price_per_sqft = price / area`)

---

## 6. Filter Methods for Feature Selection <a name="feature-selection"></a>

Use statistical tests to select relevant features:
- **Chi-Square Test:** For categorical variables
- **ANOVA F-test:** For numeric features
- **Mutual Information**

---

## 7. Recursive Feature Elimination (RFE) <a name="rfe"></a>

A wrapper method that recursively removes the least important features based on model performance.

Steps:
1. Train model
2. Remove least important feature(s)
3. Repeat until desired number of features is reached

---

## 8. Multicollinearity Detection and Handling <a name="multicol"></a>

Highly correlated features can distort model interpretability.

- **Correlation Matrix:** Identify highly correlated features (correlation > 0.8)
- **Variance Inflation Factor (VIF):**
  $$ VIF = \frac{1}{1 - R^2} $$
  VIF > 5 or 10 indicates multicollinearity.

Handling Techniques:
- Drop one of the correlated features
- Use PCA or regularization (Ridge)

---

## 9. Time-Series Feature Engineering <a name="ts-fe"></a>

Special techniques for time-dependent data:
- **Lag Features:** Previous time steps as new features (e.g., `sales_{t-1}`)
- **Rolling Statistics:** Moving averages, standard deviations
- **Datetime Extraction:** Day of week, month, holiday indicator, etc.

---

## 10. Model Monitoring <a name="model-moni"></a>

Track how a deployed model performs in production.

Metrics to monitor:
- Accuracy, precision, recall, F1
- Prediction latency
- Drift in data or concept

---

## 11. Concept Drift Detection <a name="concept-drift"></a>

Occurs when the relationship between features and target changes over time.

Detection Methods:
- Retrain regularly
- Monitor drop in accuracy
- Use drift detectors (e.g., DDM, ADWIN)

---

## 12. Data Drift Detection <a name="data-drift"></a>

Occurs when the distribution of input data changes, even if the concept doesn’t.

Detection Methods:
- Compare statistical properties (e.g., KS test, PSI)
- Monitor feature distributions over time
