In [3]:
import torch

In [6]:
import torch

# Creating a PyTorch tensor with missing data (NaN) for testing
missing_data_tensor = torch.tensor([
    [1.0, 2.0, float('nan')],
    [4.0, float('nan'), 6.0],
    [float('nan'), 8.0, 9.0],
    [10.0, 11.0, 12.0],
    [float('nan'), float('nan'), float('nan')]
])

print(missing_data_tensor)


tensor([[ 1.,  2., nan],
        [ 4., nan,  6.],
        [nan,  8.,  9.],
        [10., 11., 12.],
        [nan, nan, nan]])


# 1. Ignoring or Dropping Missing Data

# Dealing with Missing Data

Dealing with missing data is a critical step in data preprocessing and analysis. Here are various methods to address missing data, categorized into three main approaches:

| **Category**                 | **Method**                                                                                  | **Description**                                                                                                                                   | **Pros**                                | **Cons**                                              |
|------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|-------------------------------------------------------|
| **1. Ignoring or Dropping**  | **Listwise Deletion**                                                                       | Remove rows where any value is missing.                                                                                                           | Simple to implement.                    | Loss of data.                                         |
|                              | **Column-Wise Deletion**                                                                    | Drop columns with a high proportion of missing values.                                                                                           | Useful for irrelevant columns.          | Loss of features.                                     |
| **2. Statistical Imputation**| **Mean/Median/Mode Imputation**                                                             | Replace missing values with the mean, median, or mode of the column.                                                                             | Quick and easy.                          | Reduces variability in data.                         |
|                              | **Forward/Backward Fill**                                                                   | Fill missing values using the previous or next observations.                                                                                     | Useful in time series data.             | Assumes data continuity.                             |
|                              | **Random Imputation**                                                                       | Replace missing values with random values from the column.                                                                                       | Preserves data distribution.             | Adds randomness.                                     |
| **2. Predictive Imputation** | **Regression Imputation**                                                                   | Use regression models to predict missing values based on other variables.                                                                        | More accurate.                           | Computationally expensive.                           |
|                              | **k-Nearest Neighbors (KNN)**                                                               | Replace missing values with the average or mode of the k-nearest neighbors.                                                                      | Captures variable relationships.         | Computationally intensive for large datasets.        |
|                              | **Multivariate Imputation by Chained Equations (MICE)**                                     | Impute data by iteratively modeling each variable based on the others.                                                                           | Effective for complex datasets.          | Computationally expensive.                           |
| **2. Advanced Techniques**   | **Deep Learning or ML Models**                                                             | Use models like random forests or neural networks to predict missing values.                                                                     | Captures complex patterns.               | Requires significant resources.                      |
|                              | **Multiple Imputation**                                                                     | Generate multiple datasets, analyze, and combine results.                                                                                        | Captures uncertainty in imputations.     | More complex process.                                |
|                              | **Domain-Specific Imputation**                                                              | Use heuristics or domain rules to fill missing values.                                                                                           | Leverages domain knowledge.              | Requires domain expertise.                           |
| **3. Transformation**        | **Create Missing Indicator**                                                               | Add a binary variable indicating missing data.                                                                                                   | Preserves missingness information.       | Adds dimensionality.                                 |
|                              | **Replace with Constant**                                                                   | Replace missing values with a constant (e.g., `0`, `-1`).                                                                                        | Simple to implement.                     | May distort interpretation.                          |
|                              | **Binning**                                                                                 | Transform continuous variables into categorical bins.                                                                                            | Useful for specific cases.               | Loss of granularity.                                 |
| **4. Data Integration**      | **External Data Sources**                                                                   | Use additional data to infer missing values.                                                                                                     | Can improve accuracy.                    | Limited by availability of external data.            |
|                              | **Data Augmentation**                                                                       | Generate synthetic data points to supplement missing data.                                                                                       | Can enhance datasets.                    | May introduce noise or bias.                         |
| **5. Model-Specific Handling**| **Model Tolerance**                                                                         | Use algorithms that handle missing data natively (e.g., XGBoost, LightGBM).                                                                      | No preprocessing required.               | Limited to specific algorithms.                      |
|                              | **Use Data Directly**                                                                       | Use models capable of handling missing values without transformation.                                                                            | Simplifies workflow.                     | May not work for all models.                         |
| **6. Analysis Adaptation**   | **Sensitivity Analysis**                                                                    | Test how imputation methods affect results.                                                                                                      | Ensures robust results.                  | Requires additional analysis.                        |
|                              | **Weighting or Reweighting**                                                                | Adjust weights for missing observations based on known distributions.                                                                            | Maintains overall accuracy.              | Requires distributional assumptions.                 |
| **7. Do Nothing**            | **Leave Missing Values Unchanged**                                                         | Suitable for models that work with missing data directly (e.g., some deep learning models).                                                      | Simplifies process.                      | May reduce accuracy for some methods.                |

---

### **Choosing the Right Method**

The choice depends on:
- **Data Size:** Large datasets tolerate listwise deletion better than small ones.
- **Data Type:** Numerical or categorical data.
- **Missingness Pattern:** MCAR, MAR, or MNAR.
- **Domain Knowledge:** Contextual understanding of the problem.
- **Analysis Goal:** Predictive accuracy, interpretability, or exploratory insights.


# Outlier Detection and Handling Methods

## Outlier Detection Methods

| **Method**                       | **Description**                                                                 | **Used For**                                  |
|-----------------------------------|---------------------------------------------------------------------------------|-----------------------------------------------|
| **Z-Score / Standard Score**      | Measures how many standard deviations a data point is from the mean.            | Identifying outliers in normal distributions. |
| **Modified Z-Score**              | Robust version using median and MAD instead of mean and standard deviation.     | Outliers in non-normally distributed data.    |
| **IQR (Interquartile Range)**     | Detects outliers based on the range between the first and third quartiles.      | Numerical data.                              |
| **Euclidean Distance**            | Measures distance between data points; far distances suggest outliers.          | General use, often in clustering.            |
| **k-NN (k-Nearest Neighbors)**    | Detects outliers by checking the distance to the k-th nearest neighbor.         | High-dimensional datasets.                   |
| **DBSCAN (Density-Based Spatial Clustering of Applications with Noise)** | Identifies points with low density as outliers.                                | Clustering-based outlier detection.          |
| **Isolation Forest**              | Uses trees to isolate outliers. The fewer splits required, the more likely it’s an outlier. | High-dimensional data.                       |
| **One-Class SVM**                 | Identifies outliers by fitting a decision boundary around the majority of data points. | Unsupervised learning for anomaly detection. |
| **LOF (Local Outlier Factor)**    | Measures density deviation of a point with respect to its neighbors.           | Local outliers in density-based datasets.     |
| **K-Means Clustering**            | Identifies outliers by calculating the distance from cluster centroids.        | General use.                                 |
| **Gaussian Mixture Models (GMM)** | Assumes data points follow a mixture of Gaussian distributions. Points with low probabilities are outliers. | Probabilistic clustering.                   |
| **Agglomerative Clustering**      | Detects outliers by examining points that form singleton clusters.              | Clustering-based detection.                  |
| **Box Plot**                      | Visual method using quartiles to detect outliers as points outside the whiskers. | Simple visual detection.                     |
| **Scatter Plot**                  | Identifies outliers by visualizing data points far from the main concentration. | Multivariate data.                           |
| **Histogram**                     | Visualizes outliers as bins far from the main distribution.                    | Simple visualization.                        |

## Outlier Handling Methods

| **Method**                        | **Description**                                                               | **Use Case**                                  |
|------------------------------------|-------------------------------------------------------------------------------|-----------------------------------------------|
| **Remove Outliers**                | Remove data points identified as outliers based on detection methods.         | When outliers are erroneous or not useful.    |
| **Replace with Mean/Median/Mode**  | Replace outliers with the mean, median, or mode of the feature.               | When outliers are suspected to be errors.     |
| **Impute using KNN**               | Use k-nearest neighbors to replace outliers with values based on similar data. | When outliers need to be replaced with plausible values. |
| **Log Transformation**             | Apply a logarithmic transformation to reduce the impact of extreme values.    | Skewed data with extreme values.              |
| **Square Root or Cube Root**       | Apply square root or cube root to stabilize variance and reduce skewness.    | Skewed data with a small number of large values. |
| **Box-Cox Transformation**         | A family of transformations to make data more normal-distributed and reduce outlier impact. | Non-normal data with outliers. |
| **Clipping**                       | Replace outliers with predefined thresholds or percentiles (e.g., 95th percentile). | When outliers are reasonable but extreme.     |
| **Robust Regression**              | Use models like RANSAC that are less sensitive to outliers.                   | When using regression models.                 |
| **Robust PCA**                     | PCA method that reduces the influence of outliers in dimensionality reduction. | High-dimensional data with outliers.          |
| **Cluster Outliers**               | Treat outliers as a separate cluster if they represent valuable but rare events. | Fraud detection, anomaly detection.           |
| **Ensemble Methods (e.g., Random Forests)** | Combine multiple decision trees to reduce the impact of outliers.            | When using tree-based algorithms.             |

---
