## Python Basics

In [None]:
# List Comprehension
[x**2 for x in range(10) if x%2==0]

# Dict comprehension
{k:v for v,k in enumerate(['a','b','c'])}

# Lambda
f = lambda x: x*2

# map/filter
list(map(f, [1,2,3]))
list(filter(lambda x: x>1, [1,2,3]))


✅ 1. MCAR — Missing Completely At Random
The missingness is independent of both observed and unobserved data.

No systematic pattern: the fact that data is missing has nothing to do with any variables in your dataset.

👉 Example: A survey sheet is accidentally lost; any data on that sheet is missing purely by chance.

Key point: MCAR is the best-case scenario because the missingness doesn’t bias your analysis.

✅ 2. MAR — Missing At Random
The missingness is related to observed data but not to the missing data itself.

There’s a systematic relationship between missingness and other variables you’ve measured.

👉 Example: In a medical study, men are less likely to report weight than women. If you know the gender, you can explain missing weight values.

Key point: MAR allows valid imputation using other observed variables.

✅ 3. MNAR — Missing Not At Random
The missingness is related to the missing values themselves (unobserved data).

The reason for missing data is directly tied to the missing value.

👉 Example: People with higher incomes are less likely to report their income in a survey because their income is high.

Key point: MNAR is the most problematic—you can’t solve it just by looking at observed data; needs special modeling or external info.



## 📝 Summary of Imputation Techniques

| Technique                  | How it works                                       | Best used when...                               | Pros                                  | Cons                              |
|---------------------------|---------------------------------------------------|------------------------------------------------|-------------------------------------|-----------------------------------|
| **Mean Imputation**        | Replace missing value with column mean             | Data is **normally distributed** (no skew)     | Simple, fast                         | Reduces variance, ignores relation|
| **Median Imputation**      | Replace missing value with column median           | Data is **skewed or has outliers**             | Robust to outliers                   | Ignores relation between features |
| **Mode Imputation**        | Replace missing categorical value with most common | Categorical data, **balanced classes**         | Works for categorical                | Biased if class imbalance exists  |
| **Forward Fill (ffill)**   | Fill missing value with last valid observation     | **Time-series data** where prev value matters   | Preserves trends                     | Bad if sudden changes happen      |
| **Backward Fill (bfill)**  | Fill missing value with next valid observation     | **Time-series data**                          | Similar to ffill                     | Can propagate incorrect values    |
| **Interpolation**          | Estimate missing values between known points       | **Continuous, ordered data**                   | Fills gaps smoothly                  | Bad for categorical data          |
| **KNN Imputation**         | Use **k-nearest neighbors** to estimate missing    | Missing depends on other variables (MAR)        | Captures feature relationships       | Slow for large data               |
| **Multivariate Imputation (MICE)** | Iteratively predicts missing using other variables | **Multiple variables missing, MAR**        | Best statistical validity            | Complex, computationally expensive|
| **Random Sample Imputation** | Replace missing with random sample from column    | Data is **MCAR (Missing Completely At Random)** | Preserves variance                   | Adds randomness (noise)           |

---

## 🟢 General Recommendations:
✅ If **MCAR →** can safely drop rows or use random/mean imputation
✅ If **MAR →** use KNN, regression, MICE to use other features
✅ If **MNAR →** you need external info; imputation may **bias results**

---

### 🚩 Rule of Thumb:
- **Numerical, normal data → mean imputation**
- **Numerical, skewed data → median**
- **Categorical → mode**
- **Time-series → ffill or interpolation**
- **Data missing related to other vars → KNN or MICE**
