<a href="https://colab.research.google.com/github/gitmystuff/DTSC5082/blob/main/Interview_Prep_4/interview_prep_concepts_review_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering, Selection & Dimensionality Reduction

## üéØ Interview Prep: Concepts Review
**Class Duration:** ~1 hour

---

### üìö Topics Covered
1. **Feature Engineering:** Creating and transforming variables to improve model performance.
2. **Feature Selection:** Identifying the most relevant subset of predictors.
3. **Feature Extraction:** Creating new features from raw data (including deep learning approaches).
4. **Dimensionality & The Curse of Dimensionality:** Understanding how high-dimensional data affects model behavior.
5. **Dimensionality Reduction Techniques:** Methods like PCA, t-SNE, and LDA.

---

### üöÄ Learning Objectives
By the end of this session, you should be able to:

* **Implement** core feature engineering techniques used in industry (scaling, encoding, etc.).
* **Strategize** when and how to apply specific feature selection methods (Filter, Wrapper, and Embedded).
* **Explain** the mechanics of automatic feature extraction within deep learning architectures.
* **Identify** and mitigate the "Curse of Dimensionality" in large datasets.
* **Apply** the appropriate dimensionality reduction techniques based on the data structure and goal.

## Setup

In [None]:
# ==========================================
# Module 4: Environment Setup
# ==========================================

# Standard Data Science Stack
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing & Model Selection
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    LabelEncoder
)
from sklearn.model_selection import train_test_split

# Feature Selection Methods
from sklearn.feature_selection import mutual_info_classif, chi2, RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, Lasso

# Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.manifold import TSNE

# Datasets for Practice
from sklearn.datasets import load_iris, fetch_california_housing

# Global Configuration
np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

print("‚úì Libraries loaded successfully!")

# 1. Feature Engineering

> **Interview Question:** *"Why is feature engineering important?"*
> **Answer:** Feature engineering can make or break a model. Better features often matter more than better algorithms; they allow simpler models to perform better and make the underlying patterns in the data more accessible to the learner.

### üõ† Key Techniques

* **Derived Features:** Creating new variables from existing ones (e.g., calculating "Age" from "Date of Birth").
* **Transformations:** Applying mathematical functions like Log, Square Root, or Box-Cox to handle skewed data.
* **Scaling:**
  * **Standardization:** Rescaling data to have a mean of 0 and a standard deviation of 1.
  * **Normalization:** Rescaling data to a fixed range, usually .


* **Encoding:** Converting categorical variables into numerical formats (e.g., One-Hot Encoding or Label Encoding).
* **Missing Data:** Using imputation strategies (Mean, Median, Mode, or KNN) to handle null values.
* **Outlier Treatment:** Detection via Z-score or IQR and handling via clipping, capping, or removal.

## 1.1 Scaling: Why It Matters

## üéØ Interview Prep: Concepts Review
**Context:** Essential for model convergence and accuracy.

---

### üìö Core Concepts
* **Definition:** Scaling transforms features to a similar range so that no single variable dominates the model due to its magnitude.
* **Standardization ($Z$-score):** Centers data around a mean of 0 with a standard deviation of 1.
* **Normalization (Min-Max):** Rescales data to a fixed range, typically [0, 1].



---

### üöÄ Key Takeaways: When to Scale

* **Distance-Based Algorithms:** **KNN**, **SVM**, and **K-Means** require scaling because they rely on Euclidean distance; without it, features with larger units (like Salary) will drown out smaller ones (like Age).
* **Gradient Descent Optimization:** Models like **Neural Networks** and **Linear/Logistic Regression** converge significantly faster when features are on the same scale.
* **Dimensionality Reduction:** **PCA** is variance-driven. If one feature has a larger scale, PCA will incorrectly identify it as the most "important" component.
* **Unit Consistency:** Always scale when your dataset mixes different units (e.g., meters, kilograms, and currency).

In [None]:
# Load California Housing dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# Displaying min/max to identify scale discrepancies
print("Original data ranges:")
print(X.describe().loc[['min', 'max']])

print("\n‚ö†Ô∏è Notice: Features have vastly different scales!")

In [None]:
# Compare scaling methods
scaler_std = StandardScaler()
scaler_minmax = MinMaxScaler()

X_std = scaler_std.fit_transform(X)
X_minmax = scaler_minmax.fit_transform(X)

# Visualize one feature: 'MedInc'
feature = 'MedInc'
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Original Data
axes[0].hist(X[feature], bins=30, edgecolor='black')
axes[0].set_title(f'Original: {feature}')
axes[0].set_xlabel('Value')

# StandardScaler
axes[1].hist(X_std[:, 0], bins=30, edgecolor='black', color='orange')
axes[1].set_title('StandardScaler (mean=0, std=1)')
axes[1].set_xlabel('Scaled Value')

# MinMaxScaler
axes[2].hist(X_minmax[:, 0], bins=30, edgecolor='black', color='green')
axes[2].set_title('MinMaxScaler (range 0-1)')
axes[2].set_xlabel('Scaled Value')

plt.tight_layout()
plt.show()

print("\n‚úì Same distribution shape, different scales!")

## 1.2 Comparison: Choosing the Right Scaler

### üìö Key Distinctions
Use the following table to determine which scaling method fits your specific data profile.

| Method | Formula | When to Use |
| :--- | :--- | :--- |
| **StandardScaler** | $$z = \frac{x - \mu}{\sigma}$$ | **The Default Choice.** Best for algorithms assuming Gaussian distributions (Linear/Logistic Regression) and PCA. |
| **MinMaxScaler** | $$x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}$$ | **Bounded Range.** Best for Image processing, Neural Networks, or when you need exactly $[0, 1]$. |
| **RobustScaler** | $$x_{scaled} = \frac{x - Q_2}{Q_3 - Q_1}$$ | **Outlier-Resistant.** Uses median and IQR; essential when data contains significant noise or extreme outliers. |


## 1.3 Categorical Encoding

## üéØ Interview Prep: Concepts Review
**Context:** Converting non-numeric data into a format usable by machine learning models while maintaining statistical integrity.


### üìö Core Concept: The Dummy Variable Trap
> **Interview Question:** *"What's the dummy variable trap?"*
>
> **Answer:** The dummy variable trap is a scenario where independent variables are highly correlated (multicollinearity). If you have 3 categories and create 3 dummy variables, the 3rd variable is perfectly predictable from the first two. This "perfect multicollinearity" can make it impossible to calculate unique coefficients in models like Linear Regression.
>
> **Solution:** Always drop one category (use `drop_first=True` in `get_dummies` or `OneHotEncoder`).

In [None]:
# Demonstration
pets = {'Pet': ['dog', 'cat', 'cat', 'dog', 'bird', 'cat', 'dog', 'bird']}
df = pd.DataFrame(pets)

print("Original:")
print(df['Pet'].value_counts())

# WRONG WAY (dummy trap)
df_wrong = pd.get_dummies(df, columns=['Pet'], prefix='Pet')
print("\n‚ùå WRONG (creates 3 columns for 3 categories):")
print(df_wrong.head())
print(f"Columns: {df_wrong.columns.tolist()}")

# CORRECT WAY
df_correct = pd.get_dummies(df, columns=['Pet'], prefix='Pet', drop_first=True)
print("\n‚úì CORRECT (drop_first=True, only 2 columns):")
print(df_correct.head())
print(f"Columns: {df_correct.columns.tolist()}")

print("\n'cat' is the reference category (all 0s = cat)")

## 1.4 Dummy Trap

(dummy variable trap)** refers to the problem of **perfect multicollinearity** that occurs when you one-hot encode a categorical variable and keep **all** dummy columns **along with an intercept** in a linear model.

**What happens**

Suppose a categorical variable has *k* categories.
One-hot encoding creates *k* binary columns.

If you also include an intercept term, then:

* The sum of the *k* dummy columns = 1 for every row
* One column can be written as a linear combination of the others
* The design matrix becomes **rank-deficient**

This breaks assumptions of linear regression and similar models, making coefficients **non-identifiable** (infinite or unstable solutions).

### Simple example

Categories: {Red, Blue, Green}

One-hot encoded:

* Red
* Blue
* Green

For every row:

```
Red + Blue + Green = 1
```

With an intercept, this creates perfect multicollinearity.

### Why it matters

* Linear and generalized linear models cannot uniquely estimate coefficients
* Inversion of the matrix fails or becomes numerically unstable
* Coefficient interpretations become meaningless

### How to avoid it

* **Drop one dummy column** (use *k‚àí1* encoding)
* The dropped category becomes the **reference (baseline)**
* Most libraries do this automatically (e.g., `drop_first=True` in pandas)

### Key takeaway

The dummy trap is not about encoding itself‚Äîit‚Äôs about **keeping redundant information**.
Always use **k‚àí1 dummy variables** when your model includes an intercept.


## 1.5 Missing Data Strategy

## üéØ Interview Prep: Concepts Review
**Context:** Strategizing data imputation based on the underlying cause of missingness.

---

### üìö Core Concepts
> **Interview Question:** *"How do you handle missing data?"*
>
> **Answer depends on the type:**
>
> 1. **MCAR (Missing Completely at Random):** No relationship between missing data and any values. **Implication:** Mean/Median imputation is generally safe.
> 2. **MAR (Missing at Random):** Missingness is related to other observed features (e.g., men are less likely to report weight). **Implication:** Use other features to predict or impute.
> 3. **MNAR (Missing Not at Random):** The missingness itself is related to the unobserved value. **Implication:** The fact it is missing is informative; flag it! **Example:** High earners refuse to report income (MNAR) ‚Üí Use an extreme value or a binary indicator to flag the missingness.

In [None]:
import pandas as pd
import numpy as np

# Reproducibility
np.random.seed(42)

# Generate synthetic data
data = pd.DataFrame({
    'Age': [25, 30, np.nan, 45, 50, np.nan, 35],
    'Income': [50000, np.nan, 60000, 80000, np.nan, 70000, 65000]
})

print("Original data:")
print(data)


In [None]:
# Strategy: Imputation (Pandas 3.0 Compatible)
data_imputed = data.copy()

# Direct assignment
data_imputed['Age'] = data_imputed['Age'].fillna(data['Age'].median())
data_imputed['Income'] = data_imputed['Income'].fillna(data['Income'].median())

print("\nAfter median imputation:")
print(data_imputed)

# 2. Feature Selection

## üéØ Interview Prep: Concepts Review
**Context:** Identifying the most relevant features to improve model efficiency and reduce overfitting.



### üìö Core Concepts
> **Interview Question:** *"What are the three types of feature selection methods?"*
>
> **Answer:**
| Method | How It Works | Speed | Accuracy |
| :--- | :--- | :--- | :--- |
| **Filter** | Statistical tests (correlation, chi¬≤, MI) | Fast | Good |
| **Wrapper** | Train model iteratively (RFE) | Slow | Best |
| **Embedded** | Built into training (Lasso, tree importance) | Medium | High |



---

## 2.1 Filter Method: Correlation

## üéØ Interview Prep: Concepts Review
**Context:** Using statistical measures to identify and remove redundant information before model training.


### üìö Core Concepts
* **Use Case:** Quick first pass to remove redundant features.
* **Rule of Thumb:** If |correlation| > 0.8‚Äì0.9, consider dropping one feature to reduce multicollinearity.


In [None]:
# Load Iris dataset
iris = load_iris()
X_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
y_iris = iris.target

# Correlation matrix
corr_matrix = X_iris.corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Matrix', fontsize=14)
plt.tight_layout()
plt.show()

print("\nHighly correlated pairs:")
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) > 0.8:
            print(f"  {corr_matrix.columns[i]} & {corr_matrix.columns[j]}: {corr_matrix.iloc[i, j]:.3f}")

In [None]:
# Correlation matrix
corr_matrix = X_iris.corr()

# Create a mask for the upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

plt.figure(figsize=(8, 6))
sns.heatmap(
    corr_matrix,
    mask=mask,
    annot=True,
    cmap='coolwarm',
    center=0,
    square=True,
    linewidths=1,
    cbar_kws={"shrink": 0.8}
)

plt.title('Feature Correlation Matrix (Lower Triangle)', fontsize=14)
plt.tight_layout()
plt.show()


## 2.2 Filter Method: Mutual Information

## üéØ Interview Prep: Concepts Review
**Context:** Evaluating feature relevance through information gain and statistical dependence.


### üìö Core Concepts
* **Advantage over correlation:** Unlike Pearson correlation, which only measures linear associations, **Mutual Information (MI)** captures any kind of statistical dependency, including non-linear relationships!
* **Scale:**
    * **MI = 0:** The variables are completely independent.
    * **Higher MI:** Indicates a stronger relationship between the feature and the target.



> **Interview Tip:** Always mention that Mutual Information is non-parametric and doesn't assume a normal distribution, making it more robust than simple correlation for complex datasets.


In [None]:
# Mutual Information
mi_scores = mutual_info_classif(X_iris, y_iris, random_state=42)
mi_scores = pd.Series(mi_scores, index=X_iris.columns).sort_values(ascending=False)

plt.figure(figsize=(8, 4))
mi_scores.plot(kind='barh', color='coral')
plt.xlabel('Mutual Information Score')
plt.title('Feature Importance via Mutual Information')
plt.tight_layout()
plt.show()

print("Mutual Information Scores:")
print(mi_scores)

## 2.3 Wrapper Method: Recursive Feature Elimination (RFE)

## üéØ Interview Prep: Concepts Review
**Context:** Iteratively selecting features by training a model and removing the least important predictors.

### üìö Core Concepts
> **Interview Question:** *"What's the advantage of wrapper methods?"*
>
> **Answer:** Unlike filter methods, wrappers consider **feature interactions**. They evaluate how features perform together within a specific model architecture, making them highly optimized for that learner.
>
> **Downside:** They are **computationally expensive** because they require retraining the model multiple times (e.g., $N-1$ times for RFE).


In [None]:
# RFE with Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(X_iris, y_iris, test_size=0.3, random_state=42)

model = LogisticRegression(max_iter=1000, random_state=42)
rfe = RFE(estimator=model, n_features_to_select=2)
rfe.fit(X_train, y_train)

# Results
rfe_results = pd.DataFrame({
    'Feature': X_iris.columns,
    'Selected': rfe.support_,
    'Ranking': rfe.ranking_
}).sort_values('Ranking')

print("RFE Results:")
print(rfe_results)
print(f"\nSelected features: {X_iris.columns[rfe.support_].tolist()}")

## 2.4 Embedded Method: Tree-Based Feature Importance

## üéØ Interview Prep: Concepts Review
**Context:** Leveraging models that perform feature selection automatically as part of the training process.

### üìö Core Concepts
* **Definition:** Embedded methods integrate feature selection directly into the model construction.
* **Tree-Based Importance:** Decision trees and ensembles (like Random Forest or XGBoost) calculate importance based on how much each feature reduces impurity (Gini or Entropy) across all nodes.


> **Interview Tip:** Random Forests and Gradient Boosting Machines provide feature importance "for free" during the training phase, making them highly efficient for initial data exploration.

In [None]:
# Random Forest feature importance
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get importances
importances = pd.Series(rf.feature_importances_, index=X_iris.columns).sort_values(ascending=False)

plt.figure(figsize=(8, 4))
importances.plot(kind='barh', color='forestgreen')
plt.xlabel('Importance')
plt.title('Random Forest Feature Importance')
plt.tight_layout()
plt.show()

print("Feature Importances:")
print(importances)

Importance measures how much a feature reduces impurity (e.g., Gini or entropy)

# 3. Feature Extraction (Deep Learning)

## üéØ Interview Prep: Concepts Review
**Context:** Understanding how deep neural networks automatically transform raw data into informative representations.

### üìö Core Concepts
> **Interview Question:** *"How does feature extraction differ from feature selection?"*
>
> **Answer:** > * **Feature Selection:** Selecting a subset of the **existing** features.
> * **Feature Extraction:** Creating **new** features by transforming or projecting existing data into a new space (e.g., PCA or Neural Network embeddings).







## 3.1 CNNs (Computer Vision)
**How CNNs extract features:**
CNNs learn a hierarchy of features through successive layers of convolution and pooling:

1. **Early Layers:** Detect low-level features like **edges, corners, and colors**.
2. **Middle Layers:** Combine low-level features into **textures and patterns**.
3. **Deep Layers:** Assemble patterns into complex, high-level **objects or faces**.

CNNs learn visual meaning the way we learn language: from letters, to words, to sentences, and finally to concepts.

**Transfer Learning:**
You can use a pre-trained CNN (such as **VGG16**, **ResNet**, or **Inception**) as a fixed feature extractor by removing the final classification head and using the output of the convolutional base as input for a new model.

## 3.2 LLMs (Natural Language Processing)

## üéØ Interview Prep: Concepts Review
**Context:** Understanding how Transformer architectures transform raw text into high-dimensional contextual embeddings.


### üìö Core Concepts
**How Transformers extract features:**

1.  **Tokenization:** Breaking raw text into smaller sub-word units or tokens.
2.  **Embeddings:** Mapping these tokens into a dense vector space where semantic meaning is represented numerically.
3.  **Attention:** The core mechanism that creates **contextualized representations** by weighing the importance of surrounding words.



**Example: The Polysemy Problem**
The word "**bank**" extracts entirely different feature vectors depending on its neighbors:
* *"river **bank**"*: Contextual features relate to **geography/nature**.
* *"**bank** account"*: Contextual features relate to **finance/business**.

The **Attention Mechanism** dynamically extracts the relevant context, ensuring the model "understands" which version of the word is being used based on the surrounding tokens.


# 4. Dimensionality & The Curse

## üéØ Interview Prep: Concepts Review
**Context:** Understanding the theoretical and practical challenges of working with high-dimensional feature spaces.


### üìö Core Concepts
> **Interview Question:** *"What is the curse of dimensionality?"*
>
> **Answer:** The "Curse" refers to a set of phenomena that arise when analyzing data in high-dimensional spaces that do not occur in low-dimensional settings. As dimensions increase:
>
> 1. **Data Sparsity:** The volume of the space increases so fast that the available data becomes sparse. You require exponentially more data to maintain the same level of density.
> 2. **Distance Meaninglessness:** In high dimensions, the difference between the minimum and maximum distance between points decreases. To a model, everything begins to look equally "far apart," breaking distance-based algorithms like KNN.
> 3. **Computational Explosion:** Processing time and memory requirements grow significantly.
> 4. **Overfitting:** With too many features relative to observations, models easily "memorize" noise rather than learning the underlying signal.


**Example:** To maintain the same data density when moving from **2D** to **10D**, you would theoretically need **$10^8$** times more data! (**$10^8$** = **$10^10$** - **$10^2$**)



---

In [None]:
# Demonstrate curse of dimensionality
from scipy.spatial import distance

dimensions = [2, 10, 50, 100]
n_points = 100
results = []

for d in dimensions:
    # Generate random points
    points = np.random.rand(n_points, d)

    # Calculate all pairwise distances
    distances = []
    for i in range(n_points):
        for j in range(i+1, n_points):
            distances.append(distance.euclidean(points[i], points[j]))

    distances = np.array(distances)
    results.append({
        'Dimensions': d,
        'Mean Distance': distances.mean(),
        'Std Distance': distances.std(),
        'Coef. of Variation': distances.std() / distances.mean()
    })

results_df = pd.DataFrame(results)

print("Effect of Dimensionality on Distances:")
print(results_df)

print("\n‚ö†Ô∏è As dimensions increase, coefficient of variation DECREASES")
print("   This means all points become roughly equidistant!")
print("   ‚Üí Clustering and nearest-neighbor methods fail!")

# 5. Dimensionality Reduction

## üéØ Interview Prep: Concepts Review
**Context:** Compressing feature spaces while retaining essential information or maximizing class discriminability.

---

### üìö Core Concepts
> **Interview Question:** *"When would you use PCA vs LDA vs t-SNE?"*
>
> **Answer:**
| Method | Type | Supervised? | Best For |
| :--- | :--- | :--- | :--- |
| **PCA** (Principal Component Analysis) | Linear | No | **Variance Preservation.** Best for general preprocessing and noise reduction. |
| **LDA** (Linear Discriminant Analysis) | Linear | Yes | **Class Separation.** Best for supervised dimensionality reduction before classification. |
| **t-SNE** (t-Distributed Stochastic Neighbor Embedding) | Non-linear | No | **Visualization.** Best for exploring local clusters in high-dimensional data; not for preprocessing. |



---

### üöÄ Key Takeaways
* **PCA** finds the axes with the maximum variance regardless of labels.
* **LDA** finds the axes that maximize the distance between different classes.
* **t-SNE** is computationally intensive and preserves local neighborhoods, but the distances between clusters in a t-SNE plot are not always meaningful.

---

## 5.1 Demonstration: PCA Implementation

## üéØ Interview Prep: Concepts Review

**Context:** Projecting high-dimensional data into 2D for visualization while maximizing variance.

In [None]:
# PCA demonstration
from sklearn.preprocessing import StandardScaler

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_iris)

# Fit PCA
pca = PCA(n_components=4)
pca.fit(X_scaled)

# Explained variance
explained_var = pd.DataFrame({
    'PC': [f'PC{i+1}' for i in range(4)],
    'Explained Variance': pca.explained_variance_ratio_,
    'Cumulative': np.cumsum(pca.explained_variance_ratio_)
})

print("Explained Variance:")
print(explained_var)

# Scree plot
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.bar(explained_var['PC'], explained_var['Explained Variance'], color='steelblue', edgecolor='black')
plt.ylabel('Explained Variance Ratio')
plt.title('Scree Plot')

plt.subplot(1, 2, 2)
plt.plot(explained_var['PC'], explained_var['Cumulative'], marker='o', color='darkred')
plt.axhline(y=0.95, color='black', linestyle='--', label='95% threshold')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Variance')
plt.legend()

plt.tight_layout()
plt.show()

print(f"\n‚úì First 2 components explain {explained_var.iloc[1]['Cumulative']:.1%} of variance!")

## 5.2 Interview Strategy: Selecting Optimal Components

## üéØ Interview Prep: Concepts Review
**Context:** Defending your choice of dimensionality and explaining the trade-offs between information loss and model simplicity.

### üìö Selection Strategies

> **Interview Question:** *"How do you choose the number of components?"*

#### 1. The Elbow Method
Plot the explained variance for each component in a **Scree Plot**. Look for the "elbow"‚Äîthe point where the variance drop-off levels off. This indicates that additional components are likely capturing more noise than signal.


#### 2. Cumulative Explained Variance
Calculate the running total of variance captured. A common heuristic is to retain enough components to cover **90% to 95%** of the total variance in the dataset.


#### 3. Downstream Model Performance (Cross-Validation)
The most robust "real-world" method. Treat the number of components ($k$) as a **hyperparameter**. Use cross-validation to find the value of $k$ that yields the best performance on your specific target metric (e.g., F1-score or RMSE).


### üöÄ Key Takeaways
* **Parsimony:** If 2 components capture 85% of variance and 10 components capture 90%, the simpler 2-component model is often preferred to prevent overfitting and reduce computational cost.
* **Context Matters:** If the goal is **visualization**, you are usually restricted to $k=2$ or $k=3$. If the goal is **compression**, follow the 95% variance rule.


In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load Iris dataset
iris = load_iris()
X = iris.data
feature_names = iris.feature_names

# Standardize features
X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=4)
pca.fit(X_scaled)

loadings = pd.DataFrame(
    pca.components_.T,
    columns=[f'PC{i+1}' for i in range(4)],
    index=feature_names
)

loadings


In [None]:
# Reduce to 2D for visualization
pca_2d = PCA(n_components=2)
X_pca = pca_2d.fit_transform(X_scaled)

# Plot
plt.figure(figsize=(8, 6))
for i, species in enumerate(iris.target_names):
    mask = y_iris == i
    plt.scatter(X_pca[mask, 0], X_pca[mask, 1], label=species, alpha=0.6, s=50)

plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.1%} variance)')
plt.title('PCA: 4D ‚Üí 2D Projection')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("‚úì Classes well-separated in just 2 dimensions!")

## PCA and Covariance

**PCA finds the directions of maximum variance by performing an eigen-decomposition of the covariance matrix.**

**Variance (Population)**:

* $\mathrm{Var}(X) = \mathbb{E}\!\left[(X - \mu)^2\right]$

**Variance (Sample)**:

* $s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2$

**Covariance (Population)**:

* $\mathrm{Cov}(X, Y) = \mathbb{E}\!\left[(X - \mu_X)(Y - \mu_Y)\right]$


### Step-by-step relationship

### 1. Center the data

Given a data matrix ($X \in \mathbb{R}^{n \times d}$):

$
\tilde{X} = X - \mu
$

where ($\mu$) is the mean of each feature.

### 2. Compute the covariance matrix

$
\Sigma = \frac{1}{n-1} \tilde{X}^\top \tilde{X}
$

* Diagonal entries ‚Üí variances of features
* Off-diagonal entries ‚Üí covariances between features

This matrix encodes **how features vary together**.

### 3. Eigen-decomposition of covariance

**Eigen-decomposition is the process of expressing a matrix in terms of its eigenvectors and eigenvalues. In PCA, the eigenvectors define the directions of the new features, and the eigenvalues quantify their importance by how much variance they explain. The covariance matrix determines the eigenvectors, which define the new features in PCA.**

$
\Sigma v_i = \lambda_i v_i
$

* (v_i) ‚Üí **principal components** (directions)
* (\lambda_i) ‚Üí **variance explained** by each component

### 4. Interpretation

* **Principal components are the eigenvectors of the covariance matrix**
* **Eigenvalues rank components by importance**
* PCA rotates the coordinate system to:

  * maximize variance
  * remove covariance (decorrelate features)

After PCA, the new features are **uncorrelated**.

### Geometric intuition

* Covariance defines the **shape of the data cloud**
* PCA finds the **axes of the ellipsoid**
* The longest axis = first principal component




## **Our PCI Eigenvector**

$
\mathbf v_1=
\begin{bmatrix}
0.5210659\
-0.2693474\
0.5804131\
0.5648565
\end{bmatrix}
$

If you standardized the features first (as in `StandardScaler`), then ($\Sigma$) is the covariance of the standardized data (very close to the correlation matrix). For Iris (standardized), it is:

$$
\Sigma \approx
\begin{bmatrix}
1.006711 & -0.118359 & 0.877604 & 0.823431 \\
-0.118359 & 1.006711 & -0.431316 & -0.368583 \\
0.877604 & -0.431316 & 1.006711 & 0.969328 \\
0.823431 & -0.368583 & 0.969328 & 1.006711
\end{bmatrix}
$$


## Step 1: Use the eigenvector equation

$
\Sigma \mathbf v_1 = \lambda_1 \mathbf v_1
$

Compute the left side:

$
\Sigma \mathbf v_1 =
\begin{bmatrix}
1.530936\
-0.791366\
1.705303\
1.659597
\end{bmatrix}
$

## Step 2: Use the Rayleigh quotient to get the eigenvalue

Because (\mathbf v_1) is unit-length in PCA, the eigenvalue is:

$
\lambda_1 = \mathbf v_1^\top \Sigma \mathbf v_1
$

So:

$$
\lambda_1
=
\begin{bmatrix}
0.5210659 & -0.2693474 & 0.5804131 & 0.5648565
\end{bmatrix}
\begin{bmatrix}
1.530936 \\
-0.791366 \\
1.705303 \\
1.659597
\end{bmatrix}
$$


Compute the dot product:

* (0.5210659(1.530936)=0.797)
* ((-0.2693474)(-0.791366)=0.213)
* (0.5804131(1.705303)=0.990)
* (0.5648565(1.659597)=0.937)

Sum:

$
\lambda_1 \approx 0.797 + 0.213 + 0.990 + 0.937 = 2.937 \approx 2.94
$

That‚Äôs the ‚Äú2.92-ish‚Äù value (the exact number varies slightly depending on rounding and whether you use (n) vs (n-1) in covariance).

## How to print the exact value in Python

```python
pca.explained_variance_[0]          # eigenvalue for PC1
# and equivalently:
v1 = pca.components_[0]
Sigma = np.cov(X_scaled, rowvar=False, ddof=1)
v1 @ Sigma @ v1
```



In [None]:
pca.explained_variance_[0]          # eigenvalue for PC1
# and equivalently:
v1 = pca.components_[0]
Sigma = np.cov(X_scaled, rowvar=False, ddof=1)
v1 @ Sigma @ v1


## Eigenvalues

$
[2.94, 0.91, 0.15, 0.02]
$

* One eigenvalue per eigenvector (PC1‚ÄìPC4)
* Computed from the covariance (or correlation) matrix
* Each equals the **variance captured** along that PC direction


## Why eigenvalues = importance in PCA

In PCA, **importance is defined as variance explained**.

So:

* Larger eigenvalue ‚Üí more variance ‚Üí more information
* Smaller eigenvalue ‚Üí little variance ‚Üí mostly noise

That is why PCs are ordered by eigenvalue.


## Explained variance ratio (the key interpretation)

Total variance (for standardized Iris data):

$
\sum \lambda_i = 4
$

So the explained variance ratios are:

$
\text{PC1: } \frac{2.94}{4} \approx 73%
$
$
\text{PC2: } \frac{0.91}{4} \approx 23%
$
$
\text{PC3: } \frac{0.15}{4} \approx 4%
$
$
\text{PC4: } \frac{0.02}{4} < 1%
$

This tells you:

* PC1 is overwhelmingly important
* PC2 still meaningful
* PC3, PC4 contribute very little

## How this connects to loadings

* Loadings tell you **what each PC is made of**
* Eigenvalues tell you **how much that PC matters**

Example:

* PC1 loadings ‚Üí mostly petal length & width
* PC1 eigenvalue = 2.94 ‚Üí explains ~73% of variance

So you can say:

> **Petal features dominate PC1, and PC1 explains most of the dataset‚Äôs variance.**

---

## What importance does *not* mean (important distinction)

Eigenvalues do **not** mean:

* Feature importance (like Random Forest)
* Causal importance
* Predictive power for a target variable

They mean **importance for representing variance in the data**.

## Teaching-safe summary

You can safely say:

> **Eigenvalues quantify the importance of each principal component by measuring how much variance it captures.**


## One-line takeaway

**Eigenvectors tell you the direction; eigenvalues tell you how much that direction matters.**



## Covariance vs correlation PCA

* If features are on **different scales**, PCA on covariance can be misleading
* Standardizing features first is equivalent to doing PCA on the **correlation matrix**


### Equivalent formulation (SVD)

**SVD (Singular Value Decomposition) is a fundamental matrix factorization that expresses any matrix as a product of three simpler matrices. It is the backbone of PCA and many methods in data science.**

PCA can also be computed via SVD:

$
\tilde{X} = U \Sigma V^\top
$

* Columns of (V) = principal directions
* Singular values relate to covariance eigenvalues

This avoids explicitly computing the covariance matrix and is numerically stable.

### Why this matters in practice

* Highly correlated features ‚Üí strong covariance ‚Üí PCA compresses them
* PCA is a **systematic way to remove redundancy**
* This connects directly to your earlier correlation analysis on the Iris data


### One-line takeaway

**Covariance is the object PCA diagonalizes; PCA is the transformation that turns covariance into variance.**

## 5.3 LDA (Linear Discriminant Analysis)

## üéØ Interview Prep: Concepts Review
**Context:** Utilizing class labels to find a feature space that maximizes group distinctness.

---

### üìö Core Concepts

#### Key Difference from PCA: Supervised Objective
While PCA is unsupervised and focuses on **variance preservation**, LDA is **supervised**. It seeks to project data into a lower-dimensional space that maximizes the distance between class means while minimizing the variance within each class.



* **PCA:** Finds directions of maximum variation.
* **LDA:** Finds directions that maximize class separability.

#### Mathematical Limitation: Number of Components
A critical constraint of LDA is its dimensionality limit. You can project the data onto at most:
$$(n_{classes} - 1) \text{ dimensions}$$

**Example:** If you are classifying the Iris dataset (3 classes: Setosa, Versicolor, Virginica), LDA can only provide a maximum of **2** linear discriminants, regardless of how many input features you have.



---

> **Interview Tip:** Mention that because LDA uses labels, it is often more effective than PCA for dimensionality reduction specifically intended to improve the performance of a classification model.

---

In [None]:
# LDA
lda = LDA(n_components=2)
X_lda = lda.fit_transform(X_scaled, y_iris)

# Plot
plt.figure(figsize=(8, 6))
for i, species in enumerate(iris.target_names):
    mask = y_iris == i
    plt.scatter(X_lda[mask, 0], X_lda[mask, 1], label=species, alpha=0.6, s=50)

plt.xlabel('LD1')
plt.ylabel('LD2')
plt.title('LDA: Supervised Dimensionality Reduction')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("‚úì Even better class separation than PCA!")
print("  (Because LDA uses class labels)")

## 5.4 t-SNE (t-distributed Stochastic Neighbor Embedding)

## üéØ Interview Prep: Concepts Review
**Context:** A non-linear dimensionality reduction technique optimized for visualizing high-dimensional data in 2D or 3D.


### üìö Core Concepts

#### Use Case: Visualization ONLY
Unlike PCA or LDA, t-SNE is **not** a preprocessing tool for machine learning. It is used exclusively to explore and visualize the underlying structure of high-dimensional datasets.

#### Strengths:
* **Local Structure:** It is exceptionally good at keeping points that are close together in high-dimensional space close together in 2D.
* **Non-Linear:** Can unravel complex manifolds that linear methods like PCA might miss.
* **Clustering:** Naturally creates tight, well-separated visual clusters.

#### Weaknesses:
* **Computational Expense:** Much slower than PCA, especially as the number of data points increases ($O(N^2)$ or $O(N \log N)$ depending on implementation).
* **Hyperparameter Sensitivity:** Results (and even the existence of clusters) can vary wildly based on the **Perplexity** setting.
* **No Transformation:** You cannot "fit" a t-SNE model and then "transform" new, incoming data points. You must re-run the algorithm on the entire combined dataset.


> **Interview Tip:** If asked why we don't use t-SNE for training models, mention that it does not preserve global distances (the distance between two distant clusters is often meaningless) and the lack of a `transform()` method makes it impossible to use in a production inference pipeline.


In [None]:
# t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_scaled)

# Plot
plt.figure(figsize=(8, 6))
for i, species in enumerate(iris.target_names):
    mask = y_iris == i
    plt.scatter(X_tsne[mask, 0], X_tsne[mask, 1], label=species, alpha=0.6, s=50)

plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.title('t-SNE: Non-linear Dimensionality Reduction')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("‚úì Creates very tight, well-separated clusters")
print("  Perfect for exploratory visualization!")

## Comparison: PCA vs LDA vs t-SNE

In [None]:
# Side-by-side comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

methods = [
    (X_pca, 'PCA (Unsupervised, Linear)', 'PC1', 'PC2'),
    (X_lda, 'LDA (Supervised, Linear)', 'LD1', 'LD2'),
    (X_tsne, 't-SNE (Unsupervised, Non-linear)', 't-SNE 1', 't-SNE 2')
]

for ax, (X_reduced, title, xlabel, ylabel) in zip(axes, methods):
    for i, species in enumerate(iris.target_names):
        mask = y_iris == i
        ax.scatter(X_reduced[mask, 0], X_reduced[mask, 1],
                   label=species, alpha=0.6, s=40)
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    ax.set_title(title)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary: Key Interview Takeaways

## üéØ Final Concept Review
**Context:** Consolidating the core principles of feature engineering and dimensionality reduction for technical interviews.


### üìö Feature Engineering
* **Always scale** before PCA, KNN, SVM, and Neural Networks.
* **Drop first category** in one-hot encoding to avoid the **dummy variable trap** (perfect multicollinearity).
* **Understand missingness type** (MCAR/MAR/MNAR) before choosing an imputation strategy.



### üîç Feature Selection
* **Hierarchy:** **Filter** (fast, model-agnostic) ‚Üí **Wrapper** (slow, accurate) ‚Üí **Embedded** (balanced).
* **Mutual Information** is your go-to for capturing **non-linear** relationships.
* **Ensembles:** Random Forests and XGBoost provide feature importance "for free" during training.

### üß¨ Feature Extraction
* **Computer Vision:** CNNs extract features hierarchically (Early = edges, Deep = complex objects).
* **NLP:** Transformers use the **Attention mechanism** for dynamic, contextual feature selection.

### üìè Dimensionality & Reduction
* **Curse of Dimensionality:** Leads to sparse data, meaningless distance metrics, and overfitting.
* **PCA:** Unsupervised; maximizes **variance preservation**.
* **LDA:** Supervised; maximizes **class separation**.
* **t-SNE:** Non-linear; use for **visualization only** (never for preprocessing!).


## ‚ùì Common Interview Questions

**1. "When would you use StandardScaler vs MinMaxScaler?"**
* **StandardScaler:** The default choice; centers data around 0 with unit variance. Use when features follow a Gaussian distribution.
* **MinMaxScaler:** Scales to a fixed range (usually [0,1]). Preferred for Neural Networks or algorithms that require a specific bounded range.

**2. "What's the dummy variable trap?"**
* It occurs when independent variables are highly correlated (multicollinear). If you have $N$ categories and include all $N$ as dummy variables, the $N^{th}$ variable is perfectly predictable from the others.
* **Solution:** Use `drop_first=True`.

**3. "Difference between PCA and LDA?"**
* **PCA** is unsupervised and focuses on the "spread" of the data regardless of labels.
* **LDA** is supervised and focuses on finding a path that makes classes as distinct as possible.



**4. "Why can't you use t-SNE for preprocessing?"**
* It is **non-deterministic** (results change with the seed).
* It lacks a `.transform()` method for new data.
* It does not preserve **global structure** (distances between distant clusters are arbitrary).

**5. "How do you handle missing data?"**
* **Simple:** Mean/Median/Mode (only for MCAR).
* **Advanced:** KNN Imputation or MICE (Multiple Imputation by Chained Equations) to preserve relationships between variables.


### üöÄ Next Steps
* **Code:** Practice with `interview_prep_simulation_scenarios_4.ipynb`.
* **Theory:** Review the `Interview_Prep_4_Terms_and_Concepts.pdf`.
* **Focus:** Practice explaining the **"why"** (the trade-offs), not just the "how" (the code).

**Remember:** In interviews, demonstrating that you know *when* a technique will fail is just as important as knowing how to implement it!