# Daily Blog #69 - Dimensionality Reduction with PCA (Principal Component Analysis)
### July 8, 2025 

When working with large datasets, especially ones with many features (columns), models can:

* Overfit easily (too much noise)
* Run slower
* Become hard to interpret

**PCA** helps reduce the number of features while keeping as much **variance** (information) as possible.


### **Core Idea**

PCA transforms the original features into **new axes** (called *principal components*) that are:

* **Uncorrelated**
* Ordered so that the **first few capture most of the variance**

Imagine rotating the entire dataset so that most of the meaningful information aligns with the first axis.

### Example:

Suppose you have 3 features: height, weight, and BMI. They're highly correlated.

PCA can reduce this to just **1 or 2 principal components** that still retain the essence of the data.


### Step-by-Step Breakdown:

#### 1. **Standardize** the Data

PCA is sensitive to scale. Always normalize:

```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```

#### 2. **Apply PCA**

```python
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
```

#### 3. **Inspect Explained Variance**

```python
print(pca.explained_variance_ratio_)
```

This tells how much variance each component holds.


### Visualization

You can now plot your 2D PCA-reduced data:

```python
import matplotlib.pyplot as plt
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA Result")
plt.show()
```

### Takeaways:

* PCA is **unsupervised** – doesn’t care about labels.
* Great for **visualizing** high-dimensional data.
* Use it **before clustering or classification** to improve speed and performance.
* But: You lose **interpretability** – PCs are combinations of original features.
