<a href="https://www.kaggle.com/code/bhavinmoriya/learnpca?scriptVersionId=257396417" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Principal Component Analysis (PCA) in Python**

**Principal Component Analysis (PCA)** is a **dimensionality reduction** technique that transforms high-dimensional data into a lower-dimensional form while preserving as much variability as possible. It is widely used for **exploratory data analysis, noise reduction, and feature extraction**.

---

## **1. What is PCA?**
- PCA identifies **patterns** in data by finding directions (principal components) where the data varies the most.
- It **projects** data onto these directions, reducing the number of features while retaining most of the information.
- The first principal component captures the **maximum variance**, the second captures the next highest variance (orthogonal to the first), and so on.

---

## **2. Key Concepts**
- **Principal Components (PCs)**: New features created by PCA, ordered by the amount of variance they explain.
- **Eigenvalues**: Represent the magnitude of variance captured by each principal component.
- **Eigenvectors**: Define the direction of the principal components.
- **Explained Variance**: The proportion of total variance explained by each principal component.

---

## **3. Steps to Perform PCA in Python**
### **Step 1: Import Required Libraries**
```python
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
```

### **Step 2: Load and Prepare Data**
PCA works best with **standardized data** (mean=0, variance=1).

```python
# Example dataset (replace with your data)
data = pd.DataFrame({
    'Feature1': [1, 2, 3, 4, 5],
    'Feature2': [5, 4, 3, 2, 1],
    'Feature3': [2, 3, 4, 5, 6]
})

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
```

---

### **Step 3: Apply PCA**
```python
# Initialize PCA (choose number of components)
pca = PCA(n_components=2)  # Reduce to 2 dimensions

# Fit and transform the data
principal_components = pca.fit_transform(scaled_data)
```

---

### **Step 4: Explained Variance**
```python
# Explained variance ratio
print("Explained variance ratio:", pca.explained_variance_ratio_)

# Cumulative explained variance
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
print("Cumulative explained variance:", cumulative_variance)
```

#### **Output Example:**
```
Explained variance ratio: [0.924, 0.075]  # 92.4% variance explained by PC1, 7.5% by PC2
Cumulative explained variance: [0.924, 0.999]  # 99.9% total variance explained by 2 PCs
```

---

### **Step 5: Visualize Principal Components**
```python
# Create a DataFrame for the principal components
pc_df = pd.DataFrame(
    data=principal_components,
    columns=['PC1', 'PC2']
)

# Plot the principal components
plt.figure(figsize=(8, 6))
plt.scatter(pc_df['PC1'], pc_df['PC2'])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA: Principal Components')
plt.grid()
plt.show()
```

---

## **4. Choosing the Number of Components**
### **Scree Plot**
A scree plot helps visualize the explained variance to decide how many components to keep.

```python
# Plot explained variance
plt.figure(figsize=(8, 6))
plt.bar(
    range(1, len(pca.explained_variance_ratio_) + 1),
    pca.explained_variance_ratio_,
    alpha=0.5,
    align='center',
    label='Individual explained variance'
)
plt.step(
    range(1, len(cumulative_variance) + 1),
    cumulative_variance,
    where='mid',
    label='Cumulative explained variance'
)
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.title('Scree Plot')
plt.show()
```

#### **How to Interpret:**
- Look for the **"elbow"** in the scree plot (where the explained variance drops significantly).
- Choose the number of components that capture **~95% of the total variance**.

---

## **5. PCA for Dimensionality Reduction**
### **Example: Reduce to 2 Components**
```python
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(scaled_data)
print("Reduced data shape:", reduced_data.shape)  # (n_samples, 2)
```

---

## **6. PCA in Machine Learning**
PCA is often used as a **preprocessing step** to reduce the number of features before training a model.

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Example: Load a dataset (e.g., Iris)
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target

# Standardize the data
X_scaled = StandardScaler().fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

# Train a model on the reduced data
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Evaluate the model
print("Accuracy:", model.score(X_test, y_test))
```

---

## **7. Common Pitfalls and Best Practices**
### **Pitfalls**
- **Not standardizing data**: PCA is sensitive to the scale of features. Always standardize your data first.
- **Over-relying on explained variance**: Don’t blindly keep components based on variance. Consider the interpretability and downstream task performance.
- **Ignoring the scree plot**: Always visualize the explained variance to make an informed decision.

### **Best Practices**
- **Standardize your data** before applying PCA.
- **Use the scree plot** to decide the number of components.
- **Interpret the principal components** by examining the loadings (eigenvectors).
- **Test the impact** of PCA on your model’s performance.

---

## **8. Interpreting Principal Components**
The principal components are linear combinations of the original features. You can examine the **loadings** to understand what each component represents.

```python
# Get the loadings (eigenvectors)
loadings = pca.components_

# Create a DataFrame for the loadings
loadings_df = pd.DataFrame(
    loadings,
    columns=data.columns,
    index=[f'PC{i+1}' for i in range(loadings.shape[0])]
)

print("Loadings:\n", loadings_df)
```

#### **Example Output:**
```
Loadings:
          Feature1   Feature2   Feature3
PC1      0.707107  -0.707107   0.000000
PC2      0.408248   0.408248  -0.816497
```
- **PC1** is a contrast between `Feature1` and `Feature2`.
- **PC2** is influenced by all three features but most strongly by `Feature3`.

---

## **9. When to Use PCA**
- **Dimensionality Reduction**: Reduce the number of features while retaining most of the information.
- **Noise Reduction**: Remove less important components to focus on the signal.
- **Visualization**: Project high-dimensional data into 2D or 3D for plotting.
- **Feature Extraction**: Create new features that capture the most important patterns.

---

## **10. Alternatives to PCA**
- **t-SNE**: Better for visualization of high-dimensional data.
- **UMAP**: Preserves both local and global structure.
- **Factor Analysis**: Similar to PCA but assumes a statistical model for the data.

---

## **Summary**
- PCA is a powerful tool for **dimensionality reduction** and **feature extraction**.
- Always **standardize your data** before applying PCA.
- Use the **scree plot** to decide how many components to keep.
- PCA is widely used in **exploratory data analysis, machine learning, and visualization**.