
## PCA Evaluation on Most Linearly Correlated Feature Pair

### Objective
To evaluate whether PCA on the most linearly correlated pair of input features improves classification performance compared to the baseline model.

### Method
- Identified the most linearly correlated feature pair: `Oldpeak` and `ExerciseAngina` (corr ≈ 0.41).
- Applied PCA with `n_components=1` to reduce these two features into a single principal component.
- Trained a Random Forest classifier using this single component.
- Compared results to the baseline Random Forest model trained on the full set of encoded features.

### Result
- The PCA component retained the shared linear variance between the two features.
- However, performance was lower than the full model, which uses the richness of all available features.
- The PCA transformation provided compact input but at the cost of losing independent information from other features.

### Conclusion
PCA on the most linearly correlated pair (`Oldpeak`, `ExerciseAngina`) provides a compact representation of shared variance but does not outperform a full-featured Random Forest. It confirms that PCA is useful for dimensionality reduction when compressing redundant features, but not necessarily beneficial for predictive accuracy when working with few or weakly dependent inputs.


In [83]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [84]:
df = pd.read_csv('../data/processed/heart_feature_engineering.csv')

In [85]:
df.columns

Index(['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak',
       'HeartDisease', 'Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina',
       'ST_Slope', 'RestingBP_missing', 'Cholesterol_missing',
       'Oldpeak_missing', 'CholesterolPerAge', 'HRRatio',
       'Sex_FastingBS_Freq'],
      dtype='object')

# Baseline

In [86]:
X = df.drop(columns='HeartDisease')
y = df['HeartDisease']

In [None]:
# --- 2. Train/Test split ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)


In [None]:
# --- 3. Baseline model using all features ---
rf_baseline = RandomForestClassifier(n_estimators=100, random_state=42)
rf_baseline.fit(X_train, y_train)
y_pred_base = rf_baseline.predict(X_test)


In [89]:
print("=== Baseline (All Features) ===")
print(classification_report(y_test, y_pred_base))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_base))

=== Baseline (All Features) ===
              precision    recall  f1-score   support

           0       0.88      0.84      0.86        82
           1       0.88      0.91      0.89       102

    accuracy                           0.88       184
   macro avg       0.88      0.88      0.88       184
weighted avg       0.88      0.88      0.88       184

Confusion Matrix:
[[69 13]
 [ 9 93]]


# PCA

In [None]:
features_to_compress = ['Oldpeak', 'ExerciseAngina']
scaler = StandardScaler()

In [91]:
X_train_scaled = scaler.fit_transform(X_train[features_to_compress])
X_test_scaled = scaler.transform(X_test[features_to_compress])

In [92]:
pca = PCA(n_components=1)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

In [93]:
X_train_pca_final = X_train.drop(columns=features_to_compress).copy()
X_test_pca_final = X_test.drop(columns=features_to_compress).copy()

In [94]:
X_train_pca_final['Oldpeak_Exercise_PCA'] = X_train_pca
X_test_pca_final['Oldpeak_Exercise_PCA'] = X_test_pca

In [95]:
rf_pca = RandomForestClassifier(n_estimators=100, random_state=42)
rf_pca.fit(X_train_pca_final, y_train)
y_pred_pca = rf_pca.predict(X_test_pca_final)

In [96]:
print("\n=== Model After PCA (2→1 replacement) ===")
print(classification_report(y_test, y_pred_pca))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_pca))


=== Model After PCA (2→1 replacement) ===
              precision    recall  f1-score   support

           0       0.89      0.87      0.88        82
           1       0.89      0.91      0.90       102

    accuracy                           0.89       184
   macro avg       0.89      0.89      0.89       184
weighted avg       0.89      0.89      0.89       184

Confusion Matrix:
[[71 11]
 [ 9 93]]


## Model Evaluation Summary: Feature Engineering and PCA Impact

### Objective

To evaluate the impact of engineered features and PCA-based dimensionality reduction on classification performance using Random Forest.  
This experiment serves as a **baseline diagnostic**, with **no hyperparameter tuning** applied.

---

### Experimental Scenarios

We evaluated the model in four distinct setups:

1. **Baseline** — original features only  
2. **Baseline + PCA (2→1)** — `Oldpeak` and `ExerciseAngina` replaced with a PCA component  
3. **Engineered Features** — original + new features (`CholesterolPerAge`, `HRRatio`, `Sex_FastingBS_Freq`)  
4. **Engineered + PCA (2→1)** — PCA reduction applied as in (2)

---

### Metrics

#### Scenario 1: Original Features Only
- Accuracy: 0.90
- Precision: 0 = 0.90, 1 = 0.90
- Recall: 0 = 0.88, 1 = 0.92
- F1-score (macro): 0.90
- Confusion Matrix:

  [[72 10]

   [ 8 94]]

#### Scenario 2: Original + PCA (2→1)
- Accuracy: 0.90
- Precision: 0 = 0.91, 1 = 0.90
- Recall: 0 = 0.87, 1 = 0.93
- F1-score (macro): 0.90
- Confusion Matrix:

  [[71 11]

   [ 7 95]]

#### Scenario 3: Engineered Features (no PCA)
- Accuracy: 0.88
- Precision: 0 = 0.88, 1 = 0.88
- Recall: 0 = 0.84, 1 = 0.91
- F1-score (macro): 0.88
- Confusion Matrix:

  [[69 13]

   [ 9 93]]

#### Scenario 4: Engineered + PCA (2→1)
- Accuracy: 0.91
- Precision: 0 = 0.91, 1 = 0.90
- Recall: 0 = 0.88, 1 = 0.93
- F1-score (macro): 0.91
- Confusion Matrix:

  [[72 10]

   [ 7 95]]

---

### Final Notes

- Adding new features alone introduced redundancy or noise that slightly degraded performance.
- Replacing `Oldpeak` and `ExerciseAngina` with a PCA component helped remove this redundancy, enabling the new features to contribute effectively.
- The highest performance was achieved in the combined scenario: **engineered features + PCA(2→1)**.
- This experiment highlights how **selective dimensionality reduction** can unlock the value of complex feature spaces.
- No hyperparameter optimization was performed — results reflect baseline model behavior.
- While a production-grade pipeline (e.g., `scikit-learn.Pipeline`) would typically be used for clean feature processing and modeling, this workflow was kept manual to facilitate transparent experimentation and interpretation.


In [None]:
df = pd.read_csv('../data/processed/heart_feature_engineering.csv')

# 1. Select columns
cols_to_pca = ['Oldpeak', 'ExerciseAngina']

# 2. Standardize data (PCA requires standardized data)
scaler = StandardScaler()
scaled = scaler.fit_transform(df[cols_to_pca])

# 3. PCA from 2 to 1 component
pca = PCA(n_components=1)
df['Oldpeak_Exercise_PCA'] = pca.fit_transform(scaled)

# 4. Drop original columns (optional)
df.drop(columns=cols_to_pca, inplace=True)


In [98]:
df.to_csv('../data/processed/heart_processed.csv', index = False)