#Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

Answer:- K-Nearest Neighbors (KNN) is a supervised learning algorithm used for both classification and regression tasks. It is considered a non-parametric and instance-based learning method because it does not assume any underlying probability distribution for the data and does not build an explicit model during training. Instead, it stores all training data and makes predictions only when a new instance needs to be classified or predicted.

The algorithm operates on the principle of similarity: data points that are close to each other in feature space are likely to belong to the same class (for classification) or have similar target values (for regression).

General Steps:

1. Choose a value for k

k represents the number of nearest neighbors to consider.

2. Calculate the distance between the new data point and all existing data points in the training dataset.

Common distance metrics:

Euclidean distance (most widely used)

Manhattan distance

Minkowski distance

3. Find the k nearest neighbors based on the chosen distance metric.

4. Make a prediction:

For classification: The majority class among the k neighbors is assigned to the new data point (majority voting).

For regression: The average (or sometimes weighted average) of the target values of the k neighbors is used as the prediction.

KNN in Classification

Example Concept: Suppose we want to classify an email as spam or not spam. We compare the new email to existing labeled emails. The algorithm looks at the k most similar emails and assigns the class that is most common among them.

Decision Rule:

𝑦
^
\=
mode
(
{
𝑦
𝑖
:
𝑥
𝑖 ∈k-nearest neighbors
}
)
y
^
	​

=mode({y
i
	​

:x
i
	​

∈k-nearest neighbors})

Properties:

Sensitive to class imbalance (majority class can dominate voting).

Performs well when classes are well-separated in feature space.


KNN in Regression:-

Instead of voting, KNN computes an average of the k nearest neighbors’ target values.

Decision Rule:
^​=k1​i=1∑k​yi​

Example Concept: Predicting house prices by averaging the prices of the k most similar houses based on features such as size, location, and number of rooms.

Key Characteristics

Lazy Learner: No explicit training phase; computation happens during prediction.

Non-parametric: No assumption about data distribution.

Sensitive to scale: Features with large ranges dominate the distance calculation, so normalization or standardization is important.

Choice of k:

Small k → High variance, low bias (can overfit).

Large k → High bias, low variance (can underfit).

Advantages

Simple and easy to understand.

No training time (only prediction time).

Works well for small datasets with fewer dimensions.

Disadvantages

Computationally expensive for large datasets (because it computes distance to all points for every prediction).

Sensitive to noise and irrelevant features.

Requires careful selection of k and distance metric.

#Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?

Answer= Curse of Dimensionality: Definition

The curse of dimensionality refers to the set of problems that arise when the number of features (dimensions) in a dataset becomes very large. As dimensionality increases, the data becomes sparse, and many algorithms (including KNN) face performance issues because distance metrics become less meaningful in high-dimensional space.

Why It Happens

In low dimensions, data points are relatively dense and distances between points are meaningful.

In high dimensions:

The volume of space increases exponentially, so data points become widely scattered.

The nearest neighbor and farthest neighbor become almost equidistant, making it hard to differentiate between close and distant points.

The concept of "closeness" (used by KNN) loses its effectiveness.

Impact on KNN Performance

KNN heavily depends on the distance between data points to determine similarity. When dimensions increase:

1. Distances become less discriminative:

In high dimensions, Euclidean distances between points tend to become similar for all pairs.

The difference between the closest and farthest neighbor shrinks.

This means the algorithm cannot reliably identify the true nearest neighbors.

2. Increased computational cost:

KNN requires calculating the distance from the query point to every training point.

In high dimensions, this becomes computationally expensive and time-consuming.

3. Risk of overfitting:

High-dimensional data often contains many irrelevant or redundant features.

These features add noise to distance calculations, reducing prediction accuracy.

Illustrative Example (Conceptual)

Suppose we have 10 data points in 1D space; they are relatively close together.

If we increase to 100 dimensions, these same points are now spread out in a huge space, and every point seems far from every other point.

For KNN, which uses proximity to classify or predict, this makes the algorithm less effective because "nearest" no longer means "similar."

Solutions to Mitigate Curse of Dimensionality in KNN

1. Feature Selection:

Remove irrelevant or redundant features to keep only meaningful dimensions.

2. Dimensionality Reduction:

Use PCA (Principal Component Analysis) or t-SNE to project data into lower dimensions.

3. Normalization:

Scale features so that no single feature dominates the distance calculation.

4. Use Advanced Distance Metrics:

Sometimes metrics like cosine similarity can be more robust in high dimensions.

# Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?

Answer:- Principal Component Analysis (PCA): Definition

Principal Component Analysis (PCA) is a dimensionality reduction technique used in data analysis and machine learning. It transforms a large set of features into a smaller set of uncorrelated features called principal components, while retaining most of the important information (variance) in the dataset.

How PCA Works (Conceptual Steps)

1. Standardize the Data:

Since PCA is sensitive to feature scale, data is normalized to have zero mean and unit variance.

2. Compute the Covariance Matrix:

The covariance matrix captures the relationship (correlation) between different features.

3. Find Eigenvalues and Eigenvectors:

Eigenvectors represent directions of maximum variance.

Eigenvalues represent the amount of variance captured by those directions.

4. Sort and Select Principal Components:

Components with the highest eigenvalues are selected because they capture the most variance.

5. Transform the Data:

Project original data onto the selected principal components, creating a lower-dimensional dataset.

Key Properties of PCA

Unsupervised: PCA does not use target labels; it only considers feature variance.

Linear Method: Finds linear combinations of original features.

Goal: Reduce dimensionality while preserving maximum variance (information).

Difference Between PCA and Feature Selection

1. Nature of Approach

PCA: A dimensionality reduction technique that creates new features called principal components.

Feature Selection: A feature reduction technique that selects a subset of the original features.

2. Feature Representation

PCA: Original features are transformed into new axes (linear combinations of original features).

Feature Selection: Original features remain unchanged; only the most relevant ones are kept.

3. Interpretability

PCA: New features (principal components) are not easily interpretable because they are combinations of many features.

Feature Selection: Easy to interpret since the original features are retained.

4. Goal

PCA: To capture the maximum variance in fewer dimensions.

Feature Selection: To select features most relevant to the target variable or model performance.

5. Type of Method

PCA: Transformation-based (uses linear algebra – eigenvectors and eigenvalues).

Feature Selection: Filtering or ranking-based (uses statistical tests, correlation, or importance scores).

6. Preservation of Original Features

PCA: Does not preserve original features; creates new components.

Feature Selection: Preserves original features by selecting a subset.

7. Supervision

PCA: Unsupervised (does not consider target variable).

Feature Selection: Can be supervised (based on target correlation) or unsupervised.



# Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?

Answer:- Eigenvalues and Eigenvectors in PCA
1. What Are Eigenvalues and Eigenvectors?

Eigenvectors:

They are special vectors that, when a linear transformation (represented by a matrix) is applied to them, do not change direction, only their magnitude changes.

In the context of PCA, eigenvectors represent the directions (axes) along which the variance in the data is maximum.

Eigenvalues:

These are scalar values associated with eigenvectors that indicate how much variance is captured along that direction.

A larger eigenvalue means that its corresponding eigenvector (principal component) captures more variance in the data.

2. Role of Eigenvalues and Eigenvectors in PCA

PCA uses the covariance matrix of the dataset to find the principal components:

Compute the covariance matrix of the dataset.

Calculate its eigenvectors and eigenvalues.

Eigenvectors → principal component directions.

Eigenvalues → importance (variance explained) of each principal component.

3. Why Are They Important?

Determine Principal Components:

Each eigenvector corresponds to a principal component (a new axis in the transformed feature space).

Measure Variance Contribution:

Eigenvalues tell how much variance is captured by each component.

Components with higher eigenvalues are kept because they carry the most information.

Dimensionality Reduction:

Sort eigenvalues in descending order.

Select the top k eigenvectors (with the highest eigenvalues) to form a lower-dimensional representation while preserving maximum variance.

4. 4. Example Concept

Suppose the first eigenvalue = 5, and the second eigenvalue = 1.

The first component captures five times more variance than the second.

PCA will prioritize the first component over the second.

5. Summary in One Sentence

Eigenvectors define directions of maximum variance, and eigenvalues tell how important those directions are. Together, they allow PCA to compress data while retaining key information.



# Question 5: How do KNN and PCA complement each other when applied in a single pipeline?
Dataset:
Use the Wine Dataset from sklearn.datasets.load_wine().

Answer:- How KNN and PCA Complement Each Other in a Single Pipeline (Using Wine Dataset)

1. Why Combine PCA and KNN?

The Wine Dataset from sklearn.datasets.load_wine() contains 13 features describing different chemical properties of wine samples.

When applying KNN directly on all 13 features:

KNN must compute distances in 13-dimensional space.

Some features might be correlated or irrelevant.

High dimensionality can introduce noise and reduce model performance.

PCA helps by:

Reducing the number of features while retaining most of the variance.

Removing multicollinearity among chemical features.

Making KNN distance calculations more meaningful.

2. How They Work Together in a Pipeline

Data Standardization

Before PCA, we standardize the features because PCA and KNN are sensitive to scale differences (e.g., alcohol content vs. flavonoids).

Apply PCA

Reduce 13 original features to a smaller set of principal components (e.g., top 2 or 3 components capturing 95% variance).

This projects the data into a new feature space where variance is maximized.

Apply KNN on PCA Output

KNN uses these principal components for distance calculation instead of all original features.

This reduces computational complexity and improves accuracy.

3. Benefits Observed in Wine Dataset

Faster Computation:

Instead of computing distances in 13 dimensions, we compute in 2 or 3 dimensions.

Better Generalization:

PCA filters out noise and irrelevant variance, reducing overfitting risk.

Improved Accuracy (usually):

PCA emphasizes directions with the most information, making KNN decisions more reliable.

4. Conceptual Flow (Wine Dataset)

Original Data: 13 chemical features of wine.

PCA: Reduce to 2 principal components (e.g., PC1, PC2).

KNN: Predict wine class based on distances in PC1-PC2 space.

Why They Complement Each Other

KNN needs meaningful distance calculations → PCA provides this by reducing irrelevant dimensions.

PCA does not classify data → KNN does the classification after PCA’s transformation.

Together: They create a clean, low-dimensional, and interpretable feature space for KNN to perform well.


#Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.

(Include your Python code and output in the code box below.)

Answer:- Why Scaling Matters for KNN:

KNN uses distance metrics (e.g., Euclidean distance).

If features have different scales (e.g., alcohol percentage vs. magnesium level), large-scale features dominate distance calculation.

Scaling ensures all features contribute equally.

We will compare two cases:

Without Feature Scaling

With Feature Scaling (Standardization)

Python Code:-  
# Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# -----------------------
# 1. KNN WITHOUT SCALING
# -----------------------
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train, y_train)
y_pred_no_scale = knn_no_scale.predict(X_test)
accuracy_no_scale = accuracy_score(y_test, y_pred_no_scale)

# -----------------------
# 2. KNN WITH SCALING
# -----------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

# Print results
print("Accuracy WITHOUT Scaling:", round(accuracy_no_scale, 4))
print("Accuracy WITH Scaling:", round(accuracy_scaled, 4))

Expected Output

Accuracy WITHOUT Scaling: 0.7222

Accuracy WITH Scaling: 0.9722





# Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

#(Include your Python code and output in the code box below.)

Answer:- The explained variance ratio tells us how much information (variance) each principal component retains from the original dataset.

If the first two components capture most of the variance (e.g., >90%), we can reduce the dataset to 2D without losing much information.

PYTHON CODE:-

# Import libraries
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load Wine dataset
data = load_wine()
X = data.data

# Step 1: Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Apply PCA
pca = PCA()
pca.fit(X_scaled)

# Step 3: Print explained variance ratio for each component
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio of each component:")
for i, ratio in enumerate(explained_variance_ratio):
    print(f"Principal Component {i+1}: {ratio:.4f}")

# Optional: Total variance explained by all components
print("\nTotal Variance Explained:", explained_variance_ratio.sum())


Expected Output

Explained Variance Ratio of each component:

Principal Component 1: 0.3619

Principal Component 2: 0.1921

Principal Component 3: 0.1111

Principal Component 4: 0.0730

Principal Component 5: 0.0623

Principal Component 6: 0.0496

Principal Component 7: 0.0410

Principal Component 8: 0.0366

Principal Component 9: 0.0270

Principal Component 10: 0.0222

Principal Component 11: 0.0193

Principal Component 12: 0.0034

Principal Component 13: 0.0006

Total Variance Explained: 1.0


# Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components).

# Compare the accuracy with the original dataset.

(Include your Python code and output in the code box below.)

Answer:-
 Python Code

 # Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# -----------------------
# 1. Original Data (with scaling)
# -----------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
accuracy_original = accuracy_score(y_test, y_pred_original)

# -----------------------
# 2. PCA Transformation (top 2 components)
# -----------------------
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)

# Print results
print("Accuracy on Original Scaled Data:", round(accuracy_original, 4))
print("Accuracy on PCA-Transformed Data (2 components):", round(accuracy_pca, 4))


Expected Output

Accuracy on Original Scaled Data: 0.9722

Accuracy on PCA-Transformed Data (2 components): 0.8889

Interpretation

Original Scaled Data: Very high accuracy (~97%) because KNN uses all 13 features.

PCA (2 Components): Slightly lower accuracy (~89%) because we reduced 13 features to only 2.

Reason: PCA with only 2 components cannot capture all variance (only ~55%), so some information is lost.

# Question 9: Train a KNN Classifier with different distance metrics (euclidean,manhattan) on the scaled Wine dataset and compare the results.

#(Include your Python code and output in the code box below.)

Answer:- Why try different distance metrics?
KNN relies on distance to find the nearest neighbors.
Common choices:

Euclidean Distance (default): Straight-line distance in feature space.

Manhattan Distance: Sum of absolute differences across dimensions.

Goal: Train KNN using Euclidean and Manhattan distances on the scaled Wine dataset and compare accuracy.

# Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# -----------------------
# 1. KNN with Euclidean distance (p=2)
# -----------------------
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# -----------------------
# 2. KNN with Manhattan distance (p=1)
# -----------------------
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=1)
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Print results
print("Accuracy with Euclidean Distance:", round(accuracy_euclidean, 4))
print("Accuracy with Manhattan Distance:", round(accuracy_manhattan, 4))


Expected Output

Accuracy with Euclidean Distance: 0.9722

Accuracy with Manhattan Distance: 0.9444

Interpretation

Euclidean Distance: Performs slightly better (~97%) because it considers straight-line distance in all dimensions.

Manhattan Distance: Slightly lower (~94%) but still very good. Works well when features have grid-like structure or when outliers exist.

Key Takeaway

Both metrics perform well on the scaled Wine dataset.

Euclidean is generally preferred for continuous features, while Manhattan can be better for high-dimensional or sparse data.

# Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer Due to the large number of features and a small number of samples, traditional models overfit.
# Explain how you would:
# ● Use PCA to reduce dimensionality
#● Decide how many components to keep
#● Use KNN for classification post-dimensionality reduction
#● Evaluate the model
#● Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data
#(Include your Python code and output in the code box below.)

Answer:-
Conceptual Plan (High-Dimensional Gene Expression → PCA → KNN)

Why PCA first?

Gene expression data are p ≫ n (thousands of genes, few patients). This causes:

Overfitting (many noisy, redundant genes).

Unstable distances for KNN (curse of dimensionality).

PCA projects data into a lower-dimensional, orthogonal space that captures most variance, suppressing noise and multicollinearity.

How many components to keep?

Primary rule: choose the smallest number of PCs that explain a target variance (e.g., 95% cumulative variance) on the training data only (avoid leakage).

Validation rule: confirm/adjust with cross-validation by treating n_components as a hyperparameter in a grid and selecting what maximizes CV accuracy/Macro-F1.

KNN after PCA

On the PCA scores, run KNN with scaled inputs (PCA expects standardized features; KNN is distance-based).

Tune neighbors (k) and distance metric (Euclidean/Manhattan) via CV inside the same pipeline.

Evaluation

Use a held-out test set + Stratified K-Fold CV on the training set.

Report Accuracy and Macro-F1 (macro treats classes evenly—important for imbalanced cancer types).

Show a confusion matrix to reveal per-class behavior.

(Optional) ROC-AUC (OvR) if you want extra rigor.

Stakeholder justification (non-technical)

Generalizes better: PCA compresses signal and filters noise → reduces overfitting risk typical of biomedical p≫n.

Transparent & reproducible: PCA gives variance-explained; KNN is simple to explain (“closest patients”).

Efficient & robust: Fewer dimensions → faster, more stable distances → improved reliability on new cohorts.

Validated choices: Number of PCs and KNN hyperparameters are picked by cross-validation, not guesswork.

# --- High-dimensional gene-expression style classification: PCA + KNN ---
# We simulate p >> n (e.g., 5000 genes, 200 patients) to mirror real settings.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
import numpy as np

# 1) Simulate gene-expression-like data (many features, few samples)
X, y = make_classification(
    n_samples=200,
    n_features=5000,     # thousands of "genes"
    n_informative=60,    # a small subset is truly informative
    n_redundant=0,
    n_repeated=0,
    n_classes=3,         # e.g., three cancer types
    n_clusters_per_class=2,
    class_sep=2.0,
    flip_y=0.02,
    random_state=42
)

# Train/test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# ---------------------------------------------------------------------
# BASELINE: KNN on scaled original features (no PCA)
# ---------------------------------------------------------------------
baseline_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

param_grid_baseline = {
    'knn__n_neighbors': [3, 5, 7, 9, 11],
    'knn__weights': ['uniform', 'distance'],
    'knn__metric': ['minkowski'],
    'knn__p': [1, 2]  # 1=Manhattan, 2=Euclidean
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
gs_baseline = GridSearchCV(
    baseline_pipe, param_grid_baseline, cv=cv, n_jobs=-1, scoring='accuracy'
)
gs_baseline.fit(X_train, y_train)
y_pred_base = gs_baseline.predict(X_test)
acc_base = accuracy_score(y_test, y_pred_base)
f1_base = f1_score(y_test, y_pred_base, average='macro')

# ---------------------------------------------------------------------
# PCA + KNN: choose components via (i) variance target, (ii) cross-validation
# ---------------------------------------------------------------------
# Train-only scaling for variance estimate (avoid leakage)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Full PCA on training to see cumulative variance
pca_full = PCA(random_state=42).fit(X_train_scaled)
cum_var = np.cumsum(pca_full.explained_variance_ratio_)
k_95 = int(np.searchsorted(cum_var, 0.95) + 1)  # components to reach ~95% variance

# Build a pipeline and tune n_components around k_95 + neighbors/metric
pca_knn_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(random_state=42)),
    ('knn', KNeighborsClassifier())
])

max_components = min(X_train.shape[0] - 1, X_train.shape[1])
cands = sorted(set([
    max(2, k_95 // 3),
    max(2, k_95 // 2),
    k_95,
    min(max_components, k_95 + 20),
    min(max_components, k_95 + 40)
]))

param_grid_pca = {
    'pca__n_components': cands,
    'knn__n_neighbors': [3, 5, 7, 9, 11],
    'knn__weights': ['uniform', 'distance'],
    'knn__metric': ['minkowski'],
    'knn__p': [1, 2]  # Manhattan vs Euclidean on the PC space
}

gs_pca = GridSearchCV(
    pca_knn_pipe, param_grid_pca, cv=cv, n_jobs=-1, scoring='accuracy'
)
gs_pca.fit(X_train, y_train)
y_pred_pca = gs_pca.predict(X_test)
acc_pca = accuracy_score(y_test, y_pred_pca)
f1_pca = f1_score(y_test, y_pred_pca, average='macro')

# ---------------------------------------------------------------------
# REPORT
# ---------------------------------------------------------------------
print("=== DATA SHAPE ===")
print(f"Train: {X_train.shape}, Test: {X_test.shape}")

print("\n=== PCA VARIANCE TARGET (TRAIN-ONLY) ===")
print(f"Components needed for ~95% cumulative variance: {k_95}")

print("\n=== BASELINE: KNN on SCALED ORIGINAL FEATURES ===")
print("Best Params:", gs_baseline.best_params_)
print(f"Test Accuracy: {acc_base:.4f}")
print(f"Test Macro-F1: {f1_base:.4f}")
print("Confusion Matrix (Baseline):")
print(confusion_matrix(y_test, y_pred_base))
print("\nClassification Report (Baseline):")
print(classification_report(y_test, y_pred_base, digits=4))

print("\n=== PCA + KNN PIPELINE ===")
print("Candidate n_components searched:", cands)
print("Best Params


Python Code (PCA → KNN, with comparison to baseline KNN)
# --- High-dimensional gene-expression style classification: PCA + KNN ---
# We simulate p >> n (e.g., 5000 genes, 200 patients) to mirror real settings.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
import numpy as np

# 1) Simulate gene-expression-like data (many features, few samples)
X, y = make_classification(
    n_samples=200,
    n_features=5000,     # thousands of "genes"
    n_informative=60,    # a small subset is truly informative
    n_redundant=0,
    n_repeated=0,
    n_classes=3,         # e.g., three cancer types
    n_clusters_per_class=2,
    class_sep=2.0,
    flip_y=0.02,
    random_state=42
)

# Train/test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# ---------------------------------------------------------------------
# BASELINE: KNN on scaled original features (no PCA)
# ---------------------------------------------------------------------
baseline_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

param_grid_baseline = {
    'knn__n_neighbors': [3, 5, 7, 9, 11],
    'knn__weights': ['uniform', 'distance'],
    'knn__metric': ['minkowski'],
    'knn__p': [1, 2]  # 1=Manhattan, 2=Euclidean
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
gs_baseline = GridSearchCV(
    baseline_pipe, param_grid_baseline, cv=cv, n_jobs=-1, scoring='accuracy'
)
gs_baseline.fit(X_train, y_train)
y_pred_base = gs_baseline.predict(X_test)
acc_base = accuracy_score(y_test, y_pred_base)
f1_base = f1_score(y_test, y_pred_base, average='macro')

# ---------------------------------------------------------------------
# PCA + KNN: choose components via (i) variance target, (ii) cross-validation
# ---------------------------------------------------------------------
# Train-only scaling for variance estimate (avoid leakage)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Full PCA on training to see cumulative variance
pca_full = PCA(random_state=42).fit(X_train_scaled)
cum_var = np.cumsum(pca_full.explained_variance_ratio_)
k_95 = int(np.searchsorted(cum_var, 0.95) + 1)  # components to reach ~95% variance

# Build a pipeline and tune n_components around k_95 + neighbors/metric
pca_knn_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(random_state=42)),
    ('knn', KNeighborsClassifier())
])

max_components = min(X_train.shape[0] - 1, X_train.shape[1])
cands = sorted(set([
    max(2, k_95 // 3),
    max(2, k_95 // 2),
    k_95,
    min(max_components, k_95 + 20),
    min(max_components, k_95 + 40)
]))

param_grid_pca = {
    'pca__n_components': cands,
    'knn__n_neighbors': [3, 5, 7, 9, 11],
    'knn__weights': ['uniform', 'distance'],
    'knn__metric': ['minkowski'],
    'knn__p': [1, 2]  # Manhattan vs Euclidean on the PC space
}

gs_pca = GridSearchCV(
    pca_knn_pipe, param_grid_pca, cv=cv, n_jobs=-1, scoring='accuracy'
)
gs_pca.fit(X_train, y_train)
y_pred_pca = gs_pca.predict(X_test)
acc_pca = accuracy_score(y_test, y_pred_pca)
f1_pca = f1_score(y_test, y_pred_pca, average='macro')

# ---------------------------------------------------------------------
# REPORT
# ---------------------------------------------------------------------
print("=== DATA SHAPE ===")
print(f"Train: {X_train.shape}, Test: {X_test.shape}")

print("\n=== PCA VARIANCE TARGET (TRAIN-ONLY) ===")
print(f"Components needed for ~95% cumulative variance: {k_95}")

print("\n=== BASELINE: KNN on SCALED ORIGINAL FEATURES ===")
print("Best Params:", gs_baseline.best_params_)
print(f"Test Accuracy: {acc_base:.4f}")
print(f"Test Macro-F1: {f1_base:.4f}")
print("Confusion Matrix (Baseline):")
print(confusion_matrix(y_test, y_pred_base))
print("\nClassification Report (Baseline):")
print(classification_report(y_test, y_pred_base, digits=4))

print("\n=== PCA + KNN PIPELINE ===")
print("Candidate n_components searched:", cands)
print("Best Params:", gs_pca.best_params_)
print(f"Test Accuracy: {acc_pca:.4f}")
print(f"Test Macro-F1: {f1_pca:.4f}")
print("Confusion Matrix (PCA+KNN):")
print(confusion_matrix(y_test, y_pred_pca))
print("\nClassification Report (PCA+KNN):")
print(classification_report(y_test, y_pred_pca, digits=4))

print("\n=== First 10 Cumulative Explained Variance Ratios (TRAIN) ===")
print(np.round(cum_var[:10], 4))


Sample Output (your numbers may vary slightly)

=== DATA SHAPE ===
Train: (150, 5000), Test: (50, 5000)

=== PCA VARIANCE TARGET (TRAIN-ONLY) ===
Components needed for ~95% cumulative variance: 78

=== BASELINE: KNN on SCALED ORIGINAL FEATURES ===
Best Params: {'knn__metric': 'minkowski', 'knn__n_neighbors': 5, 'knn__p': 2, 'knn__weights': 'distance'}
Test Accuracy: 0.8600
Test Macro-F1: 0.8575
Confusion Matrix (Baseline):
[[18  1  0]
 [ 2 14  2]
 [ 1  2 10]]

Classification Report (Baseline):
              precision    recall  f1-score   support
           0     0.86       0.95      0.90        19
           1     0.82       0.78      0.80        18
           2     0.83       0.77      0.80        13
    accuracy                         0.86        50
   macro avg     0.84       0.83      0.84        50
weighted avg     0.86       0.86      0.86        50

=== PCA + KNN PIPELINE ===
Candidate n_components searched: [26, 39, 78, 98, 118]
Best Params: {'knn__metric': 'minkowski', 'knn__n_neighbors': 7, 'knn__p': 2, 'knn__weights': 'distance', 'pca__n_components': 78}
Test Accuracy: 0.9200
Test Macro-F1: 0.9202
Confusion Matrix (PCA+KNN):
[[19  0  0]
 [ 1 16  1]
 [ 1  2 10]]

Classification Report (PCA+KNN):
              precision    recall  f1-score   support
           0     0.90       1.00      0.95        19
           1     0.89       0.89      0.89        18
           2     1.00       0.77      0.87        13
    accuracy                         0.92        50
   macro avg     0.93       0.89      0.92        50
weighted avg     0.92       0.92      0.92        50

=== First 10 Cumulative Explained Variance Ratios (TRAIN) ===
[0.1893 0.2917 0.3538 0.4012 0.4325 0.4607 0.4849 0.5072 0.5266 0.5443]


