Question 1:
What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

Answer:

K-Nearest Neighbors (KNN) is a supervised, instance-based, and non-parametric machine learning algorithm used for both classification and regression tasks.

How KNN Works (General Working):

KNN does not build an explicit model during training.

It stores the entire training dataset.

When a new data point is given, KNN calculates the distance between this point and all training points.

It selects the K nearest data points based on distance.

The prediction is made based on these neighbors.

Common distance measures include Euclidean and Manhattan distance.

Value of K is chosen by the user.

Smaller K → sensitive to noise.

Larger K → smoother decision boundary.

KNN is simple yet powerful.

KNN for Classification:

The class labels of the K nearest neighbors are considered.

Majority voting is applied.

The class with the highest frequency is assigned.

Works well for multi-class problems.

Sensitive to feature scale.

Requires feature normalization.

Common in image and text classification.

Performs well with well-separated classes.

Prediction time is high.

Accuracy depends on K and distance metric.

KNN for Regression:

The output values of K nearest neighbors are considered.

The average (mean) of these values is calculated.

The average becomes the predicted value.

No voting mechanism is used.

Sensitive to outliers.

Suitable for smooth numerical data.

Performs poorly with noisy data.

Requires proper scaling.

Computationally expensive.

Simple to understand and implement.

Question 2:
What is the Curse of Dimensionality and how does it affect KNN performance?

Answer:

The Curse of Dimensionality refers to the problems that arise when the number of features (dimensions) increases.

Explanation:

As dimensions increase, data becomes sparse.

Distance between data points becomes less meaningful.

All points appear almost equally distant.

KNN relies heavily on distance calculations.

Nearest neighbors may not be truly “near”.

Model accuracy decreases.

Computational cost increases.

More data is required to maintain performance.

Noise increases in high dimensions.

Model overfits easily.

Effect on KNN:

Distance metrics lose discriminative power.

Poor neighbor selection.

Increased misclassification.

Slower prediction time.

Higher memory usage.

Reduced generalization.

Scaling alone is insufficient.

Feature selection or reduction needed.

PCA helps mitigate this problem.

Dimensionality reduction improves KNN performance.

Question 3:
What is Principal Component Analysis (PCA)? How is it different from feature selection?

Answer:

Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space.

What PCA Does:

Converts correlated features into uncorrelated components.

Each component is a linear combination of original features.

Components are ordered by variance.

First component captures maximum variance.

Reduces noise and redundancy.

Improves model performance.

Reduces overfitting.

Improves visualization.

Speeds up computation.

Retains most information.

Difference Between PCA and Feature Selection:
Aspect	PCA	Feature Selection
Type	Feature extraction	Feature filtering
Features	Creates new features	Keeps original features
Interpretability	Low	High
Correlation	Removes correlation	Keeps correlation
Supervision	Unsupervised	Can be supervised
Dimensionality	Reduced	Reduced
Information loss	Possible	Minimal
Example	Eigenvectors	Select top features
Data transform	Yes	No
Use case	High dimensions	Interpretability
Question 4:
 What are eigenvalues and eigenvectors in PCA, and why are they important?

Answer:

Eigenvalues and eigenvectors are mathematical concepts used in PCA to identify important directions in data.

Eigenvectors:

Represent directions of maximum variance.

Define new feature axes.

Are orthogonal to each other.

Used to form principal components.

Capture relationships between features.

Ordered by importance.

Reduce correlation.

Used for projection.

Define transformation matrix.

Determine new coordinate system.

Eigenvalues:

Represent magnitude of variance.

Measure importance of eigenvectors.

Larger eigenvalue → more information.

Used to rank components.

Help decide number of components.

Used in explained variance ratio.

Indicate data spread.

Help discard noise.

Control dimensionality reduction.

Key to PCA efficiency.

Question 5:
How do KNN and PCA complement each other when applied in a single pipeline?

Answer:

KNN and PCA work together to improve performance and efficiency.

How They Complement Each Other:

PCA reduces dimensionality.

KNN suffers in high dimensions.

PCA removes redundant features.

Distance calculations become meaningful.

Noise is reduced.

Model becomes faster.

Memory usage decreases.

Overfitting is reduced.

Accuracy often improves.

Pipeline becomes robust.

Overall Benefits:

Better generalization.

Faster prediction.

Improved accuracy.

Reduced curse of dimensionality.

Suitable for real-world data.

Dataset: Wine Dataset
Question 6:
Train a KNN Classifier on the Wine dataset with and without feature scaling

Answer:

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

data = load_wine()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
acc_without_scaling = accuracy_score(y_test, knn.predict(X_test))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)
acc_with_scaling = accuracy_score(y_test, knn.predict(X_test_scaled))

print("Accuracy without scaling:", acc_without_scaling)
print("Accuracy with scaling:", acc_with_scaling)


Output:

Accuracy without scaling: 0.72
Accuracy with scaling: 0.97

Question 7:
Train a PCA model and print explained variance ratio

Answer:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
pca.fit(X_scaled)

print("Explained Variance Ratio:", pca.explained_variance_ratio_)


Output (sample):

[0.36 0.19 0.11 0.07 0.06 ...]

Question 8: Train KNN on PCA-transformed dataset (top 2 components)

Answer:

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

X_train, X_test, y_train, y_test = train_test_split(
    X_pca, y, test_size=0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

accuracy = accuracy_score(y_test, knn.predict(X_test))
print("Accuracy with PCA:", accuracy)


Output:

Accuracy with PCA: 0.94

Question 9:
Train KNN with Euclidean and Manhattan distance metrics

Answer:

knn_euclidean = KNeighborsClassifier(metric='euclidean')
knn_manhattan = KNeighborsClassifier(metric='manhattan')

knn_euclidean.fit(X_train_scaled, y_train)
knn_manhattan.fit(X_train_scaled, y_train)

acc_euclidean = accuracy_score(y_test, knn_euclidean.predict(X_test_scaled))
acc_manhattan = accuracy_score(y_test, knn_manhattan.predict(X_test_scaled))

print("Euclidean Accuracy:", acc_euclidean)
print("Manhattan Accuracy:", acc_manhattan)


Output:

Euclidean Accuracy: 0.97
Manhattan Accuracy: 0.95

Question 10:
 Gene Expression Dataset – PCA + KNN Pipeline Explanation

Answer:

Using PCA to Reduce Dimensionality:

Gene datasets have thousands of features.

PCA reduces feature space.

Removes noise and redundancy.

Retains major variance.

Improves model stability.

Deciding Number of Components:

Use explained variance ratio.

Retain 90–95% variance.

Use scree plot.

Balance information and simplicity.

Avoid over-compression.

Using KNN after PCA:

Reduced dimensions improve distance accuracy.

Faster neighbor search.

Less overfitting.

Improved generalization.

Better classification.

Model Evaluation:

Accuracy

Precision

Recall

F1-score

Cross-validation

Justifying to Stakeholders:

PCA reduces complexity.

Improves robustness.

Handles small sample size.

Reduces overfitting.

Interpretable workflow.

Proven ML techniques.

Computationally efficient.

Suitable for biomedical data.

Scientifically justified.

Reliable real-world performance.