**Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?**

Answer

 K-Nearest Neighbors (KNN) is a supervised, non-parametric, instance-based learning algorithm used for both classification and regression. It works by storing the training data and making predictions based on similarity. For a new data point, KNN calculates the distance (such as Euclidean distance) between the point and all training samples, then selects the K nearest neighbors.
In classification, the predicted class is determined by majority voting among the neighbors.
In regression, the prediction is the average (or weighted average) of the neighbors‚Äô values.

**Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?**

Answer:-

The Curse of Dimensionality refers to problems that arise when the number of features (dimensions) increases in a dataset. As dimensions grow, data points become sparse, and the distance between points becomes less meaningful.

üîπ Effect on KNN Performance

Distances between nearest and farthest neighbors become almost equal

KNN struggles to identify truly ‚Äúnearest‚Äù neighbors

Requires more data to maintain accuracy

Leads to poor prediction performance and higher computation cost

**Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?**

Answer:- Principal Component Analysis (PCA) is an unsupervised dimensionality-reduction technique that transforms original correlated features into a smaller set of new, uncorrelated variables called principal components. These components capture the maximum variance present in the data while reducing dimensionality. PCA changes the feature space by combining features mathematically.

Feature selection, in contrast, chooses a subset of the original features based on statistical tests, model importance, or domain knowledge, without transforming them.
In summary, PCA creates new features, while feature selection retains existing ones, improving interpretability.

**Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?**

Answer :- In Principal Component Analysis (PCA), eigenvectors represent the directions (axes) along which the data varies the most, while eigenvalues indicate the amount of variance captured along each eigenvector. They are computed from the covariance matrix of the data. Eigenvectors with larger eigenvalues correspond to more important principal components. PCA selects the top eigenvectors based on the highest eigenvalues to reduce dimensionality while preserving maximum information. Thus, eigenvectors define the new feature space, and eigenvalues help decide how many principal components to keep, ensuring minimal information loss.

**Question 5: How do KNN and PCA complement each other when applied in a single pipeline?**

Answer:- KNN and PCA complement each other effectively in a single machine learning pipeline.
KNN is a distance-based algorithm whose performance degrades in high-dimensional spaces due to the curse of dimensionality. PCA addresses this by reducing the number of features while preserving most of the data‚Äôs variance. By projecting data onto fewer, informative principal components, PCA removes noise, reduces sparsity, and makes distance calculations more meaningful. As a result, KNN becomes faster, more accurate, and less sensitive to irrelevant features. In practice, applying PCA before KNN improves computational efficiency and often leads to better classification or regression performance.

**Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.**

Answer:-

In [1]:
# Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load Wine dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# -----------------------------
# KNN WITHOUT Feature Scaling
# -----------------------------
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy_without_scaling = accuracy_score(y_test, y_pred)

# -----------------------------
# KNN WITH Feature Scaling
# -----------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_with_scaling = accuracy_score(y_test, y_pred_scaled)

# Print results
print("Accuracy without scaling:", accuracy_without_scaling)
print("Accuracy with scaling:", accuracy_with_scaling)


Accuracy without scaling: 0.7407407407407407
Accuracy with scaling: 0.9629629629629629


**Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component. **

Answer:-

In [2]:
# Import libraries
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load Wine dataset
X, y = load_wine(return_X_y=True)

# Standardize features (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA (keep all components)
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Print explained variance ratio
print("Explained Variance Ratio of each Principal Component:")
print(pca.explained_variance_ratio_)


Explained Variance Ratio of each Principal Component:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


**Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.**

Answer:-

In [3]:
# Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# -----------------------------
# KNN on Original Dataset
# -----------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred_original = knn.predict(X_test_scaled)

accuracy_original = accuracy_score(y_test, y_pred_original)

# -----------------------------
# PCA (Top 2 Components)
# -----------------------------
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)

accuracy_pca = accuracy_score(y_test, y_pred_pca)

# Print results
print("Accuracy with Original Features:", accuracy_original)
print("Accuracy with PCA (2 Components):", accuracy_pca)


Accuracy with Original Features: 0.9629629629629629
Accuracy with PCA (2 Components): 0.9814814814814815


**Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.**

Answer:-

In [4]:
# Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Wine dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# -----------------------------
# KNN with Euclidean Distance
# -----------------------------
knn_euclidean = KNeighborsClassifier(
    n_neighbors=5, metric='euclidean'
)
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# -----------------------------
# KNN with Manhattan Distance
# -----------------------------
knn_manhattan = KNeighborsClassifier(
    n_neighbors=5, metric='manhattan'
)
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Print results
print("Accuracy with Euclidean distance:", accuracy_euclidean)
print("Accuracy with Manhattan distance:", accuracy_manhattan)


Accuracy with Euclidean distance: 0.9629629629629629
Accuracy with Manhattan distance: 0.9629629629629629


Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
‚óè Use PCA to reduce dimensionality
‚óè Decide how many components to keep
‚óè Use KNN for classification post-dimensionality reduction
‚óè Evaluate the model
‚óè Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data

Answer:-

Below is a clear, end-to-end explanation followed by Python code with output, suitable for exams, interviews, and real-world justification.

Solution Approach for High-Dimensional Gene Expression Data
1Ô∏è Using PCA to Reduce Dimensionality

Gene expression datasets have thousands of genes but few samples, leading to overfitting.
PCA reduces dimensionality by transforming correlated gene features into fewer orthogonal principal components that retain maximum variance and remove noise.

2Ô∏è Deciding How Many Components to Keep

Use explained variance ratio

Retain components that explain 90‚Äì95% cumulative variance

Scree plot or cumulative variance curve helps balance information retention vs complexity

3Ô∏è Using KNN After PCA

KNN is distance-based and performs poorly in high dimensions

PCA makes distances meaningful

Train KNN on PCA-transformed data for better generalization and speed

4Ô∏è Model Evaluation

Use cross-validation

Metrics:

Accuracy

Precision, Recall, F1-score (important in healthcare)

Confusion matrix to analyze misclassifications

5Ô∏è Justification to Stakeholders

‚ÄúThis pipeline reduces noise, prevents overfitting, improves interpretability, and provides stable predictions‚Äîcritical for reliable biomedical decision-making.‚Äù

In [5]:
# Import libraries
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Simulate high-dimensional gene expression data
X, y = make_classification(
    n_samples=200,
    n_features=1000,
    n_informative=50,
    n_classes=3,
    random_state=42
)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Apply PCA (retain 95% variance)
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

print("Number of PCA components retained:", pca.n_components_)

# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_pca, y_train)

# Predictions and evaluation
y_pred = knn.predict(X_test_pca)
accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy:", accuracy)


Number of PCA components retained: 125
Model Accuracy: 0.3333333333333333
