Q1. What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?
>- K-Nearest Neighbors (KNN) is a simple, non-parametric, instance-based machine learning algorithm that makes predictions based on the K closest data points in the training set.
How it works:


1. Choose a value of K (number of neighbors).

2. Calculate the distance (e.g., Euclidean) between the new data point and all training points.


3. Select the K nearest neighbors.


>- KNN for Classification:


> The class is predicted by majority voting among the K neighbors.


> Example: If most neighbors are “Default,” the output is “Default.”


>- KNN for Regression:


> The output is the average (or weighted average) of the K neighbors’ target values.


>- Example: Predicting house price as the mean price of nearby houses.


>- Key points:


> No training phase (lazy learner)


> Sensitive to K value and feature scaling


> Works well with small, clean datasets

Q2. What is the Curse of Dimensionality and how does it affect KNN
performance?
>- Curse of Dimensionality refers to the problem that occurs when the number of features (dimensions) increases, making data points farther apart and sparse.

>- Effect on KNN performance:

> Distance measures become less meaningful in high dimensions.

> Nearest neighbors are no longer truly “near.”

> KNN accuracy decreases and computation becomes slower.

> Result: KNN performs poorly on high-dimensional data unless dimensionality reduction (e.g., PCA) or feature selection is applied.

Q3.  What is Principal Component Analysis (PCA)? How is it different from
feature selection?
>- Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that transforms original features into a smaller set of new uncorrelated variables (principal components) that capture maximum variance.

>- Difference from Feature Selection:

> PCA: Creates new features by combining existing ones (feature transformation).

> Feature Selection: Chooses a subset of original features without changing them.

> Key idea: PCA reduces dimensions by compression, while feature selection reduces dimensions by selection.

Q4.  What are eigenvalues and eigenvectors in PCA, and why are they
important?
>- In PCA, eigenvectors and eigenvalues come from the covariance matrix of the data.

- Eigenvectors: Represent the directions (principal components) along which the data varies the most.

- Eigenvalues: Indicate the amount of variance captured along each eigenvector.

>- Why they are important:

> Eigenvectors define the new feature axes.

> Eigenvalues help decide how many principal components to keep (higher eigenvalue = more information).

Q5.  How do KNN and PCA complement each other when applied in a single
pipeline?

Dataset:
Use the Wine Dataset from sklearn.datasets.load_wine().
>- How KNN and PCA complement each other in one pipeline (Wine Dataset):

PCA first: Reduces the Wine dataset’s high dimensionality by transforming features into fewer uncorrelated principal components, removing noise and redundancy.

Then KNN: Runs on the reduced feature space where distance calculations are more meaningful.

>Benefits together:

Improves KNN accuracy by mitigating the curse of dimensionality

Faster computation (fewer features)

Better generalization and smoother decision boundaries

In short:
PCA simplifies the data → KNN performs better distance-based classification.

Q6.  Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.


In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# KNN WITHOUT scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc_without_scaling = accuracy_score(y_test, y_pred)

# KNN WITH scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_with_scaling = accuracy_score(y_test, y_pred_scaled)

# Results
print("Accuracy without scaling:", acc_without_scaling)
print("Accuracy with scaling:", acc_with_scaling)


Accuracy without scaling: 0.7407407407407407
Accuracy with scaling: 0.9629629629629629


Q7. Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.


In [2]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load Wine dataset
X, y = load_wine(return_X_y=True)

# Feature scaling
X_scaled = StandardScaler().fit_transform(X)

# Apply PCA
pca = PCA()
pca.fit(X_scaled)

# Explained variance ratio
print("Explained Variance Ratio:")
print(pca.explained_variance_ratio_)


Explained Variance Ratio:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


Q8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Scale data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# KNN on original data
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
acc_original = accuracy_score(y_test, knn.predict(X_test_scaled))

# PCA (top 2 components)
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# KNN on PCA data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))

# Results
print("Accuracy on original dataset:", acc_original)
print("Accuracy on PCA dataset (2 components):", acc_pca)


Accuracy on original dataset: 0.9629629629629629
Accuracy on PCA dataset (2 components): 0.9814814814814815


Q9.  Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

In [4]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# KNN with Euclidean distance
knn_eu = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_eu.fit(X_train_scaled, y_train)
acc_eu = accuracy_score(y_test, knn_eu.predict(X_test_scaled))

# KNN with Manhattan distance
knn_man = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_man.fit(X_train_scaled, y_train)
acc_man = accuracy_score(y_test, knn_man.predict(X_test_scaled))

print("Accuracy (Euclidean):", acc_eu)
print("Accuracy (Manhattan):", acc_man)


Accuracy (Euclidean): 0.9629629629629629
Accuracy (Manhattan): 0.9629629629629629


Q10. You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep

● Use KNN for classification post-dimensionality reduction

● Evaluate the model

● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data
>- 1. Using PCA to reduce dimensionality

Gene expression data has thousands of correlated features.

PCA transforms them into fewer uncorrelated components that retain maximum variance.

This reduces noise, redundancy, and overfitting.

2. Deciding how many components to keep

Keep components that explain 90–95% variance.

Use:

Explained variance ratio

Scree plot (optional)

This balances information retention vs complexity.

3. Using KNN after PCA

Apply KNN on PCA-transformed data.

Fewer dimensions → meaningful distance calculations.

Improves accuracy and speed for distance-based models.

4. Model evaluation

Use train–test split or cross-validation.

Metrics:

Accuracy

Precision / Recall (important in medical diagnosis)

Confusion matrix

5. Justification to stakeholders (business/clinical)

✔ Reduces overfitting in small-sample, high-feature data

✔ Improves model stability and generalization

✔ Faster, interpretable, and clinically reliable

✔ Suitable for real-world biomedical decision support


In [5]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset (proxy for gene expression data)
X, y = load_breast_cancer(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# PCA (retain 95% variance)
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_pca, y_train)

# Evaluation
y_pred = knn.predict(X_test_pca)
accuracy = accuracy_score(y_test, y_pred)

print("Number of PCA components:", pca.n_components_)
print("Total explained variance:", pca.explained_variance_ratio_.sum())
print("KNN Accuracy after PCA:", accuracy)


Number of PCA components: 10
Total explained variance: 0.9513920521735783
KNN Accuracy after PCA: 0.9649122807017544
