1.  What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

**K-Nearest Neighbors (KNN)** is a supervised machine learning algorithm used for both classification and regression tasks. It is a non-parametric and instance-based (lazy learning) algorithm, meaning it does not build an explicit model during training. Instead, it stores the entire training dataset and makes predictions only when a new data point needs to be classified or predicted.

KNN works based on the idea of similarity. When a new data point is given, the algorithm calculates the distance between this point and all the points in the training dataset. Common distance measures include Euclidean distance, Manhattan distance, or Minkowski distance. After computing distances, the algorithm selects the **K** nearest data points (neighbors) based on the chosen distance metric.

In **classification problems**, KNN assigns the new data point to the class that is most common among its K nearest neighbors. This is known as majority voting. For example, if K = 5 and among the five nearest neighbors, three belong to Class A and two belong to Class B, the new data point is classified as Class A.

In **regression problems**, instead of majority voting, KNN calculates the average (or sometimes weighted average) of the values of the K nearest neighbors. The predicted value is the mean of those neighbors’ target values.

The choice of K is important. A small value of K may lead to overfitting (sensitive to noise), while a large value of K may lead to underfitting (overly smooth predictions). Overall, KNN is simple, intuitive, and effective, especially when the dataset is small and well-structured.


2. What is the Curse of Dimensionality and how does it affect KNN performance?

The **Curse of Dimensionality** refers to the problems that arise when working with data in very high-dimensional spaces (that is, when the number of features is large). As the number of dimensions increases, the volume of the feature space grows exponentially, and the data points become increasingly sparse. This sparsity makes it difficult for machine learning algorithms to find meaningful patterns because the notion of “closeness” or similarity between points becomes less reliable.

In the context of **K-Nearest Neighbors (KNN)**, the curse of dimensionality significantly affects performance because KNN relies entirely on distance calculations to identify the nearest neighbors. In high-dimensional spaces, the distance between any two points tends to become very similar. The difference between the nearest and farthest neighbor becomes small, making it hard to distinguish which points are truly close. As a result, KNN may select neighbors that are not genuinely similar, leading to poor predictions.

Additionally, with more dimensions, more data is required to maintain the same level of accuracy. If the dataset is not large enough, the model may overfit or perform inconsistently. Computational cost also increases because distance calculations must be performed across many features.

Overall, the curse of dimensionality reduces the effectiveness of distance-based algorithms like KNN by making distance measures less meaningful, increasing data sparsity, and raising computational complexity. Dimensionality reduction techniques such as feature selection or Principal Component Analysis (PCA) are often used to mitigate this problem.


3. What is Principal Component Analysis (PCA)? How is it different from feature selection?

**Principal Component Analysis (PCA)** is a dimensionality reduction technique used to reduce the number of features in a dataset while preserving as much variance (information) as possible. PCA works by transforming the original correlated features into a new set of uncorrelated variables called **principal components**. These components are linear combinations of the original features and are arranged in such a way that the first principal component captures the maximum variance in the data, the second captures the next highest variance, and so on. By selecting only the top few principal components, we can reduce dimensionality while retaining most of the important information in the dataset.

PCA is different from **feature selection** in a fundamental way. Feature selection chooses a subset of the original features based on certain criteria (such as correlation, importance scores, or statistical tests). It does not modify the original features; it simply removes the less important ones. In contrast, PCA does not select existing features. Instead, it creates entirely new features (principal components) by combining the original ones. Therefore, PCA is a **feature extraction** method, while feature selection is a **feature reduction** method.

In summary, PCA transforms the feature space into a new lower-dimensional space by creating new variables, whereas feature selection keeps some of the original variables and discards others. PCA focuses on capturing maximum variance, while feature selection focuses on identifying and retaining the most relevant original features.


4. What are eigenvalues and eigenvectors in PCA, and why are they important?

In **Principal Component Analysis (PCA)**, eigenvalues and eigenvectors are mathematical concepts derived from the covariance matrix of the dataset, and they play a central role in determining the principal components.

An **eigenvector** represents a direction in the feature space along which the data varies the most. In PCA, each eigenvector corresponds to a principal component. These vectors define the new axes (directions) onto which the original data is projected. The first eigenvector points in the direction of maximum variance in the data, the second eigenvector points in the direction of the next highest variance (orthogonal to the first), and so on.

An **eigenvalue** represents the amount of variance captured along its corresponding eigenvector. In other words, it tells us how important that principal component is. A larger eigenvalue means that the principal component explains a greater portion of the total variance in the dataset.

Eigenvalues and eigenvectors are important in PCA because they help in identifying the most informative directions in the data. By ranking eigenvectors based on their eigenvalues (from highest to lowest), we can select the top principal components that capture most of the variance. This allows us to reduce the dimensionality of the dataset while retaining as much useful information as possible. Without eigenvalues and eigenvectors, PCA would not be able to determine which directions preserve the most significant patterns in the data.


5. How do KNN and PCA complement each other when applied in a single pipeline?

K-Nearest Neighbors (KNN) and Principal Component Analysis (PCA) complement each other effectively when used together in a single machine learning pipeline, especially for high-dimensional datasets.

KNN is a distance-based algorithm that relies heavily on calculating distances between data points. However, in high-dimensional spaces, the **curse of dimensionality** makes distance measures less meaningful, and KNN performance can degrade. This is where PCA becomes useful. PCA reduces the number of features by transforming the original data into a smaller set of principal components that retain most of the important variance. By reducing dimensionality, PCA removes noise, redundant features, and correlations among variables, which helps improve the reliability of distance calculations.

When PCA is applied before KNN in a pipeline, it simplifies the feature space and often improves classification or regression accuracy. It also reduces computational cost because KNN has fewer dimensions to process when calculating distances. Additionally, since PCA creates uncorrelated components, it can enhance KNN’s performance by focusing on the most informative directions in the data.

PCA improves KNN by reducing dimensionality, minimizing noise, and making distance metrics more meaningful, while KNN benefits from a cleaner and more compact feature space. Together, they create a more efficient and often more accurate machine learning pipeline.


6. Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

In [1]:
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# ----------------------------
# KNN without Feature Scaling
# ----------------------------
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# ----------------------------
# KNN with Feature Scaling
# ----------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

# Print results
print("Accuracy without Scaling:", accuracy_no_scaling)
print("Accuracy with Scaling:", accuracy_scaled)

Accuracy without Scaling: 0.7407407407407407
Accuracy with Scaling: 0.9629629629629629


7.  Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

In [2]:
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load Wine dataset
data = load_wine()
X = data.data

# Feature scaling (important before PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA (keep all components)
pca = PCA()
pca.fit(X_scaled)

# Print explained variance ratio
print("Explained Variance Ratio of each Principal Component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.4f}")

Explained Variance Ratio of each Principal Component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.

In [3]:
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# -----------------------------
# Step 1: Scaling
# -----------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# -----------------------------
# KNN on Original Scaled Data
# -----------------------------
knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
accuracy_original = accuracy_score(y_test, y_pred_original)

# -----------------------------
# Step 2: Apply PCA (Top 2 Components)
# -----------------------------
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# -----------------------------
# KNN on PCA-Transformed Data
# -----------------------------
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)

# -----------------------------
# Print Results
# -----------------------------
print("Accuracy with Original Features:", accuracy_original)
print("Accuracy with PCA (Top 2 Components):", accuracy_pca)

Accuracy with Original Features: 0.9629629629629629
Accuracy with PCA (Top 2 Components): 0.9814814814814815


9. Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.


In [4]:
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# -----------------------------
# KNN with Euclidean Distance
# -----------------------------
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# -----------------------------
# KNN with Manhattan Distance
# -----------------------------
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Print Results
print("Accuracy with Euclidean Distance:", accuracy_euclidean)
print("Accuracy with Manhattan Distance:", accuracy_manhattan)

Accuracy with Euclidean Distance: 0.9629629629629629
Accuracy with Manhattan Distance: 0.9629629629629629


10. You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer. Due to the large number of features and a small number of samples, traditional models overfit. Explain how you would:
- Use PCA to reduce dimensionality
- Decide how many components to keep
- Use KNN for classification post-dimensionality reduction
- Evaluate the model
- Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data



1. Use PCA to Reduce Dimensionality
Gene expression datasets often contain thousands of genes (features), many of which are correlated or noisy. PCA helps by:

Transforming correlated genes into uncorrelated principal components.

Capturing maximum variance in fewer dimensions.

Reducing noise and redundancy.

Making distance-based models like KNN more reliable.

Before applying PCA:

Standardize the data (important because PCA is variance-based).

Fit PCA on training data only (to avoid data leakage).

----
2. Decide How Many Components to Keep

We can decide the number of components using:

Explained Variance Ratio

Cumulative variance threshold (e.g., retain 95% variance)

Scree plot (elbow method)

For biomedical data, retaining 90–95% variance is common because:

We preserve biological signal.

We remove noise.

We drastically reduce dimensionality.

---
3. Use KNN for Classification After PCA

KNN is appropriate because:

It is non-parametric (no strong assumptions).

Works well when dimensionality is reduced.

Makes decisions based on similarity — biologically intuitive for gene expression patterns.

After PCA:

Train KNN on transformed data.

Tune k using cross-validation.

---

4. Evaluate the Model

Since biomedical datasets are small:

Use Stratified Cross-Validation

Measure:

Accuracy

Precision

Recall

F1-score

Possibly ROC-AUC (for binary cancer classification)

Cross-validation reduces optimistic bias and improves reliability.

In [5]:
# Import libraries
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Simulate high-dimensional dataset (100 samples, 1000 features)
X, y = make_classification(
    n_samples=100,
    n_features=1000,
    n_informative=50,
    n_redundant=100,
    n_classes=2,
    random_state=42
)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Build PCA + KNN pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),  # retain 95% variance
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

# Train model
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)

# Evaluate accuracy
test_accuracy = accuracy_score(y_test, y_pred)

# Cross-validation accuracy
cv_scores = cross_val_score(pipeline, X, y, cv=5)

print("Test Accuracy:", test_accuracy)
print("Cross-Validation Accuracy:", np.mean(cv_scores))
print("Number of PCA Components Retained:",
      pipeline.named_steps['pca'].n_components_)

Test Accuracy: 0.43333333333333335
Cross-Validation Accuracy: 0.54
Number of PCA Components Retained: 63
