# Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?


Answer:

**Introduction**  
K-Nearest Neighbors KNN is a simple and intuitive supervised machine learning algorithm used for both classification and regression tasks. It is a non-parametric and instance-based method, meaning it does not build a model during training. Instead, it stores the training data and makes predictions by comparing new input samples with existing data points. The main idea behind KNN is that samples that are close to each other in feature space tend to have similar outputs.

---

**What is KNN**  
KNN identifies the K closest data points from the training dataset when a new input point is given. These neighbors are determined using distance metrics such as Euclidean distance or Manhattan distance. Once these nearest points are found, the prediction is made based on their values. Since KNN delays computation until prediction time and simply stores the dataset during training, it is called a lazy learner.

---

**How KNN Works**  
1. Choose a value for K.  
2. Compute the distance between the new data point and all training samples.  
3. Identify the K nearest neighbors based on distance.  
4. Make the prediction based on these neighbors.

Although the steps remain the same, the way predictions are made differs between classification and regression.

---

**KNN for Classification**  
In classification, KNN predicts the class label of a new data point by applying **majority voting** among the K nearest neighbors. The class that appears most frequently among these neighbors becomes the predicted class.

*Example:*

If K = 5 and the neighbors belong to classes A, A, B, A, and B, the predicted class is **A**.

---

**KNN for Regression**  
In regression tasks, KNN predicts a continuous numerical value by taking the **average** or **weighted average** of the K nearest neighbors' values. This makes KNN flexible for both discrete and continuous output variables.

*Example:*

If K = 3 and the neighbor values are 10, 12, and 14, the predicted value is:  
(10 + 12 + 14) / 3 = **12**

---

**Distance Metrics Used in KNN**  
- Euclidean Distance  
- Manhattan Distance  
- Minkowski Distance  
- Hamming Distance (for categorical data)

The choice of distance metric influences how neighbors are determined and affects accuracy.

---

**Choosing the Value of K**  
- A **small K** makes the model sensitive to noise and leads to overfitting.  
- A **large K** smooths decision boundaries too much and may cause underfitting.  

Cross-validation is commonly used to select the optimal K.

---

**Advantages of KNN**  
- Simple to understand and easy to implement  
- No training time required  
- Works for both classification and regression  
- Effective for smaller datasets  

---

**Limitations of KNN**  
- Slow for large datasets due to distance calculations  
- Sensitive to irrelevant and unscaled features  
- Poor performance in high-dimensional spaces  
- Requires storing the entire dataset in memory  

---

**Conclusion**  
K-Nearest Neighbors is an intuitive algorithm that predicts outcomes based on similarity between data points. It uses majority voting for classification and averaging for regression. Although simple, its performance depends on choosing an appropriate K value, selecting a proper distance metric, and ensuring good feature scaling. When applied correctly, KNN can deliver reliable and interpretable results for a wide range of applications.

# Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?

Answer:

**Introduction**  
The Curse of Dimensionality refers to a set of problems that arise when data is represented in a high number of dimensions (features). As dimensionality increases, data becomes sparse, distance measures become less meaningful, and algorithms that rely on proximity or density begin to perform poorly. This phenomenon significantly impacts KNN because KNN makes predictions entirely based on distance calculations.

---

**Definition of Curse of Dimensionality**  
The Curse of Dimensionality describes how the behavior of data, distance, and volume changes as the number of features increases. In high-dimensional spaces, points that appear far apart in low-dimensional spaces become even more distant. Additionally, the amount of data required to represent the space grows exponentially, making it difficult to learn meaningful patterns.

---

**Why the Curse Occurs**  
As the number of features increases:  
- The volume of the feature space expands rapidly.  
- Data becomes more spread out and sparse.  
- All points begin to appear nearly equidistant from each other.  
- Traditional distance metrics lose their ability to distinguish between near and far points.

These conditions undermine algorithms that rely on neighborhood relationships.

---

**Impact on KNN Performance**  
KNN depends heavily on identifying the closest neighbors using distance measures. However, in high-dimensional spaces, distances between data points become similar, making it difficult for KNN to find truly nearest neighbors.

Effects include:  
1. **Reduced Accuracy**  
   KNN struggles to accurately classify or predict because the distinction between close and distant points weakens.

2. **Increased Noise Sensitivity**  
   Irrelevant or unimportant features distort distance calculations and mislead the algorithm.

3. **Higher Computational Cost**  
   Calculating distances in high-dimensional spaces becomes more expensive and time-consuming.

4. **Overfitting**  
   With many dimensions and few data points, KNN may fit noise instead of meaningful patterns.

---

**Example of the Problem**  
In a low-dimensional space, such as 2D, points can be clearly clustered. However, in a 100-dimensional space, even points from the same class appear far apart. As a result, KNN may incorrectly classify points because the "nearest" neighbors may not truly be similar.

---

**How to Reduce the Curse of Dimensionality**  
- **Feature Selection**: Removing irrelevant features to reduce dimensionality.  
- **Dimensionality Reduction Techniques**: Methods like PCA, LDA, or t-SNE can compress data into fewer meaningful dimensions.  
- **Normalization**: Scaling features can help distances behave more consistently.  

These techniques help improve KNN’s accuracy and efficiency.

In short, the Curse of Dimensionality causes distance metrics to lose meaning in high-dimensional spaces, leading to poor KNN performance. Because KNN relies entirely on distance-based neighbor selection, sparsity and noise in high-dimensional data greatly reduce its effectiveness.

# Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?

Answer:

**Introduction**  
Principal Component Analysis PCA is a widely used dimensionality reduction technique that transforms high-dimensional data into a smaller set of uncorrelated components. These new components capture the maximum variance in the data. PCA is essential in machine learning for simplifying datasets, improving model performance, and reducing noise while preserving important information.

---

**What is PCA**  
PCA is a mathematical transformation technique that converts original correlated features into a new set of uncorrelated variables called principal components. Each principal component is a linear combination of the original features, ordered such that the first principal component captures the highest variance, the second captures the next highest variance, and so on.

PCA works by:  
1. Standardizing the data.  
2. Computing the covariance matrix.  
3. Finding eigenvalues and eigenvectors of the covariance matrix.  
4. Forming principal components based on the eigenvectors corresponding to the largest eigenvalues.

The result is a reduced-dimensional representation that retains most of the important information from the original dataset.

---

**Purpose of PCA**  
The main goal of PCA is to reduce dimensionality while keeping as much variance as possible. It helps in:  
- Removing redundant or highly correlated features  
- Reducing computational costs  
- Improving model performance  
- Visualizing high-dimensional data in 2D or 3D  

---

**What PCA Does Not Do**  
PCA does not select original features but instead **creates new features** that are combinations of the original ones. These new features may not have direct interpretability but carry essential information.

---

**Difference Between PCA and Feature Selection**  

**1. Nature of Output**  
- **PCA:** Produces new transformed features called principal components.  
- **Feature Selection:** Selects a subset of the original features without altering them.

**2. Interpretability**  
- **PCA:** Components are not easily interpretable because they are linear combinations of multiple features.  
- **Feature Selection:** Retains original features, so interpretability remains intact.

**3. Purpose**  
- **PCA:** Reduces dimensionality by transforming features and minimizing redundancy.  
- **Feature Selection:** Reduces dimensionality by selecting the most relevant features and removing irrelevant ones.

**4. Relationship with Original Data**  
- **PCA:** Creates new axes that maximize variance.  
- **Feature Selection:** Keeps original features and discards the rest.

**5. Handling Correlated Features**  
- **PCA:** Handles correlation by combining highly correlated features into a single component.  
- **Feature Selection:** Removes or keeps features but does not combine them.

---

*Example:*

If you have 10 features and many of them are correlated, PCA may reduce the dataset to 3 principal components that capture 90 percent of the variance.  
Feature selection, however, might choose 3 out of the original 10 features based on importance or correlation thresholds.

# Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?


Answer:

**Introduction**  
Eigenvalues and eigenvectors are fundamental mathematical concepts used in Principal Component Analysis PCA. They help identify the directions of maximum variance in the data and determine how much variance each direction captures. Understanding them is essential for understanding how PCA reduces dimensionality.

---

**What Are Eigenvalues**  
Eigenvalues are scalar values that represent the amount of variance captured by each principal component. In the context of PCA, each eigenvalue corresponds to a specific eigenvector and indicates how significant that component is.  
A larger eigenvalue means the corresponding component captures more variability in the data.

---

**What Are Eigenvectors**  
Eigenvectors are direction vectors that show where the data varies the most. They define the new axes principal components onto which PCA projects the original data.  
Each eigenvector points in the direction of maximum variance for that component.

---

**Role of Eigenvalues and Eigenvectors in PCA**  
1. **Eigenvectors determine the direction of principal components**  
   PCA rotates the original feature space to align with directions of maximum variance. These directions are given by eigenvectors of the covariance matrix.

2. **Eigenvalues determine the importance of each principal component**  
   Components with larger eigenvalues capture more information about data variation. Components with very small eigenvalues contribute little and can be removed during dimensionality reduction.

3. **Ranking of Components**  
   PCA ranks principal components based on eigenvalues. The first component has the largest eigenvalue and captures the highest variance.

4. **Dimensionality Reduction**  
   PCA keeps only the components with the highest eigenvalues. This reduces dimensions while preserving most of the information contained in the original dataset.

---

*Example:*
  
If the eigenvalues of a dataset are 5.2, 1.8, 0.3, and 0.02, it means:  
- The first component explains most of the variance.  
- The fourth component explains almost no variance and can be safely removed.

The corresponding eigenvectors give the directions of these components in the feature space.

---

**Why They Are Important**  
- They help identify which directions in the data have the most significant patterns.  
- They determine how many components should be retained for effective dimensionality reduction.  
- They reduce redundancy by capturing maximum variance using fewer components.  
- They make PCA computationally efficient by focusing on the essential structure of the data.

---

**Conclusion**  
Eigenvalues and eigenvectors are central to PCA. Eigenvectors define the directions of new axes principal components, while eigenvalues show how much variance each component captures. Together, they allow PCA to reduce dimensionality effectively while preserving the most important information in the data.

# Question 5: How do KNN and PCA complement each other when applied in a single pipeline?

Answer:

**Introduction**  
K Nearest Neighbors KNN is a distance-based algorithm that performs well when data is low-dimensional and well-structured. Principal Component Analysis PCA is a dimensionality reduction technique that transforms high-dimensional data into a smaller set of components. When used together in a pipeline, PCA improves the performance, speed, and reliability of KNN.

---

**Why PCA Is Needed Before KNN**  
KNN relies entirely on distance calculations to find the nearest neighbors. In high-dimensional data, distances become less meaningful due to the curse of dimensionality. PCA reduces the number of dimensions while retaining the most important variance, making distance-based comparisons more accurate and meaningful for KNN.

---

**How PCA Improves KNN Performance**  

1. **Reduces Noise and Irrelevant Features**  
   PCA compresses data by focusing on the components with the highest variance. This removes noise and unimportant features, allowing KNN to work with cleaner, more informative data.

2. **Mitigates the Curse of Dimensionality**  
   By reducing dimensions, PCA ensures that neighbors in KNN are meaningful and not distorted by high-dimensional sparsity.

3. **Improves Computational Efficiency**  
   KNN becomes faster because fewer dimensions mean fewer distance calculations. This is especially valuable for large datasets.

4. **Enhances Model Accuracy**  
   With PCA removing redundant and correlated features, KNN can make more accurate predictions. Cleaner input features lead to better neighborhood relationships.

5. **Reduces Overfitting**  
   KNN can overfit when many irrelevant features influence distance calculations. PCA reduces this risk by compressing the feature space to the most important directions.

---

**Pipeline Workflow for PCA and KNN**  
1. Standardize the data  
2. Apply PCA to reduce dimensionality  
3. Use the transformed principal components as input to KNN  
4. Perform classification or regression using the K nearest neighbors  

This pipeline ensures that KNN works on optimized, denoised, and compact data.

---

*Example:*

Suppose a dataset has 200 features, many of which are correlated or irrelevant. Without PCA, KNN may produce inaccurate results because distance calculations become unreliable.  
After applying PCA, the dataset might be reduced to 20 principal components that capture most of the variance. KNN then performs more accurately and efficiently on these components.

# Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.


In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# KNN WITHOUT SCALING
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train, y_train)
pred_no_scale = knn_no_scale.predict(X_test)

accuracy_no_scaling = accuracy_score(y_test, pred_no_scale)

# KNN WITH SCALING
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
pred_scaled = knn_scaled.predict(X_test_scaled)

accuracy_with_scaling = accuracy_score(y_test, pred_scaled)

# Print comparison
print("Accuracy without scaling:", accuracy_no_scaling)
print("Accuracy with scaling:", accuracy_with_scaling)

Accuracy without scaling: 0.7222222222222222
Accuracy with scaling: 0.9444444444444444


# Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

In [None]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load dataset
wine = load_wine()
X = wine.data

# Feature scaling is important before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA (all components)
pca = PCA()
pca.fit(X_scaled)

# Print explained variance ratio
print("Explained Variance Ratio of Each Principal Component:\n")
for idx, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{idx+1}: {ratio:.4f}")

Explained Variance Ratio of Each Principal Component:

PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


# Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.


In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 1. KNN ON ORIGINAL SCALED DATA

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# KNN on original scaled data
knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)

accuracy_original = accuracy_score(y_test, y_pred_original)

# 2. KNN ON PCA-TRANSFORMED DATA (TOP 2 COMPONENTS)


# Apply PCA to keep top 2 principal components
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# KNN on PCA-transformed data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)

accuracy_pca = accuracy_score(y_test, y_pred_pca)

# 3. Print comparison
print("Accuracy of KNN on original scaled data:", accuracy_original)
print("Accuracy of KNN on PCA-transformed data (top 2 components):", accuracy_pca)

Accuracy of KNN on original scaled data: 0.9722222222222222
Accuracy of KNN on PCA-transformed data (top 2 components): 0.9166666666666666


# Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.


In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Scale the dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

# KNN with Euclidean Distance
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train, y_train)
pred_euclidean = knn_euclidean.predict(X_test)
acc_euclidean = accuracy_score(y_test, pred_euclidean)

# KNN with Manhattan Distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train, y_train)
pred_manhattan = knn_manhattan.predict(X_test)
acc_manhattan = accuracy_score(y_test, pred_manhattan)

# Print results
print("Euclidean Distance Accuracy:", acc_euclidean)
print("Manhattan Distance Accuracy:", acc_manhattan)

Euclidean Distance Accuracy: 0.9722222222222222
Manhattan Distance Accuracy: 1.0


# Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models overfit.
Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep

● Use KNN for classification post-dimensionality reduction

● Evaluate the model

● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data

In [None]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

X, y = make_classification(
    n_samples=200,
    n_features=2000,
    n_informative=50,
    n_redundant=50,
    n_classes=3,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

pca_full = PCA()
pca_full.fit(X_train_scaled)

cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)
n_components_95 = int(np.argmax(cumulative_variance >= 0.95) + 1)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
max_components = int((cv.n_splits - 1) * X_train.shape[0] / cv.n_splits)
n_components_95 = min(n_components_95, max_components)

pca = PCA(n_components=n_components_95)
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')

pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('pca', pca),
    ('knn', knn)
])

cv_scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='accuracy')

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print("Number of components (>=95% or capped by CV limit):", n_components_95)
print("First 10 cumulative explained variance values:", cumulative_variance[:10])
print("Cross-validation accuracies:", cv_scores)
print("Mean CV accuracy:", cv_scores.mean())
print("Test accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))

Number of components (>=95% or capped by CV limit): 128
First 10 cumulative explained variance values: [0.01042577 0.02064903 0.03082016 0.0408319  0.05075927 0.06065697
 0.07031695 0.07990645 0.08943378 0.09890067]
Cross-validation accuracies: [0.375   0.46875 0.40625 0.5     0.5    ]
Mean CV accuracy: 0.45
Test accuracy: 0.35

Classification report:
               precision    recall  f1-score   support

           0       0.38      0.21      0.27        14
           1       0.30      0.23      0.26        13
           2       0.36      0.62      0.46        13

    accuracy                           0.35        40
   macro avg       0.35      0.35      0.33        40
weighted avg       0.35      0.35      0.33        40

Confusion matrix:
 [[3 5 6]
 [2 3 8]
 [3 2 8]]


Here,

**Used PCA to reduce dimensionality**  
The code applies PCA to the gene expression dataset after scaling the features. PCA transforms thousands of original gene features into a much smaller set of principal components that capture the most important variance. This reduces dimensionality and helps the model avoid overfitting.

---

**Decided how many components to keep**  
The number of PCA components is chosen using the cumulative explained variance. The code selects the smallest number of components that together explain at least 95% of the variance. This ensures that most meaningful information is kept while removing noise and redundancy from the data.

---

**Used KNN for classification post-dimensionality reduction**  
After PCA reduces the dataset, the transformed features are passed into a KNN classifier within a pipeline. KNN operates on the lower-dimensional PCA space, making distance-based classification more accurate and less sensitive to irrelevant features. This directly implements KNN after dimensionality reduction.

---


The model is evaluated using multiple techniques:  
- Stratified k-fold cross-validation for robust accuracy estimation  
- Test accuracy on unseen data  
- A classification report showing precision, recall, and F1-score  
- A confusion matrix to analyze class-wise performance  

These evaluation methods ensure the model is tested thoroughly and that results are not due to chance.

---

## Justifying this Pipeline for Real-World Biomedical Data

Gene expression datasets in cancer research typically contain thousands of gene features but only a small number of patient samples. This makes traditional machine learning models overfit easily. PCA helps address this by compressing the high-dimensional gene expression data into a smaller set of meaningful components that capture underlying biological variation while filtering out noise. Using KNN after PCA provides a simple and interpretable classifier that works effectively in reduced-dimensional space, where distances between patients become more meaningful. Evaluation through cross-validation and testing ensures the model is reliable and generalizes well to new patient data. This combination of PCA and KNN forms a robust, scientifically sound pipeline that is suitable for biomedical applications where accuracy, stability, and interpretability are essential.