#Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

Ans - K-Nearest Neighbors (KNN) is a simple, non-parametric, instance-based machine learning algorithm used for both classification and regression. It makes predictions based on the “similarity” between data points.

###How KNN Works (General Idea)

KNN does not build a model. Instead, it stores all training data and makes predictions by:

Calculating the distance (usually Euclidean) between a new data point and all training points.

Selecting the K closest (nearest) neighbors.

Making a prediction based on those neighbors.

###KNN for Classification

In classification, KNN assigns a class label to a new point by majority vote among its K nearest neighbors.

Steps

Choose K (e.g., 3, 5, 7).

Compute distance from the new sample to all training samples.

Pick the K closest neighbors.

Count how many neighbors belong to each class.

Predict the most frequent class.

Example

If K = 5 and neighbors’ classes are:

Class A: 3

Class B: 2

→ Prediction = Class A

###KNN for Regression

In regression, KNN predicts a numerical value rather than a class.
- Steps

Find the K nearest neighbors.

Take the average (or weighted average) of their target values.

Use this average as the prediction.

Example

Neighbors' target values: 10, 12, 13 (K=3)
→ Predicted value = (10 + 12 + 13) / 3 = 11.67

###Important Considerations

Choosing K

- Small K → sensitive to noise.
- Large K → smoother but may blur class boundaries.

###Distance Metrics

Common options:

- Euclidean distance
- Manhattan distance
- Minkowski distance
- Cosine similarity (for high-dimensional data)

###Feature Scaling

KNN is distance-based → always use:

- Normalization
- Standardization

###Advantages

- Simple to understand
- Works well with small to medium datasets
- No training time

### Disadvantages

- Slow prediction with large datasets
- Sensitive to irrelevant features
- Requires careful scaling

---

#Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?

Ans - The "curse of dimensionality" describes how the volume of the feature space grows exponentially with more dimensions, causing data to become sparse and increasing computational demands. This severely impacts the k-Nearest Neighbors (KNN) algorithm because the concept of "nearest" neighbors loses its local meaning in high-dimensional spaces, as data points become equidistant from one another. As a result, KNN becomes less effective, requiring significantly more data to achieve reliable results and increasing the risk of overfitting.

##How the curse of dimensionality affects KNN

- Loss of "nearness": In high-dimensional spaces, the distance between the closest and farthest points becomes very similar. This makes the notion of a "nearest neighbor" less meaningful, as points that are "close" by a distance metric might not be meaningfully similar in reality.

- Increased data requirements: The data becomes sparse, meaning more data is needed to adequately sample the feature space and find reliable neighbors. The amount of data required grows exponentially with the number of dimensions, quickly becoming impractical.

- Performance degradation: Due to the sparsity and loss of "nearness," KNN's predictive accuracy decreases. The algorithm struggles to find relevant neighbors, and adding noisy or redundant features further degrades performance.

- Computational complexity: Calculating distances and identifying neighbors becomes computationally more intensive as the number of dimensions increases. This makes KNN slower and less efficient for high-dimensional data.

- Overfitting: With a high number of dimensions and insufficient data to represent the space, the algorithm can become overly sensitive to noise and random fluctuations in the training data, leading to overfitting.

##How to mitigate the effect

- Dimensionality reduction: Techniques like principal component analysis (PCA) can reduce the number of dimensions while retaining most of the data's variance.

- Feature selection: This involves identifying and keeping only the most relevant features and removing the rest, which can improve performance and reduce computational cost.

- Use other algorithms: For high-dimensional data, algorithms that are less sensitive to dimensionality, such as support vector machines or decision trees, may be more suitable alternatives.






---

#Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?

Ans - Principal Component Analysis (PCA) is a dimensionality reduction technique that creates new, fewer features (called principal components) by taking linear combinations of original features, while feature selection keeps a subset of the original features. The key difference is that PCA creates new features and can make the original data less interpretable, whereas feature selection works with the original features and is generally more interpretable.

##Principal Component Analysis (PCA)

- What it is: A technique that transforms original features into a new set of linearly uncorrelated features called principal components.

- How it works: It finds the directions (principal components) of maximum variance in the data. These components are ranked by the amount of variance they explain, allowing for dimensionality reduction by keeping only the most important components.

- Outcome: A new dataset with fewer features, where each feature is a linear combination of the original ones. This can lead to a loss of original feature interpretability.

- Example: Given features like "height," "weight," and "age," PCA might create a new component like "body mass index" (PC1) and another component representing "general age-related attributes" (PC2).

##Feature Selection

- What it is: A process of selecting a subset of the most relevant features from the original dataset to use for a model.

- How it works: It ranks the original input variables based on their importance for predicting a target variable and discards the rest.

- Outcome: The original features are kept, so the model is built using only a subset of the original data. This maintains the original interpretability of the features.

- Example: If you are predicting if a person will be a top athlete, feature selection might keep "height" and "weight" but discard "age" if it's found to be less relevant to the prediction goal.





---
#Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?

Ans - In PCA, eigenvectors are the directions of maximum variance in the data, and eigenvalues are the magnitude of variance along those directions. They are important because they allow PCA to reduce the dimensionality of data by identifying and keeping the components with the most variance (eigenvectors with the largest eigenvalues) while discarding the components with the least variance.

###Eigenvectors and Eigenvalues in PCA

- Eigenvectors: These are the principal components, which represent the new axes of the data. The first eigenvector points in the direction of the largest variance, the second points in the direction of the second-largest variance, and so on.
- Eigenvalues: These are scalar values that quantify the amount of variance captured by each corresponding eigenvector. A larger eigenvalue indicates that its eigenvector captures a greater amount of the data's variance.

###Why they are important

- Ranking of components: By ranking the eigenvalues from largest to smallest, you can rank the corresponding eigenvectors from most significant to least significant.

- Dimensionality reduction: You can achieve dimensionality reduction by keeping only the eigenvectors with the largest eigenvalues. This is because these eigenvectors capture the most important patterns and variance in the data, while those with very small eigenvalues capture very little information.

- Feature extraction: The eigenvectors (principal components) become the new features for a model, and the eigenvalues indicate the importance of each new feature.

- Visualization and analysis: PCA can be used to reduce a high-dimensional dataset to two or three dimensions for visualization by keeping the top two or three principal components.




---

#Question 5: How do KNN and PCA complement each other when applied in a single pipeline?

Ans - When PCA and KNN are applied in a single pipeline, PCA acts as a crucial preprocessing step that addresses several limitations of KNN, primarily related to the curse of dimensionality, leading to significant improvements in computational efficiency and, often, classification accuracy.

###How PCA Complements KNN

- Dimensionality Reduction: KNN's performance deteriorates in high-dimensional spaces (the "curse of dimensionality") because the distance metrics become less meaningful, and all data points appear to be roughly the same distance from each other. PCA mitigates this by transforming the data into a lower-dimensional space while retaining most of the important variance, allowing KNN to operate in a more meaningful feature space.

- Improved Efficiency: Calculating distances between data points in high dimensions is computationally expensive and time-consuming. By reducing the number of features, PCA significantly speeds up the distance calculations required by KNN during both training (if applicable) and prediction phases.

- Noise Reduction: PCA helps to denoise the data by focusing on the principal components that capture the most significant patterns and effectively removing "background" or irrelevant variations (noise) present in the original data. A cleaner feature set helps KNN to find more relevant neighbors and make better
classifications.

- Reduced Multicollinearity: The principal components generated by PCA are orthogonal (uncorrelated) by design. This removes multicollinearity among features, which can otherwise skew distance calculations in KNN.

- Enhanced Accuracy: In many real-world applications, such as image or financial time series classification, the combination of PCA and KNN has been shown to achieve higher accuracy compared to using KNN alone, by enabling the algorithm to focus on the most discriminative features.









---
#Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

Ans -

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)


# 1. KNN WITHOUT FEATURE SCALING

knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)

y_pred_no_scaling = knn_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)


# 2. KNN WITH FEATURE SCALING

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)

y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

print("Accuracy WITHOUT scaling:", acc_no_scaling)
print("Accuracy WITH scaling:   ", acc_scaled)


Accuracy WITHOUT scaling: 0.7222222222222222
Accuracy WITH scaling:    0.9444444444444444


#Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.


In [2]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load dataset
X, y = load_wine(return_X_y=True)

# Standardize the features (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()  # keep all components
pca.fit(X_scaled)

# Print explained variance ratio
print("Explained variance ratio of each principal component:")
for i, variance in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {variance:.4f}")


Explained variance ratio of each principal component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


#Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.

In [4]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

# Load data
X, y = load_wine(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)


# 1. KNN ON ORIGINAL (SCALED)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)

y_pred_original = knn_original.predict(X_test_scaled)
acc_original = accuracy_score(y_test, y_pred_original)


# 2. PCA (TOP 2 COMPONENTS)

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)

y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)

# Print results
print("Accuracy on ORIGINAL dataset        :", acc_original)
print("Accuracy on PCA-transformed dataset :", acc_pca)


Accuracy on ORIGINAL dataset        : 0.9444444444444444
Accuracy on PCA-transformed dataset : 0.9444444444444444


#Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results


In [5]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Scale data (important for distance-based models)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# 1. KNN with EUCLIDEAN distance

knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)

y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)


# 2. KNN with MANHATTAN distance

knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)

y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Print Results
print("Accuracy with Euclidean distance:", acc_euclidean)
print("Accuracy with Manhattan distance:", acc_manhattan)


Accuracy with Euclidean distance: 0.9444444444444444
Accuracy with Manhattan distance: 0.9814814814814815


#Question 10: You are working with a high-dimensional gene expression dataset to
#classify patients with different types of cancer.
#Due to the large number of features and a small number of samples, traditional models overfit.
#Explain how you would:
#● Use PCA to reduce dimensionality
#● Decide how many components to keep
#● Use KNN for classification post-dimensionality reduction
#● Evaluate the model
#● Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data.

In [7]:

# PCA + KNN Pipeline for Gene Expression Data
# (Using Wine dataset as example placeholder)

import numpy as np
from sklearn.datasets import load_wine   # Replace with real gene expression data
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline

# 1. Load dataset
X, y = load_wine(return_X_y=True)   # Replace this with your gene expression matrix

# 2. Build pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),    # retain 95% of variance
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

# 3. Evaluate with Stratified CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y, cv=cv, scoring='accuracy')

print("Cross-Validation Accuracy Scores:", scores)
print("Mean Accuracy:", np.mean(scores))
print("Std Dev:", np.std(scores))

# 4. Inspect PCA variance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca_full = PCA().fit(X_scaled)

print("\nExplained Variance Ratio by Component:")
print(pca_full.explained_variance_ratio_)


Cross-Validation Accuracy Scores: [0.97222222 0.94444444 0.97222222 0.94285714 0.97142857]
Mean Accuracy: 0.9606349206349206
Std Dev: 0.013879588722983607

Explained Variance Ratio by Component:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]
