**KNN & PCA**

1. What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

- K-Nearest Neighbors (KNN) is a non-parametric, supervised learning algorithm that stores training data and uses it to predict new data points based on the proximity of their neighbors. For classification, KNN assigns a new data point to the class that is most common among its 'k' nearest neighbors. For regression, it predicts a continuous value by taking the average of the 'k' nearest neighbors' values.

2. What is the Curse of Dimensionality and how does it affect KNN
performance?


- The Curse of Dimensionality describes how, as the number of features (dimensions) in a dataset increases, the data becomes sparse, leading to problems like increased data requirements and decreased algorithm performance. This heavily impacts k-Nearest Neighbors (k-NN) because, in high-dimensional spaces, all points tend to become equidistant, making the concept of "closeness" less meaningful and reducing the algorithm's ability to find true neighbors. As a result, k-NN needs exponentially more data to maintain performance and becomes more computationally expensive and prone to overfitting in high dimensions.

3. What is Principal Component Analysis (PCA)? How is it different from
feature selection?


- Principal Component Analysis (PCA) is a feature extraction technique that reduces dimensionality by creating new, uncorrelated features (principal components) that maximize data variance, while feature selection is a method that selects a subset of the original features that are most relevant to a prediction task.

4. What are eigenvalues and eigenvectors in PCA, and why are they
important?


- The Mathematics Behind Principal Component Analysis (PCA ...In PCA, eigenvectors represent the directions of maximum variance in the data (the principal components), while eigenvalues are scalar values indicating the amount of variance along those corresponding eigenvector directions.

5. How do KNN and PCA complement each other when applied in a single
pipeline?


- PCA complements KNN in a pipeline by performing dimensionality reduction, which mitigates the "curse of dimensionality" by transforming high-dimensional data into a lower-dimensional space. This reduces computational complexity, combats overfitting, and can improve KNN's performance by removing noise and redundant information. The resulting lower-dimensional features are then fed into KNN for more efficient and accurate classification or regression.

6. Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train KNN without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_no_scale = knn.predict(X_test)
accuracy_no_scale = accuracy_score(y_test, y_pred_no_scale)
print(f"Accuracy without scaling: {accuracy_no_scale:.4f}")

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN with scaling
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy with scaling: {accuracy_scaled:.4f}")

Accuracy without scaling: 0.7407
Accuracy with scaling: 0.9630


7. Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

In [3]:
from sklearn.decomposition import PCA
from sklearn.datasets import load_wine

# Load the Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Train a PCA model
pca = PCA()
pca.fit(X)

# Print the explained variance ratio of each principal component
print("Explained variance ratio of each principal component:")
print(pca.explained_variance_ratio_)

Explained variance ratio of each principal component:
[9.98091230e-01 1.73591562e-03 9.49589576e-05 5.02173562e-05
 1.23636847e-05 8.46213034e-06 2.80681456e-06 1.52308053e-06
 1.12783044e-06 7.21415811e-07 3.78060267e-07 2.12013755e-07
 8.25392788e-08]


8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset

In [6]:
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine

# Load the Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Apply PCA (retain top 2 components)
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Train KNN on PCA-transformed data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)
print(f"Accuracy with PCA (2 components): {accuracy_pca:.4f}")


Accuracy with PCA (2 components): 0.9815


9. Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

In [8]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN with euclidean distance
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)
print(f"Accuracy with euclidean distance: {accuracy_euclidean:.4f}")

# Train KNN with manhattan distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)
print(f"Accuracy with manhattan distance: {accuracy_manhattan:.4f}")

Accuracy with euclidean distance: 0.9630
Accuracy with manhattan distance: 0.9630


10. You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data


- To classify cancer types from high-dimensional gene expression data, you would use PCA to find the most informative latent features, determining the number of components to keep by examining the explained variance, then apply a K-Nearest Neighbors (KNN) classifier on the reduced data.

- 1. Use PCA for Dimensionality Reduction
Apply PCA:
PCA transforms the original, high-dimensional gene expression data into a new set of orthogonal components, called principal components (PCs).
Preserve Variance:
Each PC captures a certain amount of variance from the original data. The first few PCs capture the majority of the total variance, while later PCs capture less.


- 2. Decide How Many Components to Keep
Explained Variance Plot:
You would generate a plot (known as a "scree plot") showing the cumulative or individual variance explained by each principal component.
Identify an "Elbow":
The point where the rate of explained variance drops significantly, often referred to as the "elbow," indicates where subsequent components contribute less to the overall data variance.

- 3. Use KNN for Classification Post-Dimensionality Reduction
Train KNN on Reduced Data:
After applying PCA to transform the data into the selected principal components, the reduced dataset is used to train the KNN classifier.
Classification:
The KNN algorithm classifies new samples by finding their k nearest neighbors in the reduced feature space and assigning the most frequent class among those neighbors.

- 4. Evaluate the Model
Cross-Validation:
A robust method like repeated k-fold cross-validation is crucial for ensuring the model's generalization performance, especially with small datasets.
Performance Metrics:
Evaluate the model using metrics such as:
Accuracy: The overall proportion of correct predictions.
Precision: The proportion of true positive predictions among all positive predictions.

- 5. Justify this Pipeline to Stakeholders
Addresses Overfitting:
Explain that high-dimensional gene expression data with few samples often leads to overfitting with traditional models. PCA reduces the number of features, significantly mitigating this risk.