**KNN & PCA | Assignment**

Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

Answer - K-Nearest Neighbors (KNN) is a simple, non-parametric, supervised machine learning algorithm that can be used for both classification and regression tasks.

- How KNN Works:

1. For Classification:


When given a new data point, KNN looks at its K nearest neighbors in the training data.

It then assigns the new data point to the class that is most common among these K neighbors (a 'majority vote').


2. For Regression:

Similar to classification, it identifies the K nearest neighbors.


Instead of a majority vote, it takes the average (or median) of the target values of these K neighbors as the prediction for the new data point.


Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?

Answer - The Curse of Dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (many features/variables) that do not occur in low-dimensional settings. As the number of dimensions increases, the volume of the space increases so rapidly that the available data becomes sparse, making it difficult to find meaningful patterns or relationships.

- How the Curse of Dimensionality Affects KNN Performance:

In high-dimensional spaces, the distance between any two data points tends to become more uniform, making it difficult for KNN to effectively distinguish between 'near' and 'far' neighbors.

 This can lead to:

1-  Increased Computational Cost: Calculating distances in many dimensions is computationally expensive.

2- Reduced Performance: The concept of 'nearest neighbor' becomes less meaningful, as all points can appear equidistant from each other. This can lead to KNN considering points that are not truly similar as neighbors, thus reducing its accuracy and predictive power.

3- More Data Required: To maintain statistical significance, the amount of data needed grows exponentially with the number of dimensions.


Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?

Answer : Principal Component Analysis (PCA) is a popular unsupervised dimensionality reduction technique. It transforms a dataset of possibly correlated variables into a set of linearly uncorrelated variables called principal components. The goal is to retain as much of the original variance in the data as possible with a reduced number of dimensions.

- PCA vs. Feature Selection:

1- Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms the original features into a new, smaller set of uncorrelated features (principal components) while retaining most of the variance. It creates new features.

2- Feature Selection is a process that selects a subset of the original features based on their relevance or importance. It chooses from existing features, rather than creating new ones.

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

In Principal Component Analysis (PCA), eigenvalues and eigenvectors are fundamental mathematical concepts:


1- Eigenvectors: These represent the directions or principal components along which the data varies the most. They indicate the orientation of the new axes in the transformed space. Each eigenvector points in a direction that captures a different aspect of the data's variance. In PCA, the first eigenvector points in the direction of the highest variance, the second in the direction of the second highest variance (orthogonal to the first), and so on.

2- Eigenvalues: These are scalar values corresponding to each eigenvector. An eigenvalue quantifies the amount of variance in the data along its corresponding eigenvector. A larger eigenvalue means that its eigenvector captures more variance from the dataset.


Importance in PCA: Eigenvalues and eigenvectors are crucial because they allow PCA to:


1- Determine Principal Components: The eigenvectors define the principal components, which are the new dimensions of the dataset.

2- Order Components: Eigenvalues allow us to rank the principal components by their importance, with larger eigenvalues indicating more significant components that retain more information.

3- Reduce Dimensionality: By selecting only the eigenvectors with the largest eigenvalues, we can reduce the dimensionality of the data while retaining the most significant variance, thus preserving the most important information.


Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?

K-Nearest Neighbors (KNN) and Principal Component Analysis (PCA) complement each other effectively in a single pipeline, primarily because PCA can mitigate some of KNN's inherent weaknesses, especially in high-dimensional datasets.

 Here’s how:

- Addressing the Curse of Dimensionality: As discussed, KNN suffers from the curse of dimensionality, where distances become less meaningful in high-dimensional spaces, leading to reduced accuracy and increased sparsity of data. PCA, being a dimensionality reduction technique, can project the high-dimensional data into a lower-dimensional subspace while retaining most of the significant variance.

- Improving Computational Efficiency: Calculating distances between data points is computationally expensive in high-dimensional spaces. By reducing the number of features with PCA, the distance calculations for KNN become much faster, leading to a more efficient algorithm.

- Enhancing KNN Performance: When the irrelevant or noisy features are removed and the most discriminative information is compressed into fewer principal components by PCA, KNN can find more meaningful 'neighbors'. This often leads to improved accuracy and robustness of the KNN model.

- Noise Reduction: PCA can also help in reducing noise in the data by discarding principal components that capture very little variance, which might mostly consist of noise.

In essence, PCA acts as a pre-processing step for KNN, transforming the data into a more manageable and informative representation, thereby allowing KNN to perform better, faster, and more reliably.



Use the Wine Dataset from sklearn.datasets.load_wine().


Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.


In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (142, 13)
X_test shape: (36, 13)
y_train shape: (142,)
y_test shape: (36,)


In [2]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Train KNN without scaling
knn_unscaled = KNeighborsClassifier()
knn_unscaled.fit(X_train, y_train)

# Make predictions on the unscaled test set
y_pred_unscaled = knn_unscaled.predict(X_test)

# Evaluate the accuracy
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)
print(f"KNN Accuracy (without feature scaling): {accuracy_unscaled:.4f}")

KNN Accuracy (without feature scaling): 0.7222


In [3]:
from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
scaler = StandardScaler()

# Fit on training data and transform both training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Features scaled using StandardScaler.")

Features scaled using StandardScaler.


In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Train KNN with scaling
knn_scaled = KNeighborsClassifier()
knn_scaled.fit(X_train_scaled, y_train)

# Make predictions on the scaled test set
y_pred_scaled = knn_scaled.predict(X_test_scaled)

# Evaluate the accuracy
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"KNN Accuracy (with feature scaling): {accuracy_scaled:.4f}")

KNN Accuracy (with feature scaling): 0.9444


In [5]:
print(f"\n--- Comparison of KNN Accuracy ---")
print(f"KNN Accuracy (without feature scaling): {accuracy_unscaled:.4f}")
print(f"KNN Accuracy (with feature scaling): {accuracy_scaled:.4f}")

accuracy_difference = accuracy_scaled - accuracy_unscaled
print(f"\nDifference in accuracy (Scaled - Unscaled): {accuracy_difference:.4f}")

if accuracy_difference > 0:
    print(f"Conclusion: Feature scaling significantly improved the KNN model's accuracy by {accuracy_difference:.4f}.")
elif accuracy_difference < 0:
    print(f"Conclusion: Feature scaling decreased the KNN model's accuracy by {abs(accuracy_difference):.4f}.")
else:
    print(f"Conclusion: Feature scaling had no significant impact on the KNN model's accuracy.")


--- Comparison of KNN Accuracy ---
KNN Accuracy (without feature scaling): 0.7222
KNN Accuracy (with feature scaling): 0.9444

Difference in accuracy (Scaled - Unscaled): 0.2222
Conclusion: Feature scaling significantly improved the KNN model's accuracy by 0.2222.


Comparison of KNN Accuracies
After training K-Nearest Neighbors (KNN) classifiers on the Wine dataset both with and without feature scaling, the following accuracies were observed:

KNN Accuracy (without feature scaling): 0.7222
KNN Accuracy (with feature scaling): 0.9444
Summary:

Feature scaling, specifically using StandardScaler, significantly improved the performance of the KNN classifier on the Wine dataset. Without scaling, the model achieved an accuracy of approximately 72.22%.

However, after applying feature scaling, the accuracy rose to approximately 94.44%. This substantial increase demonstrates the critical importance of feature scaling for distance-based algorithms like KNN, as it ensures that all features contribute equally to the distance calculations, preventing features with larger ranges from dominating the distance metric

Question 7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

In [6]:
from sklearn.decomposition import PCA

# Initialize PCA. We'll start without specifying n_components to see the variance explained by all components.
pca = PCA()

# Fit PCA on the scaled training data
pca.fit(X_train_scaled)

# Print the explained variance ratio of each principal component
print('Explained variance ratio of each principal component:')
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f'Principal Component {i+1}: {ratio:.4f}')

Explained variance ratio of each principal component:
Principal Component 1: 0.3590
Principal Component 2: 0.1869
Principal Component 3: 0.1161
Principal Component 4: 0.0737
Principal Component 5: 0.0665
Principal Component 6: 0.0485
Principal Component 7: 0.0420
Principal Component 8: 0.0268
Principal Component 9: 0.0235
Principal Component 10: 0.0189
Principal Component 11: 0.0172
Principal Component 12: 0.0126
Principal Component 13: 0.0083


The explained variance ratio indicates the proportion of the dataset's variance that lies along each principal component. A higher ratio for a component means it captures more information (variance) from the original features.

Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.

# Task
Train a K-Nearest Neighbors (KNN) classifier on the Wine dataset after applying Principal Component Analysis (PCA) to reduce the data to its top 2 components. Then, compare the accuracy of this PCA-transformed KNN model with the accuracies of KNN models trained on the unscaled and scaled original datasets (from Question 6), and summarize the impact of dimensionality reduction via PCA.

## Apply PCA with 2 Components

### Subtask:
Apply Principal Component Analysis (PCA) to the scaled training and testing data (`X_train_scaled`, `X_test_scaled`), retaining only the top 2 principal components. This will reduce the dimensionality of the dataset.


**Reasoning**:
The subtask requires applying PCA with 2 components to the scaled training and testing data. This step will initialize PCA, fit it on the training data, and then transform both training and testing sets, storing the results in new variables as instructed.



In [7]:
from sklearn.decomposition import PCA

# 1. Initialize PCA with n_components=2
pca_2_components = PCA(n_components=2)

# 2. Fit PCA on the scaled training data
pca_2_components.fit(X_train_scaled)

# 3. Transform both training and testing data
X_train_pca = pca_2_components.transform(X_train_scaled)
X_test_pca = pca_2_components.transform(X_test_scaled)

print(f"Original training data shape: {X_train_scaled.shape}")
print(f"PCA transformed training data shape: {X_train_pca.shape}")
print(f"Original testing data shape: {X_test_scaled.shape}")
print(f"PCA transformed testing data shape: {X_test_pca.shape}")

Original training data shape: (142, 13)
PCA transformed training data shape: (142, 2)
Original testing data shape: (36, 13)
PCA transformed testing data shape: (36, 2)


In [8]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Initialize KNN classifier
knn_pca = KNeighborsClassifier()

# Fit KNN on PCA-transformed training data
knn_pca.fit(X_train_pca, y_train)

# Make predictions on PCA-transformed test data
y_pred_pca = knn_pca.predict(X_test_pca)

# Calculate accuracy
accuracy_pca = accuracy_score(y_test, y_pred_pca)

print(f"KNN Accuracy (on PCA-transformed data with 2 components): {accuracy_pca:.4f}")

KNN Accuracy (on PCA-transformed data with 2 components): 1.0000


In [9]:
print(f"\n--- Comparison of KNN Accuracy ---")
print(f"KNN Accuracy (without feature scaling): {accuracy_unscaled:.4f}")
print(f"KNN Accuracy (with feature scaling): {accuracy_scaled:.4f}")
print(f"KNN Accuracy (on PCA-transformed data with 2 components): {accuracy_pca:.4f}")



--- Comparison of KNN Accuracy ---
KNN Accuracy (without feature scaling): 0.7222
KNN Accuracy (with feature scaling): 0.9444
KNN Accuracy (on PCA-transformed data with 2 components): 1.0000


In this specific Wine dataset, the top 2 principal components evidently capture almost all the necessary information to perfectly distinguish between the wine classes, leading to a perfect accuracy score for the KNN model. This highlights PCA's power as a pre-processing step for algorithms like KNN, especially when dealing with datasets that might contain correlated features or noise.

Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.


In [10]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Initialize KNN classifier with Euclidean distance metric
knn_euclidean = KNeighborsClassifier(metric='euclidean')

# Fit KNN on scaled training data
knn_euclidean.fit(X_train_scaled, y_train)

# Make predictions on scaled test data
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)

# Calculate accuracy
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

print(f"KNN Accuracy (Euclidean distance on scaled data): {accuracy_euclidean:.4f}")

KNN Accuracy (Euclidean distance on scaled data): 0.9444


In [11]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Initialize KNN classifier with Manhattan distance metric
knn_manhattan = KNeighborsClassifier(metric='manhattan')

# Fit KNN on scaled training data
knn_manhattan.fit(X_train_scaled, y_train)

# Make predictions on scaled test data
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)

# Calculate accuracy
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

print(f"KNN Accuracy (Manhattan distance on scaled data): {accuracy_manhattan:.4f}")

KNN Accuracy (Manhattan distance on scaled data): 0.9444


In [12]:
print(f"KNN Accuracy (Euclidean distance on scaled data): {accuracy_euclidean:.4f}")
print(f"KNN Accuracy (Manhattan distance on scaled data): {accuracy_manhattan:.4f}")

KNN Accuracy (Euclidean distance on scaled data): 0.9444
KNN Accuracy (Manhattan distance on scaled data): 0.9444


Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.

Due to the large number of features and a small number of samples, traditional models
overfit.

Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep

● Use KNN for classification post-dimensionality reduction

● Evaluate the model

● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data

. Use PCA to reduce dimensionality:


Data Preprocessing: First, I would ensure the gene expression data is properly preprocessed. This typically involves normalization (e.g., Z-score scaling or StandardScaler) to bring all genes to a similar scale, as PCA is sensitive to the variance of features. Missing values would also need to be handled appropriately.

Applying PCA: I would then apply PCA to the scaled gene expression data. PCA will transform the original (likely correlated) gene expression features into a new set of orthogonal (uncorrelated) variables called Principal Components (PCs). These PCs capture the maximum variance in the data, with the first few PCs representing the most significant patterns.

2. Decide how many components to keep:


Scree Plot: I would generate a scree plot, which displays the eigenvalues (variance explained) for each principal component in descending order. I'd look for an 'elbow' point where the marginal gain in explained variance drops significantly, indicating that subsequent components contribute less information.

Cumulative Explained Variance: I would also examine the cumulative explained variance plot. This plot shows the total proportion of variance explained as more principal components are added. A common practice is to select a number of components that explain a high percentage of the total variance, such as 90% or 95%, while significantly reducing the number of features. Given the small sample size, aiming for a smaller number of components to avoid overfitting is critical.

Cross-Validation: For more robustness, I could also use cross-validation to test the performance of the downstream KNN classifier with different numbers of principal components and select the number that yields the best generalization performance.

3. Use KNN for classification post-dimensionality reduction:


Transformed Data: Once the optimal number of principal components is determined, the original gene expression data would be transformed into this lower-dimensional PCA space. All subsequent steps, including training and testing the KNN model, would use this PCA-transformed data.

KNN Application: A K-Nearest Neighbors (KNN) classifier would then be trained on this reduced-dimensional data. For a new patient's gene expression profile, it would be projected into the same PCA space, and its class would be determined by the majority class among its K nearest neighbors in that space.

4. Evaluate the model:


Train-Test Split/Cross-Validation: The dataset would be split into training and testing sets (or utilize k-fold cross-validation, especially given the small sample size) before PCA is applied to prevent data leakage. PCA parameters (fitting) would be learned only on the training set.

Metrics: I would evaluate the model using appropriate metrics for classification, considering the potential for imbalanced classes in cancer datasets:

- Accuracy: Overall correct predictions.
Precision, Recall, F1-score: To assess performance for each cancer type, especially for minority classes.

- ROC AUC: For binary or multi-class classification, indicating the model's ability to distinguish between classes.

- Confusion Matrix: To visualize the types of correct and incorrect predictions.

Overfitting Check: I'd compare training and test set performance. If training performance is significantly better, it indicates overfitting, and further adjustments to K in KNN or the number of PCA components might be needed.

5. Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data:


- Mitigating Overfitting: This pipeline directly addresses the problem of overfitting common in high-dimensional, small-sample biomedical datasets. PCA reduces the feature space, transforming many correlated genes into a few uncorrelated principal components, which makes the learning task simpler and reduces the chances of the model memorizing noise in the training data.

- Handling the Curse of Dimensionality: Biomedical data often suffers from the 'curse of dimensionality,' where distances become less meaningful. PCA effectively projects data into a lower-dimensional space where distance calculations for KNN are more reliable and meaningful, leading to better generalization.

-  Interpretability (to some extent): While PCs themselves are abstract, the approach can highlight the most significant patterns in gene expression without requiring domain experts to select individual genes (which can be biased or miss complex interactions). We can analyze the loadings of the original genes on the top PCs to gain some biological insights.

- Computational Efficiency: Reducing dimensionality makes the KNN algorithm, which is distance-intensive, computationally faster, allowing for quicker analysis and predictions.

- Non-Parametric Nature of KNN: KNN makes no assumptions about the underlying data distribution, which can be beneficial for complex biomedical data where parametric assumptions might not hold.

- Proven Methodology: Both PCA and KNN are well-established and widely used techniques in machine learning and bioinformatics, providing a trustworthy and understandable framework for analysis. The combination offers a robust and effective approach to extracting meaningful signals from complex gene expression data, which is crucial for reliable cancer classification in a real-world clinical context.
