1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

ans-          

    The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised
    
    learning classifier, which uses proximity to make classifications or
    
    predictions about the grouping of an individual data point.

 2: What is the Curse of Dimensionality and how does it affect KNN
performance?
]
ans-

    The Curse of Dimensionality refers to various phenomena that arise when
    
    dealing with high-dimensional data.

**How it affects KNN performance:**


1- **Misleading Distances:**

    In high dimensions, all points seem far apart, so the "nearest" neighbors
    
    might not be truly close or representative, as noted in Towards AI.



2- **Increased Data Needs:**

    To maintain density and find reliable neighbors, KNN needs an exponentially
       
    larger dataset as dimensions increase, quickly becoming impractical.


3- **Overfitting:**


    With sparse data, KNN can latch onto irrelevant features (noise), leading
    
    to poor generalization on new data, as it relies heavily on local patterns.


4- **Computational Cost**:

    Calculating distances in high-dimensional spaces is computationally
    
    expensive and slow, increasing training/prediction time.


5- **Loss of Discriminative Power:**


    The core principle of KNN—finding similar neighbors—breaks down because all
    
    points become effectively "dissimilar" and distant, making classifications unreliable

3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?

ans-

    Principal Component Analysis (PCA) is a feature extraction method that
    
    creates new, fewer, uncorrelated "artificial" features (components) that
    
    capture most data variance, ideal for dimensionality reduction, while
    
    Feature Selection directly picks the most relevant original features,
    
    keeping them intact, often using target labels to assess importance,
    
    making PCA transformative but less interpretable, whereas feature
     
    selection keeps data's original meaning but might miss complex patterns

4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

ans-
    
    In PCA, eigenvectors define the new axes (Principal Components) of the
    
    data, representing directions of maximum variance, while eigenvalues
    
    quantify the amount of variance along each eigenvector, indicating its
    
    importance.




---


 **Why They Are Important in PCA**


**Dimensionality Reduction:**

    By sorting eigenvectors by their eigenvalues (largest first), we identify
    
    the most informative directions, allowing us to discard components with
    
    small eigenvalues, thus reducing dimensions.


**Information Preservation**:

    Focusing on components with high eigenvalues ensures we keep the most
    
    significant patterns and variability in the data.


**Data Transformation**:

    They transform data from its original feature space to a new, smaller
    
    
    principal component space, making data easier to visualize and process.

    

5: How do KNN and PCA complement each other when applied in a single
pipeline?

ans -


**KNN and PCA can be combined in a single pipeline to enhance the performance and efficiency of machine learning models [1]. The techniques complement each other in two primary ways**




---

1- **Noise Reduction:**


    PCA identifies and isolates the most significant variance in the data,

    filtering out minor, random variations (noise) present in higher dimensions.

    KNN is sensitive to noise, so applying PCA first can lead to more robust and

    accurate classification or regression results by focusing on the underlying

    structure of the data



2- **Dimensionality Reduction for Efficiency**:


    PCA reduces the number of dimensions in the dataset. This is highly
    
    beneficial for KNN because the computational cost and memory requirements
    
    of KNN increase significantly with the number of dimensions (a phenomenon
     
    known as the "curse of dimensionality") . By reducing the dimensionality,
      
    PCA makes the KNN algorithm run much faster and with less memory
       
    overhead.

6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.



In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# 1. Load the dataset
wine = load_wine()
X = wine.data
y = wine.target

# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- Case 1: KNN without Feature Scaling ---

# 3. Initialize and train the KNN classifier (without scaling)
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)

# 4. Make predictions and evaluate
y_pred_unscaled = knn_unscaled.predict(X_test)
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)

# --- Case 2: KNN with Feature Scaling ---

# 5. Initialize the StandardScaler
scaler = StandardScaler()

# 6. Fit the scaler on the training data and transform both training and test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 7. Initialize and train the KNN classifier (with scaling)
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)

# 8. Make predictions and evaluate
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

# 9. Compare accuracies
print(f"Accuracy without scaling: {accuracy_unscaled:.4f}")
print(f"Accuracy with scaling:    {accuracy_scaled:.4f}")


Accuracy without scaling: 0.7407
Accuracy with scaling:    0.9630


 7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

In [2]:
# 1. Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd # For better display of results

# 2. Load the Wine dataset
wine = load_wine()
X = wine.data
# y = wine.target # Target (wine type) for potential classification, but not needed for basic PCA variance

# 3. Standardize the data (Crucial for PCA!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 4. Initialize and Train PCA
# We'll let PCA decide components, but can specify n_components=None to see all
pca = PCA()
pca.fit(X_scaled)

# 5. Print Explained Variance Ratio
print("Explained Variance Ratio for each Principal Component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.4f}")

# Optional: Print cumulative variance
print("\nCumulative Explained Variance:")
cumulative_variance = pca.explained_variance_ratio_.cumsum()
for i, cum_ratio in enumerate(cumulative_variance):
    print(f"Up to PC{i+1}: {cum_ratio:.4f}")

# Optional: Display results in a DataFrame for clarity
results_df = pd.DataFrame({
    'Principal Component': [f'PC{i+1}' for i in range(len(pca.explained_variance_ratio_))],
    'Explained Variance Ratio': pca.explained_variance_ratio_,
    'Cumulative Variance': cumulative_variance
})
print("\n--- PCA Results ---")
print(results_df)


Explained Variance Ratio for each Principal Component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080

Cumulative Explained Variance:
Up to PC1: 0.3620
Up to PC2: 0.5541
Up to PC3: 0.6653
Up to PC4: 0.7360
Up to PC5: 0.8016
Up to PC6: 0.8510
Up to PC7: 0.8934
Up to PC8: 0.9202
Up to PC9: 0.9424
Up to PC10: 0.9617
Up to PC11: 0.9791
Up to PC12: 0.9920
Up to PC13: 1.0000

--- PCA Results ---
   Principal Component  Explained Variance Ratio  Cumulative Variance
0                  PC1                  0.361988             0.361988
1                  PC2                  0.192075             0.554063
2                  PC3                  0.111236             0.665300
3                  PC4                  0.070690             0.735990
4                  PC5                  0.065633             0.801623
5                  PC6                  0.049358             0.850981
6   

8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.


In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris # Example dataset

# 1. Load and Preprocess the Dataset
# We'll use the Iris dataset as an example.
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) #

# Standardize the data (crucial for both PCA and KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 2. Train KNN on the Original (Scaled) Dataset
knn_original = KNeighborsClassifier(n_neighbors=5) #
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
accuracy_original = accuracy_score(y_test, y_pred_original)

# 3. Apply PCA and Train KNN on Transformed Dataset
# Retain top 2 components
pca = PCA(n_components=2) #
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled) # Apply the same transformation to test set

# Train KNN on the PCA-transformed data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)

# 4. Compare Accuracies
print(f"Accuracy on original dataset: {accuracy_original:.4f}")
print(f"Accuracy on PCA-transformed dataset (2 components): {accuracy_pca:.4f}")

# You can also print the explained variance ratio for context
print(f"Explained variance of the top 2 components: {pca.explained_variance_ratio_.sum():.4f}")


Accuracy on original dataset: 1.0000
Accuracy on PCA-transformed dataset (2 components): 0.9556
Explained variance of the top 2 components: 0.9521


9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

In [4]:
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# 2. Split the data into training and testing sets
# Using a random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# 3. Scale the features
# Feature scaling is crucial for distance-based algorithms like KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Function to train and evaluate KNN with a specific metric
def train_and_evaluate_knn(metric_name):
    # 'euclidean' is the default metric (p=2), 'manhattan' is p=1
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric_name)
    knn.fit(X_train_scaled, y_train)
    y_pred = knn.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    return accuracy

# 4. Train and compare models
euclidean_accuracy = train_and_evaluate_knn('euclidean')
manhattan_accuracy = train_and_evaluate_knn('manhattan')

# 5. Display results
print(f"Accuracy with Euclidean distance: {euclidean_accuracy:.4f}")
print(f"Accuracy with Manhattan distance: {manhattan_accuracy:.4f}")

# Compare and summarise
if euclidean_accuracy > manhattan_accuracy:
    print("\nEuclidean distance performed better.")
elif manhattan_accuracy > euclidean_accuracy:
    print("\nManhattan distance performed better.")
else:
    print("\nBoth distance metrics performed equally well.")


Accuracy with Euclidean distance: 0.9444
Accuracy with Manhattan distance: 0.9815

Manhattan distance performed better.


10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data
