#Assignment Code: DA-AG-016

Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?
* K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for both classification and regression tasks. It is called a “lazy learner” because it does not explicitly build a model; instead, it makes predictions based on the entire training dataset.
* How it works;
     * For a new data point, find the K nearest neighbors using distance (like Euclidean distance).
     * Use those neighbors to make the prediction.
* KNN in Classification;
     * The algorithm uses a majority vote among the K neighbors.
     * Example: If K=7 and among the 7 closest neighbors, 4 belong to class A and 3 to class B, the new point is classified as class A.
* KNN in Regression;
     * Instead of voting, KNN takes the average (or weighted average) of the target values of the K nearest neighbors.
     * Example: If the target values of the K neighbors are [10, 12, 14], then the predicted value is their average (≈ 12).

Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?
* The **curse of dimensionality** refers to the problems that arise when working with data that has a very large number of features (high dimensions).
* As dimensions increased;
        * Data points become sparser (spread out).
        * Distances between points become less meaningful.
        * Models that rely on distance (like KNN) struggle to find “true” nearest neighbors.
* Effect on KNN;

        Since KNN depends on distance:
        * Nearest neighbors become hard to identify.
        * Accuracy drops.
        * Computation cost increases.

Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?
* PCA is a dimensionality reduction technique that transforms the original features into a new set of features caled principal components.
    * These components are linear combinations of the original features.
    * They capture the maximum variance in the data with fewer dimensions.

* Difference from Feature Selection
    * Feature Selection: Keeps the most important original features.
    * PCA (Feature Extraction): Creates new features by combining original ones.

* So basically, PCA transforms features, while feature selection just picks features.


Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?
* Eigenvectors: Directions in which the data varies the most (they define the principal components).
* Eigenvalues: Numbers that show how much variance (information) each eigenvector carries.\
* Their Importance;
      * Eigenvectors decide the new feature axes (principal components).
      * Eigenvalues tell us the importance of each component, helping us choose how many components to keep.
* **Eigenvectors** give the directions of maximum variance, and **Eigenvalues** tell the amount of variance captured. Together, they form the core of PCA.

Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?
* KNN and PCA
    * Problem: KNN struggles with high-dimensional data because of the curse of dimensionality.
    * Solution: PCA reduces dimensions by keeping only the most important information.

* How they complement each other?
    * PCA transforms high-dimensional data into a lower-dimensional space.
    * KNN is then applied on this compact data, where distances are more reliable.
    * This makes KNN faster, less memory-intensive, and more accurate.

* Example;
     In image recognition, each image may have thousands of pixel features.
       * PCA reduces dimensions to a smaller set of components (e.g., 100 instead of 10,000 pixels).
       * KNN then classifies images using these reduced features, improving both speed and performance.


**Dataset**:
Use the Wine Dataset from sklearn.datasets.load_wine().

Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.
(Include your Python code and output in the code box below.)

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Split into training and testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# --- KNN without scaling ---
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
acc_no_scaling = accuracy_score(y_test, knn.predict(X_test))

# --- KNN with scaling ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)
acc_with_scaling = accuracy_score(y_test, knn.predict(X_test_scaled))

print("Accuracy without scaling:", round(acc_no_scaling, 2))
print("Accuracy with scaling:", round(acc_with_scaling, 2))


Accuracy without scaling: 0.74
Accuracy with scaling: 0.96


Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

(Include your Python code and output in the code box below.)

In [None]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

X, _ = load_wine(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)

pca = PCA().fit(X_scaled)

print("Explained variance ratio of each component:")
for i, var in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {var:.4f}")


Explained variance ratio of each component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.

(Include your Python code and output in the code box below.)

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = StandardScaler()
X_train_scaled, X_test_scaled = scaler.fit_transform(X_train), scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
acc_original = accuracy_score(y_test, knn.predict(X_test_scaled))

pca = PCA(n_components=2)
X_train_pca, X_test_pca = pca.fit_transform(X_train_scaled), pca.transform(X_test_scaled)

knn.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, knn.predict(X_test_pca))

print("Accuracy on original dataset:", round(acc_original, 2))
print("Accuracy on PCA-transformed dataset (2 components):", round(acc_pca, 2))

Accuracy on original dataset: 0.96
Accuracy on PCA-transformed dataset (2 components): 0.98


Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

(Include your Python code and output in the code box below.)

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = StandardScaler()
X_train_scaled, X_test_scaled = scaler.fit_transform(X_train), scaler.transform(X_test)

knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
acc_euclidean = accuracy_score(y_test, knn_euclidean.predict(X_test_scaled))

knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
acc_manhattan = accuracy_score(y_test, knn_manhattan.predict(X_test_scaled))

print("Accuracy with Euclidean distance:", round(acc_euclidean, 2))
print("Accuracy with Manhattan distance:", round(acc_manhattan, 2))

Accuracy with Euclidean distance: 0.96
Accuracy with Manhattan distance: 0.96


Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models overfit.
Explain how you would:
* Use PCA to reduce dimensionality
* Decide how many components to keep
* Use KNN for classification post-dimensionality reduction
* Evaluate the model
* Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data

(Include your Python code and output in the code box below.)

**Answer**-
* The Process
      * Use PCA → reduce dimensions of high-dimensional gene expression dataset.
      * Decide components → keep enough components to explain ~95% variance.
      * Apply KNN → for classification after dimensionality reduction.
      * Evaluate → check accuracy with train-test split.
      * Justify → PCA+KNN avoids overfitting and is efficient for biomedical data.


In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Simulate a high-dimensional gene expression dataset
X, y = make_classification(n_samples=200, n_features=1000, n_informative=50,
                           n_classes=2, random_state=42)

# Scale features
X_scaled = StandardScaler().fit_transform(X)

# Apply PCA to keep 95% variance
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=42)

# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Evaluate
y_pred = knn.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print("Original features:", X.shape[1])
print("Reduced features after PCA:", X_pca.shape[1])
print("Model Accuracy:", round(acc, 2))


Original features: 1000
Reduced features after PCA: 175
Model Accuracy: 0.58
