<a href="https://colab.research.google.com/github/ankit221814/ClickStream-DataAnalysis/blob/main/KNN%26PCA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?
Answer:
K-Nearest Neighbors (KNN) is a supervised, instance-based, non-parametric learning algorithm that makes predictions based on the K closest data points in the feature space.
How it works:
For Classification:
Choose value of K.
Compute distance (usually Euclidean).
Find K nearest neighbors.
Assign the most frequent class among neighbors.
For Regression:
Find K nearest neighbors.
Take average of their target values.

Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?
Answer:
The Curse of Dimensionality refers to problems that arise when working with high-dimensional datasets, where distance metrics become less meaningful.
Effect on KNN:
Distance between points becomes similar.
Nearest neighbors become unclear.
Model accuracy decreases.
Computational cost increases.

Question 3: What is PCA? How is it different from feature selection?
Answer:
PCA is a dimensionality reduction technique that transforms original features into new orthogonal components maximizing variance.
Difference:
PCA	Feature Selection
Creates new features	Selects existing features
Captures max variance	Retains most important original features
Reduces correlation	May retain correlation

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?
Answer:
Eigenvectors: Directions of new axes (principal components).
Eigenvalues: Amount of variance captured by each eigenvector.
Importance:
Larger eigenvalue → more information → more important component.

Question 5: How do KNN and PCA complement each other?
Answer:
PCA reduces dimensionality & noise, while KNN performs better in lower dimensions.
Benefits:
Faster computation
Improved accuracy
Reduced overfitting
Better distance measurement

In [2]:
#Dataset:
#Use the Wine Dataset from sklearn.datasets.load_wine().
#Question 6: Train a KNN Classifier on the Wine dataset with and without feature
#scaling. Compare model accuracy in both cases.

#Answwer:-
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
data = load_wine()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Without Scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
pred1 = knn.predict(X_test)
acc1 = accuracy_score(y_test, pred1)

# With Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)
pred2 = knn.predict(X_test_scaled)
acc2 = accuracy_score(y_test, pred2)

print("Accuracy without scaling:", acc1)
print("Accuracy with scaling:", acc2)


Accuracy without scaling: 0.7407407407407407
Accuracy with scaling: 0.9629629629629629


In [4]:
#Question 7: Train a PCA model on the Wine dataset and print the explained variance
#ratio of each principal component.
#(Include your Python code and output in the code box below.)

#Answer:-
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_wine

# Load dataset
data = load_wine()
X = data.data

# Standardize
X_scaled = StandardScaler().fit_transform(X)

# PCA
pca = PCA()
pca.fit(X_scaled)

print("Explained Variance Ratio:")
print(pca.explained_variance_ratio_)


Explained Variance Ratio:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


In [5]:
#Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
#components). Compare the accuracy with the original dataset.
#(Include your Python code and output in the code box below.)

#Answer:-
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# PCA (2 components)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Train-test split
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=42)

# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_pca, y_train)
pred = knn.predict(X_test_pca)

print("Accuracy after PCA:", accuracy_score(y_test, pred))


Accuracy after PCA: 0.9814814814814815


In [6]:
#Question 9: Train a KNN Classifier with different distance metrics (euclidean,
#manhattan) on the scaled Wine dataset and compare the results.
#(Include your Python code and output in the code box below.)

#Answer:-
from sklearn.neighbors import KNeighborsClassifier

# Euclidean
knn_eu = KNeighborsClassifier(metric='euclidean')
knn_eu.fit(X_train_scaled, y_train)
acc_eu = accuracy_score(y_test, knn_eu.predict(X_test_scaled))

# Manhattan
knn_man = KNeighborsClassifier(metric='manhattan')
knn_man.fit(X_train_scaled, y_train)
acc_man = accuracy_score(y_test, knn_man.predict(X_test_scaled))

print("Euclidean Accuracy:", acc_eu)
print("Manhattan Accuracy:", acc_man)


Euclidean Accuracy: 0.9629629629629629
Manhattan Accuracy: 0.9629629629629629


In [7]:
#Question 10: You are working with a high-dimensional gene expression dataset to
#classify patients with different types of cancer.
#Due to the large number of features and a small number of samples, traditional models overfit.Explain how you would:
#Use PCA to reduce dimensionality
#Decide how many components to keep
#Use KNN for classification post-dimensionality reduction
#Evaluate the model
#Justify this pipeline to your stakeholders as a robust solution for real-world
#biomedical data
#(Include your Python code and output in the code box below.)

#Answer:-
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_wine

# Load dataset (simulating gene dataset)
data = load_wine()
X, y = data.data, data.target

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

scores = cross_val_score(pipeline, X, y, cv=5)

print("Cross-validation Accuracy:", scores.mean())


Cross-validation Accuracy: 0.9495238095238095
