#Theory

1. What is K-Nearest Neighbors (KNN) and how does it work?
ans- K-Nearest Neighbors (KNN) is a non-parametric, lazy learning algorithm used for both classification and regression tasks.

  How it works:
  Training Phase (Lazy): There is no explicit training phase. The algorithm simply stores the entire training dataset.

  Prediction Phase:
  To classify a new data point, KNN finds the 'k' closest data points (neighbors) in the training dataset based on a chosen distance metric (e.g., Euclidean, Manhattan).
  For classification, the new data point is assigned the class label that is most frequent among its 'k' nearest neighbors (majority vote).
  For regression, the new data point is assigned the average (or weighted average) of the target values of its 'k' nearest neighbors.

2. What is the difference between KNN Classification and KNN Regression?

ans- The core difference lies in the type of output predicted and how the 'k' nearest neighbors' values are aggregated:

KNN Classification: Predicts a discrete class label. The output is determined by a majority vote among the class labels of the 'k' nearest neighbors. For example, if among 5 neighbors, 3 are 'Class A' and 2 are 'Class B', the new point is classified as 'Class A'.
KNN Regression: Predicts a continuous numerical value. The output is typically the mean (average) of the target values of the 'k' nearest neighbors. For example, if the target values of 5 neighbors are [10, 12, 11, 13, 9], the predicted value for the new point would be their average (11).

3. What is the role of the distance metric in KNN?

ans- The distance metric is fundamental to KNN because it defines "closeness" or "similarity" between data points. It determines how the 'k' nearest neighbors are identified. Different distance metrics can lead to different sets of neighbors, and therefore, different predictions. Common distance metrics include:

Euclidean Distance: The straight-line distance between two points in a Euclidean space (most common).
Manhattan Distance (City Block Distance): The sum of the absolute differences of their coordinates.
Minkowski Distance: A generalization of Euclidean and Manhattan distances.

4. What is the Curse of Dimensionality in KNN?

ans- The "Curse of Dimensionality" refers to various problems that arise when working with high-dimensional data. In the context of KNN:

As the number of features (dimensions) increases, the volume of the feature space grows exponentially.
Data points become increasingly sparse, meaning that even "close" neighbors might be far apart in high dimensions.
The concept of "distance" becomes less meaningful, as distances between data points tend to converge, making it difficult to distinguish between true neighbors and non-neighbors.
This leads to KNN becoming less effective and computationally more expensive in high-dimensional spaces.

5. How can we choose the best value of K in KNN?

ans- Choosing an optimal 'K' is crucial for KNN performance. There's no single best 'K' for all datasets, but common approaches include:

Trial and Error / Grid Search: Test different values of 'K' (e.g., odd numbers like 1, 3, 5, 7, 9...) and evaluate the model's performance (e.g., using accuracy for classification or MSE for regression) on a validation set or using cross-validation.
Cross-Validation: This is the most robust method. The dataset is split into multiple folds. The model is trained on some folds and tested on the remaining fold, rotating through all folds. This provides a more reliable estimate of performance for each 'K'.
Rule of thumb: Sometimes, K=
N

​
  (where N is the number of samples) is suggested, but this is a rough guideline.
Consider odd K for classification: For binary or multi-class classification, using an odd 'K' helps avoid ties in majority voting.

6. What are KD Tree and Ball Tree in KNN?

ans- KD Trees and Ball Trees are data structures designed to speed up the process of finding nearest neighbors, especially in datasets with many data points or higher dimensions.

KD Tree (K-Dimensional Tree): A binary tree that partitions the data space by recursively splitting it along one of the feature axes. It's efficient for low-to-medium dimensional data.
Ball Tree: A binary tree where each node represents a "hyper-sphere" (ball) containing a subset of data points. It's often more efficient than KD Trees in higher-dimensional spaces and for arbitrary distance metrics because it doesn't rely on axis-aligned splits.

7. When should you use KD Tree vs. Ball Tree?

ans- KD Tree: Generally preferred for lower-dimensional data (typically less than 20 dimensions) and when using Euclidean distance (or other L_p norms). It's simpler to implement and often faster for these scenarios.
Ball Tree: Generally preferred for higher-dimensional data (above 20 dimensions where KD Trees become inefficient due to the curse of dimensionality) or when using non-Euclidean distance metrics where axis-aligned splits are less effective.

8. What are the disadvantages of KNN?

ans- Computationally Expensive (Prediction Phase): For large datasets, finding the 'k' nearest neighbors for each new data point can be very slow, as it requires calculating distances to all training points.
Memory Intensive: It needs to store the entire training dataset in memory, which can be an issue for very large datasets.
Sensitive to Irrelevant Features: Irrelevant or noisy features can disproportionately influence distance calculations, leading to poor performance.
Sensitive to Feature Scaling: Features with larger scales will have a greater impact on distance calculations than features with smaller scales.
Doesn't work well with High-Dimensional Data (Curse of Dimensionality): Distances become less meaningful, and computational cost increases.
Imbalanced Data: If one class is heavily imbalanced, the majority class might dominate the 'k' neighbors, leading to biased predictions for the minority class.

9. How does feature scaling affect KNN?

ans- Feature scaling is critical for KNN. Since KNN relies on distance calculations to find neighbors:

Features with larger numerical ranges or higher magnitudes will inherently have a disproportionately larger impact on the distance calculation compared to features with smaller ranges.
This can lead to features with smaller ranges being effectively ignored, even if they are highly predictive.
Solution: Techniques like Standardization (Z-score normalization) or Normalization (Min-Max scaling) should be applied to bring all features to a similar scale before applying KNN.

10. How does KNN handle missing values in a dataset?
ans- KNN itself does not inherently handle missing values. If a data point has missing values, its distance to other points cannot be accurately calculated using standard distance metrics. Common strategies to deal with missing values before applying KNN include:

Imputation: Replacing missing values with estimated values, such as:
Mean, median, or mode of the feature.
Using more sophisticated imputation methods (e.g., K-Nearest Neighbor Imputation, where missing values are imputed based on the non-missing values of the 'k' nearest neighbors).
Exclusion: Removing rows (data points) or columns (features) with missing values. This is only feasible if there are very few missing values.
Specific distance metrics: Some specialized distance metrics can be used that are designed to handle missing values, but these are less common in general KNN implementations.
Principal Component Analysis (PCA)

11. What is PCA (Principal Component Analysis)?

ans- Principal Component Analysis (PCA) is an unsupervised linear dimensionality reduction technique. Its primary goal is to transform a dataset with potentially correlated variables into a new set of uncorrelated variables called Principal Components (PCs), while retaining as much of the original variance as possible.

12. How does PCA work?

ans- PCA works by:

Standardizing the Data: (Optional but recommended) Scaling features to have zero mean and unit variance, as PCA is sensitive to feature scaling.
Calculating the Covariance Matrix: This matrix describes the relationships (covariance) between all pairs of features.
Computing Eigenvalues and Eigenvectors:
Eigenvectors represent the directions (axes) of maximum variance in the data. These are the Principal Components.
Eigenvalues represent the magnitude of variance along each eigenvector. A higher eigenvalue indicates that its corresponding eigenvector captures more variance.
Selecting Principal Components: The principal components are ranked by their corresponding eigenvalues in descending order. You choose the top 'k' eigenvectors (where 'k' is the desired number of dimensions) that capture a sufficient amount of variance.
Transforming the Data: Project the original data onto the subspace defined by the selected principal components. This creates a new, lower-dimensional representation of the data.

13. What is the geometric intuition behind PCA?

ans- Geometrically, PCA finds a new coordinate system (a set of orthogonal axes) for the data.

The first principal component is the direction along which the data varies the most (the line that best fits the data in a way that minimizes the perpendicular distances of points to the line).
The second principal component is orthogonal to the first principal component and captures the next largest amount of variance, and so on.
Essentially, PCA rotates the existing data axes to align with the directions of maximum variance in the data, thereby "spreading out" the data as much as possible along these new axes. This allows for projection onto a lower-dimensional space while preserving the most significant information (variance).

14. What is the difference between Feature Selection and Feature Extraction?

ans- Feature Selection: This process involves choosing a subset of the original features from the dataset that are most relevant to the prediction task. It aims to eliminate redundant or irrelevant features. The selected features retain their original meaning. Examples: Variance Thresholding, Recursive Feature Elimination, correlation-based methods.
Feature Extraction: This process involves transforming the original features into a new, smaller set of features (components) while retaining most of the important information. The new features are often combinations or projections of the original features and may not have a direct, interpretable meaning like the original features. PCA is a prime example of feature extraction.

15. What are Eigenvalues and Eigenvectors in PCA?

ans- In the context of PCA:

Eigenvectors: These are the directions or axes of the new feature space. They represent the principal components. Each eigenvector points in a direction along which the data exhibits maximum variance. They are orthogonal to each other.
Eigenvalues: Each eigenvector has a corresponding eigenvalue. The eigenvalue quantifies the amount of variance captured by its corresponding eigenvector (principal component). Larger eigenvalues indicate that their corresponding eigenvectors capture more of the data's variance, making them more significant.

16. How do you decide the number of components to keep in PCA?

ans- Deciding the number of components (k) to keep is crucial:

Scree Plot: Plot the eigenvalues in descending order. Look for an "elbow" point where the explained variance drops significantly. The components before the elbow are usually retained.
Explained Variance Ratio: Calculate the cumulative sum of the explained variance ratio for each principal component. Choose the number of components that explain a certain percentage of the total variance (e.g., 95% or 99%). This is often the most common and systematic approach.
Kaiser's Rule: Keep only principal components whose eigenvalues are greater than 1. (Less common in practice for high-dimensional data).
Downstream Task Performance: If PCA is used as a preprocessing step for another model, evaluate the performance of that model with different numbers of components.

17. Can PCA be used for classification?

ans- No, PCA itself is not a classification algorithm. PCA is an unsupervised dimensionality reduction technique. It transforms the data into a lower-dimensional space.
However, PCA is very frequently used as a preprocessing step for classification algorithms. By reducing the dimensionality of the input data, PCA can:

Reduce computational cost and training time for subsequent classification models.
Mitigate the curse of dimensionality, potentially improving the performance of classifiers (like KNN) in high-dimensional spaces.
Help visualize high-dimensional data by reducing it to 2 or 3 components. So, you would typically run PCA, then feed the PCA-transformed data into a classifier like Logistic Regression, SVM, or KNN.

18. What are the limitations of PCA?

ans- Linearity Assumption: PCA assumes linear relationships between variables. It may not perform well if the data has complex non-linear structures.
Sensitivity to Scaling: PCA is sensitive to the scaling of features. Features with larger variances will have a disproportionately larger influence on the principal components. Data standardization is often required.
Loss of Interpretability: The new principal components are linear combinations of the original features, which can make them difficult to interpret in terms of the original domain.
Information Loss: While it tries to retain maximum variance, some information is always lost during dimensionality reduction.
Outlier Sensitivity: PCA is sensitive to outliers, which can significantly affect the calculated principal components.
Unsupervised: It does not consider the class labels (target variable) during the transformation, which means it might not always find the optimal components for a supervised task.

19. How do KNN and PCA complement each other?

ans- KNN and PCA are often used together, with PCA serving as a powerful preprocessing step for KNN:

Addressing the Curse of Dimensionality for KNN: PCA reduces the number of features, mitigating the negative effects of high dimensionality on KNN (computational cost, sparsity, meaningful distance).
Improving KNN Performance: By removing noise and irrelevant features (implicitly through variance maximization), PCA can often improve the accuracy and efficiency of KNN, especially in high-dimensional datasets.
Reducing Computational Load: With fewer dimensions, KNN's distance calculations become faster and less memory-intensive.
Enhancing Interpretability (Visualization): PCA can reduce data to 2 or 3 dimensions, making it possible to visualize clusters or relationships in the data before applying KNN.

20. How does KNN handle missing values in a dataset?

ans- KNN does not inherently handle missing values. If a data point has missing values, its distance to other points cannot be accurately calculated using standard distance metrics. Common strategies to deal with missing values before applying KNN include:

Imputation: Replacing missing values with estimated values, such as:
Mean, median, or mode of the feature.
Using more sophisticated imputation methods (e.g., K-Nearest Neighbor Imputation, where missing values are imputed based on the non-missing values of the 'k' nearest neighbors).
Exclusion: Removing rows (data points) or columns (features) with missing values. This is only feasible if there are very few missing values.
Specific distance metrics: Some specialized distance metrics can be used that are designed to handle missing values, but these are less common in general KNN implementations.
What are the key differences between PCA and Linear Discriminant Analysis (LDA)?
Both PCA and LDA are dimensionality reduction techniques, but they have fundamental differences:

| Feature               | Principal Component Analysis (PCA)                               | Linear Discriminant Analysis (LDA)                                |
| :-------------------- | :--------------------------------------------------------------- | :---------------------------------------------------------------- |
| Type of Algorithm | Unsupervised                                                     | Supervised (requires class labels)                                |
| Goal | Maximize variance in the projected data. Find directions of max variance. | Maximize class separability. Find directions that best separate classes. |
| Information Used | Only feature data (X). Ignores class labels.                     | Feature data (X) AND class labels (y).                           |
| Objective | Transform data to a new space where components are uncorrelated. | Project data to maximize the ratio of between-class variance to within-class variance. |
| Number of Comp. | Up to min(n_samples - 1, n_features)                           | At most (number of classes - 1)                                 |
| Use Cases | General dimensionality reduction, noise reduction, visualization.  | Classification preprocessing, feature extraction for classification. |
| Limitations | Assumes linearity, sensitive to scaling, ignores class info.       | Assumes Gaussian distribution and equal covariance matrices per class. Less effective if classes are not well-separated linearly. |
| Interpretation | Components are difficult to interpret.                           | Components are discriminant functions, related to class separation. |

#Practical

In [None]:
pip install scikit-learn numpy matplotlib seaborn

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris, make_regression, make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from matplotlib.colors import ListedColormap

21: Train a KNN Classifier on the Iris dataset and print model accuracy.

In [None]:
# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a KNN Classifier (default k=5)
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)

# Make predictions
y_pred = knn_classifier.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Task 21: KNN Classifier Accuracy on Iris Dataset: {accuracy:.4f}")

22: Train a KNN Regressor on a synthetic dataset and evaluate using Mean Squared Error (MSE).

In [None]:
# Generate a synthetic regression dataset
X_reg, y_reg = make_regression(n_samples=100, n_features=1, noise=20, random_state=42)

# Split data into training and testing sets
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.3, random_state=42
)

# Train a KNN Regressor (default k=5)
knn_regressor = KNeighborsRegressor(n_neighbors=5)
knn_regressor.fit(X_train_reg, y_train_reg)

# Make predictions
y_pred_reg = knn_regressor.predict(X_test_reg)

# Calculate and print Mean Squared Error
mse = mean_squared_error(y_test_reg, y_pred_reg)
print(f"\nTask 22: KNN Regressor Mean Squared Error (MSE): {mse:.4f}")

# Optional: Visualize the regression
plt.figure(figsize=(8, 6))
plt.scatter(X_test_reg, y_test_reg, color='blue', label='Actual values')
plt.scatter(X_test_reg, y_pred_reg, color='red', label='Predicted values')
plt.title('Task 22: KNN Regression on Synthetic Dataset')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.legend()
plt.show()

23: Train a KNN Classifier using different distance metrics (Euclidean and Manhattan) and compare accuracy.

In [None]:
# Use the Iris dataset (already loaded in Task 21)
# X_train, X_test, y_train, y_test are already defined

# KNN with Euclidean distance
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# KNN with Manhattan distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

print(f"\nTask 23: KNN Accuracy (Euclidean Distance): {accuracy_euclidean:.4f}")
print(f"Task 23: KNN Accuracy (Manhattan Distance): {accuracy_manhattan:.4f}")

if accuracy_euclidean > accuracy_manhattan:
    print("           Euclidean distance performed better.")
elif accuracy_manhattan > accuracy_euclidean:
    print("           Manhattan distance performed better.")
else:
    print("           Both distances performed equally well.")

24: Train a KNN Classifier with different values of K and visualize decision boundaries.

In [None]:
# Use only the first two features for visualization
X_2d = X[:, :2] # Sepal Length, Sepal Width
y_2d = y

# Split data for 2D visualization
X_train_2d, X_test_2d, y_train_2d, y_test_2d = train_test_split(X_2d, y_2d, test_size=0.3, random_state=42)

# Define a color map for the classes
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

plt.figure(figsize=(15, 5))

# Iterate over different K values
k_values = [1, 5, 15]
for i, k in enumerate(k_values):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_2d, y_train_2d)

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
    y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                         np.arange(y_min, y_max, 0.02))
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.subplot(1, len(k_values), i + 1)
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light, shading='auto')

    # Plot also the training points
    plt.scatter(X_train_2d[:, 0], X_train_2d[:, 1], c=y_train_2d, cmap=cmap_bold,
                edgecolor='k', s=20, label='Training points')
    plt.scatter(X_test_2d[:, 0], X_test_2d[:, 1], c=y_test_2d, cmap=cmap_bold,
                edgecolor='k', s=60, marker='X', label='Test points')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title(f"Task 24: K-NN Classifier (K={k})")
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[1])

plt.suptitle("Decision Boundaries for different K values", y=1.02, fontsize=16)
plt.tight_layout()
plt.show()

25: Apply Feature Scaling before training a KNN model and compare results with unscaled data.

In [None]:
# Use the Iris dataset (X, y from Task 21)
# X_train, X_test, y_train, y_test are already defined

# KNN on unscaled data (from Task 21)
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
y_pred_unscaled = knn_unscaled.predict(X_test)
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)
print(f"\nTask 25: KNN Accuracy (Unscaled Data): {accuracy_unscaled:.4f}")

# Apply Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# KNN on scaled data
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Task 25: KNN Accuracy (Scaled Data): {accuracy_scaled:.4f}")

if accuracy_scaled > accuracy_unscaled:
    print("           Scaling improved accuracy.")
elif accuracy_unscaled > accuracy_scaled:
    print("           Scaling did not improve accuracy (or slightly reduced).")
else:
    print("           Scaling had no effect on accuracy.")

26: Train a PCA model on synthetic data and print the explained variance ratio for each component.

In [None]:
# Generate a synthetic dataset with more features
X_synthetic_pca, y_synthetic_pca = make_classification(
    n_samples=100, n_features=10, n_informative=5, n_redundant=2, random_state=42
)

# Train a PCA model
# We'll reduce to fewer components than original features, e.g., 5
pca = PCA(n_components=5)
pca.fit(X_synthetic_pca)

# Print explained variance ratio
print(f"\nTask 26: Explained Variance Ratio for each PCA component (synthetic data):")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"  Component {i+1}: {ratio:.4f}")

print(f"Total Explained Variance (first {pca.n_components} components): {pca.explained_variance_ratio_.sum():.4f}")

27: Apply PCA before training a KNN Classifier and compare accuracy with and without PCA.

In [None]:
# Use the Iris dataset (X, y from Task 21)
# X_train, X_test, y_train, y_test are already defined

# --- KNN without PCA (from Task 21) ---
# knn_no_pca = KNeighborsClassifier(n_neighbors=5)
# knn_no_pca.fit(X_train, y_train)
# y_pred_no_pca = knn_no_pca.predict(X_test)
# accuracy_no_pca = accuracy_score(y_test, y_pred_no_pca)
# print(f"\nTask 27: KNN Accuracy (Without PCA): {accuracy_no_pca:.4f}")
# Using previous result for direct comparison
print(f"\nTask 27: KNN Accuracy (Without PCA): {accuracy:.4f} (from Task 21)")


# --- KNN with PCA ---
# Apply PCA (e.g., reduce to 2 components for visualization or 95% variance)
# For Iris (4 features), let's reduce to 2 or 3 components to see effect
pca_knn = PCA(n_components=2) # Reduce 4 features to 2
X_train_pca = pca_knn.fit_transform(X_train)
X_test_pca = pca_knn.transform(X_test)

knn_with_pca = KNeighborsClassifier(n_neighbors=5)
knn_with_pca.fit(X_train_pca, y_train)
y_pred_with_pca = knn_with_pca.predict(X_test_pca)
accuracy_with_pca = accuracy_score(y_test, y_pred_with_pca)
print(f"Task 27: KNN Accuracy (With PCA, n_components={pca_knn.n_components}): {accuracy_with_pca:.4f}")

if accuracy_with_pca > accuracy: # Comparing with accuracy from Task 21
    print("           PCA improved accuracy.")
elif accuracy > accuracy_with_pca:
    print("           PCA reduced accuracy.")
else:
    print("           PCA had no effect on accuracy.")

# Optional: Visualize PCA-transformed data (2D)
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_train_pca[:, 0], y=X_train_pca[:, 1], hue=y_train, palette='viridis', legend='full')
plt.title('Task 27: PCA Transformed Iris Data (2 Components)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

28: Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV.

In [None]:
# Use the Iris dataset (X, y from Task 21)
# X_train, X_test, y_train, y_test are already defined

# Define the parameter grid
param_grid = {'n_neighbors': np.arange(1, 21)} # Test K from 1 to 20

# Initialize GridSearchCV
grid_search = GridSearchCV(
    KNeighborsClassifier(),
    param_grid,
    cv=5, # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1 # Use all available CPU cores
)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print(f"\nTask 28: Best K for KNN Classifier (GridSearchCV): {grid_search.best_params_['n_neighbors']}")
print(f"Task 28: Best Cross-validation Accuracy (GridSearchCV): {grid_search.best_score_:.4f}")

# Evaluate on the test set with the best estimator
best_knn = grid_search.best_estimator_
y_pred_tuned = best_knn.predict(X_test)
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
print(f"Task 28: Test Accuracy with Best K: {accuracy_tuned:.4f}")

29: Train a KNN Classifier and check the number of misclassified samples.

In [None]:
# Use the Iris dataset (X, y from Task 21)
# X_train, X_test, y_train, y_test are already defined

# Train a KNN Classifier (default k=5)
knn_misclassified = KNeighborsClassifier(n_neighbors=5)
knn_misclassified.fit(X_train, y_train)

# Make predictions
y_pred_misclassified = knn_misclassified.predict(X_test)

# Calculate misclassified samples
misclassified_samples = np.sum(y_pred_misclassified != y_test)
total_samples = len(y_test)

print(f"\nTask 29: Total Test Samples: {total_samples}")
print(f"Task 29: Number of Misclassified Samples: {misclassified_samples}")
print(f"Task 29: Classification Error Rate: {misclassified_samples / total_samples:.4f}")

30: Train a PCA model and visualize the cumulative explained variance.

In [None]:
# Use the Iris dataset (X, y from Task 21)
# X is the full dataset

# Train a PCA model with all components
pca_full = PCA(n_components=None) # n_components=None retains all components
pca_full.fit(X)

# Calculate cumulative explained variance
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

# Plot the cumulative explained variance
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--')
plt.title('Task 30: Cumulative Explained Variance by Principal Components (Iris Dataset)')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.grid(True)
plt.xticks(range(1, len(cumulative_variance) + 1))
plt.axhline(y=0.95, color='r', linestyle=':', label='95% explained variance')
plt.legend()
plt.show()

print(f"\nTask 30: Explained Variance Ratio for each component (Iris):")
for i, ratio in enumerate(pca_full.explained_variance_ratio_):
    print(f"  Component {i+1}: {ratio:.4f}")

print(f"Cumulative Explained Variance: {cumulative_variance}")

31: Train a KNN Classifier using different values of the weights parameter (uniform vs. distance) and compare accuracy.

In [None]:
# Load Iris dataset (if not already loaded)
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# KNN with uniform weights
knn_uniform = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn_uniform.fit(X_train, y_train)
y_pred_uniform = knn_uniform.predict(X_test)
accuracy_uniform = accuracy_score(y_test, y_pred_uniform)

# KNN with distance weights
knn_distance = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_distance.fit(X_train, y_train)
y_pred_distance = knn_distance.predict(X_test)
accuracy_distance = accuracy_score(y_test, y_pred_distance)

print(f"Task 31: KNN Accuracy (Weights='uniform'): {accuracy_uniform:.4f}")
print(f"Task 31: KNN Accuracy (Weights='distance'): {accuracy_distance:.4f}")

if accuracy_distance > accuracy_uniform:
    print("           Distance weighting performed better.")
elif accuracy_uniform > accuracy_distance:
    print("           Uniform weighting performed better.")
else:
    print("           Both weighting schemes performed equally well.")

32: Train a KNN Regressor and analyze the effect of different K values on performance.

In [None]:
# Generate a synthetic regression dataset
X_reg, y_reg = make_regression(n_samples=100, n_features=1, noise=20, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.3, random_state=42
)

k_values = range(1, 21) # Test K from 1 to 20
mse_scores = []

for k in k_values:
    knn_regressor = KNeighborsRegressor(n_neighbors=k)
    knn_regressor.fit(X_train_reg, y_train_reg)
    y_pred_reg = knn_regressor.predict(X_test_reg)
    mse_scores.append(mean_squared_error(y_test_reg, y_pred_reg))

plt.figure(figsize=(10, 6))
plt.plot(k_values, mse_scores, marker='o', linestyle='-')
plt.title('Task 32: Effect of K on KNN Regressor Performance (MSE)')
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Mean Squared Error (MSE)')
plt.xticks(k_values)
plt.grid(True)
plt.show()

best_k_reg = k_values[np.argmin(mse_scores)]
min_mse = np.min(mse_scores)
print(f"\nTask 32: Best K for KNN Regressor (lowest MSE): {best_k_reg}")
print(f"Task 32: Minimum MSE observed: {min_mse:.4f}")

33: Implement KNN Imputation for handling missing values in a dataset.

In [None]:
# Create a synthetic dataset with missing values
np.random.seed(42)
X_missing = np.array([
    [1.0, 2.0, np.nan],
    [3.0, np.nan, 5.0],
    [np.nan, 6.0, 7.0],
    [8.0, 9.0, 10.0],
    [11.0, 12.0, np.nan]
])

print(f"\nTask 33: Original Dataset with Missing Values:\n{X_missing}")

# Initialize KNNImputer
# n_neighbors specifies the number of neighboring samples to use for imputation
imputer = KNNImputer(n_neighbors=2)

# Fit and transform the data
X_imputed = imputer.fit_transform(X_missing)

print(f"\nTask 33: Dataset after KNN Imputation:\n{X_imputed}")

34: Train a PCA model and visualize the data projection onto the first two principal components.

In [None]:
# Load Iris dataset (X, y)
iris = load_iris()
X, y = iris.data, iris.target

# Apply PCA to reduce to 2 components
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X)

# Visualize the projection
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_pca_2d[:, 0], y=X_pca_2d[:, 1], hue=y, palette='viridis', legend='full')
plt.title('Task 34: Data Projection onto First Two Principal Components (Iris)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()

35: Train a KNN Classifier using the KD Tree and Ball Tree algorithms and compare performance.

In [None]:
# Load Iris dataset (X, y)
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# KNN with KDTree algorithm
knn_kdtree = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')
knn_kdtree.fit(X_train, y_train)
y_pred_kdtree = knn_kdtree.predict(X_test)
accuracy_kdtree = accuracy_score(y_test, y_pred_kdtree)

# KNN with BallTree algorithm
knn_balltree = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree')
knn_balltree.fit(X_train, y_train)
y_pred_balltree = knn_balltree.predict(X_test)
accuracy_balltree = accuracy_score(y_test, y_pred_balltree)

print(f"\nTask 35: KNN Accuracy (Algorithm='kd_tree'): {accuracy_kdtree:.4f}")
print(f"Task 35: KNN Accuracy (Algorithm='ball_tree'): {accuracy_balltree:.4f}")

if accuracy_kdtree > accuracy_balltree:
    print("           KD Tree performed slightly better.")
elif accuracy_balltree > accuracy_kdtree:
    print("           Ball Tree performed slightly better.")
else:
    print("           Both algorithms resulted in similar accuracy for this dataset.")
print("           Note: The primary difference is often computational efficiency for large datasets, not accuracy.")

36: Train a PCA model on a high-dimensional dataset and visualize the Scree plot.

In [None]:
# Generate a high-dimensional synthetic dataset
X_high_dim, _ = make_classification(n_samples=100, n_features=50, random_state=42)

# Train a PCA model with all components
pca_scree = PCA(n_components=None)
pca_scree.fit(X_high_dim)

# Get explained variance for each component
explained_variance = pca_scree.explained_variance_

# Plot the Scree plot
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(explained_variance) + 1), explained_variance, marker='o', linestyle='--')
plt.title('Task 36: Scree Plot for High-Dimensional Data')
plt.xlabel('Principal Component Number')
plt.ylabel('Eigenvalue (Explained Variance)')
plt.grid(True)
plt.show()

print(f"\nTask 36: Explained Variance for first 5 components of high-dimensional data:")
for i in range(min(5, len(explained_variance))):
    print(f"  Component {i+1}: {explained_variance[i]:.4f}")

37: Train a KNN Classifier and evaluate performance using Precision, Recall, and F1-Score.

In [None]:
# Load Iris dataset (X, y)
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a KNN Classifier
knn_metrics = KNeighborsClassifier(n_neighbors=5)
knn_metrics.fit(X_train, y_train)
y_pred_metrics = knn_metrics.predict(X_test)

# Print classification report
print(f"\nTask 37: Classification Report for KNN Classifier:\n")
print(classification_report(y_test, y_pred_metrics, target_names=iris.target_names))

# Optionally, print individual scores
print(f"Overall Precision (macro avg): {precision_score(y_test, y_pred_metrics, average='macro'):.4f}")
print(f"Overall Recall (macro avg): {recall_score(y_test, y_pred_metrics, average='macro'):.4f}")
print(f"Overall F1-Score (macro avg): {f1_score(y_test, y_pred_metrics, average='macro'):.4f}")

38: Train a PCA model and analyze the effect of different numbers of components on accuracy.

In [None]:
# Load Iris dataset (X, y)
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

n_components_list = range(1, X.shape[1] + 1) # From 1 to max features
accuracy_pca_effect = []

for n_comp in n_components_list:
    pca_effect = PCA(n_components=n_comp)
    X_train_pca_effect = pca_effect.fit_transform(X_train)
    X_test_pca_effect = pca_effect.transform(X_test)

    knn_pca_effect = KNeighborsClassifier(n_neighbors=5)
    knn_pca_effect.fit(X_train_pca_effect, y_train)
    y_pred_pca_effect = knn_pca_effect.predict(X_test_pca_effect)
    accuracy_pca_effect.append(accuracy_score(y_test, y_pred_pca_effect))

plt.figure(figsize=(10, 6))
plt.plot(n_components_list, accuracy_pca_effect, marker='o', linestyle='-')
plt.title('Task 38: Effect of Number of PCA Components on KNN Accuracy')
plt.xlabel('Number of Principal Components')
plt.ylabel('KNN Accuracy')
plt.xticks(n_components_list)
plt.grid(True)
plt.show()

optimal_n_comp = n_components_list[np.argmax(accuracy_pca_effect)]
max_accuracy_pca = np.max(accuracy_pca_effect)
print(f"\nTask 38: Optimal number of PCA components (max accuracy): {optimal_n_comp}")
print(f"Task 38: Maximum accuracy with PCA: {max_accuracy_pca:.4f}")

39: Train a KNN Classifier with different leaf_size values and compare accuracy.

In [None]:
# Load Iris dataset (X, y)
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

leaf_sizes = [1, 10, 30, 50] # Common leaf_size values
accuracy_leaf_size = []

for ls in leaf_sizes:
    knn_leaf = KNeighborsClassifier(n_neighbors=5, leaf_size=ls)
    knn_leaf.fit(X_train, y_train)
    y_pred_leaf = knn_leaf.predict(X_test)
    accuracy_leaf_size.append(accuracy_score(y_test, y_pred_leaf))
    print(f"Task 39: KNN Accuracy (leaf_size={ls}): {accuracy_leaf_size[-1]:.4f}")

# Note: For small datasets like Iris, leaf_size typically has minimal to no effect on accuracy
# It primarily impacts the speed of neighbor queries for very large datasets.

40: Train a PCA model and visualize how data points are transformed before and after PCA.

In [None]:
# Generate a 2D synthetic dataset (e.g., elongated blob)
X_orig, y_orig = make_classification(n_samples=100, n_features=2, n_redundant=0, n_informative=2,
                                     n_clusters_per_class=1, random_state=42, class_sep=1.5)

# Stretch the data to make PCA more obvious
X_orig[:, 0] = X_orig[:, 0] * 3

# Plot original data
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.scatterplot(x=X_orig[:, 0], y=X_orig[:, 1], hue=y_orig, palette='viridis', legend='full')
plt.title('Task 40: Original Data (2 Features)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True)

# Apply PCA
pca_transform = PCA(n_components=2)
X_transformed = pca_transform.fit_transform(X_orig)

# Plot transformed data
plt.subplot(1, 2, 2)
sns.scatterplot(x=X_transformed[:, 0], y=X_transformed[:, 1], hue=y_orig, palette='viridis', legend='full')
plt.title('Task 40: PCA Transformed Data (2 Principal Components)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.tight_layout()
plt.show()

print("\nTask 40: Original data has its variance distributed along original axes.")
print("         PCA transformed data has its variance aligned with the new principal components,")
print(f"         and PC1 typically captures the most variance: {pca_transform.explained_variance_ratio_[0]:.4f}")

41: Train a KNN Classifier on a real-world dataset (Wine dataset) and print classification report.

In [None]:
# Load Wine dataset
wine = load_wine()
X_wine, y_wine = wine.data, wine.target

# Split data
X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(
    X_wine, y_wine, test_size=0.3, random_state=42, stratify=y_wine # Stratify for balanced classes
)

# Scale the data (important for KNN)
scaler_wine = StandardScaler()
X_train_wine_scaled = scaler_wine.fit_transform(X_train_wine)
X_test_wine_scaled = scaler_wine.transform(X_test_wine)

# Train a KNN Classifier
knn_wine = KNeighborsClassifier(n_neighbors=5)
knn_wine.fit(X_train_wine_scaled, y_train_wine)
y_pred_wine = knn_wine.predict(X_test_wine_scaled)

# Print classification report
print(f"\nTask 41: Classification Report for KNN Classifier on Wine Dataset:\n")
print(classification_report(y_test_wine, y_pred_wine, target_names=wine.target_names))

42: Train a KNN Regressor and analyze the effect of different distance metrics on prediction error.

In [None]:
# Use the synthetic regression dataset (X_reg, y_reg from Task 32)
# X_train_reg, X_test_reg, y_train_reg, y_test_reg are already defined

# KNN Regressor with Euclidean distance
knn_reg_euclidean = KNeighborsRegressor(n_neighbors=5, metric='euclidean')
knn_reg_euclidean.fit(X_train_reg, y_train_reg)
y_pred_reg_euclidean = knn_reg_euclidean.predict(X_test_reg)
mse_euclidean_reg = mean_squared_error(y_test_reg, y_pred_reg_euclidean)

# KNN Regressor with Manhattan distance
knn_reg_manhattan = KNeighborsRegressor(n_neighbors=5, metric='manhattan')
knn_reg_manhattan.fit(X_train_reg, y_train_reg)
y_pred_reg_manhattan = knn_reg_manhattan.predict(X_test_reg)
mse_manhattan_reg = mean_squared_error(y_test_reg, y_pred_reg_manhattan)

print(f"\nTask 42: KNN Regressor MSE (Euclidean Distance): {mse_euclidean_reg:.4f}")
print(f"Task 42: KNN Regressor MSE (Manhattan Distance): {mse_manhattan_reg:.4f}")

if mse_euclidean_reg < mse_manhattan_reg:
    print("           Euclidean distance performed better (lower MSE).")
elif mse_manhattan_reg < mse_euclidean_reg:
    print("           Manhattan distance performed better (lower MSE).")
else:
    print("           Both distances performed equally well.")

43: Train a KNN Classifier and evaluate using ROC-AUC score.

In [None]:
# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Convert to a binary classification problem for ROC-AUC
# Class 0 vs. (Class 1 or 2)
y_binary = (y == 0).astype(int)

# Split data
X_train_bin, X_test_bin, y_train_bin, y_test_bin = train_test_split(
    X, y_binary, test_size=0.3, random_state=42, stratify=y_binary
)

# Scale the data
scaler_bin = StandardScaler()
X_train_bin_scaled = scaler_bin.fit_transform(X_train_bin)
X_test_bin_scaled = scaler_bin.transform(X_test_bin)

# Train a KNN Classifier (need probabilities for ROC-AUC)
knn_roc = KNeighborsClassifier(n_neighbors=5)
knn_roc.fit(X_train_bin_scaled, y_train_bin)

# Get probability estimates for the positive class (class 1)
y_prob = knn_roc.predict_proba(X_test_bin_scaled)[:, 1]

# Calculate ROC-AUC score
auc_score = roc_auc_score(y_test_bin, y_prob)
print(f"\nTask 43: KNN Classifier ROC-AUC Score: {auc_score:.4f}")

# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test_bin, y_prob)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {auc_score:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Task 43: Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

44: Train a PCA model and visualize the variance captured by each principal component.

In [None]:
# Load Iris dataset (X, y)
iris = load_iris()
X, y = iris.data, iris.target

# Train a PCA model with all components
pca_variance_plot = PCA(n_components=None)
pca_variance_plot.fit(X)

explained_variance_ratio = pca_variance_plot.explained_variance_ratio_

# Plot the explained variance captured by each component
plt.figure(figsize=(8, 6))
plt.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio)
plt.title('Task 44: Variance Captured by Each Principal Component (Iris)')
plt.xlabel('Principal Component Number')
plt.ylabel('Explained Variance Ratio')
plt.xticks(range(1, len(explained_variance_ratio) + 1))
plt.grid(axis='y')
plt.show()

print(f"\nTask 44: Explained Variance Ratio for each PCA component (Iris):")
for i, ratio in enumerate(explained_variance_ratio):
    print(f"  Component {i+1}: {ratio:.4f}")

45: Train a KNN Classifier and perform feature selection before training.

In [None]:
# Load Iris dataset (X, y)
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- KNN without Feature Selection ---
knn_no_fs = KNeighborsClassifier(n_neighbors=5)
knn_no_fs.fit(X_train, y_train)
y_pred_no_fs = knn_no_fs.predict(X_test)
accuracy_no_fs = accuracy_score(y_test, y_pred_no_fs)
print(f"\nTask 45: KNN Accuracy (Without Feature Selection): {accuracy_no_fs:.4f}")

# --- KNN with Feature Selection ---
# Select top 2 features using f_classif (ANOVA F-value)
selector = SelectKBest(f_classif, k=2) # Choose top 2 features
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Print selected features (indices)
selected_features_indices = selector.get_support(indices=True)
print(f"Task 45: Selected Feature Indices: {selected_features_indices}")
print(f"Task 45: Corresponding Feature Names: {[iris.feature_names[i] for i in selected_features_indices]}")


# Train KNN on selected features
knn_fs = KNeighborsClassifier(n_neighbors=5)
knn_fs.fit(X_train_selected, y_train)
y_pred_fs = knn_fs.predict(X_test_selected)
accuracy_fs = accuracy_score(y_test, y_pred_fs)
print(f"Task 45: KNN Accuracy (With Feature Selection, k=2): {accuracy_fs:.4f}")

if accuracy_fs > accuracy_no_fs:
    print("           Feature selection improved accuracy.")
elif accuracy_no_fs > accuracy_fs:
    print("           Feature selection reduced accuracy.")
else:
    print("           Feature selection had no effect on accuracy.")

46: Train a PCA model and visualize the data reconstruction error after reducing dimensions.

In [None]:
# Generate a synthetic 2D dataset
X_rec, y_rec = make_classification(n_samples=100, n_features=2, n_informative=2,
                                  n_redundant=0, n_clusters_per_class=1, random_state=42, class_sep=1.5)

# Reduce dimensions to 1 component, then reconstruct
pca_reconstruction = PCA(n_components=1)
X_reduced = pca_reconstruction.fit_transform(X_rec)
X_reconstructed = pca_reconstruction.inverse_transform(X_reduced)

# Calculate reconstruction error (Mean Squared Error between original and reconstructed)
reconstruction_error = mean_squared_error(X_rec, X_reconstructed)
print(f"\nTask 46: Data Reconstruction Error (MSE with 1 component): {reconstruction_error:.4f}")

# Visualize original vs. reconstructed data
plt.figure(figsize=(10, 7))
sns.scatterplot(x=X_rec[:, 0], y=X_rec[:, 1], color='blue', label='Original Data', alpha=0.6)
sns.scatterplot(x=X_reconstructed[:, 0], y=X_reconstructed[:, 1], color='red', marker='x', s=100, label='Reconstructed Data (1 PC)', alpha=0.8)
plt.title('Task 46: Original vs. Reconstructed Data after PCA (1 Component)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()

47: Train a KNN Classifier and visualize the decision boundary.

In [None]:
# Use only the first two features from Iris for visualization
X_2d_db = X[:, :2] # Sepal Length, Sepal Width
y_2d_db = y

# Split data for 2D visualization
X_train_2d_db, X_test_2d_db, y_train_2d_db, y_test_2d_db = train_test_split(
    X_2d_db, y_2d_db, test_size=0.3, random_state=42
)

# Define a color map for the classes
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

# Train a KNN Classifier
knn_db = KNeighborsClassifier(n_neighbors=5)
knn_db.fit(X_train_2d_db, y_train_2d_db)

# Plot the decision boundary
x_min, x_max = X_2d_db[:, 0].min() - 1, X_2d_db[:, 0].max() + 1
y_min, y_max = X_2d_db[:, 1].min() - 1, X_2d_db[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))
Z = knn_db.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light, shading='auto')

# Plot training and test points
plt.scatter(X_train_2d_db[:, 0], X_train_2d_db[:, 1], c=y_train_2d_db, cmap=cmap_bold,
            edgecolor='k', s=20, label='Training points')
plt.scatter(X_test_2d_db[:, 0], X_test_2d_db[:, 1], c=y_test_2d_db, cmap=cmap_bold,
            edgecolor='k', s=60, marker='X', label='Test points')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title('Task 47: KNN Classifier Decision Boundary')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.legend()
plt.show()

48: Train a PCA model and analyze the effect of different numbers of components on data variance.

In [None]:
# Load Iris dataset (X, y)
iris = load_iris()
X, y = iris.data, iris.target

print(f"\nTask 48: Effect of different numbers of PCA components on total explained variance:")

for n_comp in range(1, X.shape[1] + 1):
    pca_var_analysis = PCA(n_components=n_comp)
    pca_var_analysis.fit(X)
    total_explained_variance = np.sum(pca_var_analysis.explained_variance_ratio_)
    print(f"  With {n_comp} component(s): Total Explained Variance = {total_explained_variance:.4f}")

# Also show the cumulative plot from Task 30 again for visual understanding
# (Code duplicated for completeness of this task's output)
pca_full_reprise = PCA(n_components=None)
pca_full_reprise.fit(X)
cumulative_variance_reprise = np.cumsum(pca_full_reprise.explained_variance_ratio_)

plt.figure(figsize=(8, 6))
plt.plot(range(1, len(cumulative_variance_reprise) + 1), cumulative_variance_reprise, marker='o', linestyle='--')
plt.title('Task 48: Cumulative Explained Variance by Principal Components (Iris Dataset)')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.grid(True)
plt.xticks(range(1, len(cumulative_variance_reprise) + 1))
plt.axhline(y=0.95, color='r', linestyle=':', label='95% explained variance')
plt.legend()
plt.show()