## Q1.What is K-Nearest Neighbors (KNN) and how does it work.
**Ans**- K-Nearest Neighbors is a supervised machine learning algorithm that can be used for classification or regression. It's one of the simplest, yet surprisingly effective algorithms.

**KNN Working**
1. Choose the number of neighbors
* We pick how many neighboring points to look at.
2. Calculate the distance
* Measure how far the new point is from all existing points using a distance metric:
  * Euclidean distance
  * Manhattan distance
  * Minkowski distance

Formula for Euclidean distance (in 2D):

    d = √((x₂-x₁)²+(y₂-y₁)²)
3. Find the K nearest neighbors
* Sort the dataset by distance and pick the top 'K' closest points.

4. Vote or average
* For classification:

Take a majority vote among the neighbors' labels.

* For regression:

Take the average of the neighbors' values.

5. Assign the result
The new point is given the label or value based on its neighbors.

**Example (Classification)**

Imagine we have a dataset of fruits labeled by weight and color — and labeled as apple or orange.

We get a new fruit and want to classify it:
* Measure distance to all other fruits.
* Pick the 3 nearest.
* If 2 out of 3 are apples → classify as an apple.

**Advantages**
* Simple to understand and implement.
* No need for a trained model — it's lazy learning.
* Works well with small, clean datasets.

**Disadvantages**
* Can be slow with large datasets.
* Sensitive to irrelevant features and different scales.
* Choice of K can affect accuracy.

## Q2.What is the difference between KNN Classification and KNN Regression.
**Ans** - while both use the K-Nearest Neighbors idea, what they do with the neighbors is different.

**KNN Classification vs KNN Regression**

|Aspect	|KNN Classification	|KNN Regression|
|-|||
|Type of Output	|Categorical (like Apple, Orange, Yes, No)	|Continuous (like price, temperature, height)|
|Decision Rule	|Majority vote — whichever class is most common among the K neighbors	|Average (or weighted average) of the K neighbors' values|
|Use Case	|Spam detection, disease prediction, image recognition	|House price prediction, temperature forecasting|
|Example	|If 3 out of 5 neighbors are labeled Cat, classify as Cat	|If neighbor values are [10, 12, 14], predict average → 12|
|Evaluation Metrics	|Accuracy, Precision, Recall, F1-score	|Mean Squared Error (MSE), Mean Absolute Error (MAE), R² score|

**Visual Intuition**
* Classification:
  * New point → Look at K nearest points → Pick the class that appears most.
* Regression:
  * New point → Look at K nearest points → Average their values.

**Quick Example**

Dataset

|Weight	|Color Value	|Label (Classification)	|Value (Regression)|
|-||||
|200g	|0.7 (greenish)	|Apple	|₹50|
|250g	|0.9 (reddish)	|Apple	|₹60|
|300g	|0.4 (orange)	|Orange	|₹70|

New fruit: 260g, 0.85 color
* KNN Classification:
  * Nearest neighbors: Apple, Apple, Orange
  * Result: Apple (2 votes vs 1)
* KNN Regression:
  * Nearest values: ₹60, ₹50, ₹70
  * Average: ₹60

## Q3. What is the role of the distance metric in KNN?
**Ans** - The distance metric is the mathematical formula KNN uses to measure how close or similar two data points are in feature space.
Since KNN is based on the nearest neighbors, it defines near depends entirely on this metric.

Suppose We are trying to find our closest friends in a city based on how far their houses are from me - what we consider “distance” matters.

**Common Distance Metrics in KNN**
1. Euclidean Distance
  * Straight-line distance between two points.

Formula (in 2D)

    d = √((x₂−x₁)²+(y₂−y₁)²)

Good for
Continuous, real-valued features.

2. Manhattan Distance
  * Sum of absolute differences between coordinates

Formula

    d = |x₂-x₁|+|y₂-y₁|
Good for
Grid-like data or when movement is restricted to orthogonal directions.

3. Minkowski Distance
  * A generalization of both Euclidean and Manhattan.

Formula

    d = (∑ⁿᵢ₌₁ |xᵢ-yᵢ|ᵖ)¹/ᵖ

* p=1 → Manhattan
* p=2 → Euclidean
* p=3 or higher → more flexible

4. Hamming Distance
  * Counts the number of positions where values differ.

## Q4. What is the Curse of Dimensionality in KNN?
**Ans** - The Curse of Dimensionality refers to the weird, unintuitive problems that happen when our data has too many features.

KNN relies on measuring distances to find nearby points. But in high-dimensional spaces:
* All points tend to become equally far apart — meaning, "neighbors" stop feeling like neighbors.
* Distance-based algorithms like KNN struggle to find meaningful nearby points because everything seems far.

**Reason**

As we add more dimensions:
* The volume of the space increases exponentially.
* Data points become sparse.
* The concept of "closeness" breaks down — most points are about the same distance from each other.

Example:

Imagine placing points inside:
* A 1D line → points are clearly near or far
* A 2D square → points are still close
* A 10D cube → almost all points are near the edges, and very far from one another

**It Affects KNN**
* In low dimensions
  * KNN can meaningfully identify nearest neighbors.
* In high dimensions
  * Distance differences become less significant
  * KNN predictions become unreliable
  * Model slows down because it has to compute many high-dimensional distances

**Summary**

|In Low Dimensions	|In High Dimensions
|-||
|Distances are meaningful	|Distances become similar (almost equal)|
|Neighbors are useful	|Neighbors are hard to define|
|KNN works well	|KNN performs poorly|

## Q5. How can you choose the best value of K in KNN?
**Ans** - 'K' is the number of neighbors weonsider when making a prediction — and picking the right value is critical for balancing:
* Bias
* Variance

**Methods to Find the Best K**
1. Trial and Error with Cross-Validation
* Split our data into training and validation sets.
* Try several values of K.
* Calculate accuracy or error on the validation set.
* Pick the K with the best validation performance.

2. Elbow Method

For classification tasks
* Plot a graph of error rate / accuracy vs. K values.
* Look for the 'elbow point' where the error stops decreasing rapidly or levels off.
* Choose K at this turning point — it's usually a good balance.

Example:
* K=1 → low training error, high variance
* K too large → high bias
* Sweet spot in between → balanced.

3. Leave-One-Out Cross-Validation
* Use each data point as a validation set once, and the rest as training.
* Calculate error for each K.
* Pick K with the lowest average error.
* Good for small datasets, but can be slow for large ones.

4. Domain Knowledge
* Sometimes the nature of our problem might suggest a sensible K:
  * Very noisy data → Larger K smooths out the noise.
  * Well-separated data → Smaller K preserves local structure.

**Summary**

|K Value	|Effect|
|-||
|Small|Low bias, high variance|
|Large|High bias, low variance|
|Balanced|Good trade-off between bias and variance|

## Q6. What are KD Tree and Ball Tree in KNN?
**Ans** - They're both data structures used to speed up nearest neighbor searches in KNN, especially when we have a lot of data points or high dimensions.

KNN, by default, is a brute-force algorithm — it checks the distance from our query point to every other point.

**KD Tree**
* A binary search tree that splits data points along one feature at each level.
* It recursively divides the space into nested half-spaces.
* It's working:
  * Pick a dimension and split the data at the median value.
  * Next level, pick the y-axis, split again.
  * Alternate splitting by dimension at each level.
* Best for:
  * Low to moderate dimensional data.
  * Faster than brute-force for medium-size problems.

**Ball Tree**
* A tree-based structure where data points are organized into hyperspherical clusters.
* Each node represents a ball.
* It hierarchically divides data into smaller balls.

* It's working:
  * Data points are grouped into balls.
  * Each ball has a center and a radius.
  * The tree recursively partitions data so that points within a ball are closer to each other than to points in other balls.

* Best for:
  * Higher dimensional data.
  * Non-uniform or clustered data.
  * Sometimes faster and more flexible than KD Tree when the data isn't evenly spread.

**Comparison**

|Feature	|KD Tree	|Ball Tree|
|-|||
|Structure	|Divides space with axis-aligned splits	|Divides space with hyperspheres|
|Best for	|Low to moderate dimensions (up to 20-30)	|Higher dimensions (30+)|
|Speed	|Fast for small-medium low-dimensional data	|Fast for high-dimensional or unevenly distributed data|
|Drawback	|Struggles in high-dimensional space	|Slightly more complex structure|

**In Practice (with scikit-learn)**

In scikit-learn's KNeighborsClassifier or KNeighborsRegressor:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')
model = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree')

## Q7. When should you use KD Tree vs. Ball Tree?
**Ans** - Choosing between KD Tree and Ball Tree really depends on the characteristics of our data.

**Use KD Tree**
1. Low-Dimensional Data
* KD Tree is efficient when our data has fewer dimensions.
* It splits data along axis-aligned planes, which is very effective for lower-dimensional spaces.
* Example:
Data with features like height, weight, age, and income works really well with KD Tree.

2. Well-Distributed or Uniform Data
* KD Tree performs best when the data is uniformly distributed in the feature space.
* If our data points are evenly spread out, the tree will divide the space efficiently.
* Example:
A dataset of points spread relatively evenly across the 2D plane, like a scatter of cities on a map.

3. Faster Search for Smaller Datasets
* For small to medium-sized datasets, KD Tree can be faster in practice compared to other methods like brute force.

4. When Space is Regular
* KD Tree is excellent if the data space doesn't have highly irregular or clustered structures. The axis-aligned splitting works well here.

**Use Ball Tree**
1. High-Dimensional Data
* Ball Tree is more efficient when we have higher dimensions.
* In high dimensions, KD Tree starts to struggle because the splitting process becomes less effective. Ball Tree, with its hyperspherical partitions, handles this much better.

2. Non-Uniform or Clustered Data
* Ball Tree excels when data points are clustered in some regions and sparse in others.
* Instead of splitting based on axis-alignment, Ball Tree groups data into balls, making it well-suited for cases where data is naturally grouped.
* Example:
Data with dense regions and sparse areas, such as image data or geospatial data.

3. Irregularly Distributed Data
* If the data points are not uniformly distributed, Ball Tree will likely perform better because the tree doesn't split space in a rigid, axis-aligned manner.
* Example:
Data where there's density variation — for instance, predicting similar items in a recommendation system where user behavior clusters in certain areas.

4. Larger Datasets
* Ball Tree can scale better when working with larger datasets and high-dimensional spaces, especially when the dataset is not very sparse.

**When to Choose Each**

|Criteria	|Use KD Tree	|Use Ball Tree|
|-|||
|Data Dimensionality	|Low dimensions (up to 20-30)	|High dimensions (30+ or more)|
|Data Distribution	|Uniformly distributed	|Non-uniform or clustered data|
|Data Size	|Smaller to medium-sized datasets	|Large datasets, especially with high dimensions|
|Performance	|Fast for small, well-distributed data	|Better for large or sparse, high-dimensional data|
|Use Case	|Well-distributed 2D or 3D data (e.g., scatter plots)	|Clusters of data in high-dimensional spaces (e.g., image or text data)|

## Q8. What are the disadvantages of KNN?
**Ans** - **Disadvantages of K-Nearest Neighbors**
1. High Computational Cost
* Brute-force distance calculation:
KNN requires computing the distance between the query point and every other point in the dataset. This can be very slow for large datasets.
  * Time Complexity: O(n) for each prediction.
  * As wer dataset grows, predicting becomes slower since it involves searching through all data points for each new query.

2. Memory Intensive
* Storing the entire dataset:
KNN is a lazy learner — it doesn't learn a model beforehand but rather uses the entire dataset during the prediction phase.
  * Memory usage increases because the algorithm stores all training data in memory. For large datasets, this can become inefficient and require a lot of RAM.

3. Poor Performance with High Dimensions
* As the number of dimensions increases, the data becomes sparse, and distances between points become less meaningful.
* KNN relies heavily on distances to find neighbors, and in high-dimensional spaces, the distance between points becomes more similar across the entire dataset, making the algorithm less effective.

4. Sensitive to Noisy Data and Outliers
* Outliers can heavily affect the predictions because KNN doesn't have an internal model to smooth things out.
  * For example, if an outlier is close to the query point, it might incorrectly influence the classification or regression.
* Noise in the data can also disrupt the neighbor finding process and reduce accuracy.

5. Requires Feature Scaling
* Feature sensitivity:
KNN is highly sensitive to the scale of features.
  * Features with larger values can dominate the distance calculations, while features with smaller scales might be ignored.
* Solution: we need to scale/normalize our data to ensure fair distance computation.

6. Choice of K
* Selecting the right K is crucial:
  * Small K → Overfitting because the model is too sensitive to noise and outliers.
  * Large K → Underfitting because the model may oversmooth and lose local structure in the data.
* Finding the optimal value of K can sometimes be difficult, requiring techniques like cross-validation.

7. Not Ideal for Real-Time Predictions
* Since KNN needs to search the entire dataset for each prediction, it is not suitable for real-time predictions or applications that require fast responses.

8. Imbalanced Data Problems
* KNN can struggle with imbalanced datasets where some classes are underrepresented.
  * The algorithm might tend to favor the majority class, especially if K is too small.
  * Solution: Weighted KNN can be used to give more importance to the closer neighbors, or we can balance the data using techniques like oversampling or undersampling.

**Disadvantages**

|Disadvantage	|Description|
|-||
|High Computational Cost	|Slow for large datasets due to distance calculation.|
|Memory Intensive	|Requires storing the entire dataset in memory.|
|Curse of Dimensionality	|Struggles with high-dimensional data.|
|Sensitive to Noise & Outliers	|Outliers and noise can heavily impact results.|
|Requires Feature Scaling	|Sensitive to feature scales, needing normalization.|
|Choice of K	|Selecting the optimal K value is not straightforward.|
|Not Ideal for Real-Time	|Slow predictions, especially for large datasets.|
|Imbalanced Data	|May be biased towards majority class in unbalanced data.|

## Q9. How does feature scaling affect KNN?
**Ans** - Feature scaling is critical in K-Nearest Neighbors because the algorithm heavily relies on measuring the distance between data points to identify their nearest neighbors. If the features are not scaled properly, the distance calculations can be biased, leading to inaccurate predictions.

**Feature Scaling Matter in KNN**

KNN calculates the distance between the data points to find the closest neighbors. If features are on different scales, features with larger values will dominate the distance calculation, and features with smaller values may have little influence, even though they could be important.

**Example:**

Imagine we have two features, height and income, with the following values for two data points:
* Point A: Height = 160 cm, Income = 50,000 dollars
* Point B: Height = 180 cm, Income = 200,000 dollars

Without scaling:
* Distance in height will be the difference between 160 and 180.
* Distance in income will be the difference between 50,000 and 200,000.

Since income has much larger values, it will dominate the distance metric, making the difference in height seem less significant. This could mislead the algorithm into focusing more on income than height, even if height is just as important in our problem.

**Feature Scaling Affects KNN's Performance**
1. Without Scaling:
* Features with larger numeric ranges will overshadow other features.
* KNN may become biased toward the larger-scale features, and distances may not reflect actual closeness in a meaningful way.

2. With Scaling:
* Normalization or Standardization puts all features on the same scale.
* Equal importance is given to all features, and KNN will treat all features fairly when calculating distances.
* This ensures that all features contribute equally to the decision-making process.

**Feature Scaling for KNN**
1. Min-Max Scaling

Min-Max scaling rescales the features so they fall within a specific range, usually between 0 and 1.

Formula:

    X′ = (X-min(X))/(max(X)-min(X))

* Good for: When we need our data in a fixed range.
* Drawback: It's sensitive to outliers — extreme values can distort the scaling.

2. Z-score Scaling

Z-score scaling standardizes the data by subtracting the mean and dividing by the standard deviation. This makes the data have a mean of 0 and standard deviation of 1.

Formula:

    X′ = (X-μ)/σ​
Where:
* μ is the mean of the feature,
* σ is the standard deviation.
* Good for: When the data has outliers or we want the data to be centered around 0.
* Drawback: It's more complex, and the data could have extreme values if the distribution is very skewed.

**Visual Example of the Effect of Feature Scaling in KNN**

Without Scaling:
* Feature 1: Age (0-100 years)
* Feature 2: Income (10,000-200,000)

Without scaling, the distance between two points would be dominated by Income, as the range for income is much larger.

**With Scaling:**
* After scaling, both Age and Income will be on the same scale. This ensures that both features are given equal weight in the distance calculation, and the nearest neighbors are selected more meaningfully.

## Q10. What is PCA (Principal Component Analysis)?
**Ans** - PCA is a linear transformation method that:
* Converts data into a new coordinate system where the axes represent the directions of maximum variance in the data.
* The first principal component captures the most variance, the second one captures the next most, and so on.
* It essentially projects data onto fewer dimensions without losing much information.

**PCA Working**
1. Center the Data :

Before applying PCA, we subtract the mean of each feature from the data. This centers the data around the origin, which ensures that PCA doesn't get biased by the scale of features.

2. Calculate the Covariance Matrix:
* The covariance matrix shows how different features vary with each other.
* If we have features like height and weight, the covariance matrix captures the relationship between them.

3. Compute Eigenvalues and Eigenvectors:
* Eigenvectors are the directions in which the data has the most variance. They represent the principal components.
* Eigenvalues represent the magnitude of variance along each eigenvector. The higher the eigenvalue, the more significant the eigenvector.

4. Sort Eigenvectors by Eigenvalues:

The eigenvectors are sorted by their corresponding eigenvalues in descending order, so the first principal component has the highest variance, and the second principal component has the second-highest variance, and so on.

5. Project the Data:

The original data is projected onto the top principal components. This results in a new set of coordinates for the data, now expressed in fewer dimensions.

**PCA Intuition**

Imagine we have data points spread out in a 2D space:
* Without PCA, our data might be spread along some diagonal direction in the 2D space.
* With PCA, we find the axis that best captures the variance, then project the data along this new axis. This new axis could capture most of the spread of our data.

If we then move to 3D and apply PCA again, we can reduce dimensions further by projecting onto just 2 or 1 of the new principal components, still preserving the most significant parts of the data.

**Use PCA**
1. Dimensionality Reduction:
* Reduces the number of features in the data, which can speed up computations and make models more interpretable.
* Useful for high-dimensional data like text data or images.

2. Improves Performance:
* In some cases, PCA can improve the performance of machine learning algorithms by reducing overfitting. It removes noisy features and retains the most informative ones.

3. Visualize High-Dimensional Data:
* PCA is often used for visualizing high-dimensional data by projecting it to 2D or 3D.

4. Multicollinearity Reduction:
* PCA helps in reducing multicollinearity, which can improve the performance of models like linear regression.

**Applications of PCA:**
1. Image Compression:
* PCA reduces the number of pixels in an image while preserving its key features.

2. Data Visualization:
* It helps visualize complex data, reducing it to 2D or 3D for better interpretation.

3. Feature Extraction:
* In machine learning, PCA is used to extract meaningful features and reduce noise in data.

4. Preprocessing:
* PCA is often applied as a preprocessing step for machine learning algorithms to improve their performance by eliminating irrelevant features.

**PCA in Action**

Example: Imagine we have a dataset with 3 features: height, weight, age.
* Step 1: Standardize the data.
* Step 2: Compute the covariance matrix of the data.
* Step 3: Calculate the eigenvalues and eigenvectors of the covariance matrix.
* Step 4: Choose the top 2 principal components based on the highest eigenvalues.
* Step 5: Project the data onto these 2 components, effectively reducing our 3D data to 2D while preserving the most important variance.

**Considerations:**
* PCA does not work well with non-linear relationships between features. If the data has non-linear patterns, we might need to look into Non-Linear PCA or other techniques like t-SNE or UMAP.
* Interpretability: PCA transforms the features into new components, which can be harder to interpret since the new components are combinations of the original features.
* oss of Information: If we reduce to too few components, we might lose some important information, which could hurt model performance.

## Q11. How does PCA work?
**Ans** - Principal Component Analysis works by identifying the most important features of wer data, and then transforming the data into a new set of axes aligned with these components. The goal is to reduce the dimensionality of wer data while retaining as much variance as possible.

**PCA Working: Step-by-Step**
1. Standardize the Data
* The first step in PCA is to center the data by subtracting the mean of each feature. If features have different scales, they should also be standardized to have the same scale.

This is important because PCA is sensitive to the variance of features, and features with larger scales would dominate the principal components if the data is not standardized.

2. Calculate the Covariance Matrix
* Covariance tells us how two features change together. If features are highly correlated, their covariance will be large, meaning that they share a similar pattern of variation.
* The covariance matrix is computed to represent the covariance between all pairs of features in the data. For a dataset with n features, the covariance matrix is an n x n matrix where:

      Cov(X,Y) = 1N∑(Xi-μX)(Yi-μY)
This matrix captures the relationships between all pairs of features in the dataset.

3. Calculate the Eigenvalues and Eigenvectors of the Covariance Matrix
* Eigenvectors and eigenvalues are central to PCA:
  * Eigenvectors determine the directions of the new feature axes.
  * Eigenvalues represent the magnitude of variance along the corresponding eigenvector.

Mathematically, the eigenvector v and eigenvalue λ satisfy the equation:

          Cv = λv
where
* C is the covariance matrix.

Why Eigenvectors and Eigenvalues?

Eigenvectors represent the directions of maximum variance in the data, and eigenvalues show how much variance each direction accounts for.

4. Sort the Eigenvectors by Eigenvalues
* Once we have the eigenvectors and eigenvalues, the next step is to sort them in descending order of the eigenvalues. This step ensures that the most significant principal components are chosen first.
* The first principal component corresponds to the eigenvector with the largest eigenvalue and captures the most variance in the data. The second principal component corresponds to the second largest eigenvalue, and so on.

5. Choose the Top k Principal Components
* After sorting, we can decide how many principal components to keep. Typically, we'll select the top k components based on their eigenvalues. The number of components we choose depends on how much variance we want to retain in the data.
* The goal is to reduce dimensionality by selecting just the most important components, while still retaining as much variance as possible.

6. Project the Data onto the New Principal Components
* Finally, we project the original data onto the top k principal components. This is done by multiplying the original data matrix by the matrix of the selected eigenvectors.

The result is a new set of k-dimensional data that is a linear combination of the original features, with the data now described by the most important features.

**PCA Visualized**

Imagine we have a 2D dataset:
* The data might look like a set of points spread out along some diagonal line.
* The first principal component would be the line that best fits this spread, capturing the maximum variance.
* The second principal component would be perpendicular to PC1 and capture the remaining variance.

By projecting the data onto just PC1, we have reduced the 2D data to 1D while retaining the maximum possible variance. If we project onto PC1 and PC2, we're back to the original 2D space but with reduced noise.

**PCA Working**
* Data Simplification: PCA helps to reduce the number of features by combining correlated features into a smaller set of uncorrelated principal components.
* Variance Retention: It helps retain as much information as possible with fewer features, which can improve model performance and visualization.
* Noise Reduction: By discarding components with low variance, PCA can filter out noise and make patterns in the data clearer.

**Example of PCA with a 3D Dataset**

Imagine we have a dataset with 3 features: height, weight, and age. Here's how PCA would reduce this 3D dataset:
1. Center the data by subtracting the mean of each feature.
2. Calculate the covariance matrix to see how the features correlate.
3. Compute the eigenvectors and eigenvalues.
4. Choose the top 2 eigenvectors based on the largest eigenvalues.
5. Project the original 3D data onto the 2D plane defined by the top 2 principal components.

Now, the data is represented in a 2D space that still captures most of the variance of the original 3D data.

## Q12. What is the geometric intuition behind PCA?
**Ans** - The geometric intuition behind Principal Component Analysis is rooted in the idea of finding the directions in the data space that capture the maximum variance or spread of the data. PCA essentially involves finding new axes to represent the data such that the most important patterns in the data are preserved in these new axes.

**1. Data as Points in Space**

Imagine we have a dataset with two features like height and weight. Each data point can be represented as a point in 2D space:
* The x-axis represents height.
* The y-axis represents weight.

If we have a set of points, the data will be scattered across this 2D space.

**2. Finding the "Best" Direction**

PCA's main task is to find a new direction in the data space where the data points are spread out the most. This direction is called the first principal component.

Geometrically:
* we want to find a straight line in the 2D space that best fits the distribution of the points.
* The goal is to find the line along which the data has the greatest variance, meaning the points are most spread out in that direction.

In 2D, this line will be a line of best fit that minimizes the perpendicular distance from the points to the line. This is the direction that captures the maximum variation in the data.

Example:
* Imagine our data is somewhat diagonal.
* The first principal component will be a line diagonal to this data spread, capturing the maximum variance in the data along that direction.

**3. Perpendicular Direction**
Once we have the first principal component, the next step is to find a second direction that is perpendicular to the first one.
* This second direction captures the next largest variance in the data, but it is orthogonal to the first component.
* In 2D, the second principal component will be a line perpendicular to the first, and it captures the remaining variation in the data that isn't covered by PC1.

**4. Dimensionality Reduction**

Once we've found the first and second principal components, we can project our data onto these new axes, effectively reducing the dimensionality of our data:
* For example, if we start with 3D data, PCA can reduce it to just 2D or even 1D by focusing on the components that explain the most variance.
* In higher dimensions, the process is the same, but the axes we choose are in the higher-dimensional space, and the projection is onto those axes.

**5. How the Data is Projected**

The actual transformation PCA does is to rotate the original data along the new axes defined by the principal components. This involves two key geometric operations:
1. Centering the data: Subtract the mean of each feature to center the data around the origin.
2. Projecting the data onto the new axes: After finding the new axes, we project the original data points onto these axes. This projection results in lower-dimensional data but retains as much of the variance as possible.

**Geometric Intuition with an Example**

Imagine we have a set of data points in a 2D space:
* Original axes: The data is spread out in a certain direction, but it's not aligned with the axes.
* PCA step: PCA finds the direction of maximum spread, which would be the first principal component.
* Transformation: we then project the data onto this new direction, creating a new set of points along the first principal component.
* Perpendicular component: PCA also finds the second principal component, which is perpendicular to the first and represents the next largest variance in the data.

By rotating the data to align with these new axes, we're compressing the data into a more efficient representation without losing much information.

**Geometric Interpretation of Eigenvectors and Eigenvalues**
* The eigenvectors represent the directions of the new axes. They are vectors that point in the direction of maximum variance in the data.
* The eigenvalues represent the magnitude of variance along each of these new axes. Larger eigenvalues mean that the corresponding eigenvector captures more of the variance, making that component more important.

**Visual Example in 2D:**
1. Data Points: Suppose we have data that is not aligned with the axes, but has a clear pattern in some diagonal direction.
2. First Eigenvector: The first eigenvector will point along the direction of the diagonal, capturing the maximum variance in the data.
3. Second Eigenvector: The second eigenvector will point along the perpendicular direction to the first, capturing the remaining variance.

## Q13. What is the difference between Feature Selection and Feature Extraction?
**Ans** - Feature Selection and Feature Extraction are both techniques used to reduce the number of features in a dataset, but they work in fundamentally different ways.

**Feature Selection:**
* Feature selection is the process of selecting a subset of the most relevant features from the original set, without transforming the features. we keep the original features but remove irrelevant, redundant, or noisy ones.

It's working:
* No transformation of the original features occurs.
* Features are selected based on certain criteria.
* It's about filtering out the less useful features.

Methods of Feature Selection:
1. Filter Methods: These are independent of any machine learning algorithms and rely on statistical measures to select the most relevant features.
  * Example: Using Pearson correlation to remove features that are highly correlated.

2. Wrapper Methods: These methods evaluate subsets of features based on the performance of a machine learning model. They try to find the best subset of features by training a model and measuring its performance.
  * Example: Recursive Feature Elimination, where a model is trained multiple times and features are removed iteratively to improve performance.

3. Embedded Methods: These methods perform feature selection during the model training process. Algorithms like Lasso regression or Decision Trees automatically select important features during model fitting.
  * Example: Lasso regression applies L1 regularization and automatically shrinks the coefficients of less important features to zero.

**Advantages of Feature Selection:**
* Simplicity: The features we keep are the original features, so it's easier to interpret.
* Less computation: Reducing the number of features can speed up model training and reduce overfitting.

**Disadvantages of Feature Selection:**
* Might miss interactions: Feature selection only selects or rejects entire features, so we may miss some relationships or interactions between features that could have been useful.

**Feature Extraction:**
* Feature extraction is the process of transforming the data into a new set of features, typically by combining or deriving new features that represent the most important information in a reduced dimensional space.

It's working:
* Transformation happens, and the original features are combined or mapped into new features.
* New feature are created, which might not directly resemble the original features.

Methods of Feature Extraction:
1. Principal Component Analysis: PCA transforms the data into new features called principal components. These components are linear combinations of the original features that capture the maximum variance in the data.
  * Example: Reducing 5 features into 2 principal components that capture most of the variance in the data.

2. Linear Discriminant Analysis: LDA is used for dimensionality reduction while maintaining class separability. It transforms features to maximize the separation between classes.
  * Example: Reducing dimensions while ensuring that different classes in the data remain separable.
3. Autoencoders: These are neural networks used for unsupervised learning. The network learns to compress the data into a lower-dimensional representation, and then reconstruct it back to the original features.
  * Example: A neural network trained to extract the most important features for image compression.

4. t-SNE or UMAP: These are techniques used for nonlinear dimensionality reduction that help in visualizing complex data in 2D or 3D while preserving the data's structure.
  * Example: Reducing high-dimensional data to 2D for visualization.

**Advantages of Feature Extraction:**
* Reduced Dimensionality: Can significantly reduce the number of features while preserving important patterns.
* New Features: Can create new features that capture more complex relationships within the data.
* Better for Complex Data: Can handle non-linear data better.

**Disadvantages of Feature Extraction:**
* Loss of Interpretability: The new features may not be easily interpretable, as they are combinations or transformations of the original features.
* More computation: Some feature extraction methods can be computationally intensive and require more time.

**Differences:**

|Aspect	|Feature Selection	|Feature Extraction|
|-|||
|Nature	|Selects a subset of original features.	|Creates new features by transforming the original ones.|
|Output	|A smaller set of the original features.	|A new set of features, usually lower-dimensional.|
|Transformation	|No transformation of the features.	|The features are transformed into new combinations.|
|Interpretability	|Easy to interpret, as features remain unchanged.	|Often less interpretable, as new features are derived.|
|Computation	|Typically faster, as it involves selecting features.	|Can be computationally expensive, especially for complex methods.|
|Use Case	|Works well when we want to keep only the most important features.	|Works well when we need to reduce dimensionality or extract hidden patterns.|

## Q14. What are Eigenvalues and Eigenvectors in PCA?
**Ans** - Eigenvalues and eigenvectors play a central role in Principal Component Analysis. They are mathematical concepts used to transform the data into a new space, where the data can be analyzed more effectively, particularly when reducing dimensionality.

**Eigenvectors:**

Eigenvectors represent the directions in the original feature space along which the data is most spread out.
* Eigenvectors are vectors that define the axes of the new feature space.
* These vectors indicate the directions in which the data has the most variation, or the most important patterns, in the feature space.

**Geometric Intuition for Eigenvectors:**
* When we perform PCA, we're essentially rotating our data so that the new axes align with the directions of maximum variance.
* The first principal component corresponds to the eigenvector with the largest eigenvalue, which is the direction that captures the most variation in the data.
* The second principal component corresponds to the second eigenvector, which captures the second-most variation, and so on.

**Eigenvalues:**

Eigenvalues correspond to the magnitude or amount of variance captured by each eigenvector. They tell us how much variance is captured along each principal component.
* Eigenvalues represent how important each principal component is. A larger eigenvalue indicates that the corresponding eigenvector captures more variance, meaning that principal component is more important for explaining the structure of the data.
* Smaller eigenvalues indicate less variance, meaning those components are less useful in representing the data's underlying structure.

**Geometric Intuition for Eigenvalues:**
* The larger the eigenvalue, the more spread out the data is along the direction defined by its corresponding eigenvector.
* The eigenvalue reflects how much of the total variance in the dataset is captured by the principal component.

**Mathematical Explanation of Eigenvalues and Eigenvectors:**
Given a covariance matrix C of a dataset, the eigenvalue λ and eigenvector v satisfy the following equation:

    Cv=λv
* C is the covariance matrix of the dataset.
* v is the eigenvector, representing the direction of maximum variance.
* λ is the eigenvalue, representing the amount of variance along that direction.

In simpler terms:

* Eigenvector v: The direction in which the data has the most spread or variance.
* Eigenvalue λ: The amount of variance along the eigenvector v.

**Importantance in PCA:**
* In PCA, we use the eigenvectors to determine the new axes for our data.
* The eigenvalues tell us the importance of these new axes. We can then rank the principal components by their eigenvalues and decide how many components we want to keep for dimensionality reduction.

**The Role of Eigenvalues and Eigenvectors in PCA:**
1. Data Transformation: PCA finds the eigenvectors of the covariance matrix of the data, which represent the directions in which the data has the most variance. Then, it transforms the data into this new space.
2. Variance Explained: The eigenvalues associated with each eigenvector give the amount of variance explained by that principal component. The larger the eigenvalue, the more information the corresponding eigenvector captures.
3. Dimensionality Reduction: By sorting the eigenvectors according to their eigenvalues, we can select the most important components that capture the most variance and discard less important ones. This is the key idea behind dimensionality reduction in PCA.

**Geometric Example:**

Imagine we have a dataset with two features, and we want to perform PCA. Here's a simplified geometric interpretation:
1. Data Distribution: our data points may be spread out in a certain direction.
2. First Principal Component: The first eigenvector points in the direction that best captures the variance in the data. The corresponding eigenvalue is large because the data is widely spread along this diagonal direction.
3. Second Principal Component: The second eigenvector will be perpendicular to the first one and captures the remaining variance. Its corresponding eigenvalue will be smaller.

After applying PCA:
* We will have a new set of axes, with the first component capturing the most information, and the second component capturing less.
* We can project the data onto the first few components, which will reduce dimensionality but retain most of the data's variance.

## Q15. How do we decide the number of components to keep in PCA?
**Ans** - Deciding the number of components to keep in Principal Component Analysis is a critical step in dimensionality reduction. The goal is to retain as much variance in the data as possible while reducing the number of features to make the dataset more manageable for modeling and analysis.

Some common methods to decide how many principal components to retain:

**1. Explained Variance Threshold**

One of the most popular methods is to retain enough components to explain a certain percentage of the variance in the data. Typically, we choose a threshold for how much of the total variance we want to retain.
* Variance explained by each component: After performing PCA, each principal component has an associated eigenvalue that tells us how much variance it captures from the data.
* Cumulative explained variance: The cumulative sum of the explained variance tells us how much variance is captured by the first k components. We can plot the cumulative variance and select the number of components that reaches the desired threshold.

Steps:
1. Compute explained variance for each component: This can be done by dividing each eigenvalue by the sum of all eigenvalues.
2. Plot the cumulative explained variance: This shows the percentage of total variance captured by the first k components.
3. Set a threshold: Select the number of components such that the cumulative variance reaches a threshold.

Example:
* If we want to retain 95% of the variance, look for the smallest number k such that the cumulative variance is ≥ 0.95.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

data = load_iris()
X = data.data

pca = PCA()
pca.fit(X)

explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)

plt.plot(range(1, len(cumulative_variance)+1), cumulative_variance, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance vs. Number of Components')
plt.show()

* From the plot, we can determine how many components are needed to retain a specific percentage of variance.

**2. Scree Plot**

A scree plot shows the eigenvalues associated with each principal component. It is a line plot that helps visualize how much variance is explained by each component.
* The elbow point of the scree plot often suggests an optimal number of components to retain.
* The idea is to retain the components before the "elbow" because they explain the most variance, and after that point, additional components contribute less to the variance.

**Steps:**
1. Plot the eigenvalues of the components.
2. Look for the elbow point where the eigenvalues start to decrease more slowly.
3. Retain the components before this elbow.

Example:

In [None]:
plt.plot(range(1, len(explained_variance)+1), explained_variance, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance')
plt.title('Scree Plot')
plt.show()

* The "elbow" in the scree plot indicates a point where adding more components yields diminishing returns in terms of explained variance.

**3. Kaiser Criterion**

This method suggests that we should retain all components with eigenvalues greater than 1. The logic behind this is that an eigenvalue greater than 1 indicates that the principal component explains more variance than an individual feature in the original dataset.

**Steps:**
1. Calculate the eigenvalues of the covariance matrix.
2. Retain all components whose eigenvalue is greater than 1.

**Limitations:**
* While simple, this rule doesn't always provide the most optimal selection of components, especially in datasets with noise or where features have different scales.

**4. Cross-Validation or Model-Based Methods**

We can use cross-validation or model-based techniques to determine the optimal number of components based on model performance. For example:
* Perform cross-validation with different numbers of principal components.
* Evaluate model performance on a downstream task like classification or regression.
* Choose the number of components that provides the best performance or the highest trade-off between variance retention and model performance.

**Steps:**
1. Apply PCA with a varying number of components.
2. Evaluate a machine learning model using cross-validation on the transformed data.
3. Select the number of components that gives the highest performance.

**5. Domain Knowledge and Interpretability**

Sometimes the number of components to retain can also be decided based on domain knowledge or the interpretability of the components.
* For example, if we are working with image data and we know that most of the important features are captured in the first few principal components, we may choose to retain those.
* Similarly, if we're working with text data, we might want to retain enough components to capture the most meaningful patterns but avoid going too high to ensure interpretability.

## Q16. Can PCA be used for classification?
**Ans** - Principal Component Analysis is primarily a dimensionality reduction technique rather than a direct classification method. However, it can be used as a preprocessing step in classification tasks, and in some cases, it can help improve the performance of classification algorithms.

**PCA as a Preprocessing Step for Classification:**
1. Reducing Dimensionality:
* High-dimensional datasets can suffer from problems like overfitting, high computational cost, and poor generalization. PCA helps by reducing the number of features while retaining most of the variance in the data.
* After applying PCA, we transform the data into a lower-dimensional space, which makes it easier for classification models to learn the underlying patterns in the data.

2. Improving Model Performance:
* Some classification algorithms can benefit from PCA, especially when the original dataset has a lot of noise, redundant features, or correlated features.
* By removing less important features, PCA can help the model focus on the most meaningful patterns, improving its accuracy and speed.

3. Visualizing Data:
* PCA is also commonly used for visualizing high-dimensional data. By projecting data onto the first two or three principal components, we can get an intuitive sense of how well the data can be separated for classification tasks.
* This is particularly useful for classification models like SVM or logistic regression, where we can visualize how well the decision boundary separates classes in the reduced feature space.

**PCA Works in Classification:**
1. Apply PCA to the dataset to reduce dimensionality.
2. Train the classifier on the transformed dataset.
3. Evaluate the classifier's performance based on its ability to predict class labels.

Example Steps:
1. Apply PCA to reduce dimensionality.
2. Train a classifier on the transformed data.
3. Evaluate performance on a test set.

Example with Python:

In [None]:
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

clf = SVC(kernel='linear')
clf.fit(X_train_pca, y_train)

y_pred = clf.predict(X_test_pca)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

* Explanation:
  * PCA is applied to the training data to reduce it to 2 components.
  * An SVM classifier is then trained on the reduced dataset.
  * The classifier's performance is evaluated using accuracy.

**Use PCA in Classification:**
1. High Dimensionality:
  * PCA is particularly useful when we have many features and suspect that not all features are equally important.
  * For example, datasets with thousands of features can benefit from PCA to reduce the feature space.

2. Collinearity:
  * When features are highly correlated, PCA can help by transforming the features into a set of uncorrelated components, which can improve the performance of classifiers like logistic regression or SVM.

3. Noise Reduction:
  * If the dataset contains a lot of noise or irrelevant features, PCA can help by removing the components with lower variance that often correspond to noise, leaving only the most informative components.

**When Not to Use PCA in Classification:**
1. Interpretability:
* If we need to interpret the results in terms of the original features, PCA may not be ideal. After PCA, the principal components are linear combinations of the original features, so it's harder to interpret the impact of individual features.

2. Small Datasets:
* For smaller datasets, PCA may remove important features that could actually help in classification. In such cases, it's better to use classifiers directly on the original data or with minimal dimensionality reduction.

3. Loss of Information:
* When we reduce dimensions too much, we might lose important information that is needed for classification. The decision on how many components to keep should be carefully chosen based on the explained variance.

## Q17. What are the limitations of PCA?
**Ans** - Principal Component Analysis is a powerful and widely used technique for dimensionality reduction, but it does have some limitations and drawbacks. These limitations can affect its effectiveness in certain situations.

**1. Linear Assumption**
* Limitation: PCA assumes that the relationships between features are linear. It identifies the principal components by finding linear combinations of the original features that maximize the variance in the data.
* Impact: If the underlying structure of the data is non-linear, PCA may fail to capture important patterns or features of the data. In such cases, non-linear dimensionality reduction techniques may perform better.

**2. Sensitivity to Outliers**
* Limitation: PCA is sensitive to outliers because it is based on variance, and outliers can significantly increase variance in the data. Since PCA tries to find the directions with the maximum variance, an outlier can distort the directions of the principal components.
* Impact: Outliers can lead to a poor or misleading representation of the data in the lower-dimensional space. In some cases, it might result in principal components that do not represent the true structure of the data.
* Solution: we can pre-process the data by removing or handling outliers before applying PCA.

**3. Difficulty in Interpretation**
* Limitation: After performing PCA, the new features are linear combinations of the original features. This can make it difficult to interpret what each principal component represents, especially if the original data contains many features.
* Impact: If interpretability of the results is crucial, PCA might not be the best choice because the transformed features don't correspond directly to the original features.
* Solution: Techniques like Factor Analysis or Independent Component Analysis might be more suitable when we need both dimensionality reduction and interpretability.

**4. Loss of Information**
* Limitation: When we reduce the number of dimensions using PCA, we may lose some important information. Even though PCA tries to retain the most important variance in the data, some useful but subtle patterns might get discarded, especially if we reduce the dimensions too much.
* Impact: If we retain too few components, we might reduce the variance too much, leading to the loss of key information that could be valuable for tasks like classification or prediction.
* Solution: Carefully choose the number of components to retain based on the cumulative explained variance or cross-validation, ensuring that the retained components capture enough variance while minimizing information loss.

**5. Assumes Zero Mean Data**
* Limitation: PCA requires that the data be centered, meaning each feature should have a mean of zero. If the data is not centered, the results of PCA may be distorted.
* Impact: If the data is not preprocessed, the principal components will be shifted, and the results might not be meaningful.
* Solution: Always center the data before applying PCA, or use techniques that automatically handle centering.

**6. Assumes Features are on the Same Scale**
* Limitation: PCA is sensitive to the scale of the features. If the features in wer dataset are on different scales, the principal components will be dominated by the features with larger scales, potentially distorting the results.
* Impact: Features with larger numeric values will disproportionately influence the principal components, leading to biased results.
* Solution: Standardize or normalize the data so that all features have the same scale before applying PCA.

**7. Difficulty in Handling Categorical Data**
* Limitation: PCA works best with continuous numerical data. It is not well-suited for datasets that contain categorical variables.
* Impact: If we apply PCA directly to categorical data, it won't capture meaningful patterns, as categorical features do not have an inherent numerical relationship.
* Solution: If we want to apply PCA to datasets with categorical data, we need to encode categorical variables into numerical values before applying PCA.

**8. Computational Complexity for Large Datasets**
* Limitation: PCA can be computationally expensive, especially for large datasets with many features and observations. The calculation of the covariance matrix and finding the eigenvectors and eigenvalues can become time-consuming as the size of the dataset increases.
* Impact: For datasets with millions of features or observations, performing PCA might take a considerable amount of time and memory.
* Solution: For very large datasets, we can use approximate methods for PCA, such as Incremental PCA or Randomized PCA, which scale better with large datasets.

**9. No Guarantee of Improved Model Performance**
* Limitation: While PCA often helps reduce dimensionality and improve computation speed, reducing dimensionality doesn't always lead to better model performance. In some cases, especially if the data is already well-structured, PCA might actually hurt the performance by discarding relevant features.
* Impact: If the principal components don't capture the most important discriminative features for a model, PCA might lead to worse model performance, especially in tasks like classification and regression.
* Solution: Always evaluate the performance of a model with and without PCA using cross-validation to ensure that dimensionality reduction leads to better results.

**10. Linear Combinations May Not Be Physically Meaningful**
* Limitation: The principal components obtained from PCA are linear combinations of the original features. In some domains, these components may not have an intuitive or meaningful physical interpretation, especially in fields like physics or biology.
* Impact: If we need to interpret the reduced features in terms of the original features, the results may not be easily interpretable.
* Solution: For domain-specific applications, we may need to look for techniques that provide more interpretable results.

## Q18. How do KNN and PCA complement each other?
K-Nearest Neighbors and Principal Component Analysis can work together in complementary ways to improve the performance of machine learning models, especially when dealing with high-dimensional datasets.

**1. PCA for Dimensionality Reduction → KNN for Classification**
* Problem with high-dimensional data: When we have a high number of features, KNN can become less effective. This is due to the curse of dimensionality, where the distance between points becomes less meaningful as the number of dimensions increases. High-dimensional spaces lead to more sparse data, making it harder for KNN to identify meaningful neighbors.
* How PCA helps: PCA reduces the dimensionality of the data by finding new axes that capture the most variance in the data. This results in a lower-dimensional representation of the data, where the most informative features are retained, and less important, redundant, or noisy features are discarded.
* Complementary Effect: By reducing the number of dimensions through PCA, the data becomes more compact, and the distance metrics used by KNN become more meaningful. This can improve the performance of KNN, making it faster and more accurate in high-dimensional spaces.

**2. Improved Efficiency in KNN Classification**
* KNN's computational cost: KNN can be computationally expensive, especially when the number of features is very high because it calculates the distance between the test point and all training points. For large datasets, this can be quite slow.
* How PCA helps: By reducing the number of features before applying KNN, PCA reduces the number of distance calculations that KNN has to perform. This speeds up the classification process, as fewer dimensions need to be considered when calculating distances.

**3. Better Handling of Noise and Redundancy in KNN**
* Noise and redundancy in high dimensions: In high-dimensional data, features can often be highly correlated or noisy, which may confuse KNN. KNN relies heavily on the distance metric, so redundant or noisy features can distort distance calculations and hurt its performance.
* How PCA helps: PCA eliminates redundancy by transforming the data into a new set of uncorrelated components. By selecting the components that capture the most variance, PCA removes noise and correlated features, leaving behind a cleaner, more informative representation of the data.
* Complementary Effect: This makes the distance calculations in KNN more reliable and meaningful, which can lead to better classification accuracy.

**4. Reducing Overfitting in KNN**
* Overfitting in KNN: KNN is prone to overfitting, especially when the dataset has many features and the model becomes too sensitive to noise or irrelevant data points.
* How PCA helps: By reducing the number of features, PCA helps to simplify the model, making it less likely to overfit the data. Fewer features means fewer parameters to consider, and thus, the model becomes less sensitive to variations that might not generalize well.
* Complementary Effect: The reduced feature space provided by PCA can help generalize the KNN model, leading to better performance on unseen data.

**5. Visualizing Data for KNN**
* High-dimensional data visualization: KNN might struggle with high-dimensional data because it is difficult to visualize and understand the relationships between features in more than three dimensions.
* How PCA helps: PCA can reduce the dimensionality to 2 or 3 components, allowing we to visualize the dataset in a lower-dimensional space. This makes it easier to understand the structure of the data and how KNN might classify different data points.
* Complementary Effect: Visualizing the data in 2D or 3D space can provide insights into the separability of different classes and guide the choice of K for the KNN model. It also allows for more intuitive analysis and debugging of the KNN classifier.

**6. PCA Before KNN for Feature Selection**
* Feature selection: In some cases, dimensionality reduction with PCA can serve as a form of feature selection. It helps to select only the most important components, which can be useful in focusing on the most relevant information for KNN classification.
* How PCA helps: By applying PCA first, we can focus on the top principal components that capture the most variance in the data. These components are often the most informative, and using them for classification can result in a more efficient and effective KNN model.
* Complementary Effect: This combination helps avoid overfitting due to irrelevant or noisy features while ensuring that KNN works with the most valuable information in the dataset.

**Example: How PCA and KNN Can Be Used Together**
1. Step 1: Apply PCA - Reduce the dimensionality of the data, retaining enough components that explain most of the variance.

2. Step 2: Apply KNN - Use the transformed data to train and evaluate the KNN classifier. Since the data is now lower-dimensional and cleaner, the KNN algorithm is more likely to perform well.

## Q19. How does KNN handle missing values in a dataset?
**Ans** - K-Nearest Neighbors does not inherently handle missing values in a dataset. Since KNN relies on calculating the distance between data points to determine neighbors, missing values can cause problems in these distance calculations. However, there are several strategies that can be used to handle missing values when using KNN:

**1. Imputation**

One of the most common ways to handle missing values in KNN is through imputation, which involves filling in the missing values with estimates based on the rest of the data.
* Simple Imputation Techniques:
  * Mean/Median Imputation: Replace missing values of a feature with the mean or median value of that feature across all instances.
    * For numerical features: Use the mean or median of the available data to fill missing values.
    * For categorical features: Use the mode of the available data.
* KNN Imputation: Use the KNN algorithm itself to predict and fill the missing values. This works by looking at the K-nearest neighbors and using their values to impute the missing ones. KNN finds similar instances to the one with the missing value, and the missing value is replaced by a weighted average or the most frequent value from the neighbors.

Example:

In [None]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3)
X_imputed = imputer.fit_transform(X)

In this case, KNNImputer from scikit-learn is used to impute missing values based on the neighbors' values.

**2. Removing Instances with Missing Values**
* Removing Rows: If the number of missing values is relatively small, we can simply drop the rows containing missing values.
  * This works well when the missing data is randomly distributed, and the removal of a few rows does not result in significant data loss.
  * However, if many rows have missing values, this approach can lead to a significant reduction in the dataset size, which may affect model performance.

Example:

In [None]:
X_clean = X.dropna()

* Removing Columns: If a feature has a significant proportion of missing values, it may be best to drop the feature entirely. This can be useful if the feature is not crucial for prediction or if imputing the missing values doesn't make sense.

**3. Using a Distance Metric That Handles Missing Values**
* Some advanced distance metrics can handle missing values while calculating the distance between points.
  * For example, we could use a custom distance metric where only the available features are used to calculate the distance between two data points. This way, if one feature is missing for a point, the algorithm ignores that feature when calculating the distance.
  * Hamming Distance: For categorical variables, we can compute the Hamming distance, which measures the number of mismatched features between two points. In cases where a feature is missing, it can be treated as a mismatch or ignored.

However, this approach can be more complex to implement and may not always lead to better results.

**4. Using Nearest Neighbor Search for Missing Value Imputation**
* Another advanced approach is to use KNN search algorithms to identify nearest neighbors that have non-missing values for the specific features in question. Then, the missing value is imputed based on the values of those nearest neighbors.
  * Local Imputation: For missing values in certain features, we use KNN to look for neighbors that have similar values for the other features. Once we find those neighbors, we impute the missing values based on the corresponding values from the neighbors.

**5. Use Models That Handle Missing Values**

Some models, including certain tree-based models, can handle missing values natively. These models can often make decisions even when some values are missing, by splitting data in ways that don't require complete information for all features.
  * While KNN doesn't handle missing values directly, we could use models like Random Forest or XGBoost for certain types of tasks if we're concerned with how missing data is treated.

## Q20. What are the key differences between PCA and Linear Discriminant Analysis (LDA)?
**Ans** - Principal Component Analysis and Linear Discriminant Analysis are both dimensionality reduction techniques, but they are designed for different purposes and work in fundamentally different ways. Here are the key differences between PCA and LDA:

**1. Purpose**
* PCA:
  * PCA is an unsupervised technique used to reduce the dimensionality of data by finding new directions that capture the maximum variance in the data.
  * The primary goal is to capture the maximum variance in the data without considering class labels.
* LDA:
  * LDA is a supervised technique that seeks to find the directions that maximize the separation between different classes in the data.
  * The primary goal is to find linear combinations of features that best separate the classes, i.e., to maximize the class separability.

**2. Supervision Type**
* PCA: Unsupervised. PCA does not take class labels into account. It only looks at the overall variance in the data.
* LDA: Supervised. LDA takes class labels into account during the process of dimensionality reduction and focuses on maximizing the between-class variance while minimizing the within-class variance.

**3. Objective**
* PCA: The goal is to find the directions that explain the maximum variance in the data, irrespective of the class structure. This can be useful when we just want to reduce the dimensionality of the data without regard to class labels.
* LDA: The goal is to find a lower-dimensional representation that maximizes the separation between classes. LDA aims to find a subspace where the between-class variance is maximized and the within-class variance is minimized.

**4. Output Components**
* PCA: Produces orthogonal components that capture the maximum variance in the data. These components are ordered based on how much variance they explain. The first principal component explains the most variance, the second the second most, and so on.
* LDA: Produces discriminant axes that best separate the classes. The number of discriminant components is at most C-1, where C is the number of classes.

**5. Data Structure**
* PCA: Works on continuous data and does not require any information about the class labels. It is purely driven by the variance in the data.
* LDA: Works on labeled data. It requires the classes to be known and tries to find a transformation that maximizes class separability.

**6. Assumptions**
* PCA:
  * Assumes that the directions of maximum variance are the most important features in the data.
  * Does not assume anything about the class structure of the data.
* LDA:
  * Assumes that the data from each class follows a Gaussian distribution and has the same covariance matrix for each class.
  * Assumes that the features contribute linearly to the class separability.

**7. Use Cases**
* PCA:
  * Dimensionality reduction for unsupervised learning problems.
  * Used in cases where we want to reduce the number of features without considering the class labels, such as data compression or when we have a large number of features in exploratory analysis.
* LDA:
  * Used for supervised learning problems, especially for classification tasks.
  * Used when we want to improve class separability and reduce the dimensionality while preserving the class structure, making it useful for classification problems with labeled data.

**8. Performance**
* PCA: Can be useful for reducing the dimensionality and speeding up the computation in unsupervised tasks, but it does not guarantee better classification performance, as it ignores the class labels.
* LDA: Tends to outperform PCA in classification tasks because it focuses on separating the classes, thus increasing class separability. However, LDA may not work well when the class distributions are not Gaussian or if the assumption of equal class covariances is violated.

**9. Example:**
* PCA: In a dataset with many features, PCA can be used to reduce the dimensionality of the feature space by finding new, uncorrelated features that capture the most variance. It is commonly used in tasks like data compression, noise reduction, and feature extraction in unsupervised settings.
* LDA: In a dataset with labeled classes, LDA could be used to reduce the dimensionality of the feature space while ensuring that the data points from different classes are as far apart as possible. It is commonly used in tasks like classification, where the class labels are crucial.

**Differences:**

|Aspect	|PCA	|LDA|
|-|||
|Type of Technique	|Unsupervised	|Supervised|
|Objective	|Maximize variance (data representation) |Maximize class separability|
|Input Data	|No class labels required	|Class labels required|
|Assumptions	|Variance-driven, no class info	|Gaussian distribution, equal covariance|
|Output	|Principal components (max variance)	|Discriminant axes (max separation)|
|Use Case	|Dimensionality reduction in unsupervised tasks	|Classification, class separability|
|Dimensionality	|No limit on components, can exceed class number	|Max C-1 components (C = number of classes)|
|Computational Complexity	|Can be computationally expensive in high-dimensional data	|Can also be computationally expensive, especially with many classes|

#Practical

## Q21. Train a KNN Classifier on the Iris dataset and print model accuracy
**Ans** - KNN Classifier on the Iris dataset and printing its accuracy.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("KNN Classifier Accuracy on Iris dataset:", accuracy)

**Sample Output**

In [None]:
KNN Classifier Accuracy on Iris dataset: 1.0

## Q22. Train a KNN Regressor on a synthetic dataset and evaluate using Mean Squared Error (MSE)
**Ans** -

In [None]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

X, y = make_regression(n_samples=200, n_features=1, noise=15, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn_regressor = KNeighborsRegressor(n_neighbors=5)
knn_regressor.fit(X_train, y_train)

y_pred = knn_regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print("KNN Regressor Mean Squared Error on synthetic dataset:", mse)

**Sample Output**

In [None]:
KNN Regressor Mean Squared Error on synthetic dataset: 276.54

## Q23. Train a KNN Classifier using different distance metrics (Euclidean and Manhattan) and compare accuracy
**Ans** - KNN Classifier on the Iris dataset using two different distance metrics:
* Euclidean distance
* Manhattan distance

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=1)
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

print("KNN Classifier Accuracy with Euclidean distance:", accuracy_euclidean)
print("KNN Classifier Accuracy with Manhattan distance:", accuracy_manhattan)

**Sample Output**

In [None]:
KNN Classifier Accuracy with Euclidean distance: 1.0
KNN Classifier Accuracy with Manhattan distance: 1.0

## Q24. Train a KNN Classifier with different values of K and visualize decision boundaries
**Ans** - K affects KNN's decision boundaries.

* Use a simple 2D synthetic classification dataset.
* Train KNN classifiers with different K values.
* Visualize the decision boundaries for each.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_clusters_per_class=1, n_classes=3, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

plt.figure(figsize=(18, 5))

k_values = [1, 5, 15]

colors = ['#FFAAAA', '#AAFFAA', '#AAAAFF']

for i, k in enumerate(k_values, 1):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)

    h = .02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.subplot(1, 3, i)
    plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.4)
    plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=plt.cm.coolwarm, edgecolor='k')
    plt.title(f'K = {k}')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

plt.tight_lawet()
plt.show()

* K = 1 : Very flexible, highly sensitive to noise — overfitting.
* K = 5 : Smoother, more generalized decision boundary.
* K = 15: Even smoother — may underfit if k is too large.

## Q25. Apply PCA before training a KNN Classifier and compare accuracy with and without PCA
**Ans** - Combining PCA and KNN Classifier on a dataset, then comparing classification accuracy
* Without PCA
* With PCA

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn_no_pca = KNeighborsClassifier(n_neighbors=5)
knn_no_pca.fit(X_train, y_train)
y_pred_no_pca = knn_no_pca.predict(X_test)
accuracy_no_pca = accuracy_score(y_test, y_pred_no_pca)

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)

print("KNN Classifier Accuracy without PCA:", accuracy_no_pca)
print("KNN Classifier Accuracy with PCA (2 components):", accuracy_pca)

**Sample Output**

In [None]:
KNN Classifier Accuracy without PCA: 1.0
KNN Classifier Accuracy with PCA (2 components): 0.9667

* Without PCA → Uses all 4 original features, gets full accuracy.
* With PCA → Small drop in accuracy, but dimensionality is reduced by 50%, often worth it when dealing with high-dimensional, noisy, or computationally expensive data.

## Q28. Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn = KNeighborsClassifier()

param_grid = {
    'n_neighbors': [1, 3, 5, 7, 9, 11],
    'p': [1, 2]
}

grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_score = grid_search.best_score_

best_knn = grid_search.best_estimator_
y_pred = best_knn.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

print("Best Hyperparameters:", best_params)
print("Best Cross-Validation Accuracy:", best_score)
print("Test Set Accuracy with Best Model:", test_accuracy)

**Sample Output**

In [None]:
Best Hyperparameters: {'n_neighbors': 7, 'p': 2}
Best Cross-Validation Accuracy: 0.9667
Test Set Accuracy with Best Model: 1.0

## Q29. Train a KNN Classifier and check the number of misclassified samples.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

misclassified_count = np.sum(y_test != y_pred)

print("Number of misclassified samples:", misclassified_count)

**Sample Output**

In [None]:
Number of misclassified samples: 1

## Q30. Train a PCA model and visualize the cumulative explained variance.

In [None]:
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

iris = load_iris()
X = iris.data

pca = PCA()
pca.fit(X)

cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

plt.figure(figsize=(8,5))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--', color='b')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA - Cumulative Explained Variance')
plt.grid(True)
plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(1.5, 0.96, '95% threshold', color='r')
plt.show()

* Fits a PCA model to the Iris dataset.
* Calculates cumulative variance explained by adding up the variance ratio of each principal component.
* Plots it so we can see how much variance is captured as we add more components.
* Adds a 95% threshold line to see how many components are needed to capture most of the variance.

## Q31. Train a KNN Classifier using different values of the weights parameter (uniform vs. distance) and compare accuracy
**Ans** - the weights parameter in KNN controls how neighbors contribute to the final prediction:
* uniform → all neighbors contribute equally
* distance → closer neighbors have more influence

Let's compare them side by side on the Iris dataset.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn_uniform = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn_uniform.fit(X_train, y_train)
y_pred_uniform = knn_uniform.predict(X_test)
accuracy_uniform = accuracy_score(y_test, y_pred_uniform)

knn_distance = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_distance.fit(X_train, y_train)
y_pred_distance = knn_distance.predict(X_test)
accuracy_distance = accuracy_score(y_test, y_pred_distance)

print("KNN Classifier Accuracy with uniform weights:", accuracy_uniform)
print("KNN Classifier Accuracy with distance weights:", accuracy_distance)

**Sample Output**

In [None]:
KNN Classifier Accuracy with uniform weights: 1.0
KNN Classifier Accuracy with distance weights: 1.0

**Interpretation:**
* On clean, small datasets like Iris, both methods often perform equally well.
* On noisier or imbalanced datasets, distance weighting often improves performance by reducing the influence of distant, potentially misleading neighbors.

## Q32. Train a KNN Regressor and analyze the effect of different K values on performance
**Ans** -
* Training a KNN Regressor on a synthetic regression dataset
* Testing several values of k
* Evaluating performance using Mean Squared Error
* Plotting MSE vs. k to visualize the effect

In [None]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import numpy as np

X, y = make_regression(n_samples=300, n_features=1, noise=20, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

k_values = range(1, 21)
mse_values = []

for k in k_values:
    knn_regressor = KNeighborsRegressor(n_neighbors=k)
    knn_regressor.fit(X_train, y_train)
    y_pred = knn_regressor.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_values.append(mse)

plt.figure(figsize=(8,5))
plt.plot(k_values, mse_values, marker='o', linestyle='--', color='blue')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Mean Squared Error (MSE)')
plt.title('Effect of K on KNN Regressor Performance')
plt.grid(True)
plt.show()

* Small k values (like 1-3) → Very flexible (low bias, high variance) → can overfit
* Large k values (like 15-20) → Smoother, more generalized predictions → can underfit
* There’s typically a sweet spot (often somewhere around 5-10) where MSE is minimized.

## Q33. Implement KNN Imputation for handling missing values in a dataset
**Ans** - KNN Imputation is an effective method for filling missing values in a dataset by using the K-Nearest Neighbors algorithm to predict missing values based on the nearest available neighbors.

Implementation of KNN imputation using the KNNImputer class from sklearn.impute.

In [None]:
from sklearn.impute import KNNImputer
import numpy as np
import pandas as pd

data = {
    'Feature1': [1, 2, np.nan, 4, 5],
    'Feature2': [5, np.nan, 7, 8, 10],
    'Feature3': [10, 20, 30, np.nan, 50]
}

df = pd.DataFrame(data)

print("Original dataset with missing values:")
print(df)

imputer = KNNImputer(n_neighbors=2)

df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print("\nDataset after KNN Imputation:")
print(df_imputed)

* Original Dataset: We have missing values represented by np.nan.
* KNNImputer: We’re filling the missing values using the 2 nearest neighbors (we can adjust n_neighbors).
* Imputed Dataset: The missing values are replaced with predictions based on the closest neighbors.

**Sample Output:**

In [None]:
Original dataset with missing values:
   Feature1  Feature2  Feature3
0       1.0       5.0      10.0
1       2.0       NaN      20.0
2       NaN       7.0      30.0
3       4.0       8.0       NaN
4       5.0      10.0      50.0

Dataset after KNN Imputation:
   Feature1  Feature2  Feature3
0       1.0       5.0      10.0
1       2.0       7.5      20.0
2       3.0       7.0      30.0
3       4.0       8.0      40.0
4       5.0      10.0      50.0

**Explanation:**
* Feature1: The missing value in row 2 is imputed as the average of its 2 nearest neighbors (rows 1 and 3).
* Feature2: The missing value in row 1 is imputed based on neighbors in rows 0 and 2.
* Feature3: The missing value in row 3 is imputed using rows 2 and 4.

## Q34. Train a PCA model and visualize the data projection onto the first two principal components
**Ans** - PCA on the Iris dataset and visualize how the data looks when projected onto the first two principal components.

In [None]:
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

iris = load_iris()
X = iris.data
y = iris.target

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X)

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k', s=100)

plt.title('PCA - Projection onto First Two Principal Components')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.colorbar(label='Species')

plt.show()

* PCA reduces the original 4-dimensional data into 2 principal components.
* The scatter plot shows how the Iris dataset points are spread out in the new 2D space, with points colored according to their species.

**Interpretation:**
* The data points are grouped in different clusters (representing different Iris species).
* The plot helps us understand how well the 2 principal components capture the underlying structure of the dataset.

## Q35. Train a KNN Classifier using the KD Tree and Ball Tree algorithms and compare performance
**Ans** - KNN Classifier using both the KD Tree and Ball Tree algorithms and compare their performance. These two methods differ in how they organize and search for nearest neighbors, especially when dealing with higher-dimensional data.

We'll use the Iris dataset for simplicity and compare:
* KD Tree: Efficient for lower-dimensional data.
* Ball Tree: More efficient for higher-dimensional data.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn_kd_tree = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')
knn_kd_tree.fit(X_train, y_train)
y_pred_kd_tree = knn_kd_tree.predict(X_test)
accuracy_kd_tree = accuracy_score(y_test, y_pred_kd_tree)

knn_ball_tree = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree')
knn_ball_tree.fit(X_train, y_train)
y_pred_ball_tree = knn_ball_tree.predict(X_test)
accuracy_ball_tree = accuracy_score(y_test, y_pred_ball_tree)

print(f"Accuracy with KD Tree: {accuracy_kd_tree:.4f}")
print(f"Accuracy with Ball Tree: {accuracy_ball_tree:.4f}")

* KNeighborsClassifier: We train KNN classifiers using both the KD Tree and Ball Tree algorithms.
* Comparison: After training, we compare the performance using accuracy score.

**Sample Output:**

In [None]:
Accuracy with KD Tree: 1.0000
Accuracy with Ball Tree: 1.0000

**Interpretation:**
* Both KD Tree and Ball Tree may show similar performance on small datasets like Iris.
* The KD Tree is more efficient in lower dimensions, while Ball Tree handles higher dimensions better.
* If we're dealing with higher-dimensional datasets, Ball Tree is likely to perform better or at least be more efficient computationally.

## Q36. Train a PCA model on a high-dimensional dataset and visualize the Scree plot
**Ans** - PCA model on a high-dimensional dataset and visualize the Scree plot, which shows the explained variance of each principal component. The Scree plot helps us to decide how many components to keep based on the explained variance.

In [None]:
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np

X, y = make_classification(n_samples=300, n_features=100, random_state=42)

pca = PCA()
pca.fit(X)

explained_variance = pca.explained_variance_ratio_

plt.figure(figsize=(10,6))
plt.plot(range(1, len(explained_variance) + 1), explained_variance, marker='o', linestyle='--', color='b')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.grid(True)
plt.show()

* Synthetic Dataset: We generate a high-dimensional dataset with 100 features using make_classification.
* PCA: We apply PCA to reduce the dimensions and obtain the explained variance ratio for each principal component.
* Scree Plot: We plot the explained variance ratio for each component.

**Sample Output:**
* The plot will display a declining curve showing the explained variance of each component.
* we'll see that few components explain most of the variance, which helps in dimensionality reduction.

**Interpretation:**
* The Scree plot visually represents the "elbow" point — the point where the variance explained by each additional component starts to drop off significantly.
* The first few principal components capture most of the variance, while the rest contribute less and less.

## Q37. Train a KNN Classifier and evaluate performance using Precision, Recall, and F1-Score
**Ans** - K-Nearest Neighbors classifier and evaluate its performance using Precision, Recall, and F1-Score in Python with scikit-learn. Here's a clean, well-commented implementation we can run:

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1-Score: {f1:.2f}')

print("\nClassification Report:\n", classification_report(y_test, y_pred))

**Example Output:**

In [None]:
Precision: 1.00
Recall: 1.00
F1-Score: 1.00

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

**Explanation:**
* Precision: How many of the predicted positives are actual positives.
* Recall: How many of the actual positives are captured by the model.
* F1-Score: Harmonic mean of precision and recall.

## Q38. Train a PCA model and analyze the effect of different numbers of components on accuracy

**Ans** - PCA (Principal Component Analysis) is a powerful tool for dimensionality reduction, and it’s super insightful to see how the number of components affects classification accuracy.

Let’s walk through a clean, reproducible example using PCA + KNN (since we just worked with KNN) — and analyze how accuracy changes as we reduce the number of components.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

accuracies = []

for n_components in range(1, X.shape[1] + 1):

    pca = PCA(n_components=n_components)
    X_train_pca = pca.fit_transform(X_train)
    X_test_pca = pca.transform(X_test)

    knn = KNeighborsClassifier(n_neighbors=3)
    knn.fit(X_train_pca, y_train)

    y_pred = knn.predict(X_test_pca)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)

for i, acc in enumerate(accuracies, start=1):
    print(f'Components: {i}, Accuracy: {acc:.2f}')
    plt.figure(figsize=(8,5))
plt.plot(range(1, X.shape[1] + 1), accuracies, marker='o')
plt.title('Effect of PCA Components on KNN Accuracy')
plt.xlabel('Number of PCA Components')
plt.ylabel('Accuracy')
plt.grid(True)
plt.xticks(range(1, X.shape[1] + 1))
plt.show()

**Example Output**

In [None]:
Components: 1, Accuracy: 0.57
Components: 2, Accuracy: 0.97
Components: 3, Accuracy: 1.00
Components: 4, Accuracy: 1.00

**Interpretation:**
* With 1 component → big drop in accuracy.
* With 2 components → already very good (~97%).
* dding more components (3 or 4) brings us to perfect classification.
* This shows PCA can reduce dimensionality (from 4D to 2D) without much loss in accuracy — and often helps with faster, more efficient models.

## Q39. Train a KNN Classifier with different leaf_size values and compare accuracy.

**Ans** - leaf_size is an interesting hyperparameter in KNN that affects the speed of the BallTree or KDTree used internally by the algorithm. It usually doesn’t affect accuracy much, but it can influence performance and computation time — and it’s worth experimenting to confirm this.

Let’s set up a clean experiment where we train a KNN classifier on the Iris dataset while varying leaf_size, and then compare accuracies.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

leaf_sizes = range(5, 60, 5)
accuracies = []

for leaf_size in leaf_sizes:

    knn = KNeighborsClassifier(n_neighbors=3, leaf_size=leaf_size)
    knn.fit(X_train, y_train)

    y_pred = knn.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)

for ls, acc in zip(leaf_sizes, accuracies):
    print(f'Leaf Size: {ls}, Accuracy: {acc:.2f}')
    plt.figure(figsize=(8,5))
plt.plot(leaf_sizes, accuracies, marker='o')
plt.title('Effect of leaf_size on KNN Accuracy')
plt.xlabel('Leaf Size')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()

**Example Output:**

In [None]:
Leaf Size: 5, Accuracy: 1.00
Leaf Size: 10, Accuracy: 1.00
Leaf Size: 15, Accuracy: 1.00
Leaf Size: 20, Accuracy: 1.00
Leaf Size: 25, Accuracy: 1.00
Leaf Size: 30, Accuracy: 1.00
Leaf Size: 35, Accuracy: 1.00
Leaf Size: 40, Accuracy: 1.00
Leaf Size: 45, Accuracy: 1.00
Leaf Size: 50, Accuracy: 1.00
Leaf Size: 55, Accuracy: 1.00

**Interpretation:**
* In this case, the Iris dataset is very simple and clean — so changing leaf_size doesn't affect accuracy at all.
* In larger, noisier, or more high-dimensional datasets, leaf_size can influence:
  * Query speed (faster or slower lookup)
  * Memory usage
  * Sometimes very slight accuracy differences

leaf_size is more about computational efficiency than model accuracy.

## Q40. Train a PCA model and visualize how data points are transformed before and after PCA

**Ans** - visualizing data before and after PCA really makes it clear how PCA works by projecting high-dimensional data into a lower-dimensional space while preserving as much variance as possible.

Let’s do this step by step using the Iris dataset and project it to 2 principal components so we can easily plot it.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
for target, color in zip([0, 1, 2], ['r', 'g', 'b']):
    plt.scatter(X[y == target, 0], X[y == target, 1], label=target_names[target], c=color)
plt.title('Original Data (First 2 Features)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()

plt.subplot(1, 2, 2)
for target, color in zip([0, 1, 2], ['r', 'g', 'b']):
    plt.scatter(X_pca[y == target, 0], X_pca[y == target, 1], label=target_names[target], c=color)
plt.title('Data After PCA (2 Components)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()

plt.tight_lawet()
plt.show()

* Left plot (Before PCA) — Raw data using the first two features (may not separate classes well)
* Right plot (After PCA) — Data transformed into the new PCA space (usually better separation since it captures the directions of maximum variance)

**Interpretation:**
* PCA rotates the original axes into a new space where:
  * PC1 (Principal Component 1) captures the most variance.
  * PC2 captures the second most, orthogonal to PC1.
* we’ll notice much better clustering and separation in the PCA-transformed space.

## Q41. Train a KNN Classifier on a real-world dataset (Wine dataset) and print classification report

**Ans** - Wine dataset is a classic real-world dataset that works great with classification algorithms like KNN.

Let’s walk through a clean, practical example where we:
* Train a KNN Classifier on the Wine dataset
* Predict on the test set
* Print a detailed classification report (Precision, Recall, F1-Score, Support)

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

wine = load_wine()
X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=wine.target_names))

**Example Output:**

In [None]:
Classification Report:

              precision    recall  f1-score   support

     class_0       0.94      1.00      0.97        12
     class_1       0.89      0.89      0.89         9
     class_2       1.00      0.91      0.95        11

    accuracy                           0.94        32
   macro avg       0.94      0.93      0.94        32
weighted avg       0.94      0.94      0.94        32

**Interpretation:**
* Precision: How many predicted samples for a class are correct.
* Recall: How many actual samples for a class are correctly predicted.
* F1-score: Harmonic mean of precision and recall.
* Support: Number of actual occurrences of each class in the test set.

Overall Accuracy: 94% in this case — great for a simple KNN without any tuning.

## Q42. Train a KNN Regressor and analyze the effect of different distance metrics on prediction error

**Ans** - Exploring how different distance metrics affect a KNN Regressor's performance is a super insightful exercise. KNN regression predicts a continuous value by averaging the values of its nearest neighbors, and the choice of distance metric can influence which neighbors it considers "nearest."

Let's set this up cleanly and analyze the prediction error with different distance metrics.

**How We'll Do It:**
* Use the Boston Housing dataset (classic regression dataset — now replaced by fetch_california_housing, so we'll use that)
* Train KNNRegressor with different distance metrics:
  * Euclidean (p=2)
  * Manhattan (p=1)
  * Chebyshev (infinite norm)
* Compare prediction errors using Mean Squared Error (MSE).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X = housing.data
y = housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

distance_metrics = {
    'Euclidean (p=2)': 2,
    'Manhattan (p=1)': 1,
    'Chebyshev (p=inf)': np.inf
}

mse_scores = {}

for name, p in distance_metrics.items():
    knn = KNeighborsRegressor(n_neighbors=5, p=p)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores[name] = mse
    print(f'{name} MSE: {mse:.3f}')
    plt.figure(figsize=(8,5))
plt.bar(mse_scores.keys(), mse_scores.values(), color=['blue', 'green', 'red'])
plt.title('Effect of Distance Metric on KNN Regression Error')
plt.ylabel('Mean Squared Error (MSE)')
plt.xticks(rotation=20)
plt.grid(axis='y')
plt.show()

**Example Output:**

In [None]:
Euclidean (p=2) MSE: 0.544
Manhattan (p=1) MSE: 0.530
Chebyshev (p=inf) MSE: 0.610

📊 Interpretation:
* Manhattan (p=1) performed slightly better here with the lowest MSE.
* Chebyshev (p=inf) was the worst — this makes sense since it only considers the largest difference along any dimension.
* Euclidean (p=2) was close but slightly higher than Manhattan..

## Q43. Train a KNN Classifier and evaluate using ROC-AUC score

**Ans** - using ROC-AUC is a powerful way to evaluate a classifier, especially when we care about the trade-off between True Positive Rate (TPR) and False Positive Rate (FPR).

Since ROC-AUC works for binary classification problems, we'll need to either:

Use a binary dataset
or

Convert a multi-class dataset (like Iris or Wine) to binary by selecting only two classes.

Let’s go with the Wine dataset, and convert it to a binary classification problem (e.g., classifying class_0 vs others). Then we'll train a KNN Classifier and compute the ROC-AUC score.

In [None]:
import numpy as npfrom sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score, roc_curve, auc
import matplotlib.pyplot as plt

wine = load_wine()
X = wine.data
y = wine.target

y_binary = (y == 0).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_prob = knn.predict_proba(X_test)[:, 1]

fpr, tpr, thresholds = roc_curve(y_test, y_prob)

roc_auc = auc(fpr, tpr)
print(f'ROC-AUC Score: {roc_auc:.3f}')

plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, color='blue', label=f'KNN (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.title('ROC Curve for KNN Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

**Example Output:**

In [None]:
ROC-AUC Score: 0.989

**Interpretation:**
* ROC-AUC close to 1.0 means excellent separation between positive and negative classes.
* KNN did very well here — AUC of 0.989 is superb!
* The ROC curve plots TPR vs FPR at various thresholds, giving a full picture of model performance.

## Q44. Train a PCA model and visualize the variance captured by each principal component.

**Ans** - PCA is to visualize how much variance each principal component captures. This helps we decide how many components to keep while retaining most of the information.

Let’s go through it step by step using the Wine dataset again.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

wine = load_wine()
X = wine.data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
X_pca = pca.fit_transform(X_scaled)

explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)

plt.figure(figsize=(8,5))
plt.bar(range(1, len(explained_variance_ratio)+1), explained_variance_ratio, alpha=0.7, align='center',
        label='Individual explained variance')
plt.step(range(1, len(cumulative_variance_ratio)+1), cumulative_variance_ratio, where='mid',
         label='Cumulative explained variance', color='red')
plt.xlabel('Principal Component Index')
plt.ylabel('Explained Variance Ratio')
plt.title('Variance Captured by Each Principal Component')
plt.legend(loc='best')
plt.grid(True)
plt.show()

**Interpretation:**
* Typically, we'll see a steep curve initially, capturing most variance in the first few components.
* For example:
  * PC1 might explain 36%
  * PC2 might add 19%
  * And so on…
* We can decide to keep components covering, say, 95% of the variance.

## Q45. Train a KNN Classifier and perform feature selection before training

**Ans** - KNN classifier can improve performance, reduce overfitting, and sometimes even speed up training.

* Using the Wine dataset and SelectKBest feature selection.

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

wine = load_wine()
X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

selector = SelectKBest(score_func=f_classif, k=5)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

selected_features = selector.get_support(indices=True)
print("Selected feature indices:", selected_features)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_selected, y_train)

y_pred = knn.predict(X_test_selected)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=wine.target_names))

**Example Output:**

In [None]:
Selected feature indices: [0 6 9 10 12]

Accuracy: 0.972
Classification Report:
              precision    recall  f1-score   support

     class_0       1.00      1.00      1.00        12
     class_1       0.90      1.00      0.95         9
     class_2       1.00      0.91      0.95        11

**Interpretation:**
* Feature selection improved clarity and focus for KNN, which can struggle when irrelevant or noisy features are present.
* Selected top 5 features using ANOVA F-values, which measure how strongly each feature is related to the target.
* Accuracy and F1-scores remain high — sometimes even better with fewer, more relevant features.

##Q 46. Train a PCA model and visualize the data reconstruction error after reducing dimensions

**Ans** - Analyzing the data reconstruction error after dimensionality reduction is a great way to understand how much information is lost when compressing data with PCA.

Let's build this step-by-step using the Wine dataset — we'll:
1. Reduce the number of dimensions with PCA
2. Reconstruct the original data from the reduced representation
3. Compute and visualize the reconstruction error

**Required Libraries:**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

wine = load_wine()
X = wine.data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

components_list = range(1, X_scaled.shape[1] + 1)

reconstruction_errors = []

for n_components in components_list:

    pca = PCA(n_components=n_components)
    X_pca = pca.fit_transform(X_scaled)

    X_reconstructed = pca.inverse_transform(X_pca)

    error = np.mean((X_scaled - X_reconstructed) ** 2)
    reconstruction_errors.append(error)
    plt.figure(figsize=(8, 5))
plt.plot(components_list, reconstruction_errors, marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Reconstruction Error (MSE)')
plt.title('PCA Data Reconstruction Error vs Number of Components')
plt.grid(True)
plt.show()

**Interpretation:**
* we can decide how many components to keep by looking for the elbow point in the curve — a spot where adding more components doesn't significantly reduce error.
* This gives a trade-off between dimensionality reduction and information loss.

##Q 47. Train a KNN Classifier and visualize the decision boundary.

**Ans** - Visualizing the decision boundary of a KNN classifier is an excellent way to understand how the classifier is making decisions across different regions of the feature space. To make this visualization easier, let's reduce the number of features to two using PCA for visualization purposes, and then plot the decision boundary.

We'll use the Wine dataset, but for simplicity, we'll select only the first two principal components to reduce the data to two dimensions.

**Required Libraries:**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

wine = load_wine()
X = wine.data
y = wine.target

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

x_min, x_max = X_pca[:, 0].min() - 1, X_pca[:, 0].max() + 1
y_min, y_max = X_pca[:, 1].min() - 1, X_pca[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.RdYlBu)

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolor='k', marker='o', s=100, cmap=plt.cm.RdYlBu, label="Train")
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor='k', marker='s', s=100, cmap=plt.cm.RdYlBu, label="Test")

plt.title("KNN Classifier Decision Boundary (k=5) with PCA Transformed Features")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend(loc="best")
plt.grid(True)
plt.show()

**Interpretation:**
* The KNN model makes local decisions based on the majority vote of the nearest neighbors, which is clearly visualized in the decision boundaries.
* The sharp boundaries might indicate that KNN is highly sensitive to feature scaling and choice of k.

## Q48. Train a PCA model and analyze the effect of different numbers of components on data variance.

**Ans** - The effect of different numbers of components on data variance in PCA can help us to understand how much information is retained as we reduce dimensionality.

We'll use the Wine dataset again and explore the following steps:
1. Apply PCA for varying numbers of components.
2. Compute and visualize the explained variance ratio for each component.
3. Analyze how the total variance captured by the components increases as we add more components.

**Required Libraries:**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

wine = load_wine()
X = wine.data
y = wine.target

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
X_pca = pca.fit_transform(X_scaled)

explained_variance_ratio = pca.explained_variance_ratio_

cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
plt.figure(figsize=(10, 6))

plt.subplot(1, 2, 1)
plt.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, alpha=0.7, align='center')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance by Each Principal Component')

plt.subplot(1, 2, 2)
plt.plot(range(1, len(cumulative_variance_ratio) + 1), cumulative_variance_ratio, marker='o', color='r')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Cumulative Explained Variance')

plt.tight_lawet()
plt.show()

**Interpretation:**
* The first 1 or 2 components typically capture most of the variance.
* Cumulative variance curve will help we decide how many components to keep to retain a certain level of information.
* This is useful for dimensionality reduction, where we aim to reduce dimensions without losing too much information.