Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its application.

Ans:

Min-Max scaling, or normalization, transforms feature values to a fixed range, usually [0, 1]. The formula is:

X scaled = (X - X min)/(X max - X min)

where:

X is the original feature value
X_min is the minimum value of the feature
X_max is the maximum value of the feature

Example

Suppose you have house sizes in square feet as follows:

House 1: 800 sq ft
House 2: 1200 sq ft
House 3: 2000 sq ft
Identify Minimum and Maximum Values:

Minimum value: 800 sq ft
Maximum value: 2000 sq ft
Apply Min-Max Scaling:

For House 1:
(800 - 800) / (2000 - 800) = 0
For House 2:
(1200 - 800) / (2000 - 800) = 0.33
For House 3:
(2000 - 800) / (2000 - 800) = 1
Scaled Values:

House 1: 0
House 2: 0.33
House 3: 1

Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling? Provide an example to illustrate its application.

Ans:

Unit Vector Technique in Feature Scaling

The Unit Vector technique, also known as normalization, scales feature values so that the length (or norm) of the feature vector is 1. Each feature value is divided by the Euclidean norm of the feature vector.

Formula: For a feature vector X, the unit vector normalization is:

X_normalized = X / ||X||

where ||X|| is the Euclidean norm (or L2 norm) of the vector X, calculated as:

||X|| = sqrt(sum(X_i^2))

Differences from Min-Max Scaling

Range of Values:

Unit Vector: Scales feature values so the length of the vector is 1, but values are not constrained to a specific range like [0, 1].
Min-Max Scaling: Transforms feature values to a fixed range, typically [0, 1], based on the minimum and maximum values.


Normalization Type:

Unit Vector: Normalizes the entire feature vector to have a unit norm, affecting the relative scale of values within the vector.
Min-Max Scaling: Scales individual values based on their minimum and maximum, preserving relative spacing within a defined range.

Example
Suppose you have house sizes in square feet:

House 1: 800 sq ft
House 2: 1200 sq ft
House 3: 2000 sq ft
Calculate the Euclidean Norm of the Vector:

Compute the norm:
||X|| = sqrt(800^2 + 1200^2 + 2000^2)
||X|| = sqrt(640000 + 1440000 + 4000000)
||X|| = sqrt(6080000) approx 2465.0
Normalize Each Feature Value:

For House 1:
X_normalized_1 = 800 / 2465.0 approx 0.325
For House 2:
X_normalized_2 = 1200 / 2465.0 approx 0.487
For House 3:
X_normalized_3 = 2000 / 2465.0 approx 0.812
Normalized Values:

House 1: 0.325
House 2: 0.487
House 3: 0.812

Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an example to illustrate its application.

Ans:

PCA is a technique that reduces the dimensionality of data by transforming it into a new set of variables (principal components) that capture the most variance.

How PCA Works:

- Standardize the Data: Center and scale the data.
- Compute Covariance Matrix: Analyze how features vary with each other.
- Calculate Eigenvalues and Eigenvectors: Identify the principal components.
- Select Principal Components: Choose the top components based on eigenvalues.
- Transform the Data: Project the original data onto these components.

Example

Data: Heights (170 cm, 180 cm, 160 cm) and Weights (60 kg, 80 kg, 55 kg).

1. Standardize the Data: Normalize heights and weights.
1. Compute Covariance Matrix: Determine relationships between height and weight.
1. Calculate Eigenvalues and Eigenvectors: Find principal components.
1. Select Principal Component: Choose the one with the highest variance.
1. Transform Data: Reduce data to this principal component.

Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature Extraction? Provide an example to illustrate this concept.

Ans:

PCA is a feature extraction technique that transforms original features into a new set of features (principal components) that capture the most variance in the data. It simplifies data by reducing dimensionality while retaining essential information.

How Can PCA Be Used for Feature Extraction:

Transform Original Features: Convert original features into principal components.

Select Principal Components: Choose components that capture the most variance.
Reduce Dimensionality: Use the selected components as new features.

Example

Original Data: Three features (A, B, C).

- Standardize Data: Normalize features A, B, and C.
- Compute Covariance Matrix: Analyze feature relationships.
- Calculate Eigenvalues and Eigenvectors: Determine principal components.
- Select Principal Components: Choose the top components, e.g., the first two.
- Transform Data: Project data onto these two components, reducing dimensionality.

Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to preprocess the data.

Q6. You are working on a project to build a model to predict stock prices. The dataset contains many features, such as company financial data and market trends. Explain how you would use PCA to reduce the dimensionality of the dataset.

Ans:

Identify Minimum and Maximum Values:

Determine the minimum and maximum values for each feature (price, rating, delivery time).
Apply Min-Max Scaling Formula:

Use the formula: X_scaled = (X - X_min) / (X_max - X_min)
Here, X is the original value, X_min is the minimum value, and X_max is the maximum value for that feature.
Transform Each Feature:

Price: Scale all price values to [0, 1].
Rating: Scale all rating values to [0, 1].
Delivery Time: Scale all delivery time values to [0, 1].
Use Scaled Data:

Replace original feature values with their scaled values to ensure equal contribution of each feature to the recommendation system.


Example:

- Price: $10 (min), $50 (max)
- Rating: 3 (min), 5 (max)
- Delivery Time: 15 minutes (min), 60 minutes (max)
- Scaling Calculation:
- - For a price of $30: (30 - 10) / (50 - 10) = 0.5
- - For a rating of 4: (4 - 3) / (5 - 3) = 0.5
- - For a delivery time of 30 minutes: (30 - 15) / (60 - 15) = 0.4


Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the values to a range of -1 to 1.

In [1]:
import numpy as np

data = np.array([1, 5, 10, 15, 20])

data_min = np.min(data)
data_max = np.max(data)

new_min = -1
new_max = 1

scaled_data = new_min + (data - data_min) * (new_max - new_min) / (data_max - data_min)

print(scaled_data)


[-1.         -0.57894737 -0.05263158  0.47368421  1.        ]


Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform Feature Extraction using PCA. How many principal components would you choose to retain, and why?

Ans:

To perform feature extraction using PCA on a dataset with features like [height, weight, age, gender, blood pressure], follow these steps:

Standardize the Data:

Standardize each feature to have zero mean and unit variance, especially important if features are on different scales.

Compute the Covariance Matrix:

Calculate the covariance matrix to understand how features vary with each other.

Calculate Eigenvalues and Eigenvectors:

Find the eigenvalues and eigenvectors of the covariance matrix. Eigenvectors represent the directions of maximum variance, and eigenvalues represent the magnitude of variance in those directions.

Sort and Select Principal Components:

Sort the eigenvectors by their corresponding eigenvalues in descending order.
Choose the top principal components based on the amount of variance they explain.

Determine the Number of Principal Components to Retain:

Cumulative Explained Variance: Compute the cumulative explained variance ratio for the principal components.

Choose Components: Retain enough principal components to cover a significant percentage of the total variance, typically 90% to 95%.

Example

Assuming the explained variance ratios for the principal components are:

- Principal Component 1: 40% variance
- Principal Component 2: 25% variance
- Principal Component 3: 15% variance
- Principal Component 4: 10% variance
- Principal Component 5: 10% variance

Steps to Choose Principal Components:

Calculate Cumulative Explained Variance:

- PC1 + PC2 = 40% + 25% = 65%
- PC1 + PC2 + PC3 = 65% + 15% = 80%
- PC1 + PC2 + PC3 + PC4 = 80% + 10% = 90%
- PC1 + PC2 + PC3 + PC4 + PC5 = 100%

Select Components:

To retain at least 90% of the variance, I would choose the first 4 principal components.