# Feature Engineering-3

## Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its application.

Min-Max scaling, also known as normalization, is a data preprocessing technique used to scale numerical features within a specific range, typically between 0 and 1. This transformation is performed to ensure that all features have the same scale, making them more comparable and preventing certain features from dominating others when training machine learning models. Min-Max scaling is defined by the following formula for each feature:


$$X_{scaled} = \frac{X - X_{min}} {X_{max} - X_{min}} $$

Where:

- $X_{scaled}$ is the scaled value of the feature 
- $X $ the original value of the feature.
- $X_{min}$ is the minimum value of the feature in the dataset.
- $X_{max}$ is the maximum value of the feature in the dataset.

After applying Min-Max scaling, the values of the feature will fall within the range [0, 1]. If the minimum and maximum values of the original feature are known, this scaling ensures that the minimum value maps to 0, and the maximum value maps to 1.

Here's an example to illustrate Min-Max scaling:

Suppose you have a dataset containing the ages of people and their corresponding incomes, and you want to scale both features using Min-Max scaling.

Original Data:

- Age (in years): [25, 40, 30, 35, 50]
- Income (in thousands): [30, 60, 40, 50, 80]

For Age:
- $X_{min}$ = 25 (minimum age)
- $X_{max}$ = 50 (maximum age)

For Income:
- $X_{min}$ = 30 (minimum income)
- $X_{max}$ = 80 (maximum income)

Now, let's scale the data:

Scaled Age:
- $X_{scaled} = \frac{X - 25}{50 - 25}$ for each value in the Age feature.

Scaled Income:
- $X_{scaled} = \frac{X - 30}{80 - 25}$ for each value in the Income feature.

The scaled values will be between 0 and 1 for both Age and Income. This scaling allows you to compare and use these features in machine learning algorithms without one feature dominating the other due to differences in their original scales.

## Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling? Provide an example to illustrate its application.

The Unit Vector technique, also known as Unit Vector scaling or Vector normalization, is a feature scaling method that scales numerical features to have a unit norm. It differs from Min-Max scaling in that it doesn't scale the data to a specific range like [0, 1] but instead scales the data such that the resulting vector has a Euclidean norm (L2 norm) of 1. The L2 norm of a vector is the square root of the sum of the squares of its components.

The formula for scaling a feature using Unit Vector scaling is as follows:

$$X_{scaled} = \frac{X}{||X||_{2}}$$



​
 

Where:

- $X_{scaled}$ is the scaled value of the feature $X$.
- $X$ is the original value of the feature.
- $||X||_{2}$ represents the L2 norm of the feature vector $X$

Unit Vector scaling is often used in machine learning when the direction of the data vectors is more important than their actual magnitude. It is useful in cases where the magnitude of the features is not crucial, such as in some clustering or dimensionality reduction algorithms.

Here's an example to illustrate Unit Vector scaling:

Suppose you have a dataset with two features, representing the length and width of different objects. You want to scale these features using Unit Vector scaling.

Original Data:

Length (in centimeters): [5, 8, 3, 10, 6]
Width (in centimeters): [2, 4, 1, 5, 3]

To scale the data using Unit Vector scaling:

1. Calculate the L2 norm for each data point:
 
2. Scale each feature by dividing by its L2 norm:

Scaled Length: $ \frac{Length}{∥Length∥_{2}} $

Scaled Width: $ \frac{width}{∥width∥_{2}} $

For example, for the first data point:

- Length: 5
- Width: 2

Calculate the L2 norm:  $ ∥X∥_{2} = \sqrt{5^{2} + 2^{2}}$

Scaled Length: $ \frac{5}{\sqrt{29}} $

Scaled Length: $ \frac{2}{\sqrt{29}} $

The scaled values will have a Euclidean norm of 1, meaning they fall on the unit circle in a two-dimensional space. This scaling technique emphasizes the direction of the data vectors while maintaining their relative relationships. It does not enforce a specific range for the features, unlike Min-Max scaling.


## Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an example to illustrate its application.

Principal Component Analysis (PCA) is a dimensionality reduction technique commonly used in machine learning and statistics. Its goal is to transform high-dimensional data into a new coordinate system, capturing the most important information while minimizing information loss. PCA achieves this by identifying the principal components, which are the directions in the data that have the maximum variance.

The steps involved in PCA are as follows:

1. **Standardize the Data**: Ensure that the data is centered (subtract the mean) and standardized (divide by the standard deviation) to give all features equal importance.

2. **Compute the Covariance Matrix**: Calculate the covariance matrix for the standardized data.

3. **Compute Eigenvectors and Eigenvalues**: Find the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent the principal components, and the eigenvalues indicate the amount of variance captured by each principal component.

4. **Sort and Select Principal Components**: Sort the eigenvectors by their corresponding eigenvalues in descending order. Choose the top $ k $ eigenvectors to form the new feature subspace (where $k$ is the desired dimensionality of the reduced data).

5. **Project the Data**: Multiply the original standardized data by the selected eigenvectors to obtain the new reduced-dimensional data.

In [3]:
import numpy as np
from sklearn.decomposition import PCA

# Generate sample data
np.random.seed(42)
data = np.random.rand(5, 3)  # 5 samples, 3 features

# Step 1: Standardize the data
mean = np.mean(data, axis=0)
std_dev = np.std(data, axis=0)
standardized_data = (data - mean) / std_dev

print("Original Data:\n", standardized_data)

# Step 2: Compute the Covariance Matrix
cov_matrix = np.cov(standardized_data, rowvar=False)

print("cov matrix:\n", cov_matrix)

# Step 3: Compute Eigenvectors and Eigenvalues
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

print("eigenvalues:\n", eigenvalues)
print('******')
print("eigenvectors:\n", eigenvectors)
print('******')

# Step 4: Sort and Select Principal Components
sorted_indices = np.argsort(eigenvalues)[::-1]
print('sorted_indices:\n', sorted_indices)
top_k_indices = sorted_indices[:2]  # Select the top 2 principal components


# Step 5: Project the Data
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(standardized_data)

# Print the original and reduced data
print("Original Data:\n", standardized_data)
print("\nReduced Data:\n", reduced_data)


Original Data:
 [[-0.51154143  1.31491696  0.64425338]
 [ 0.30841528 -0.73584035 -1.17636352]
 [-1.66932533  1.09676143  0.23057173]
 [ 0.70871634 -1.08533586  1.39625714]
 [ 1.16373513 -0.59050218 -1.09471874]]
cov matrix:
 [[ 1.25       -1.04670348 -0.3404206 ]
 [-1.04670348  1.25        0.27416587]
 [-0.3404206   0.27416587  1.25      ]]
eigenvalues:
 [2.4537255  0.20100754 1.09526696]
******
eigenvectors:
 [[-0.67001308  0.71338339  0.20534511]
 [ 0.66000875  0.6990729  -0.27511004]
 [ 0.33981014  0.04879775  0.93922726]]
******
sorted_indices:
 [0 2 1]
Original Data:
 [[-0.51154143  1.31491696  0.64425338]
 [ 0.30841528 -0.73584035 -1.17636352]
 [-1.66932533  1.09676143  0.23057173]
 [ 0.70871634 -1.08533586  1.39625714]
 [ 1.16373513 -0.59050218 -1.09471874]]

Reduced Data:
 [[ 1.42951997  0.13831095]
 [-1.09204359 -0.83910405]
 [ 1.92069255 -0.42795863]
 [-0.71671805  1.755521  ]
 [-1.54145089 -0.62676928]]


## Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature Extraction? Provide an example to illustrate this concept.

Principal Component Analysis (PCA) and Feature Extraction are closely related concepts used in data analysis and machine learning.

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of correlated variables into a set of uncorrelated variables1. It’s a technique for dimensionality reduction that identifies a set of orthogonal axes, called principal components, that capture the maximum variance in the data1. The main goal of PCA is to reduce the dimensionality of a dataset while preserving the most important patterns or relationships between the variables without any prior knowledge of the target variables1.

On the other hand, Feature Extraction is a process of dimensionality reduction by which an initial set of raw data is reduced to more manageable groups for processing2. It’s the name for methods that select and/or combine variables into features, effectively reducing the amount of data that must be processed, while still accurately and completely describing the original data set2.

So, PCA is actually a type of feature extraction technique. It’s used to reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables, retaining most of the sample’s information, and useful for the regression and classification of data1.

For example, consider a dataset with a large number of features. If we apply PCA, it will identify a new set of variables, smaller than the original set of variables, that retains most of the sample’s information1. These new features, called Principal Components, are ordered in decreasing order of importance1. Here, Principal Component-1 (PC1) captures the maximum information of the original dataset, followed by PC2, then PC3, and so on1. This way, PCA helps in reducing the dimensionality of the dataset while preserving the most important patterns or relationships between the variables1.

## Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to preprocess the data.

Min-Max scaling is a common technique used in data preprocessing to scale numerical features within a specific range, typically between 0 and 1. This technique is particularly useful when dealing with features that have different scales or units, as it ensures that all features contribute equally to the analysis and model training process. Here's how you could use Min-Max scaling to preprocess the data for your food delivery service recommendation system:

1. Identify numerical features: First, identify the numerical features in your dataset that you want to scale. In your case, features such as price, rating, and delivery time are numerical and may need scaling.

2. Understand the range of values: Before applying Min-Max scaling, it's essential to understand the range of values for each feature. This step helps you determine the appropriate scaling range.

3. Compute the minimum and maximum values: For each numerical feature, compute the minimum (min) and maximum (max) values in the dataset. These values will be used in the scaling formula.

4. Apply Min-Max scaling formula: Once you have identified the min and max values for each feature, you can apply the Min-Max scaling formula to scale the values within the desired range (usually 0 to 1). The formula is as follows:

$$X_{scaled} = \frac{X}{||X||_{2}}$$

Where:

- $X_{scaled}$ is the scaled value of the feature $X$.
- $X$ is the original value of the feature.
- $||X||_{2}$ represents the L2 norm of the feature vector $X$
​
  is the scaled value of the feature.
5. Scale the data: Apply the Min-Max scaling formula to each numerical feature in your dataset, ensuring that the values are scaled within the desired range (0 to 1).

6. Normalization: After scaling, you might want to ensure that your features are centered around zero with a unit variance. While Min-Max scaling typically brings your values between 0 and 1, normalization ensures that the mean of the feature is 0 and the standard deviation is 1. You can achieve this by subtracting the mean and dividing by the standard deviation of each feature.

7. Implement the scaled data in your recommendation system: Once you have scaled the numerical features, you can incorporate them into your recommendation system for food delivery, where they can be used as input features for your models or algorithms.

## Q6. You are working on a project to build a model to predict stock prices. The dataset contains many features, such as company financial data and market trends. Explain how you would use PCA to reduce the dimensionality of the dataset.

Principal Component Analysis (PCA) is a technique used for dimensionality reduction in datasets with many features. In the context of predicting stock prices, where the dataset contains numerous features like company financial data and market trends, PCA can be employed to streamline the data while retaining as much relevant information as possible. Here's how PCA could be applied:

1. Data Preprocessing:

    - Normalize the data: PCA works best when the data is centered around zero and has a consistent scale. Therefore, it's crucial to normalize the features before applying PCA.
    - Handle missing values: Take care of any missing values in the dataset. Impute or remove them based on the nature of the data and the extent of missingness.

2. Feature Selection (Optional):

    - If there are features that are known to be irrelevant for predicting stock prices or have very low variance, it might be beneficial to exclude them from the analysis before applying PCA. However, PCA inherently captures the variance in the data, so this step can be optional.

3. Applying PCA:

    - Calculate the covariance matrix of the standardized feature matrix. This matrix represents the relationships between different features.
    - Compute the eigenvectors and eigenvalues of the covariance matrix. Eigenvectors represent the directions of maximum variance, and eigenvalues represent the magnitude of variance along these directions.
    - Sort the eigenvectors based on their corresponding eigenvalues in descending order. This ranking reflects the amount of variance each principal component captures.
    - Select the top k eigenvectors (principal components) that explain the most variance in the data. This choice depends on the desired level of dimensionality reduction.
    - Project the original data onto the selected principal components to obtain the reduced-dimensional representation of the dataset.

4. Interpretation:

    - Analyze the variance explained by each principal component. This analysis helps in understanding how much information is retained after dimensionality reduction.
    - Examine the loadings of the original features on each principal component. Loadings indicate the contribution of each feature to the principal components and can provide insights into the underlying structure of the data.

5. Modeling:

    - Use the reduced-dimensional dataset obtained from PCA as input for building predictive models. By reducing the number of features, PCA can help alleviate the curse of dimensionality and improve model performance.
    - Experiment with different algorithms and hyperparameters to find the best model for predicting stock prices.

6. Evaluation:

    - Assess the performance of the predictive models using appropriate evaluation metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), or Mean Absolute Error (MAE).
    -Compare the performance of models trained on the original dataset versus the reduced-dimensional dataset obtained through PCA.

## Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the values to a range of -1 to 1.

In [11]:
import numpy as np
data = [1, 5, 10, 15, 20]
x_min = np.min(data)
x_max = np.max(data)
scaled_data = [(i-min)/(max-min)for i in data]
print(scaled_data)


[0.0, 0.21052631578947367, 0.47368421052631576, 0.7368421052631579, 1.0]


## Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform Feature Extraction using PCA. How many principal components would you choose to retain, and why?

To determine how many principal components (PCs) to retain for PCA feature extraction, you typically consider the cumulative explained variance ratio. The explained variance ratio tells you the proportion of the dataset's variance that lies along each principal component. Retaining enough principal components to explain a significant portion of the variance ensures that you capture most of the information in the original dataset while reducing dimensionality.

Here's how you would approach this for your dataset with features [height, weight, age, gender, blood pressure]:

1. Standardize the features: Before applying PCA, it's important to standardize the features to have a mean of 0 and a standard deviation of 1. This ensures that each feature contributes equally to the analysis.

2. Apply PCA: Compute the covariance matrix of the standardized data and then perform eigendecomposition to obtain the principal components and their corresponding eigenvalues.

3. Determine the number of principal components to retain: Plot the cumulative explained variance ratio against the number of principal components. This plot will help you decide how many components to retain while ensuring that you capture a significant portion of the variance in the data.

Let's say after performing PCA, you find that the cumulative explained variance ratio plot looks like this:


From the plot, you can see that around 95% of the variance in the data is explained by the first 3 principal components. Hence, you might choose to retain these three principal components.

Therefore, in this case, I would choose to retain 3 principal components because they explain a significant portion of the variance while reducing the dimensionality of the dataset. This decision strikes a balance between retaining enough information and reducing the complexity of the data.