<a href="https://colab.research.google.com/github/Vaibhav074N/Assigement-Mar19/blob/main/Assigement_Mar19.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its
application.

Min-Max scaling is a type of feature scaling used in data preprocessing to transform the features of a dataset so that they have a range between 0 and 1. This scaling technique is used to normalize the data and is particularly useful when the feature values have different scales and ranges

The formula for Min-Max scaling is given as:

X_scaled = (X - X_min) / (X_max - X_min)

where X is the original feature value, X_min is the minimum value of the feature in the dataset, and X_max is the maximum value of the feature in the dataset.



Min-Max scaling can be applied to each feature of the dataset independently. This scaling technique preserves the distribution of the data and ensures that all features are on the same scale. It is often used as a preprocessing step before applying machine learning algorithms that assume the features are on the same scale, such as k-nearest neighbors, support vector machines, and artificial neural networks.

In [1]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

X = np.array([10, 20, 30, 40, 50])

# Create a MinMaxScaler object
scaler = MinMaxScaler()

# Apply Min-Max scaling to the feature values
X_scaled = scaler.fit_transform(X.reshape(-1, 1))

print(X_scaled)


[[0.  ]
 [0.25]
 [0.5 ]
 [0.75]
 [1.  ]]


Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling?
Provide an example to illustrate its application.

Ans:
The Unit Vector technique in feature scaling, also known as L2 normalization or vector normalization, is a method to scale numerical features in a dataset to a unit vector. In this technique, each data point (vector) is divided by its magnitude (Euclidean norm) to ensure that the length of the vector becomes 1. The purpose of this scaling is to make the magnitude of each data point comparable and to prevent features with larger values from dominating the analysis.

The formula for calculating the unit vector for each data point is as follows:

Unit Vector = Data Point / ||Data Point||

Where ||Data Point|| represents the magnitude of the data point, calculated as the square root of the sum of squares of each component of the vector.

On the other hand, Min-Max scaling is a different technique that scales the data to a specific range, typically between 0 and 1. It works by subtracting the minimum value of the feature and then dividing by the range (the difference between the maximum and minimum values).

In [2]:
import numpy as np

# Given dataset
data_points = np.array([[5, 7], [3, 9], [8, 6]])

# Step 1: Unit Vector Scaling
magnitudes = np.linalg.norm(data_points, axis=1)  # Calculate magnitudes using Euclidean norm
unit_vector_scaled = data_points / magnitudes[:, np.newaxis]  # Divide each data point by its magnitude
print("Unit Vector Scaled:")
print(unit_vector_scaled)

# Step 2: Min-Max Scaling
min_values = np.min(data_points, axis=0)
max_values = np.max(data_points, axis=0)
min_max_scaled = (data_points - min_values) / (max_values - min_values)
print("Min-Max Scaled:")
print(min_max_scaled)

Unit Vector Scaled:
[[0.58123819 0.81373347]
 [0.31622777 0.9486833 ]
 [0.8        0.6       ]]
Min-Max Scaled:
[[0.4        0.33333333]
 [0.         1.        ]
 [1.         0.        ]]


Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an
example to illustrate its application.

Ans:
PCA, which stands for Principal Component Analysis, is a popular dimensionality reduction technique used in various fields, including machine learning, data analysis, and pattern recognition. It helps in reducing the number of dimensions in a high-dimensional dataset while preserving the most important patterns and variations in the data. This is achieved by transforming the original features into a new set of orthogonal (uncorrelated) features, known as principal components. The first principal component captures the most significant variance in the data, and each subsequent principal component explains as much of the remaining variance as possible.

- The steps involved in PCA are as follows:

1.Standardize the data: Center the data by subtracting the mean from each feature, and then scale the data to have a variance of 1.

2.Calculate the covariance matrix: Calculate the covariance matrix of the standardized data.

3.Compute eigenvectors and eigenvalues: Find the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent the principal components, and the corresponding eigenvalues represent the amount of variance explained by each principal component.

4.Choose the number of principal components: Select the top k eigenvectors with the highest eigenvalues to represent the dataset in a lower-dimensional space.

5.Project the data onto the new feature space: Transform the original data into the new feature space spanned by the selected principal components.

PCA is widely used for dimensionality reduction as it helps in simplifying complex datasets, reduces computation time, and often improves the performance of machine learning models by eliminating noise and redundancy.

In [3]:
import numpy as np
from sklearn.decomposition import PCA

# Given dataset
data_points = np.array([[1, 3], [2, 4], [3, 5], [4, 6], [5, 7]])

# Step 1: Standardize the data
mean = np.mean(data_points, axis=0)
std = np.std(data_points, axis=0)
centered_data = data_points - mean
scaled_data = centered_data / std

# Step 2: Perform PCA
pca = PCA(n_components=1)  # We want to reduce to 1 dimension
reduced_data = pca.fit_transform(scaled_data)

print("Original Data:")
print(data_points)
print("Reduced Data:")
print(reduced_data)

Original Data:
[[1 3]
 [2 4]
 [3 5]
 [4 6]
 [5 7]]
Reduced Data:
[[ 2.]
 [ 1.]
 [-0.]
 [-1.]
 [-2.]]


Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature
Extraction? Provide an example to illustrate this concept.


PCA (Principal Component Analysis) is closely related to feature extraction in the context of dimensionality reduction. In fact, PCA can be seen as a feature extraction technique that transforms the original features into a new set of uncorrelated features called principal components. These principal components are linear combinations of the original features and are ranked in order of importance based on the variance they explain in the data.

- The main steps of using PCA for feature extraction are:

1.Standardize the data: Center the data by subtracting the mean from each feature and scale the data to have a variance of 1.

2.Calculate the covariance matrix: Compute the covariance matrix of the standardized data.

3.Compute eigenvectors and eigenvalues: Find the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent the principal components, and the corresponding eigenvalues indicate the amount of variance explained by each principal component.

4.Choose the number of principal components: Select the top k eigenvectors with the highest eigenvalues to retain the most important information while reducing dimensionality.

5.Project the data onto the new feature space: Transform the original data into the new feature space spanned by the selected principal components.

The relationship between PCA and feature extraction lies in the fact that PCA extracts new features (principal components) from the original features in a way that maximizes the variance in the data. By doing so, PCA condenses the information in the original features into a smaller set of principal components, allowing for more efficient representation and visualization of the data.

In [4]:
import numpy as np
from sklearn.decomposition import PCA

data_points = np.array([[1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6], [5, 6, 7]])

# Step 1: Standardize the data
mean = np.mean(data_points, axis=0)
std = np.std(data_points, axis=0)
centered_data = data_points - mean
scaled_data = centered_data / std

# Step 2: Perform PCA
pca = PCA(n_components=2)  # We want to reduce to 2 dimensions
extracted_features = pca.fit_transform(scaled_data)

print("Original Data:")
print(data_points)
print("Extracted Features:")
print(extracted_features)

Original Data:
[[1 2 3]
 [2 3 4]
 [3 4 5]
 [4 5 6]
 [5 6 7]]
Extracted Features:
[[ 2.44948974e+00  3.43990023e-16]
 [ 1.22474487e+00 -1.14663341e-16]
 [-0.00000000e+00 -0.00000000e+00]
 [-1.22474487e+00  1.14663341e-16]
 [-2.44948974e+00  2.29326682e-16]]


Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset
contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to
preprocess the data.


Ans:
To preprocess the data for building a recommendation system for a food delivery service, we can use Min-Max scaling to standardize the features. Min-Max scaling will transform the data so that all the features are scaled to a specific range, typically between 0 and 1. This ensures that each feature contributes equally to the recommendation process, regardless of its original scale or magnitude. Here's how Min-Max scaling can be applied to preprocess the data:

Gather the dataset: Collect the dataset containing the relevant features for the food items, such as price, rating, and delivery time.

Calculate the minimum and maximum values for each feature: Find the minimum and maximum values for each feature in the dataset. These values will be used to perform the scaling.

Apply Min-Max scaling: For each feature, apply the Min-Max scaling formula to transform the data into a specific range (e.g., [0, 1]).

The Min-Max scaling formula is given by:

Scaled Value = (Value - Min) / (Max - Min)

where:

"Value" is the original value of the feature.

"Min" is the minimum value of the feature in the dataset.

"Max" is the maximum value of the feature in the dataset.

Use the scaled data for the recommendation system: Once the data has been Min-Max scaled, the scaled values for price, rating, and delivery time will all lie between 0 and 1. This ensures that each feature is on the same scale, and no particular feature dominates the recommendation process. The scaled data can then be used in building the recommendation system, such as collaborative filtering or content-based filtering algorithms, to provide personalized food recommendations to users.

In [5]:
import numpy as np

data = np.array([
    [10, 4.5, 30],
    [20, 4.0, 45],
    [15, 4.2, 25]
])

# Step 2: Calculate the minimum and maximum values for each feature
min_values = np.min(data, axis=0)
max_values = np.max(data, axis=0)

# Step 3: Apply Min-Max scaling
scaled_data = (data - min_values) / (max_values - min_values)

print("Original Data:")
print(data)
print("Min-Max Scaled Data:")
print(scaled_data)

Original Data:
[[10.   4.5 30. ]
 [20.   4.  45. ]
 [15.   4.2 25. ]]
Min-Max Scaled Data:
[[0.   1.   0.25]
 [1.   0.   1.  ]
 [0.5  0.4  0.  ]]


Q6. You are working on a project to build a model to predict stock prices. The dataset contains many
features, such as company financial data and market trends. Explain how you would use PCA to reduce the
dimensionality of the dataset.

To reduce the dimensionality of the dataset for predicting stock prices, you can use PCA (Principal Component Analysis). PCA will help you identify the most important patterns and variations in the data while reducing the number of features (dimensions) to a smaller set of uncorrelated features called principal components. By doing so, PCA simplifies the dataset and makes it more manageable while retaining most of the relevant information for predicting stock prices.

- Here's how you can use PCA to reduce the dimensionality of the dataset:

1.Standardize the data: Center the data by subtracting the mean from each feature and scale the data to have a variance of 1. Standardization is important for PCA as it ensures that all features are on the same scale and have equal importance in the analysis.

2.Calculate the covariance matrix: Compute the covariance matrix of the standardized data. The covariance matrix shows the relationships between different features and helps identify how they vary together.

3.Compute eigenvectors and eigenvalues: Find the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent the principal components, and the corresponding eigenvalues indicate the amount of variance explained by each principal component.

4.Choose the number of principal components: Select the top k eigenvectors with the highest eigenvalues. The number of principal components you choose will determine the amount of variance retained in the reduced dataset. You can use techniques like explained variance or cumulative explained variance to decide on the number of principal components.

5.Project the data onto the new feature space: Transform the original data into the new feature space spanned by the selected principal components. This will result in a lower-dimensional dataset with reduced features.

Using PCA for dimensionality reduction can be particularly useful when you have a large number of features, and you want to focus on the most informative ones. It can also help in mitigating the "curse of dimensionality," where high-dimensional datasets can lead to overfitting and increased computational complexity.

Once you have the reduced dataset after PCA, you can use it as input for building your model to predict stock prices.

Keep in mind that PCA might not always be the best choice, and its effectiveness depends on the specific characteristics of the dataset and the problem at hand. Sometimes, other feature selection or feature engineering techniques might be more appropriate for predicting stock prices. It is essential to experiment with different approaches and evaluate their impact on the model's performance to make an informed decision.

In [6]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Given dataset
data = np.array([
    [100, 10, 2, 0.8, 50],
    [150, 12, 3, 1.2, 55],
    [80, 8, 1.5, 0.6, 45],
    [120, 11, 2.5, 1.0, 60],
    [90, 9, 2, 0.9, 48]
])

# Separate features (X) and target variable (y)
X = data[:, :-1]
y = data[:, -1]

# Step 1: Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Calculate the covariance matrix
cov_matrix = np.cov(X_scaled.T)

# Step 3: Compute eigenvectors and eigenvalues
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Step 4: Choose the number of principal components
n_components = 2
top_eigenvectors = eigenvectors[:, :n_components]

# Step 5: Project the data onto the new feature space
X_reduced = np.dot(X_scaled, top_eigenvectors)

# Combine the reduced features (X_reduced) and the target variable (y)
reduced_data = np.column_stack((X_reduced, y))

print("Original Data:")
print(data)
print("PCA Reduced Data:")
print(reduced_data)

Original Data:
[[100.   10.    2.    0.8  50. ]
 [150.   12.    3.    1.2  55. ]
 [ 80.    8.    1.5   0.6  45. ]
 [120.   11.    2.5   1.   60. ]
 [ 90.    9.    2.    0.9  48. ]]
PCA Reduced Data:
[[ 6.07290513e-01 -3.07910713e-01  5.00000000e+01]
 [-3.08827623e+00 -1.44731333e-02  5.50000000e+01]
 [ 2.70653243e+00 -1.79196259e-01  4.50000000e+01]
 [-1.13970119e+00 -8.21737448e-02  6.00000000e+01]
 [ 9.14154473e-01  5.83753849e-01  4.80000000e+01]]


Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the
values to a range of -1 to 1.

In [1]:
import numpy as np

data = np.array([1, 5, 10, 15, 20])

new_min = -1
new_max = 1

min_value = np.min(data)
max_value = np.max(data)

# Apply Min-Max scaling
scaled_data = (data - min_value) / (max_value - min_value) * (new_max - new_min) + new_min

print("Original Data:")
print(data)
print("Min-Max Scaled Data:")
print(scaled_data)

Original Data:
[ 1  5 10 15 20]
Min-Max Scaled Data:
[-1.         -0.57894737 -0.05263158  0.47368421  1.        ]


Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform
Feature Extraction using PCA. How many principal components would you choose to retain, and why?

Ans:

To perform feature extraction using PCA on the given dataset containing features: [height, weight, age, gender, blood pressure], we need to reduce the dimensionality of the data while retaining as much variance as possible. The number of principal components to retain is a crucial decision, as it impacts the amount of information preserved in the reduced dataset.

- To determine the number of principal components to retain, we can follow these steps:

1.Standardize the data: Center the data by subtracting the mean from each feature and scale the data to have a variance of 1. Standardization is necessary for PCA as it ensures that all features are on the same scale and have equal importance in the analysis.

2.Compute the covariance matrix: Calculate the covariance matrix of the standardized data. The covariance matrix shows the relationships between different features and helps identify how they vary together.

3.Compute eigenvectors and eigenvalues: Find the eigenvectors and eigenvalues of the covariance matrix. Eigenvectors represent the principal components, and eigenvalues indicate the amount of variance explained by each principal component.

4.Choose the number of principal components: Decide on the number of principal components to retain based on the cumulative explained variance or a threshold percentage of variance explained. For example, you can choose to retain the top k principal components, where k is determined based on the amount of variance you want to retain (e.g., 95% or 99%).

5.Project the data onto the new feature space: Transform the original data into the new feature space spanned by the selected principal components.

To decide on the number of principal components to retain, we can plot the cumulative explained variance against the number of principal components and visually inspect the point where the curve starts to level off.

In [4]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Synthetic dataset with 100 samples and 5 features (height, weight, age, gender, blood pressure)
np.random.seed(0)
height = np.random.normal(loc=170, scale=10, size=100)
weight = np.random.normal(loc=70, scale=10, size=100)
age = np.random.normal(loc=30, scale=5, size=100)
gender = np.random.choice(['Male', 'Female'], size=100)
blood_pressure = np.random.normal(loc=120, scale=10, size=100)

# Create the dataset by stacking the features horizontally
data = np.column_stack((height, weight, age, blood_pressure))

# Step 1: Standardize the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Step 2: Calculate the covariance matrix
cov_matrix = np.cov(data_scaled.T)

# Step 3: Compute eigenvectors and eigenvalues
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Step 4: Choose the number of principal components
# For this example, let's say we want to retain 95% of the total variance
total_variance = np.sum(eigenvalues)
explained_variance_ratio = eigenvalues / total_variance
cumulative_explained_variance = np.cumsum(explained_variance_ratio)
num_components_to_retain = np.argmax(cumulative_explained_variance >= 0.95) + 1

# Step 5: Project the data onto the new feature space
pca = PCA(n_components=num_components_to_retain)
data_reduced = pca.fit_transform(data_scaled)

print("Original Data:")
print(data)

Original Data:
[[187.64052346  88.83150697  28.15409081 113.15989102]
 [174.00157208  56.52240939  28.80310411 136.59550796]
 [179.78737984  57.29515002  35.49829798 130.68509399]
 [192.40893199  79.69396708  33.27631865 115.46614196]
 [188.6755799   58.26876595  33.20065763 113.12162389]
 [160.2272212   89.43621186  21.91521978 107.85922597]
 [179.50088418  65.86381019  29.87836938 115.59077368]
 [168.48642792  62.52545189  26.30984545 117.19644505]
 [168.96781148  89.22942026  31.399623   116.35306456]
 [174.10598502  84.80514791  29.50924805 121.56703855]
 [171.44043571  88.6755896   34.55089454 125.78521498]
 [184.54273507  79.06044658  31.58609108 123.49654457]
 [177.61037725  61.38774315  33.93163981 112.35856076]
 [171.21675016  89.10064953  27.66790452 105.62208526]
 [174.43863233  67.31996629  25.27776872 133.64531848]
 [173.33674327  78.02456396  27.94975153 113.10550815]
 [184.94079073  79.47251968  29.91489793 113.477064  ]
 [167.94841736  68.44989907  31.89575868 114.78810

In [5]:
print("PCA Reduced Data:")
print(data_reduced)

PCA Reduced Data:
[[ 1.95306903e+00 -1.33569470e-01  1.54830281e+00 -3.76928447e-01]
 [-1.94346653e+00 -1.24627521e+00  4.16267976e-01 -1.81185524e-01]
 [-8.42943781e-01 -2.17557161e+00 -3.14095293e-01  1.28429850e-01]
 [ 1.77777136e+00 -1.42611129e+00  9.60429800e-01  4.04710155e-02]
 [ 4.86907825e-01 -1.68035242e+00  3.07187989e-01  1.56662919e+00]
 [ 9.30840470e-01  2.61474069e+00  6.41075249e-01 -4.86503358e-01]
 [ 2.36505486e-01 -5.79327592e-01  3.97873757e-01  7.66880535e-01]
 [-7.06326067e-01  3.99683339e-01  1.96726935e-01  7.29660493e-01]
 [ 1.35117719e+00  5.15239146e-01 -7.91921409e-02 -1.13901631e+00]
 [ 7.93865133e-01  8.33799326e-02  5.51476883e-01 -1.04136207e+00]
 [ 1.02361022e+00 -4.47854122e-01 -2.36373901e-01 -1.79960188e+00]
 [ 8.31677260e-01 -1.07592485e+00  8.41934949e-01 -6.13853655e-01]
 [ 3.91285260e-01 -9.54425252e-01 -4.73466132e-01  1.14149481e+00]
 [ 1.84389891e+00  1.28880189e+00  4.34059395e-01 -2.18514494e-01]
 [-1.33125622e+00 -4.07789644e-01  1.1162580

In [6]:
print("Number of Principal Components Retained:", num_components_to_retain)

Number of Principal Components Retained: 4


In [7]:
print("Explained Variance Ratio:", explained_variance_ratio[:num_components_to_retain])

Explained Variance Ratio: [0.33499486 0.18169156 0.23433278 0.24898081]


In [8]:
print("Cumulative Explained Variance:", cumulative_explained_variance[:num_components_to_retain])

Cumulative Explained Variance: [0.33499486 0.51668642 0.75101919 1.        ]
