Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its
application.

Min-Max scaling, also known as Min-Max normalization, is a data preprocessing technique that rescales the values of each feature in a dataset to a range of 0 to 1. This is done by subtracting the minimum value of each feature from all of its values and then dividing by the difference between the maximum and minimum values.

Min-Max scaling is often used to normalize features before machine learning algorithms are applied. This is because many machine learning algorithms assume that the features are on a comparable scale. Without normalization, features with a wide range of values can dominate the learning process, leading to inaccurate results.

In [1]:
import seaborn as sns 
import pandas as pd 


In [2]:
df = pd.read_csv('tips.csv')


In [3]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [4]:
# min max scalling 
from sklearn.preprocessing import Normalizer


In [8]:
min_max  = Normalizer()


In [9]:
min_max.fit(df[['total_bill' , 'tip']])
## we are able to find the mean and standradevation 

## Xscale = (Xi - Xmin) / (Xmax - Xmin)

In [10]:
min_max.transform(df[['total_bill' , 'tip']]) 
## here we are able to apply the formular in each data poit 




array([[0.99823771, 0.05934197],
       [0.98735707, 0.15851187],
       [0.98640661, 0.16432285],
       [0.99037159, 0.13843454],
       [0.98939488, 0.14525073],
       [0.98309589, 0.18309141],
       [0.97496878, 0.2223418 ],
       [0.99333102, 0.11529735],
       [0.99161511, 0.12922644],
       [0.97694312, 0.21349975],
       [0.98641987, 0.16424323],
       [0.99009498, 0.14039917],
       [0.99485672, 0.10129216],
       [0.98700924, 0.16066347],
       [0.97988851, 0.19954574],
       [0.98389908, 0.17872495],
       [0.9871829 , 0.15959298],
       [0.9750328 , 0.22206088],
       [0.97938658, 0.20199487],
       [0.98709527, 0.1601341 ],
       [0.97504726, 0.22199737],
       [0.9909398 , 0.13430677],
       [0.99014941, 0.14001479],
       [0.98201   , 0.18882891],
       [0.98737215, 0.15841793],
       [0.99147891, 0.1302673 ],
       [0.98899596, 0.14794255],
       [0.98780711, 0.15568276],
       [0.98092686, 0.19437721],
       [0.98854552, 0.15092298],
       [0.

What is the Unit Vector technique in feature scaling and how does it differ from min-max Scalling 

Unit vector technique is a data preprocessing technique that scales each feature vector to have a unit length. This is done by dividing each component of the feature vector by its Euclidean norm. The Euclidean norm is the square root of the sum of the squares of the components of the vector.

Unit vector technique differs from Min-Max scaling in two ways. First, unit vector technique scales each feature vector to have a unit length, while Min-Max scaling scales each feature vector to a range of 0 to 1. Second, unit vector technique is a more general technique that can be used with any machine learning algorithm that uses distance measures, while Min-Max scaling is specifically designed for algorithms that assume that the features are on a comparable scale.


In [11]:
# Import the necessary libraries
import numpy as np

# Create a feature vector
x = np.array([1, 2, 3])

# Calculate the Euclidean norm of the feature vector
norm = np.linalg.norm(x)

# Scale the feature vector to have a unit length
scaled_x = x / norm

# Print the scaled feature vector
print(scaled_x)

[0.26726124 0.53452248 0.80178373]


What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an
example to illustrate its application.

PCA is often used in dimensionality reduction, which is the process of reducing the number of variables in a dataset while preserving as much information as possible. This can be useful for a variety of reasons, such as making the data easier to visualize, or reducing the computational cost of machine learning algorithm

We can use PCA to reduce the dimensionality of this dataset from five dimensions (one for each exam) to two dimensions. To do this, we first compute the covariance matrix of the data. The covariance matrix is a square matrix that measures the correlation between each pair of variables. The eigenvalues of the covariance matrix are the variances of the principal components


The Iris dataset is a classic dataset in machine learning that contains measurements of sepal and petal width and length for 150 flowers of three different species: Iris setosa, Iris versicolor, and Iris virginica.

We can use PCA to reduce the dimensionality of the Iris dataset from four features (sepal width, sepal length, petal width, and petal length) to two features. This can be done by finding the two principal components that explain the most variance in the data.

The first principal component (PC1) explains 93.31% of the variance in the Iris dataset, while the second principal component (PC2) explains 6.32% of the variance. This means that PC1 and PC2 together explain 99.63% of the variance in the data, which is a very high percentage.

In [4]:
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = load_iris()

# Create a PCA object
pca = PCA()

# Fit the PCA object to the Iris dataset
pca.fit(iris.data)

# Get the principal components
principal_components = pca.components_

# Get the eigenvalues
eigenvalues = pca.explained_variance_

# Print the eigenvalues
print("Eigenvalues:")
for eigenvalue in eigenvalues:
    print(eigenvalue)

# # Plot the Iris data points on a scatter plot with PC1 on the x-axis and PC2 on the y-axis
# plt.scatter(principal_components[:, 0], principal_components[:, 1], c=iris.target)
# plt.xlabel('PC1')
# plt.ylabel('PC2')
# plt.show()


Eigenvalues:
4.228241706034863
0.24267074792863336
0.07820950004291938
0.02383509297344944


What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature
Extraction? Provide an example to illustrate this concept.


PCA (Principal Component Analysis) is a dimensionality reduction technique that is used to reduce the number of features in a dataset while retaining as much information as possible. It does this by finding the directions of maximum variance in the data and projecting the data onto a new subspace with fewer dimensions.

Feature extraction is the process of selecting a subset of features from a dataset that are most relevant to the task at hand. This can be done manually or automatically, and there are a number of different techniques that can be used.

PCA can be used for feature extraction by first performing PCA on the dataset to reduce the number of dimensions. The new features that are created by PCA are then ranked according to their importance, and the most important features are selected.

Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset
contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to
preprocess the data.

Min-Max scaling is a technique used to scale a dataset to a specific range, usually between 0 and 1. It is used to normalize the data and make sure that all the variables are on the same scale. This is important for recommendation systems because it helps to ensure that all the features are equally important to the algorithm.

To use Min-Max scaling to preprocess the data for a food delivery recommendation system, you would follow these steps:

1. Calculate the minimum and maximum values for each feature.
2. Subtract the minimum value from each value in the feature.
3. Divide the resulting value by the difference between the maximum and minimum values.

In [1]:
import pandas as pd 


In [2]:
df = pd.read_csv("food.csv")

In [3]:
df.head()

Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,Time_Order_picked,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken(min)
0,0x4607,INDORES13DEL02,37,4.9,22.745049,75.892471,22.765049,75.912471,19-03-2022,11:30:00,11:45:00,conditions Sunny,High,2,Snack,motorcycle,0,No,Urban,(min) 24
1,0xb379,BANGRES18DEL02,34,4.5,12.913041,77.683237,13.043041,77.813237,25-03-2022,19:45:00,19:50:00,conditions Stormy,Jam,2,Snack,scooter,1,No,Metropolitian,(min) 33
2,0x5d6d,BANGRES19DEL01,23,4.4,12.914264,77.6784,12.924264,77.6884,19-03-2022,08:30:00,08:45:00,conditions Sandstorms,Low,0,Drinks,motorcycle,1,No,Urban,(min) 26
3,0x7a6a,COIMBRES13DEL02,38,4.7,11.003669,76.976494,11.053669,77.026494,05-04-2022,18:00:00,18:10:00,conditions Sunny,Medium,0,Buffet,motorcycle,1,No,Metropolitian,(min) 21
4,0x70a2,CHENRES12DEL01,32,4.6,12.972793,80.249982,13.012793,80.289982,26-03-2022,13:30:00,13:45:00,conditions Cloudy,High,1,Snack,scooter,1,No,Metropolitian,(min) 30


Time_taken(min) Delivery_person_Ratings

In [4]:
from sklearn.preprocessing import MinMaxScaler

In [5]:
# crate a object 
min_max = MinMaxScaler()


In [14]:
min_max.fit(df[['Delivery_person_Ratings' ]])

In [15]:
min_max.transform(df[['Delivery_person_Ratings']])

array([[0.78],
       [0.7 ],
       [0.68],
       ...,
       [0.78],
       [0.74],
       [0.78]])

You are working on a project to build a model to predict stock prices. The dataset contains many
features, such as company financial data and market trends. Explain how you would use PCA to reduce the
dimensionality of the dataset.

Principal component analysis (PCA) is a dimensionality reduction technique that can be used to reduce the number of features in a dataset while preserving as much of the information as possible. This can be useful for stock price prediction, as it can help to reduce the computational complexity of the model and improve its performance.

To use PCA to reduce the dimensionality of a dataset for stock price prediction, you would follow these steps:

Standardize the features in the dataset. This means that you would subtract the mean from each feature and divide it by its standard deviation.
Calculate the covariance matrix of the standardized features. This is a matrix that shows how the features are correlated with each other.
Find the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors are the directions in which the data is most spread out, and the eigenvalues are the variances of the data along those directions.
Select the top k eigenvectors, where k is the desired number of dimensions.
Project the data onto the subspace spanned by the selected eigenvectors. This will reduce the dimensionality of the data from d to k.

Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the
values to a range of -1 to 1.

In [16]:
import pandas as pd 



df = pd.DataFrame([1,5,10 ,15, 20])



In [17]:
from sklearn.preprocessing import MinMaxScaler

In [18]:
min_max = MinMaxScaler()

In [19]:
min_max.fit_transform(df)

array([[0.        ],
       [0.21052632],
       [0.47368421],
       [0.73684211],
       [1.        ]])

For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform
Feature Extraction using PCA. How many principal components would you choose to retain, and why?

Here are the steps on how to perform feature extraction using PCA on a dataset containing the following features: [height, weight, age, gender, blood pressure]:

Standardize the data. This means subtracting the mean from each feature and dividing by the standard deviation. This is important because PCA works best when the features are on a similar scale.
Calculate the covariance matrix. This is a matrix that shows how each feature is correlated with the other features.
Find the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors are the directions in which the data varies the most, and the eigenvalues are the variances of the data along those directions.
Choose the number of principal components to retain. This is usually done by choosing the number of principal components that account for a certain percentage of the variance in the data, such as 95%.

In [20]:
import pandas as pd

In [24]:
df = pd.read_csv("cardio_train.csv")


In [25]:
df.columns

Index(['id;age;gender;height;weight;ap_hi;ap_lo;cholesterol;gluc;smoke;alco;active;cardio'], dtype='object')

In [None]:
import numpy as np
from sklearn.decomposition import PCA

# Load the data
data = np.loadtxt("data.csv", delimiter=",")

# Standardize the data
scaler = StandardScaler()
data_standardized = scaler.fit_transform(data)

# Calculate the covariance matrix
covariance_matrix = np.cov(data_standardized.T)

# Find the eigenvectors and eigenvalues of the covariance matrix
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

# Sort the eigenvalues in descending order
eigenvalues = eigenvalues[::-1]

# Choose the number of principal components to retain
num_principal_components = 2

# Create a PCA object
pca = PCA(n_components=num_principal_components)

# Transform the data to the principal component space
data_pca = pca.fit_transform(data_standardized)

# Print the variance explained by each principal component
print(pca.explained_variance_ratio_)
