Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

Ans:


Label Encoding:

Converts each category into a unique integer.
No inherent order is implied among the categories.
Example: Types of fruits ("Apple," "Banana," "Cherry") where the order doesn’t matter.

Ordinal Encoding:

Converts each category into an integer based on an inherent order.
Implies a ranking or order among the categories.
Example: Customer satisfaction levels ("Low," "Medium," "High") where there is a meaningful order.

When to Choose:

Label Encoding for categories without order.

Ordinal Encoding for categories with a meaningful ranking.

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

Ans:

Target Guided Ordinal Encoding assigns numerical values to categories based on their average target value. Here’s how it works:

Calculate the mean of the target variable for each category.
Sort categories by these mean values.
Assign integers based on this order (higher mean gets a higher number).
Example:

Feature: Contract Type with categories ["Month-to-Month," "One Year," "Two Year"].
Target: Churn rate.
Encoding: Higher churn rates get higher integers. For instance:
"Month-to-Month" -> 2
"One Year" -> 1
"Two Year" -> 0
Use Case: When you want to capture the impact of categorical features on a target variable and the categories have an ordinal relationship.

Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Ans:

Covariance is a measure of the degree to which two random variables change together.

Positive Covariance: Both variables increase or decrease together.
Negative Covariance: One variable increases while the other decreases.
Zero Covariance: No linear relationship.
Importance:
Understanding Relationships: Helps in determining how two variables are related.
Modeling: Used in techniques like Principal Component Analysis (PCA).

Its calculated from oberved data, output data and average of both set taken separately.

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [1]:
from sklearn.preprocessing import LabelEncoder

color_encoder = LabelEncoder()
size_encoder = LabelEncoder()
material_encoder = LabelEncoder()

colors = ['red', 'green', 'blue']
sizes = ['small', 'medium', 'large']
materials = ['wood', 'metal', 'plastic']

color_encoded = color_encoder.fit_transform(colors)
size_encoded = size_encoder.fit_transform(sizes)
material_encoded = material_encoder.fit_transform(materials)

print("Original Colors:", colors)
print("Encoded Colors:", color_encoded)
print("Original Sizes:", sizes)
print("Encoded Sizes:", size_encoded)
print("Original Materials:", materials)
print("Encoded Materials:", material_encoded)


Original Colors: ['red', 'green', 'blue']
Encoded Colors: [2 1 0]
Original Sizes: ['small', 'medium', 'large']
Encoded Sizes: [2 1 0]
Original Materials: ['wood', 'metal', 'plastic']
Encoded Materials: [2 0 1]


LabelEncoder assigns integer values based on the alphabetical order of the categories. For example, "blue" is encoded as 0 because it comes first alphabetically, and "red" is encoded as 2 because it comes last.

Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [2]:
import numpy as np

age = np.array([25, 30, 35, 40, 45])
income = np.array([50000, 60000, 70000, 80000, 90000])
education_level = np.array([1, 2, 3, 2, 1])

data = np.vstack([age, income, education_level])

cov_matrix = np.cov(data, bias=False)

print("Covariance Matrix:\n", cov_matrix)


Covariance Matrix:
 [[ 6.25000000e+01  1.25000000e+05 -1.11022302e-16]
 [ 1.25000000e+05  2.50000000e+08 -2.22044605e-13]
 [-1.11022302e-16 -2.22044605e-13  7.00000000e-01]]


Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

Ans:

For the given categorical variables, the encoding methods should be chosen based on the nature of each variable and their relationships. Here’s the best approach for each:

1. Gender (Male/Female)
Encoding Method: One-Hot Encoding
Reason: "Gender" is a nominal variable with no inherent order. One-hot encoding will create separate binary columns for each category without implying any ranking or order.
2. Education Level (High School/Bachelor's/Master's/PhD)
Encoding Method: Ordinal Encoding
Reason: "Education Level" has a natural ordinal relationship (there is a clear order from High School to PhD). Ordinal encoding will preserve this ranking by assigning increasing integers based on the education level.
3. Employment Status (Unemployed/Part-Time/Full-Time)
Encoding Method: Ordinal Encoding
Reason: "Employment Status" can be considered ordinal if we assume there's an inherent order in terms of employment engagement or status. However, if there’s no inherent ranking, one-hot encoding could also be used.


Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.

In [4]:
import numpy as np

temperature = np.array([25, 22, 27, 30, 24])
humidity = np.array([60, 65, 55, 50, 70])

weather_condition = np.array([0, 1, 0, 2, 1])
wind_direction = np.array([0, 1, 2, 3, 0])
data = np.vstack([temperature, humidity, weather_condition, wind_direction])

cov_matrix = np.cov(data, bias=False)
print("Covariance Matrix:\n", cov_matrix)

Covariance Matrix:
 [[  9.3  -21.25   0.9    3.1 ]
 [-21.25  62.5   -1.25  -8.75]
 [  0.9   -1.25   0.7    0.55]
 [  3.1   -8.75   0.55   1.7 ]]


- Temperature and Humidity: A moderate negative relationship.
- Temperature and Weather Condition: A slight positive relationship.
- Temperature and Wind Direction: A slight positive relationship.
- Humidity and Weather Condition: A slight negative relationship.
- Humidity and Wind Direction: A moderate negative relationship.
- Weather Condition and Wind Direction: A slight positive relationship.