In [1]:
# Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.
# Ordinal Encoding is used for categorical variables that have a natural order (e.g., Low, Medium, High). It assigns an integer to each category based on its rank.
# Label Encoding is used for categorical variables that do not have an inherent order. Each category is simply assigned a unique integer value.
# Example: For a variable "Size" (Small, Medium, Large), Ordinal Encoding would assign 0, 1, 2. For a variable "Color" (Red, Green, Blue), Label Encoding would assign 0, 1, 2 without implying any order.

In [2]:
# Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.
# Target Guided Ordinal Encoding uses the relationship between the categorical variable and the target variable to assign ordinal values. Categories are ordered based on their mean target value.
# Example: In a churn prediction project, for a "Contract Type" feature, categories like "Month-to-Month" and "Two-Year" can be ordered based on the average churn rate for each contract type.

In [3]:
# Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?
# Covariance measures the relationship between two continuous variables and indicates the direction of their linear relationship.
# It is important because it helps us understand how two variables change together (e.g., if one increases, does the other increase or decrease?).
# Formula: Cov(X, Y) = Σ[(Xi - mean(X)) * (Yi - mean(Y))] / n

In [4]:
# Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.
from sklearn.preprocessing import LabelEncoder

# Sample data
color = ['red', 'green', 'blue']
size = ['small', 'medium', 'large']
material = ['wood', 'metal', 'plastic']

# Initialize label encoder
le = LabelEncoder()

# Apply Label Encoding
color_encoded = le.fit_transform(color)
size_encoded = le.fit_transform(size)
material_encoded = le.fit_transform(material)

# Output
print(f"Color encoding: {color_encoded}")
print(f"Size encoding: {size_encoded}")
print(f"Material encoding: {material_encoded}")

# Output explanation:
# The output will be arrays of integers representing the encoded values for each category in the respective categorical variables.


Color encoding: [2 1 0]
Size encoding: [2 1 0]
Material encoding: [2 0 1]


In [5]:
# Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.
import numpy as np

# Sample dataset
age = np.array([25, 30, 35, 40, 45])
income = np.array([30000, 35000, 40000, 45000, 50000])
education_level = np.array([1, 2, 3, 4, 5])  # Ordinal encoding for education level: 1=High School, 2=Bachelor's, 3=Master's, 4=PhD, 5=Post-Doc

# Covariance matrix
data = np.array([age, income, education_level]).T
cov_matrix = np.cov(data, rowvar=False)

print(f"Covariance matrix:\n{cov_matrix}")

# Interpretation:
# A positive covariance between two variables means they tend to increase together, while a negative covariance means one tends to increase as the other decreases.
# The diagonal values show the variance for each variable, while the off-diagonal values show the covariance between pairs of variables.


Covariance matrix:
[[6.25e+01 6.25e+04 1.25e+01]
 [6.25e+04 6.25e+07 1.25e+04]
 [1.25e+01 1.25e+04 2.50e+00]]


In [6]:
# Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?
# For "Gender" (Male/Female), I would use Label Encoding because there are only two categories and no inherent order.
# For "Education Level" (High School/Bachelor's/Master's/PhD), I would use Ordinal Encoding since there is a natural order in the education levels.
# For "Employment Status" (Unemployed/Part-Time/Full-Time), I would use One-Hot Encoding because the categories are distinct and do not have an inherent order.


In [7]:
# Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.
temperature = np.array([30, 32, 28, 35, 33])
humidity = np.array([80, 75, 90, 60, 85])
weather_condition = np.array([1, 2, 3, 1, 2])  # Ordinal encoding: 1=Sunny, 2=Cloudy, 3=Rainy
wind_direction = np.array([1, 2, 3, 4, 1])  # Ordinal encoding: 1=North, 2=South, 3=East, 4=West

# Covariance calculation
data = np.array([temperature, humidity, weather_condition, wind_direction]).T
cov_matrix = np.cov(data, rowvar=False)

print(f"Covariance matrix:\n{cov_matrix}")

# Interpretation:
# The covariance matrix will show how the continuous variables (Temperature and Humidity) relate to each other, 
# and also the relationship between the continuous variables and the categorical ones (Weather Condition, Wind Direction).
# Positive covariance means variables change in the same direction, while negative covariance means they change in opposite directions.

Covariance matrix:
[[ 7.300e+00 -2.475e+01 -1.350e+00  8.500e-01]
 [-2.475e+01  1.325e+02  7.000e+00 -8.250e+00]
 [-1.350e+00  7.000e+00  7.000e-01  5.000e-02]
 [ 8.500e-01 -8.250e+00  5.000e-02  1.700e+00]]
