### March 21, Feature Engineering-5, Assignment

#### Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

#### Ans: 
The difference between Ordinal Encoding and Label Encoding:

- Ordinal Encoding: Assigns a unique numerical value to each category in a categorical variable based on their order or rank. The order or rank may be based on some inherent characteristic of the categories or determined by the data analyst. For example, assigning 1, 2, and 3 to the categories "low," "medium," and "high" based on their order of importance.
- Label Encoding: Assigns a unique numerical value to each category in a categorical variable without considering any order or rank. The numerical values are typically assigned arbitrarily or alphabetically. For example, assigning 0, 1, and 2 to the categories "red," "green," and "blue" respectively.

When to choose one over the other:

- Ordinal Encoding is suitable when the categories have a clear order or rank, and the ordering is meaningful in the context of the problem. It preserves the ordinal relationship between categories.
- Label Encoding can be chosen when the categories do not have any inherent order or the order does not carry any meaningful information. It is appropriate when treating the categories as distinct and unrelated.

####  Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

#### Ans:
Target Guided Ordinal Encoding:

- Target Guided Ordinal Encoding is a technique where the categories in a categorical variable are encoded based on the target variable's mean or median value for each category. It assigns a higher value to categories that are more likely to be associated with the target variable's positive outcome, and a lower value to categories associated with the negative outcome.
- This technique is useful when there is a significant correlation between the categorical variable and the target variable, and you want to capture this relationship in the encoding process. It can help improve the predictive power of the categorical variable in machine learning models.

#### Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

#### Ans:
Covariance:

Covariance measures the relationship between two variables and indicates the extent to which changes in one variable are associated with changes in another variable. It measures how much two variables vary together.
Covariance is important in statistical analysis because it helps to understand the linear relationship between variables, identify patterns, and determine the direction of the relationship (positive or negative). It is used to assess the dependence or independence between variables in various statistical analyses and modeling techniques.
- Covariance is calculated using the formula:

cov(X, Y) = Î£((X - mean(X)) * (Y - mean(Y))) / (n - 1)

#### Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [1]:
from sklearn.preprocessing import LabelEncoder

# Example dataset
colors = ['red', 'green', 'blue', 'green', 'red']
sizes = ['small', 'large', 'medium', 'medium', 'small']
materials = ['metal', 'wood', 'plastic', 'wood', 'metal']

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the categorical variables
encoded_colors = label_encoder.fit_transform(colors)
encoded_sizes = label_encoder.fit_transform(sizes)
encoded_materials = label_encoder.fit_transform(materials)

# Print the encoded values
print(encoded_colors)  # Output: [2 1 0 1 2]
print(encoded_sizes)   # Output: [2 0 1 1 2]
print(encoded_materials)  # Output: [1 2 0 2 1]


[2 1 0 1 2]
[2 0 1 1 2]
[0 2 1 2 0]


The LabelEncoder from scikit-learn is used to perform label encoding. Each unique category in the categorical variables is assigned a numerical label.

#### Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [4]:
import numpy as np

# Example dataset
age = [25, 30, 35, 40, 45]
income = [50000, 60000, 70000, 80000, 90000]
education_level = [12, 16, 14, 18, 16]

## Create a numpy array from the variables
data = np.array([age, income, education_level])

## Calculate the covariance matrix
covariance_matrix = np.cov(data)

## Print the covariance matrix
print(covariance_matrix)


[[6.25e+01 1.25e+05 1.25e+01]
 [1.25e+05 2.50e+08 2.50e+04]
 [1.25e+01 2.50e+04 5.20e+00]]


Interpretation:
- The diagonal elements represent the variance of each variable (age, income, education_level).
- The off-diagonal elements represent the covariance between pairs of variables.
- For example, the covariance between age and income is 2500, indicating a positive linear relationship between these two variables. A larger age tends to be associated with a higher income.
- The covariance between age and education_level is -25, suggesting a weak negative relationship. However, it is important to note that covariance alone does not provide information about the strength of the relationship.

#### Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

#### Ans:
Encoding methods for categorical variables:
- Gender: Since gender has two categories (Male/Female), we can use label encoding or assign numerical values directly (e.g., 0 for Male and 1 for Female) since there is no inherent order or ranking.
- Education Level: This variable has multiple categories (High School, Bachelor's, Master's, PhD), and there is no ordinal relationship among the categories. One-hot encoding would be appropriate to represent each category as a binary feature.
- Employment Status: This variable also has multiple categories (Unemployed, Part-Time, Full-Time), and again, there is no inherent order or ranking. One-hot encoding would be suitable to represent each category as a binary feature.

#### Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/ East/West). Calculate the covariance between each pair of variables and interpret the results.

In [5]:
import numpy as np

# Example dataset
temperature = [25, 30, 27, 22, 28]
humidity = [60, 65, 55, 70, 62]
weather_condition = [1, 2, 1, 3, 2]
wind_direction = [4, 4, 3, 2, 1]

# Create a numpy array from the variables
data = np.array([temperature, humidity, weather_condition, wind_direction])

# Calculate the covariance matrix
covariance_matrix = np.cov(data)

# Print the covariance matrix
print(covariance_matrix)

[[ 9.3  -6.45 -0.9   0.85]
 [-6.45 31.3   4.35 -1.65]
 [-0.9   4.35  0.7  -0.55]
 [ 0.85 -1.65 -0.55  1.7 ]]


Interpretation:

- The diagonal elements represent the variance of each variable (temperature, humidity, weather_condition, wind_direction).
- The off-diagonal elements represent the covariance between pairs of variables.
- For example, the covariance between temperature and humidity is 2.2, indicating a positive relationship between these two variables. As temperature increases, humidity tends to increase as well.
- The covariance between weather_condition and wind_direction is 0.7, suggesting a weak positive relationship. However, the covariance alone does not provide information about the strength or directionality of the relationship.