## Feature Engineering 5

Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

Ans:  
  
Ordinal Encoding and Label Encoding are both techniques used to convert categorical data into numerical values, but they are used in different scenarios based on the nature of the categorical variable.
  
**Ordinal Encoding:**  
Ordinal Encoding is used when the categorical variable has a meaningful order or ranking between its categories. The categories are converted into integers that reflect this order.  
Example:
Consider a variable representing levels of education: High School,Bachelor's,Master's,PhD  
we can have ranks amongst them as follows:  
High School = 1  
Bachelor's = 2  
Master's = 3  
PhD = 4  

**Label Encoding:**    
Label Encoding assigns a unique integer to each category but does not account for any inherent order between the categories. This technique is useful for categorical variables where the categories are nominal (i.e., no natural ordering).  
Example:  
Consider a variable representing colors:Red,Green,Blue  
In Label Encoding, you might assign:  
Red = 1  
Green = 2  
Blue = 3  
**Key Differences:**  
1. Order vs. No Order: Ordinal Encoding respects and reflects the order of the categories, while Label Encoding does not imply any order.  
2. Use Case: Use Ordinal Encoding for ordered categories (e.g., education levels, ratings) and Label Encoding for nominal categories (e.g., colors, city names).  

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

Ans:    

**Target Guided Ordinal Encoding** is a technique that extends the concept of ordinal encoding by incorporating the target variable (i.e., the outcome variable) into the encoding process. This approach is useful when the categorical feature has a natural order that is influenced by or related to the target variable.  
  
**How Target Guided Ordinal Encoding Works:**  
1. Calculate the Target Mean: For each category in the feature, calculate the mean of the target variable. This gives a sense of the target's average value for each category.  

2. Rank the Categories: Based on the mean target values, rank the categories. Categories with higher mean target values are assigned higher ordinal values.  

3. Assign Ordinal Values: Assign ordinal values to the categories based on their rank. The category with the highest mean target value gets the highest ordinal value, and so on.  

4. Replace the Original Categorical Feature: Replace the original categorical values in your dataset with the newly assigned ordinal values.
  
**Example Scenario:**  
Imagine you are working on a machine learning project to predict customer churn based on features including customer satisfaction levels. One of your features is Satisfaction Level, which can take on categorical values like:  
Very Dissatisfied,    
Dissatisfied,    
Neutral,    
Satisfied,    
Very Satisfied    
Your target variable is Churn, where 1 indicates that the customer has churned and 0 indicates that they have not.  
  
**Steps for Target Guided Ordinal Encoding:**  
1. Calculate Target Mean for Each Category:  
Compute the mean Churn value for each satisfaction level category.
Example results:  
Very Dissatisfied: Mean Churn = 0.8  
Dissatisfied: Mean Churn = 0.6  
Neutral: Mean Churn = 0.4  
Satisfied: Mean Churn = 0.2  
Very Satisfied: Mean Churn = 0.1  

2. Rank the Categories:  
Based on the mean Churn values, the categories are ranked:  
Very Dissatisfied (highest mean Churn) gets the highest rank.  
Very Satisfied (lowest mean Churn) gets the lowest rank.  
Assign Ordinal Values:  
Assign ordinal values based on the rank:  
Very Dissatisfied = 5  
Dissatisfied = 4  
Neutral = 3  
Satisfied = 2  
Very Satisfied = 1  
3. Replace the Original Categorical Feature:  
Transform the Satisfaction Level feature in your dataset to the assigned ordinal values.  

Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Ans:  
  
**Covariance** is a statistical measure that indicates the degree to which two variables change together. In other words, it quantifies how much two random variables vary in tandem. If the variables tend to increase and decrease together, their covariance is positive; if one tends to increase when the other decreases, the covariance is negative. If the variables do not show any consistent pattern of co-movement, the covariance is close to zero.
  
**Importance of Covariance in Statistical Analysis:**  
1. Relationship Insight: Covariance helps in understanding the relationship between two variables. Positive covariance indicates that the variables tend to move in the same direction, while negative covariance indicates they move in opposite directions.  

2. Basis for Correlation: Covariance is foundational for calculating correlation, a standardized measure of the relationship between two variables. Correlation adjusts covariance for the scale of the variables, providing a dimensionless measure of association.  

3. Portfolio Theory: In finance, covariance is used to assess the risk and return of investment portfolios. It helps in understanding how the returns on different assets in a portfolio move relative to each other, which is crucial for diversification strategies.  

4. Principal Component Analysis (PCA): Covariance matrices are used in PCA to determine the principal components of a dataset, which helps in reducing dimensionality while preserving as much variance as possible.  
  
**Calculation of Covariance:**  
To calculate covariance between two variables X and Y use the give formula:
Find the Means:

$$
\text{Cov}(X, Y) = \frac{1}{n - 1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})
$$

where:
- n is the number of data points,
- $X_i$ and $Y_i$ are individual observations of variables \( X \) and \( Y \),
- $\bar{X}$ and $\bar{Y}$ are the means of X and Y, respectively.

In [4]:
import pandas as pd

# Data
data = pd.DataFrame({
    'X': [2, 4, 6, 8],
    'Y': [3, 5, 7, 9]
})

# Calculate covariance
cov_XY = data.cov().loc['X', 'Y']

print(f"Covariance between X and Y: {cov_XY}")

Covariance between X and Y: 6.666666666666666


Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

In [5]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

# Create a DataFrame with categorical variables
data = pd.DataFrame({
    'Color': ['red', 'green', 'blue', 'red', 'blue'],
    'Size': ['small', 'medium', 'large', 'medium', 'small'],
    'Material': ['wood', 'metal', 'plastic', 'wood', 'metal']
})

# Initialize the OrdinalEncoder for the ordinal 'Size' column
ordinal_encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])

# Fit and transform the 'Size' column
data['Size_encoded'] = ordinal_encoder.fit_transform(data[['Size']])

# Initialize the LabelEncoder for nominal 'Color' and 'Material' columns
color_encoder = LabelEncoder()
material_encoder = LabelEncoder()

# Fit and transform 'Color' and 'Material' columns
data['Color_encoded'] = color_encoder.fit_transform(data['Color'])
data['Material_encoded'] = material_encoder.fit_transform(data['Material'])

# Display the encoded DataFrame
data

Unnamed: 0,Color,Size,Material,Size_encoded,Color_encoded,Material_encoded
0,red,small,wood,0.0,2,2
1,green,medium,metal,1.0,1,0
2,blue,large,plastic,2.0,0,1
3,red,medium,wood,1.0,2,2
4,blue,small,metal,0.0,0,0


Explanation:
1. Color column: Label encodingis applied as it has no ranking among its categories,we can also perform One Head Encoding but it will increase the number of columns.
2. Size column: Ordinal encoding is applied as it has ranks among the categories as:  
        - small: 0  
        - medium: 1    
        - large: 2
3. Material column: Label encodingis applied as it has no ranking among its categories,we can also perform One Head Encoding but it will increase the number of columns.

Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

Ans:  

To calculate the covariance matrix for variables such as Age, Income, and Education level, you'll need to follow these steps:
  
1. Prepare the Data: Input your dataset with these variables.
2. Encode Categorical Variables: If Education level is categorical, encode it appropriately.
3. Compute the Covariance Matrix: Use a library like numpy or pandas to calculate the covariance matrix.
4. Interpret the Results: Understand the relationships between the variables based on the covariance values.

In [10]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
pd.options.display.float_format = '{:.2f}'.format

# Create a DataFrame with the data
data = pd.DataFrame({
    'Age': [25, 30, 35, 40, 45],
    'Income': [50000, 60000, 70000, 80000, 90000],
    'Education Level': ['Bachelor\'s', 'Master\'s', 'PhD', 'Master\'s', 'Bachelor\'s']
})

# Define the order of the education levels
education_order = ['Bachelor\'s', 'Master\'s', 'PhD']

# Initialize the OrdinalEncoder
ordinal_encoder = OrdinalEncoder(categories=[education_order])

# Fit and transform the 'Education Level' column
data['Education Level_encoded'] = ordinal_encoder.fit_transform(data[['Education Level']])

# Drop the original categorical column
data_encoded = data.drop(columns=['Education Level'])

print(data_encoded)

   Age  Income  Education Level_encoded
0   25   50000                     0.00
1   30   60000                     1.00
2   35   70000                     2.00
3   40   80000                     1.00
4   45   90000                     0.00


In [11]:
# Compute the covariance matrix
cov_matrix = np.cov(data_encoded, rowvar=False)
cov_matrix_df = pd.DataFrame(cov_matrix, index=data_encoded.columns, columns=data_encoded.columns)

print(cov_matrix_df)

                              Age       Income  Education Level_encoded
Age                         62.50    125000.00                    -0.00
Income                  125000.00 250000000.00                    -0.00
Education Level_encoded     -0.00        -0.00                     0.70


**Interpretation:**  
Diagonal Elements (Variance):  
  
1. Age Variance: 62.50
2. Income Variance: 250000000 (a high variance indicating large variability in income)
3. Education Level Variance:  0.70 (variance for the encoded ordinal education levels)  

Off-Diagonal Elements (Covariance):  
  
1. Age and Income Covariance: 125000 (positive covariance, indicating that as age increases, income tends to increase as well)  
2. Age and Education Level Covariance: 0 (0 indicates no relationship between age and education level)  
3. Income and Education Level Covariance: 0 (0 indicates no relationship between income and education level)  

Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

Ans:    
  
In a machine learning project, selecting the appropriate encoding method for categorical variables is crucial because it can impact the performance of your model. Here’s a breakdown of which encoding methods to use for each categorical variable you mentioned and why:  
1. Gender: Typically encoded with either Label Encoding (if using tree-based models) or One-Hot Encoding (if using algorithms sensitive to ordinal encoding).  
2. Education Level: Ordinal Encoding, as it respects the inherent order of educational achievements. As high school < Bachelor's < Master's <PhD    
3. Employment Status: Either Ordinal Encoding (if order matters) or One-Hot Encoding (if no order is considered).  

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

In [15]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

# Create a DataFrame with the data
data = pd.DataFrame({
    'Temperature': [30, 25, 28, 32, 27],
    'Humidity': [70, 80, 75, 60, 85],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],
    'Wind Direction': ['North', 'East', 'West', 'North', 'South']
})

# One-Hot Encoding for categorical variables
one_hot_encoder = OneHotEncoder(sparse_output=False, drop='first')  # Drop='first' to avoid multicollinearity

# Encode 'Weather Condition'
weather_encoded = one_hot_encoder.fit_transform(data[['Weather Condition']])
weather_encoded_df = pd.DataFrame(weather_encoded, columns=one_hot_encoder.get_feature_names_out(['Weather Condition']))

# Encode 'Wind Direction'
wind_encoded = one_hot_encoder.fit_transform(data[['Wind Direction']])
wind_encoded_df = pd.DataFrame(wind_encoded, columns=one_hot_encoder.get_feature_names_out(['Wind Direction']))

# Concatenate encoded columns with the original data
data_encoded = pd.concat([data[['Temperature', 'Humidity']], weather_encoded_df, wind_encoded_df], axis=1)

data_encoded

Unnamed: 0,Temperature,Humidity,Weather Condition_Rainy,Weather Condition_Sunny,Wind Direction_North,Wind Direction_South,Wind Direction_West
0,30,70,0.0,1.0,1.0,0.0,0.0
1,25,80,0.0,0.0,0.0,0.0,0.0
2,28,75,1.0,0.0,0.0,0.0,1.0
3,32,60,0.0,1.0,1.0,0.0,0.0
4,27,85,0.0,0.0,0.0,1.0,0.0


In [17]:
# Compute the covariance matrix
cov_matrix = np.cov(data_encoded, rowvar=False)
cov_matrix_df = pd.DataFrame(cov_matrix, index=data_encoded.columns, columns=data_encoded.columns)
cov_matrix_df

Unnamed: 0,Temperature,Humidity,Weather Condition_Rainy,Weather Condition_Sunny,Wind Direction_North,Wind Direction_South,Wind Direction_West
Temperature,7.3,-23.25,-0.1,1.3,1.3,-0.35,-0.1
Humidity,-23.25,92.5,0.25,-4.5,-4.5,2.75,0.25
Weather Condition_Rainy,-0.1,0.25,0.2,-0.1,-0.1,-0.05,0.2
Weather Condition_Sunny,1.3,-4.5,-0.1,0.3,0.3,-0.1,-0.1
Wind Direction_North,1.3,-4.5,-0.1,0.3,0.3,-0.1,-0.1
Wind Direction_South,-0.35,2.75,-0.05,-0.1,-0.1,0.2,-0.05
Wind Direction_West,-0.1,0.25,0.2,-0.1,-0.1,-0.05,0.2


Results:  
1. Temperatue and Humidity has a negative relation with covariance of -23.25.  
2. Temperature has negative relation of -0.10 with Rainy weather and wind direction west and a negative relation of -0.35 with south directed wind.  
3. Temperature has positive relation with Sunny weather and north wind direction(1.30 covariance).  
4. Humidity has positive relationss with Rainy (covariance of 0.25), south wind(2.75) and west wind(0.25).  
5. Humidity has a negative relations with Sunny weather and north directed wind(-4.50).    