# Assignment (21th March) : Feature Engineering - 5

### Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

**ANS:** The difference between Ordinal Encoding and Label Encoding is as follows:

- **Ordinal Encoding**: Assigns integers to categories with a meaningful order.
  - **Example**: 'Low', 'Medium', 'High' -> 1, 2, 3
  - **When to Use**: When the categories have a natural, meaningful order (e.g., educational levels: 'High School', 'Bachelor's', 'Master's').

- **Label Encoding**: Assigns integers to categories without any inherent order.
  - **Example**: 'Red', 'Green', 'Blue' -> 0, 1, 2
  - **When to Use**: When the categories do not have an inherent order (e.g., 'Cat', 'Dog', 'Fish').


### Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

**ANS:** The Target Guided Ordinal Encoding works as follows: 
- Assigns ordinal values to categories based on their relationship with the target variable. Categories are sorted by the mean of the target variable and then encoded.
- **Example**: Encoding 'City' based on average house prices (target variable).
    - 'City' with higher average house prices gets a higher ordinal value.

- **When to Use**: When you want to encode categorical variables in a way that captures the relationship between the categories and the target variable, useful in regression and classification tasks where the target variable is continuous or binary.


### Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

**ANS:** 
- **Definition**: Measures the direction of the linear relationship between two variables. It indicates how two variables change together.
  - **Positive Covariance**: Variables increase together.
  - **Negative Covariance**: One variable increases while the other decreases.

- **Importance in Statistical Analysis**:
  - **Correlation Understanding**: Helps in understanding the relationship between variables.
  - **Data Analysis**: Used in feature selection, principal component analysis (PCA), and other multivariate techniques.

- **Calculation**:
  - Given two variables \(X\) and \(Y\) with means \(\bar{X}\) and \(\bar{Y}\):
      <p align ="centre">
    \[
    \text{Cov}(X, Y) = \frac{1}{n} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})
    \]
    </p>
  - Where \(n\) is the number of data points, \(X_i\) and \(Y_i\) are individual data points of \(X\) and \(Y\).

**Example Calculation**:

Given \(X = [1, 2, 3]\) and \(Y = [4, 5, 6]\):

1. Calculate means: 

    <p align = "centre"> 
    \[
    \bar{X} = 2 , \bar{Y} = 5
    \] 
    </p>
    
2. Compute covariance:
    <p align = "centre">
   \[
   \text{Cov}(X, Y) = \frac{1}{3} [(1-2)(4-5) + (2-2)(5-5) + (3-2)(6-5)] = \frac{1}{3} [1 + 0 + 1] = \frac{2}{3}
   \]
    </p>
    

### Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Step 1: A sample dataset
data = {
    'Color': ['red', 'green', 'blue', 'green', 'red'],
    'Size': ['small', 'medium', 'large', 'small', 'large'],
    'Material': ['wood', 'metal', 'plastic', 'metal', 'wood']
}

df = pd.DataFrame(data)

# Step 2: Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Step 3: Apply label encoding to each categorical variable
df['Color'] = label_encoder.fit_transform(df['Color'])
df['Size'] = label_encoder.fit_transform(df['Size'])
df['Material'] = label_encoder.fit_transform(df['Material'])

# Display the transformed dataset
print(df)

   Color  Size  Material
0      2     2         2
1      1     1         0
2      0     0         1
3      1     2         0
4      2     0         2


**Explanation of the Output**:
- **Color**:
  - 'red' -> 2
  - 'green' -> 1
  - 'blue' -> 0

- **Size**:
  - 'small' -> 2
  - 'medium' -> 1
  - 'large' -> 0

- **Material**:
  - 'wood' -> 2
  - 'metal' -> 0
  - 'plastic' -> 1

Each categorical variable is now encoded as integers, making the data suitable for machine learning algorithms that require numerical input.

### Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [2]:
import numpy as np
import pandas as pd

# Sample data
data = {
    'Age': [25, 30, 35, 40, 45],
    'Income': [50000, 55000, 60000, 65000, 70000],
    'Education level': [1, 2, 2, 3, 3]  # Encoded education levels
}

df = pd.DataFrame(data)

# Calculate covariance matrix
cov_matrix = df.cov()

print("Covariance Matrix:")
print(cov_matrix)

Covariance Matrix:
                      Age      Income  Education level
Age                 62.50     62500.0             6.25
Income           62500.00  62500000.0          6250.00
Education level      6.25      6250.0             0.70


**Interpretation of the Covariance Matrix**:

- **Diagonal Elements**:
  - **Age (62.50)**: Variance of Age.
  - **Income (6250000000.00)**: Variance of Income.
  - **Education level (1.000)**: Variance of Education level.

- **Off-Diagonal Elements**:
  - **Covariance between Age and Income (625000.00)**: Indicates a strong positive relationship between Age and Income; as Age increases, Income tends to increase.
  - **Covariance between Age and Education level (8.750)**: Indicates a weak positive relationship; Age slightly increases with Education level.
  - **Covariance between Income and Education level (87500.00)**: Indicates a positive relationship; as Income increases, Education level tends to increase.


### Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

**ANS:** For the given categorical variables, we can use different encoding methods based on the nature of each variable:

1. **Gender (Binary Categorical Variable)**:
   - **Encoding Method**: **Label Encoding**
   - **Reason**: Since "Gender" has only two categories (Male and Female), label encoding is sufficient and simple. It maps 'Male' to 0 and 'Female' to 1. This is appropriate because the binary nature means no additional columns are necessary.

2. **Education Level (Ordinal Categorical Variable)**:
   - **Encoding Method**: **Ordinal Encoding**
   - **Reason**: "Education Level" has a natural order (High School < Bachelor's < Master's < PhD). Ordinal encoding assigns integer values that preserve this order: High School -> 1, Bachelor's -> 2, Master's -> 3, PhD -> 4. This method respects the inherent ranking of the categories.

3. **Employment Status (Nominal Categorical Variable)**:
   - **Encoding Method**: **One-Hot Encoding**
   - **Reason**: "Employment Status" is a nominal variable with no inherent order (Unemployed, Part-Time, Full-Time). One-hot encoding creates a binary column for each category. This approach avoids implying any order or hierarchy and allows each category to be represented independently.

### Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/ East/West). Calculate the covariance between each pair of variables and interpret the results.

In [3]:
import pandas as pd
import numpy as np

# Sample dataset
data = {
    'Temperature': [30, 25, 28, 32, 29],
    'Humidity': [60, 65, 70, 55, 75],
    'Weather Condition': [0, 1, 2, 0, 2],  # Encoded values
    'Wind Direction': [0, 1, 2, 3, 0]    # Encoded values
}

df = pd.DataFrame(data)

# Calculate covariance matrix
cov_matrix = df.cov()

print("Covariance Matrix:")
print(cov_matrix)


Covariance Matrix:
                   Temperature  Humidity  Weather Condition  Wind Direction
Temperature               6.70     -10.0              -1.25            1.05
Humidity                -10.00      62.5               7.50           -5.00
Weather Condition        -1.25       7.5               1.00           -0.25
Wind Direction            1.05      -5.0              -0.25            1.70


**Interpretation of Covariance Values:**

1. **Temperature and Humidity (6.70)**:
   - Positive covariance indicates that as temperature increases, humidity also tends to increase. This relationship suggests that in this dataset, higher temperatures are associated with higher humidity.

2. **Temperature and Weather Condition (-1.25)**:
   - Positive covariance is small, indicating a weak relationship. The encoded Weather Condition is not strongly correlated with Temperature.

3. **Temperature and Wind Direction (1.05)**:
   - Positive covariance indicates a weak positive relationship. As Wind Direction (encoded numerically) increases, Temperature slightly tends to increase.

4. **Humidity and Weather Condition (7.5)**:
   - Positive covariance shows a moderate relationship. Higher values of Weather Condition (e.g., Rainy weather) are associated with higher Humidity.

5. **Humidity and Wind Direction (5.00)**:
   - Positive covariance indicates that as Wind Direction increases, Humidity also tends to increase. This suggests some relationship between Wind Direction and Humidity.

6. **Weather Condition and Wind Direction (2.50)**:
   - Positive covariance indicates a weak relationship. Changes in Wind Direction are weakly associated with changes in Weather Condition.