# Feature Engineering 5

**Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.**

Label Encoding assigns a unique integer to each category in a categorical variable, regardless of any inherent order.

Ordinal Encoding also assigns integers to categories, but it preserves the order between them.

When to use which:

Label Encoding: Use when there's no inherent order in the categories (e.g., color, gender).

Ordinal Encoding: Use when there's a clear order or ranking among the categories (e.g., education level, product rating).

**Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.**

Target Guided Ordinal Encoding assigns labels to categories based on the target variable. The categories are ordered based on their mean target value.

Example:
In a customer churn prediction model, you might use target guided ordinal encoding for the "tenure" feature. Customers with higher tenure might be less likely to churn. By ordering the tenure categories based on churn rate, you can capture this information in the encoding.

**Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?**

Covariance measures the relationship between two variables. A positive covariance indicates that the two variables tend to move in the same direction, while a negative covariance suggests they move in opposite directions. A covariance of zero implies no linear relationship.   

Importance:

- Understanding relationships between variables
- Feature selection
- Portfolio management
- Risk assessment

Calculation:
`cov(X, Y) = E[(X - E[X])(Y - E[Y])]`

where:
- E[] is the expected value
- X and Y are the two variables

**Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.**

In [3]:
pip install scikit-learn


Collecting scikit-learn
  Downloading scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading joblib-1.4.2-py3-none-any.whl (301 kB)
Downloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.5.1 threadpoolctl-3.5.0
Note: you may need to restart the kernel to use updated packages.


In [4]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = {'Color': ['red', 'green', 'blue', 'red'],
        'Size': ['small', 'medium', 'large', 'small'],
        'Material': ['wood', 'metal', 'plastic', 'wood']}
df = pd.DataFrame(data)

le = LabelEncoder()
for col in df.columns:
    df[col] = le.fit_transform(df[col])

print(df)


   Color  Size  Material
0      2     2         2
1      1     1         0
2      0     0         1
3      2     2         2


**Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.**
Note: To calculate the covariance matrix, we would need actual numerical data for Age, Income, and Education level. Assuming you have this data in a pandas DataFrame named 'data', you can use:

In [16]:
import pandas as pd
import numpy as np

# Sample DataFrame
data = {'Name': ['Jack', 'John', 'Jenny'],
        'Age': [28, 21, 35],
        'Income': [2000000, 1800000, 2400000],  # Remove quotes for numerical values
        'Education_level': [12+4, 12+6, 12+9]}

df = pd.DataFrame(data)

# Convert 'Income' and 'Education_level' column to numeric
df['Income'] = pd.to_numeric(df['Income'])
df['Education_level'] = pd.to_numeric(df['Education_level'])
# Calculate covariance matrix for numerical columns
covariance_matrix = df[['Age', 'Income','Education_level']].cov()  
print(covariance_matrix)


                       Age        Income  Education_level
Age                   49.0  2.100000e+06        10.500000
Income           2100000.0  9.333333e+10    566666.666667
Education_level       10.5  5.666667e+05         6.333333


The output will be a 3x3 matrix. The diagonal elements represent the variance of each variable, and the off-diagonal elements represent the covariance between pairs of variables.

Interpretation:

- Positive covariance: Two variables tend to increase or decrease together.
- Negative covariance: Two variables tend to move in opposite directions.
- High magnitude covariance: Strong relationship between variables.
- Low magnitude covariance: Weak relationship between variables.

**Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?**

- Gender: Label encoding (no inherent order)
- Education Level: Ordinal encoding (there's a clear hierarchy)
- Employment Status: Label encoding (no inherent order)

**Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.**

In [None]:
import pandas as pd
import numpy as np

data = {'Temperature': ['Jack', 'John', 'Jenny'],
        'Humidity': [28, 21, 35],
        'Weather Condition': [2000000, 1800000, 2400000],  # Remove quotes for numerical values
        'Wind Direction': [12+4, 12+6, 12+9]}
df = pd.DataFrame(data)

covariance_matrix = df[['Temperature', 'Humidity']].cov()
print(covariance_matrix)

covariance_matrix = df[['Weather Condition', 'Wind Direction']].cov()
print(covariance_matrix)

In [18]:
import numpy as np

temperature = np.array([25, 22, 27, 30, 24])
humidity = np.array([60, 65, 55, 50, 70])

weather_condition = np.array([0, 1, 0, 2, 1])
wind_direction = np.array([0, 1, 2, 3, 0])
data = np.vstack([temperature, humidity, weather_condition, wind_direction])

cov_matrix = np.cov(data, bias=False)
print("Covariance Matrix:\n", cov_matrix)

Covariance Matrix:
 [[  9.3  -21.25   0.9    3.1 ]
 [-21.25  62.5   -1.25  -8.75]
 [  0.9   -1.25   0.7    0.55]
 [  3.1   -8.75   0.55   1.7 ]]


- Temperature and Humidity: A moderate negative relationship.
- Temperature and Weather Condition: A slight positive relationship.
- Temperature and Wind Direction: A slight positive relationship.
- Humidity and Weather Condition: A slight negative relationship.
- Humidity and Wind Direction: A moderate negative relationship.
- Weather Condition and Wind Direction: A slight positive relationship.