### Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

Ordinal Encoding and Label Encoding are methods used to convert categorical data into numerical data, but they are used in different contexts.

- **Ordinal Encoding** is used when the categorical variables have an inherent order or ranking. For example, education levels like 'High School', 'Bachelor's', 'Master's', and 'PhD' have a natural order.
- **Label Encoding** assigns a unique number to each category but does not consider any order. It is used for categorical variables without any inherent order. For example, encoding 'red', 'green', and 'blue' colors.

**Example:**

In [1]:
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

data = {'Education': ['High School', 'Bachelor', 'Master', 'PhD']}
df = pd.DataFrame(data)
encoder = OrdinalEncoder(categories=[['High School', 'Bachelor', 'Master', 'PhD']])
df['Education_encoded'] = encoder.fit_transform(df[['Education']])
print(df)

     Education  Education_encoded
0  High School                0.0
1     Bachelor                1.0
2       Master                2.0
3          PhD                3.0


### Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

Target Guided Ordinal Encoding orders the categories based on the mean of the target variable. It is useful when the categorical variable is related to the target variable.

Example:

In [2]:
import pandas as pd
import numpy as np

# Sample Data
data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C'],
        'Target': [1, 2, 3, 2, 3, 1]}
df = pd.DataFrame(data)

# Calculate mean target value per category
mean_target = df.groupby('Category')['Target'].mean()
df['Category_encoded'] = df['Category'].map(mean_target)
print(df)


  Category  Target  Category_encoded
0        A       1               1.5
1        B       2               2.5
2        C       3               2.0
3        A       2               1.5
4        B       3               2.5
5        C       1               2.0


### Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance measures the degree to which two variables change together. It is important in statistical analysis to understand the relationship between variables.

Formula:
$$
Cov
(
𝑋
,
𝑌
)
=
∑
(
𝑋
𝑖
−
𝑋
ˉ
)
(
𝑌
𝑖
−
𝑌
ˉ
)
/
(𝑛
−
1)
$$

### Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [7]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample Data
data = {'Color': ['red', 'green', 'blue'],
        'Size': ['small', 'medium', 'large'],
        'Material': ['wood', 'metal', 'plastic']}
df = pd.DataFrame(data)

# Label Encoding
label_encoders = {}
for column in df.columns:
    le = LabelEncoder()
    df[column + '_encoded'] = le.fit_transform(df[column])
    label_encoders[column] = le

print(df)


   Color    Size Material  Color_encoded  Size_encoded  Material_encoded
0    red   small     wood              2             2                 2
1  green  medium    metal              1             1                 0
2   blue   large  plastic              0             0                 1


### Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [8]:
import pandas as pd

# Sample Data
data = {'Age': [25, 32, 47, 51],
        'Income': [50000, 60000, 120000, 150000],
        'Education': [12, 16, 18, 20]}
df = pd.DataFrame(data)

# Covariance Matrix
cov_matrix = df.cov()
print(cov_matrix)


                     Age        Income      Education
Age           150.916667  5.783333e+05      40.166667
Income     578333.333333  2.300000e+09  150000.000000
Education      40.166667  1.500000e+05      11.666667


### Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

Dataset:

Gender: Use Label Encoding (Male/Female) because there is no inherent order.  
Education Level: Use Ordinal Encoding (High School/Bachelor's/Master's/PhD) because there is a natural order.  
Employment Status: Use Label Encoding (Unemployed/Part-Time/Full-Time) because there is no inherent order.


### Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.

Covariance is calculated between continuous variables. For categorical variables, it is not applicable directly.

In [9]:
import numpy as np
import pandas as pd

# Sample Data
data = {'Temperature': [30, 25, 27, 35, 20],
        'Humidity': [70, 65, 75, 80, 60]}
df = pd.DataFrame(data)

# Covariance Calculation
cov_matrix = df.cov()
print(cov_matrix)


             Temperature  Humidity
Temperature         31.3      40.0
Humidity            40.0      62.5


Interpretation:

Covariance between Temperature and Humidity will indicate whether they increase together (positive value) or one increases while the other decreases (negative value).