# Feature Engineering - 4

Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

Ans.

- **Ordinal Encoding** is used for categorical variables with a natural ordering between the categories, where the categories are assigned integer values according to their order. For example, suppose we have a Column named size with features short, medium and long, then we can use Ordinal Encoding as there is an inherent order.

- **Label Encoding** is most appropriate for a categorical variable with no natural ordering, as each unique category can be assigned a unique integer value. For example, we have a column named Color with values Red, Blue and Green, then it is better to use label encoding as there is no inherent order between the colors.

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

Ans. Target Guided Ordinal Encoding is appropriate for a categorical variable that is strongly associated with the target variable, as it assigns ordinal values to categories based on the mean of the target variable for each category.

In Target Guided Ordinal Encoding, we calculate the mean of target variable for all the categories and assign this to our categories.

For example, using a sample dataset as shown, we have

In [25]:
import pandas as pd

data = {
    "Age": [25, 30, 22, 45, 28, 35, 40, 26, 31, 50],
    "EducationLevel": ["High School", "Bachelor's", "Associate's", "Master's", "Bachelor's", 
                       "Master's", "PhD", "Associate's", "Bachelor's", "Master's"],
    "Income": [50000, 75000, 30000, 90000, 60000, 85000, 100000, 55000, 70000, 120000]
}

df = pd.DataFrame(data)

print('Our Sample Data is:')
df

Our Sample Data is:


Unnamed: 0,Age,EducationLevel,Income
0,25,High School,50000
1,30,Bachelor's,75000
2,22,Associate's,30000
3,45,Master's,90000
4,28,Bachelor's,60000
5,35,Master's,85000
6,40,PhD,100000
7,26,Associate's,55000
8,31,Bachelor's,70000
9,50,Master's,120000


In [33]:
mean_income = df.groupby('EducationLevel')['Income'].mean().to_dict()
mean_income

{"Associate's": 42500.0,
 "Bachelor's": 68333.33333333333,
 'High School': 50000.0,
 "Master's": 98333.33333333333,
 'PhD': 100000.0}

The above dictionary shows the mean of different categories of EducationLevel, now for encoding, we assign this mean to the respective categories as shown:

In [36]:
df['Encoded_EducationLevel'] = df['EducationLevel'].map(mean_income)
df

Unnamed: 0,Age,EducationLevel,Income,Encoded_EducationLevel
0,25,High School,50000,50000.0
1,30,Bachelor's,75000,68333.333333
2,22,Associate's,30000,42500.0
3,45,Master's,90000,98333.333333
4,28,Bachelor's,60000,68333.333333
5,35,Master's,85000,98333.333333
6,40,PhD,100000,100000.0
7,26,Associate's,55000,42500.0
8,31,Bachelor's,70000,68333.333333
9,50,Master's,120000,98333.333333


Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Ans. Covariance measures both the strength and direction of the linear relationship between two variables, and is a measure of the degree of association between them. Covariance shows us the dependence between two variables. It shows us how a variable changes as other variable changes. Positive covariance means when one increases the other also increases and when one decreases other also decreases. It doesn't have any limit.

Covariance is important in statistical analysis as it can help us determine the interdependence of features and how a feature affects the target feature.

Formula of covariance : 
 $$ cov\left(x,y\right)  = \frac{\sum\limits_{i=1}^N (x_i - \bar{x})(y_i - \bar{y}) }{N-1} $$
 where,
- $\bar{x}$ is mean of $x$ data
- $\bar{y}$ is mean of $y$ data
- $N$ is the number of smaples

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

Ans. Solution using python is as shown :

In [3]:
import pandas as pd
df = pd.DataFrame({'color':['red','blue','green','green','red','blue'],
    'size':['small','medium','small','large','small', 'medium'],
    'material':['wood','metal','plastic','wood','metal','plastic']})
df

Unnamed: 0,color,size,material
0,red,small,wood
1,blue,medium,metal
2,green,small,plastic
3,green,large,wood
4,red,small,metal
5,blue,medium,plastic


In [11]:
from sklearn.preprocessing import LabelEncoder

label = LabelEncoder()
df['color_encoded'] = label.fit_transform(df['color'])
df['size_encoded'] = label.fit_transform(df['size'])
df['material_encoded'] = label.fit_transform(df['material'])

print('Label Encoding output : ')
df

Label Encoding output : 


Unnamed: 0,color,size,material,color_encoded,size_encoded,material_encoded
0,red,small,wood,2,2,2
1,blue,medium,metal,0,1,0
2,green,small,plastic,1,2,1
3,green,large,wood,1,0,2
4,red,small,metal,2,2,0
5,blue,medium,plastic,0,1,1


Here the different unique features of each category are assigned an integer from 0 to n-1 where n is number of unique features. Here n=3, so we get numbering as 0, 1, 2. This numbering is done according to alphabetical order of name of features. So that blue gets 0, green gets 1 and red gets 2 and so on. In this way, our data gets encoded.

Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

Ans. Solution using python is as shown:

In [37]:
import pandas as pd

data = {
    "Age": [25, 30, 22, 45, 28, 35, 40, 26, 31, 50],
    "EducationLevel": ["High School", "Bachelor's", "Associate's", "Master's", "Bachelor's", 
                       "Master's", "PhD", "Associate's", "Bachelor's", "Master's"],
    "Income": [50000, 75000, 30000, 90000, 60000, 85000, 100000, 55000, 70000, 120000]
}

df = pd.DataFrame(data)

print('Our Sample Data is:')
df

Our Sample Data is:


Unnamed: 0,Age,EducationLevel,Income
0,25,High School,50000
1,30,Bachelor's,75000
2,22,Associate's,30000
3,45,Master's,90000
4,28,Bachelor's,60000
5,35,Master's,85000
6,40,PhD,100000
7,26,Associate's,55000
8,31,Bachelor's,70000
9,50,Master's,120000


In [20]:
#calculating covariance

df[['Age','Income']].cov()

Unnamed: 0,Age,Income
Age,84.177778,229222.2
Income,229222.222222,694722200.0


Only age and income are numerical features, so only they can be used to calculate covariance.

Here, we see that there is a very high positive covariance between Age and Income. Thus from this we can interpret that when age increases, income also increases.

Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

Ans. The encoding methods which should be used are as follows:

- **Gender :** here two values of gender are male and femal. We should use One Hot Encoding because gender data has no order or ranking. Both gender have equal value. 

- **Education Level :** here the values are High School/Bachelor's/Master's/PhD. We note that these values are ordered. PhD is highest, Master's is on second number and son on. Hence, in this case we should use Ordinal Encoding.

- **Employment Status :** here also the values Unemployed/Part-Time/Full-Time can be considered as ordered, there is an inherent order in them. Hence, in this case also we should use Ordinal Encoding.

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

Ans. The python code using random variables is as shown:

In [21]:
import pandas as pd

data = {
    "Temperature": [25.5, 22.0, 28.3, 20.8, 23.7, 26.1, 24.4, 27.9, 21.6, 29.2],
    "Humidity": [60, 75, 45, 85, 50, 62, 68, 55, 80, 40],
    "Weather Condition": ["Sunny", "Cloudy", "Rainy", "Cloudy", "Sunny", "Rainy", "Cloudy", "Sunny", "Rainy", "Sunny"],
    "Wind Direction": ["North", "South", "East", "West", "North", "South", "East", "West", "North", "South"]
}

df = pd.DataFrame(data)

df

Unnamed: 0,Temperature,Humidity,Weather Condition,Wind Direction
0,25.5,60,Sunny,North
1,22.0,75,Cloudy,South
2,28.3,45,Rainy,East
3,20.8,85,Cloudy,West
4,23.7,50,Sunny,North
5,26.1,62,Rainy,South
6,24.4,68,Cloudy,East
7,27.9,55,Sunny,West
8,21.6,80,Rainy,North
9,29.2,40,Sunny,South


In [24]:
df[['Temperature','Humidity']].cov()

Unnamed: 0,Temperature,Humidity
Temperature,8.736111,-39.4
Humidity,-39.4,225.333333


Only Temperature and Humidity are numerical features, so only they can be used to calculate covariance. 

So, we see that there is a negative covariance between temperature and humidity. Thus from this we can interpret that, as the temperature increases the humidity decreases. 