Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

#### Answer -

**Ordinal Encoding -** Ordinal Encoding assigns a unique integer to each category based on an inherent order. This method is useful when the categories have a meaningful order or ranking.      

**Label Encoding -** Label Encoding assigns a unique integer to each category without implying any order. It is suitable for nominal categories, where no order or ranking is present.

In [1]:
import pandas as pd
df1 = pd.DataFrame({"color":["Red","Blue","Green"]})
df2 = pd.DataFrame({"class":["10th","8th","12th"]})

In [2]:
df1

Unnamed: 0,color
0,Red
1,Blue
2,Green


In [3]:
from sklearn.preprocessing import LabelEncoder

In [4]:
encoder = LabelEncoder()

In [5]:
## Label Encoding Example
encoder.fit_transform(df1["color"])   

array([2, 0, 1])

In [6]:
df2

Unnamed: 0,class
0,10th
1,8th
2,12th


In [7]:
from sklearn.preprocessing import OrdinalEncoder

In [8]:
encoder  = OrdinalEncoder(categories=[["8th","10th","12th"]])

In [9]:
encoder.fit_transform(df2[["class"]])

array([[1.],
       [0.],
       [2.]])

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

#### Answer -

It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In [10]:
df = pd.DataFrame({"city":["Mumbai","Banglore","Baikunthpur","Banglore","Mumbai"],
             "price":[500,800,300,900,600]})

In [11]:
df

Unnamed: 0,city,price
0,Mumbai,500
1,Banglore,800
2,Baikunthpur,300
3,Banglore,900
4,Mumbai,600


In [12]:
av_price = df.groupby("city")["price"].mean()

In [13]:
av_price

city
Baikunthpur    300.0
Banglore       850.0
Mumbai         550.0
Name: price, dtype: float64

In [14]:
df["av_price"] = df["city"].map(av_price)

In [15]:
df

Unnamed: 0,city,price,av_price
0,Mumbai,500,550.0
1,Banglore,800,850.0
2,Baikunthpur,300,300.0
3,Banglore,900,850.0
4,Mumbai,600,550.0


Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

#### Answer -

Covariance is a statistical measure that indicates the extent to which two variables change together. If the greater values of one variable correspond with the greater values of the other variable, the covariance is positive. Conversely, if greater values of one variable correspond with the lesser values of the other variable, the covariance is negative. A covariance close to zero suggests that the variables are not linearly related.

### Importance of Covariance in Statistical Analysis
**Understanding Relationships:** Covariance helps in identifying the direction of the linear relationship between variables. Knowing whether the relationship is positive or negative can be crucial in fields like finance, economics, and social sciences.

**Feature Selection:** In machine learning and data analysis, covariance can help in selecting relevant features. Variables with high covariance might provide redundant information.

**Portfolio Theory:** In finance, covariance is used to diversify risk in portfolio management. Assets with low or negative covariance can reduce the overall risk of the portfolio.

**Basis for Further Analysis:** Covariance is a building block for other statistical concepts like the correlation coefficient, which standardizes covariance by dividing it by the product of the standard deviations of the variables.

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

#### Answer -

In [16]:
df = pd.DataFrame({"Color":["red","green","blue"],
              "Size":["small", "medium", "large"],
              "Material": ["wood", "metal", "plastic"]})

In [17]:
df

Unnamed: 0,Color,Size,Material
0,red,small,wood
1,green,medium,metal
2,blue,large,plastic


In [18]:
from sklearn.preprocessing import LabelEncoder

In [19]:
encoder = LabelEncoder()

In [20]:
encoder.fit_transform(df["Color"])

array([2, 1, 0])

In [21]:
encoder.fit_transform(df["Size"])

array([2, 1, 0])

In [22]:
encoder.fit_transform(df["Material"])

array([2, 0, 1])

Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

#### Answer -

we assume "Education Level" is encoded as follows:

High School: 1      
Bachelor's: 2            
Master's: 3            
PhD: 4 

In [23]:
import pandas as pd

# Sample data
data = {
    'Age': [25, 30, 35, 40],
    'Income': [50000, 60000, 80000, 90000],
    'Education Level': [1, 2, 3, 4]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Age,Income,Education Level
0,25,50000,1
1,30,60000,2
2,35,80000,3
3,40,90000,4


In [24]:
df.cov()

Unnamed: 0,Age,Income,Education Level
Age,41.666667,116666.7,8.333333
Income,116666.666667,333333300.0,23333.333333
Education Level,8.333333,23333.33,1.666667


Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

#### Answer -

In [25]:
df = pd.DataFrame({"Gender" :["Male","Female","Male","Female"],
             "Education Level" :["High School","Bachelor's","Master's","PhD"],
             "Employment Status": ["Unemployed","Part-Time","Full-Time","Full-Time"]})

In [26]:
df

Unnamed: 0,Gender,Education Level,Employment Status
0,Male,High School,Unemployed
1,Female,Bachelor's,Part-Time
2,Male,Master's,Full-Time
3,Female,PhD,Full-Time


**Gender -** Use Label Encoding because the categories are nominal and do not have any inherent order.

In [27]:
encoder = LabelEncoder()

In [28]:
encoder.fit_transform(df["Gender"])

array([1, 0, 1, 0])

**Education Level -** Use Ordinal Encoding because the categories have a clear order based on the level of education.

In [29]:
encoder = OrdinalEncoder(categories=[["High School","Bachelor's","Master's","PhD"]])

In [30]:
encoder.fit_transform(df[["Education Level"]])

array([[0.],
       [1.],
       [2.],
       [3.]])

**Employment Status -** Use Ordinal Encoding because the categories have a clear order based on the level of employment.

In [31]:
encoder = OrdinalEncoder(categories=[["Unemployed","Part-Time","Full-Time"]])

In [32]:
encoder.fit_transform(df[["Employment Status"]])

array([[0.],
       [1.],
       [2.],
       [2.]])

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

#### Answer -

#### Encode Categorical Variables

In [33]:
encoder = LabelEncoder()

In [34]:
encoder.fit_transform(["Sunny","Cloudy","Rainy"])

array([2, 0, 1])

In [35]:
encoder.fit_transform(["North","South","East","West"])

array([1, 2, 0, 3])

In [36]:
import pandas as pd

data = {
    'Temperature': [30, 25, 20, 22, 28],
    'Humidity': [70, 65, 80, 75, 60],
    'Weather Condition': [2, 0, 1, 0, 1],
    'Wind Direction': [1, 2, 0, 3, 0]
}

df = pd.DataFrame(data)

In [37]:
df

Unnamed: 0,Temperature,Humidity,Weather Condition,Wind Direction
0,30,70,2,1
1,25,65,0,2
2,20,80,1,0
3,22,75,0,3
4,28,60,1,0


In [38]:
df.cov()

Unnamed: 0,Temperature,Humidity,Weather Condition,Wind Direction
Temperature,17.0,-23.75,2.0,-1.0
Humidity,-23.75,62.5,5.5511150000000004e-17,1.25
Weather Condition,2.0,5.5511150000000004e-17,0.7,-0.7
Wind Direction,-1.0,1.25,-0.7,1.7
