In [1]:
# Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
# might choose one over the other.

# ANS-

# Label Encoding:--
# Label encoding involves assigning a unique integer to each category in a categorical feature.
# It's often used for nominal categorical data (categories without an inherent order) and not recommended for ordinal data (categories with a specific order). 
# Label encoding can lead to misinterpretation by the model, as it may treat the encoded values as ordinal when they're not.

# Ordinal Encoding:--
# Ordinal encoding, on the other hand, is specifically designed for ordinal categorical data, where the categories have a meaningful order. 
# In this method, each category is assigned a unique integer based on its order.

# EXAMPLE- 

from sklearn.preprocessing import LabelEncoder

data = ['red', 'green', 'blue', 'green', 'red', 'blue']
encoder = LabelEncoder()
encoded_data = encoder.fit_transform(data)

print(encoded_data)


import pandas as pd

data = {'size': ['small', 'medium', 'large', 'medium', 'small']}
df = pd.DataFrame(data)

size_mapping = {'small': 0, 'medium': 1, 'large': 2}
df['size_encoded'] = df['size'].map(size_mapping)

print(df)


[2 1 0 1 2 0]
     size  size_encoded
0   small             0
1  medium             1
2   large             2
3  medium             1
4   small             0


In [2]:
# Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
# a machine learning project.

# ANS-

# Target Guided Ordinal Encoding is a technique used to encode categorical features based on the relationship between the categories and the target variable. 
# This method is particularly useful when dealing with ordinal categorical features where the categories have a meaningful order, and you want to capture the impact of these categories on the target variable.

# Calculate the Mean/Median of Target Variable for Each Category:--
# Calculate the mean (or median) value of the target variable for each category in the categorical feature. This gives you insight into the relationship between each category and the target.

# Order the Categories:--
# Sort the categories based on their mean (or median) target value. This establishes the order in which you will assign ordinal labels.

# Assign Ordinal Labels:--
# Assign ordinal labels (integer values) to the categories according to their order based on target variable means (or medians).

# Replace Categorical Values:--
# Replace the original categorical values in the feature column with the assigned ordinal labels.

# EXAMPLE- 

import pandas as pd
from sklearn.model_selection import train_test_split

# Sample data
data = {
    'category': ['A', 'B', 'A', 'C', 'B', 'C', 'A', 'C'],
    'target': [10, 20, 15, 25, 30, 35, 12, 28]
}

df = pd.DataFrame(data)

# Calculate mean target for each category
mean_target_per_category = df.groupby('category')['target'].mean()

ordered_categories = mean_target_per_category.sort_values().index

# Create a mapping of ordered categories to ordinal labels
category_mapping = {category: label for label, category in enumerate(ordered_categories)}

# ordinal encoding based on target relationship
df['category_encoded'] = df['category'].map(category_mapping)

print(df)


  category  target  category_encoded
0        A      10                 0
1        B      20                 1
2        A      15                 0
3        C      25                 2
4        B      30                 1
5        C      35                 2
6        A      12                 0
7        C      28                 2


In [None]:
Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

ANS-

Covariance:
Covariance is a statistical measure that indicates the degree to which two variables change together. In other words, it measures the extent to which changes in one variable are associated with changes in another variable. Covariance can be positive, indicating that the two variables tend to increase or decrease together, or negative, indicating that one variable tends to increase while the other decreases.

Importance in Statistical Analysis:
Covariance is important in statistical analysis for several reasons:

Relationship Assessment:--
Covariance helps us understand whether two variables have a linear relationship. If the covariance is positive, it suggests that the variables tend to increase together, while a negative covariance suggests that they move in opposite directions.

Portfolio Management: --
In finance, covariance is used to assess the relationship between the returns of different assets in a portfolio. It helps in diversification, where assets with low or negative covariance can reduce overall portfolio risk.

Data Exploration:--
Covariance provides insights into the interactions between variables. It can help identify patterns and dependencies, which is crucial in exploratory data analysis.

Feature Selection:--
In machine learning, covariance can be used to identify redundant features. Variables with high positive covariance might carry similar information, so one could potentially be dropped without losing much information.

Regression Analysis:--
Covariance is used in linear regression to estimate the relationship between independent and dependent variables. The covariance matrix plays a role in calculating regression coefficients.

Calculation -

cov(X, Y) = Σ((X_i - X̄) * (Y_i - Ȳ)) / (n - 1)


In [3]:
# Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
# large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
# Show your code and explain the output.

# ANS-

from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Sample data
data = {
    'Color': ['red', 'green', 'blue', 'green', 'red', 'blue'],
    'Size': ['small', 'medium', 'large', 'medium', 'small', 'medium'],
    'Material': ['wood', 'metal', 'plastic', 'wood', 'metal', 'plastic']
}

df = pd.DataFrame(data)

# LabelEncoder
label_encoder = LabelEncoder()

# label encoding to each column
for column in df.columns:
    df[column + '_encoded'] = label_encoder.fit_transform(df[column])

print(df)


   Color    Size Material  Color_encoded  Size_encoded  Material_encoded
0    red   small     wood              2             2                 2
1  green  medium    metal              1             1                 0
2   blue   large  plastic              0             0                 1
3  green  medium     wood              1             1                 2
4    red   small    metal              2             2                 0
5   blue  medium  plastic              0             1                 1


In [4]:
# Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
# level. Interpret the results.

# ANS-


import numpy as np

# sample dataset
age = [25, 30, 40, 35, 28]
income = [50000, 60000, 75000, 65000, 55000]
education_level = [12, 16, 18, 14, 15]

# Create a data matrix
data_matrix = np.vstack((age, income, education_level))

# Calculate the covariance matrix
covariance_matrix = np.cov(data_matrix)

print("Covariance Matrix:")
print(covariance_matrix)


Covariance Matrix:
[[3.530e+01 5.675e+04 1.000e+01]
 [5.675e+04 9.250e+07 1.750e+04]
 [1.000e+01 1.750e+04 5.000e+00]]


In [5]:
# Q6. You are working on a machine learning project with a dataset containing several categorical
# variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
# and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
# each variable, and why?

# ANS-

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample data
data = {
    'Gender': ['Male', 'Female', 'Male', 'Male', 'Female'],
    'Education Level': ['High School', 'Bachelor\'s', 'Master\'s', 'Bachelor\'s', 'PhD'],
    'Employment Status': ['Unemployed', 'Part-Time', 'Full-Time', 'Part-Time', 'Full-Time']
}

df = pd.DataFrame(data)

# Encode 'Gender' using Label Encoding (Binary)
gender_mapping = {'Male': 0, 'Female': 1}
df['Gender_encoded'] = df['Gender'].map(gender_mapping)

# Encode 'Education Level' using One-Hot Encoding
education_encoded = pd.get_dummies(df['Education Level'], prefix='Education')
df = pd.concat([df, education_encoded], axis=1)

# Encode 'Employment Status' using One-Hot Encoding
employment_encoded = pd.get_dummies(df['Employment Status'], prefix='Employment')
df = pd.concat([df, employment_encoded], axis=1)

print(df)


   Gender Education Level Employment Status  Gender_encoded  \
0    Male     High School        Unemployed               0   
1  Female      Bachelor's         Part-Time               1   
2    Male        Master's         Full-Time               0   
3    Male      Bachelor's         Part-Time               0   
4  Female             PhD         Full-Time               1   

   Education_Bachelor's  Education_High School  Education_Master's  \
0                     0                      1                   0   
1                     1                      0                   0   
2                     0                      0                   1   
3                     1                      0                   0   
4                     0                      0                   0   

   Education_PhD  Employment_Full-Time  Employment_Part-Time  \
0              0                     0                     0   
1              0                     0                     1   
2       

In [6]:
# Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
# categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
# East/West). Calculate the covariance between each pair of variables and interpret the results.

# ANS-

# To calculate the covariance between each pair of variables ("Temperature" and "Humidity"), as well as the covariance between each categorical variable ("Weather Condition" and "Wind Direction"), we need to understand that covariance is typically calculated between two continuous variables. 
# Categorical variables need to be encoded before calculating their covariance with other variables.


import pandas as pd
import numpy as np

# Sample data
data = {
    'Temperature': [25, 28, 30, 22, 27],
    'Humidity': [60, 55, 70, 40, 65],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],
    'Wind Direction': ['North', 'South', 'East', 'West', 'North']
}

df = pd.DataFrame(data)

# Calculate covariance between Temperature and Humidity
cov_temp_humidity = np.cov(df['Temperature'], df['Humidity'])[0, 1]

print(f"Covariance between Temperature and Humidity: {cov_temp_humidity:.2f}")


Covariance between Temperature and Humidity: 29.75
