In [1]:
#Ans 01:

In [2]:
# Ordinal encoding and label encoding are both techniques used in machine learning to convert categorical data into numerical
# format, but they are applied differently.

# Label Encoding:
# In label encoding, each unique category is assigned a different integer number.
# For example, in a "Color" feature with categories like Red, Green, and Blue, label encoding might assign Red as 0, Green as 1, and Blue as 2.

# Ordinal Encoding:
# Ordinal encoding is similar to label encoding but specifically used when there's an inherent order or hierarchy among the categories.
# It assigns numbers to categories based on their rank or order.
# For instance, in a "Size" feature with categories Small, Medium, and Large, ordinal encoding might assign Small as 0, Medium as 1, and Large as 2,
# reflecting their relative order.

# When to Choose:

# Label Encoding: Use label encoding when the categorical data doesn't have an intrinsic order or when the order is irrelevant for the model. For
# instance, encoding different types of fruits or city names where no inherent order exists.

# Ordinal Encoding: Opt for ordinal encoding when there's a clear order or hierarchy in the categories. Examples include encoding education levels
# (such as High School, Bachelor's, Master's) or ratings (Low, Medium, High) where the order matters.

# Choosing between the two depends on the nature of the categorical data and its relevance to the problem you're solving. If there's no meaningful
# order, label encoding is preferable. If there's a hierarchy, ordinal encoding would be more suitable.|

In [3]:
##########################################################################
#Ans 02:

In [4]:
# Target Guided Ordinal Encoding is a technique used when encoding categorical variables by considering the target variable's
# relationship with the categories. Here's how it generally works:

# Calculate the Mean/Median/Sum of the Target Variable: For each category in the categorical feature, calculate a statistical measure
# like mean, median, or sum of the target variable within that category.

# Order the Categories: Sort the categories based on the calculated measure (e.g., mean of the target variable). This establishes an ordinal
# relationship among the categories.

# Assign Ordinal Ranks: Assign ranks or numbers to the categories based on this order. The category with the lowest mean (or highest,
# depending on the problem) might get a lower rank, and the one with the highest mean gets a higher rank.

# Replace Categories with Ranks: Replace the original categories in the dataset with these ordinal ranks.

# This technique is particularly useful in scenarios where the relationship between the categorical variable and the target variable is essential.
# For example:

# Marketing Campaigns: When categorizing customers based on their responsiveness to marketing campaigns, Target Guided Ordinal Encoding can assign
# ranks based on their historical response rates. High responders might get a higher rank, indicating their likelihood to respond positively to
# future campaigns.

# Risk Assessment: In credit risk analysis, categories representing different risk levels (low, medium, high) can be ordered based on historical
# default rates. This ordinal encoding can help predict the likelihood of default for new applicants.

# By leveraging the relationship between the categorical variable and the target variable, Target Guided Ordinal Encoding can provide additional
# information to machine learning models, potentially improving predictive performance in certain cases.

In [5]:
##########################################################################
#Ans 03:

In [6]:
# Covariance is a measure that indicates the extent to which two random variables change in relation to each other. It
# describes the direction of the linear relationship between two variables.

# Positive Covariance: When one variable tends to have high values when the other has high values, and low values when the other has
# low values, the covariance is positive.

# Negative Covariance: If one variable tends to have high values when the other has low values and vice versa, the covariance is negative.

# Importance in Statistical Analysis:

# Covariance is crucial in statistical analysis for various reasons:

# Relationship Assessment: It helps in understanding how two variables move concerning each other. For instance, in finance, it helps analyze
# how the prices of two stocks move together.

# Portfolio Diversification: In finance, understanding the covariance between different assets helps in constructing diversified portfolios.
# Assets with low or negative covariance can reduce overall portfolio risk.

# Linear Regression: Covariance is fundamental in linear regression analysis. The sign and magnitude of covariance between the predictor and
# response variables determine the strength and direction of the relationship.

# Calculation of Covariance:

# The formula for calculating the covariance between two variables X and Y, given a dataset with 'n' observations, is:
    
#     Cov(X,Y)= ∑_n_i_1(((X_i − X_bar)*(Y_i − Y_bar))/n-1)
    
# Where:
# X_i and Y_i are individual data points
# X_bar and Y_bar are the means of X and Y, respectively

# This formula computes the average of the product of the deviations of each pair of data points from their respective means. The division by 
# n−1 (sample size minus 1) is a correction for sample variability and makes the covariance an unbiased estimator of the population covariance.

# However, the magnitude of the covariance is not easily interpretable, as it depends on the scales of the variables. Therefore, standardized
# measures like correlation coefficient are often used to gauge the strength and direction of the relationship between variables.

In [7]:
##########################################################################
#Ans 04:

In [8]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Sample data
data = {
    'Color': ['red', 'green', 'blue', 'green', 'red'],
    'Size': ['small', 'medium', 'large', 'small', 'medium'],
    'Material': ['wood', 'metal', 'plastic', 'wood', 'plastic']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply LabelEncoder to each categorical column
for col in df.columns:
    if df[col].dtype == 'object':  # Check if column contains categorical data
        df[col + '_encoded'] = label_encoder.fit_transform(df[col])

# Display the encoded DataFrame
print(df)

   Color    Size Material  Color_encoded  Size_encoded  Material_encoded
0    red   small     wood              2             2                 2
1  green  medium    metal              1             1                 0
2   blue   large  plastic              0             0                 1
3  green   small     wood              1             2                 2
4    red  medium  plastic              2             1                 1


In [9]:
# This code snippet performs label encoding using scikit-learn's LabelEncoder. Here's a breakdown:

# 1. A sample dataset with three categorical columns (Color, Size, and Material) is created and converted into a Pandas DataFrame.
# 2. The LabelEncoder object is initialized.
# 3. The code iterates through each column in the DataFrame and checks if it contains categorical data. For such columns, it applies label
#    encoding using fit_transform() method of LabelEncoder.
# 4. Encoded versions of the categorical columns are added as new columns in the DataFrame with '_encoded' appended to their original names.
# 5. The resulting DataFrame is displayed, showing both the original categorical columns and their respective encoded versions.

# The output will be a DataFrame where the original categorical columns (Color, Size, Material) are preserved, and new columns ('Color_encoded',
# 'Size_encoded', 'Material_encoded') are added, representing the label encoded versions of the original categorical data. Each unique category
# in the original columns will be replaced with an integer label. For instance, 'red' might be replaced with 0, 'green' with 1, 'blue' with 2,
# and so on for each categorical feature. The same encoding logic applies to the Size and Material columns as well.

In [10]:
##########################################################################
#Ans 05:

In [11]:
import numpy as np
import pandas as pd

# Sample data
data = {
    'Age': [30, 40, 25, 35, 28],
    'Income': [50000, 60000, 45000, 70000, 55000],
    'Education_Level': [14, 16, 12, 18, 13]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Calculate the covariance matrix
covariance_matrix = np.cov(df.T)

# Display the covariance matrix
print("Covariance Matrix:")
print(covariance_matrix)

Covariance Matrix:
[[3.530e+01 4.175e+04 1.180e+01]
 [4.175e+04 9.250e+07 2.175e+04]
 [1.180e+01 2.175e+04 5.800e+00]]


In [12]:
# In this code:

# 1. The sample data containing Age, Income, and Education Level is structured into a Pandas DataFrame.
# 2. The numpy library's cov() function is applied to the transposed DataFrame (df.T) to calculate the covariance matrix.

# The resulting output will display the covariance matrix, showing the pairwise covariances between Age, Income, and Education Level.
# The matrix will be a 3x3 matrix, where each element represents the covariance between two variables. The diagonal elements will represent
# the variances of each variable, and the off-diagonal elements will represent the covariances between pairs of variables.

# Interpreting the covariance matrix involves considering the relationships between variables:

# 1. Diagonal Elements: These represent the variances of individual variables. A higher value indicates higher variability in that variable.
# For instance, a larger variance in Income implies more spread in income levels among individuals in the dataset.
# 2. Off-diagonal Elements: These indicate the covariances between pairs of variables. A positive covariance indicates a positive relationship
# between the variables (i.e., they tend to increase/decrease together), while a negative covariance suggests an inverse relationship. The
# magnitude indicates the strength of the relationship.

# However, interpreting the raw covariances might be challenging due to the differences in scales of the variables. For a clearer understanding
# of the relationships, it's common to normalize the covariances using correlation coefficients, which standardize the scale to range between -1
# and 1, facilitating easier comparison and interpretation.

In [13]:
##########################################################################
#Ans 06:

In [14]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample data
data = {
    'Gender': ['Male', 'Female', 'Male', 'Female'],
    'Education Level': ['High School', "Bachelor's", "Master's", 'PhD'],
    'Employment Status': ['Unemployed', 'Part-Time', 'Full-Time', 'Part-Time']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Initialize encoders
label_encoder = LabelEncoder()

# Perform encoding for each variable
# Gender - Label Encoding
df['Gender_encoded'] = label_encoder.fit_transform(df['Gender'])

# Education Level - Ordinal Encoding
education_mapping = {
    'High School': 0,
    "Bachelor's": 1,
    "Master's": 2,
    'PhD': 3
}
df['Education_Level_encoded'] = df['Education Level'].map(education_mapping)

# Employment Status - One-Hot Encoding
df = pd.get_dummies(df, columns=['Employment Status'], prefix='Employment_Status')

# Display the encoded DataFrame
print(df)

   Gender Education Level  Gender_encoded  Education_Level_encoded  \
0    Male     High School               1                        0   
1  Female      Bachelor's               0                        1   
2    Male        Master's               1                        2   
3  Female             PhD               0                        3   

   Employment_Status_Full-Time  Employment_Status_Part-Time  \
0                        False                        False   
1                        False                         True   
2                         True                        False   
3                        False                         True   

   Employment_Status_Unemployed  
0                          True  
1                         False  
2                         False  
3                         False  


In [15]:
##########################################################################
#Ans 07:

In [16]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Sample data
data = {
    'Temperature': [25, 28, 22, 30, 26],
    'Humidity': [60, 55, 70, 50, 65],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Cloudy', 'Rainy'],
    'Wind Direction': ['North', 'South', 'East', 'West', 'North']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Initialize encoders
label_encoder = LabelEncoder()

# Perform label encoding for categorical variables
df['Weather Condition Encoded'] = label_encoder.fit_transform(df['Weather Condition'])
df['Wind Direction Encoded'] = label_encoder.fit_transform(df['Wind Direction'])

# Calculate the covariance matrix
covariance_matrix = np.cov(df[['Temperature', 'Humidity', 'Weather Condition Encoded', 'Wind Direction Encoded']].T)

# Display the covariance matrix
print("Covariance Matrix:")
print(covariance_matrix)

Covariance Matrix:
[[  9.2  -22.5   -1.7    3.4 ]
 [-22.5   62.5    3.75  -8.75]
 [ -1.7    3.75   0.7   -0.65]
 [  3.4   -8.75  -0.65   1.3 ]]


In [17]:
##########################################################################