In [2]:

### Q1: What is the Difference Between Ordinal Encoding and Label Encoding? Provide an Example of When You Might Choose One Over the Other.

''' Ordinal Encoding:
- Definition: Ordinal encoding assigns numerical values to categories based on their order. It is used when there is an inherent order in the categorical data.
- Example Use Case: Encoding education levels (e.g., High School < Bachelor's < Master's < PhD).

Label Encoding:
- Definition: Label encoding assigns unique numerical values to categories without considering any order. It is used for nominal categorical data.
- Example Use Case: Encoding car brands (e.g., Toyota, Ford, BMW). 

Example:
 '''
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
import pandas as pd

# Sample data
data = pd.DataFrame({
    'education': ['High School', 'Bachelor', 'Master', 'PhD'],
    'car_brand': ['Toyota', 'Ford', 'BMW', 'Tesla']
})

# Ordinal Encoding for education
ordinal_encoder = OrdinalEncoder(categories=[['High School', 'Bachelor', 'Master', 'PhD']])
data['education_encoded'] = ordinal_encoder.fit_transform(data[['education']])

# Label Encoding for car brands
label_encoder = LabelEncoder()
data['car_brand_encoded'] = label_encoder.fit_transform(data['car_brand'])

print(data)



     education car_brand  education_encoded  car_brand_encoded
0  High School    Toyota                0.0                  3
1     Bachelor      Ford                1.0                  1
2       Master       BMW                2.0                  0
3          PhD     Tesla                3.0                  2


In [3]:
### Q2: Explain How Target Guided Ordinal Encoding Works and Provide an Example of When You Might Use It in a Machine Learning Project.

'''Target Guided Ordinal Encoding:
- Definition: This encoding method orders the categories based on the target variable. It assigns numerical values to categories based on their relationship with the target variable (e.g., mean of the target for each category).
- Example Use Case: Encoding categories of a feature based on their mean value of the target variable to capture the relationship between the feature and the target.

Example:
'''
import pandas as pd

# Sample data
data = pd.DataFrame({
    'city': ['New York', 'Los Angeles', 'San Francisco', 'Boston'],
    'price': [300, 200, 400, 150]
})

# Calculate mean price for each city
city_mean_price = data.groupby('city')['price'].mean().sort_values()

# Create a mapping of city to mean price rank
city_mapping = {k: i for i, k in enumerate(city_mean_price.index, 0)}

# Apply Target Guided Ordinal Encoding
data['city_encoded'] = data['city'].map(city_mapping)

print(data)





            city  price  city_encoded
0       New York    300             2
1    Los Angeles    200             1
2  San Francisco    400             3
3         Boston    150             0


In [4]:
### Q3: Define Covariance and Explain Why It Is Important in Statistical Analysis. How Is Covariance Calculated?
'''
Covariance:
- Definition: Covariance measures the direction of the linear relationship between two variables. It indicates whether the variables increase or decrease together.
- Importance: Covariance helps in understanding the relationship between variables and is used in portfolio theory, risk management, and other statistical analyses.
'''

'\nCovariance:\n- Definition: Covariance measures the direction of the linear relationship between two variables. It indicates whether the variables increase or decrease together.\n- Importance: Covariance helps in understanding the relationship between variables and is used in portfolio theory, risk management, and other statistical analyses.\n'

In [5]:
### Q4: For a Dataset with the Following Categorical Variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), Perform Label Encoding Using Python's scikit-learn Library. Show Your Code and Explain the Output.

# Label Encoding Example:

from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Sample data
data = pd.DataFrame({
    'color': ['red', 'green', 'blue'],
    'size': ['small', 'medium', 'large'],
    'material': ['wood', 'metal', 'plastic']
})

# Apply Label Encoding
label_encoder = LabelEncoder()
data['color_encoded'] = label_encoder.fit_transform(data['color'])
data['size_encoded'] = label_encoder.fit_transform(data['size'])
data['material_encoded'] = label_encoder.fit_transform(data['material'])

print(data)




   color    size material  color_encoded  size_encoded  material_encoded
0    red   small     wood              2             2                 2
1  green  medium    metal              1             1                 0
2   blue   large  plastic              0             0                 1


In [7]:
### Q5: Calculate the Covariance Matrix for the Following Variables in a Dataset: Age, Income, and Education Level. Interpret the Results.

# Example Data and Covariance Matrix Calculation:
import pandas as pd
import numpy as np

# Sample data
data = pd.DataFrame({
    'age': [25, 35, 45, 20, 35],
    'income': [50000, 60000, 70000, 40000, 50000],
    'education_level': [2, 3, 4, 1, 2]  # Assume 1: High School, 2: Bachelor's, 3: Master's, 4: PhD
})

# Calculate covariance matrix
cov_matrix = data.cov()

print(cov_matrix)

'''
Interpretation:
- Positive covariance values indicate that the variables tend to increase together.
- Negative covariance values indicate that one variable tends to decrease when the other increases.
- Zero covariance indicates no linear relationship between the variables.

'''

                       age       income  education_level
age                  95.00     102500.0            10.25
income           102500.00  130000000.0         13000.00
education_level      10.25      13000.0             1.30


'\nInterpretation:\n- Positive covariance values indicate that the variables tend to increase together.\n- Negative covariance values indicate that one variable tends to decrease when the other increases.\n- Zero covariance indicates no linear relationship between the variables.\n\n'

In [1]:
### Q6: You Are Working on a Machine Learning Project with a Dataset Containing Several Categorical Variables, Including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which Encoding Method Would You Use for Each Variable, and Why?

''' Encoding Methods:
- Gender: Binary Encoding (Male/Female).
  - Reason: Only two categories, binary encoding is sufficient.
- Education Level: Ordinal Encoding (High School < Bachelor's < Master's < PhD).
  - Reason: There is an inherent order in education levels.
- Employment Status: One-Hot Encoding (Unemployed, Part-Time, Full-Time).
  - Reason: No ordinal relationship among categories.

Example: '''

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder

# Sample data
data = pd.DataFrame({
    'gender': ['Male', 'Female', 'Female', 'Male'],
    'education_level': ['High School', 'Bachelor', 'Master', 'PhD'],
    'employment_status': ['Unemployed', 'Part-Time', 'Full-Time', 'Unemployed']
})

# Binary Encoding for Gender
label_encoder = LabelEncoder()
data['gender_encoded'] = label_encoder.fit_transform(data['gender'])

# Ordinal Encoding for Education Level
ordinal_encoder = OrdinalEncoder(categories=[['High School', 'Bachelor', 'Master', 'PhD']])
data['education_level_encoded'] = ordinal_encoder.fit_transform(data[['education_level']])

# One-Hot Encoding for Employment Status
one_hot_encoder = OneHotEncoder(sparse=False)
employment_encoded = one_hot_encoder.fit_transform(data[['employment_status']])
employment_encoded_df = pd.DataFrame(employment_encoded, columns=one_hot_encoder.get_feature_names_out(['employment_status']))

# Concatenate one-hot encoded columns with original data
data = pd.concat([data, employment_encoded_df], axis=1).drop('employment_status', axis=1)

print(data)




   gender education_level  gender_encoded  education_level_encoded  \
0    Male     High School               1                      0.0   
1  Female        Bachelor               0                      1.0   
2  Female          Master               0                      2.0   
3    Male             PhD               1                      3.0   

   employment_status_Full-Time  employment_status_Part-Time  \
0                          0.0                          0.0   
1                          0.0                          1.0   
2                          1.0                          0.0   
3                          0.0                          0.0   

   employment_status_Unemployed  
0                           1.0  
1                           0.0  
2                           0.0  
3                           1.0  




In [8]:
### Q7: You Are Analyzing a Dataset with Two Continuous Variables, "Temperature" and "Humidity", and Two Categorical Variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the Covariance Between Each Pair of Variables and Interpret the Results.

#Covariance Calculation for Continuous Variables:

import pandas as pd
import numpy as np

# Sample data
data = pd.DataFrame({
    'temperature': [30, 25, 20, 35, 40],
    'humidity': [70, 65, 60, 75, 80]
})

# Calculate covariance matrix for continuous variables
cov_matrix = data.cov()

print(cov_matrix)

'''
Interpretation:
- The covariance matrix will show the relationship between temperature and humidity. A positive value indicates that as temperature increases, humidity also increases, and vice versa.
'''

#Encoding Categorical Variables and Calculating Covariance:

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Sample data
data = pd.DataFrame({
    'temperature': [30, 25, 20, 35, 40],
    'humidity': [70, 65, 60, 75, 80],
    'weather_condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],
    'wind_direction': ['North', 'South', 'East', 'West', 'North']
})

# Encode categorical variables
label_encoder = LabelEncoder()
data['weather_condition_encoded'] = label_encoder.fit_transform(data['weather_condition'])
data['wind_direction_encoded'] = label_encoder.fit_transform(data['wind_direction'])

# Calculate covariance matrix including encoded categorical variables
cov_matrix = data.cov()

print(cov_matrix)

'''
Interpretation:
- Covariance values will show relationships between continuous variables and encoded categorical variables. Positive or negative values will indicate whether variables increase or decrease together.
'''

             temperature  humidity
temperature         62.5      62.5
humidity            62.5      62.5
                           temperature  humidity  weather_condition_encoded  \
temperature                      62.50     62.50                       0.00   
humidity                         62.50     62.50                       0.00   
weather_condition_encoded         0.00      0.00                       1.00   
wind_direction_encoded            3.75      3.75                       0.25   

                           wind_direction_encoded  
temperature                                  3.75  
humidity                                     3.75  
weather_condition_encoded                    0.25  
wind_direction_encoded                       1.30  


  cov_matrix = data.cov()


'\nInterpretation:\n- Covariance values will show relationships between continuous variables and encoded categorical variables. Positive or negative values will indicate whether variables increase or decrease together.\n'