In [1]:
# Question 5: Label Encoding vs One-Hot Encoding
# Task: Show the difference between Label Encoding and One-Hot Encoding on the Titanic dataset for the 'Sex' feature.





# Question 6: Combining Feature Scaling Techniques
# Task: Demonstrate combining Min-Max Scaling and Standardization for the same datasetand explain the results.





# Question 7: Handling Multiple Categorical Features
# Task: Handle multiple categorical features ('Sex', 'Embarked') from the Titanic dataset using One-Hot Encoding.




# Question 8: Ordinal Encoding for Ranked Categories
# Task: Ordinal encode 'Pclass' (Passenger class) from the Titanic dataset considering passenger class as a ranked feature.





# Question 9: Impact of Scaling on Different Algorithms
# Task: Investigate the impact of different scaling techniques on a decision tree model and compare it with a SVM.



# Question 10: Custom Transformations for Categorical Features
# Task: Implement a custom transformation function for encoding high cardinality categorical features efficiently.

# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np

# Create a sample DataFrame for the Titanic 'Sex' feature
data = {'Sex': ['male', 'female', 'male', 'male', 'female', 'female', 'male']}
df = pd.DataFrame(data)

print("Original Data:")
print(df)

# --- Label Encoding ---
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Apply Label Encoding to the 'Sex' column
df['Sex_LabelEncoded'] = label_encoder.fit_transform(df['Sex'])

print("\nData after Label Encoding:")
print(df)

# --- One-Hot Encoding ---
# Using pandas get_dummies is a simple way for One-Hot Encoding
# prefix='Sex': adds 'Sex_' to the new column names
# dtype=int: ensures the new columns have integer values (0 or 1)
df_onehot = pd.get_dummies(df['Sex'], prefix='Sex', dtype=int)

print("\nData after One-Hot Encoding (using pandas get_dummies):")
print(df_onehot)

# Alternatively, using scikit-learn's OneHotEncoder (more suitable for pipelines)
# Initialize OneHotEncoder
# handle_unknown='ignore' handles categories not seen during fit (useful for test sets)
# sparse_output=False ensures a dense numpy array output instead of a sparse matrix
onehot_encoder_sk = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Reshape the 'Sex' column to be a 2D array as required by scikit-learn's encoder
sex_column = df[['Sex']]

# Fit and transform the 'Sex' column
sex_onehot_sk = onehot_encoder_sk.fit_transform(sex_column)

# Get the feature names for the one-hot encoded columns
# This is useful for creating a DataFrame with meaningful column names
onehot_feature_names = onehot_encoder_sk.get_feature_names_out(['Sex'])

# Create a DataFrame from the scikit-learn One-Hot encoded result
df_onehot_sk = pd.DataFrame(sex_onehot_sk, columns=onehot_feature_names)

print("\nData after One-Hot Encoding (using scikit-learn OneHotEncoder):")
print(df_onehot_sk)

# Note: Typically, you would drop one of the One-Hot encoded columns to avoid multicollinearity
# For example, dropping the 'Sex_male' column:
# df_onehot_sk_dropped = df_onehot_sk.drop('Sex_male', axis=1)
# print("\nData after One-Hot Encoding (scikit-learn, 'Sex_male' dropped):")
# print(df_onehot_sk_dropped)





Original Data:
      Sex
0    male
1  female
2    male
3    male
4  female
5  female
6    male

Data after Label Encoding:
      Sex  Sex_LabelEncoded
0    male                 1
1  female                 0
2    male                 1
3    male                 1
4  female                 0
5  female                 0
6    male                 1

Data after One-Hot Encoding (using pandas get_dummies):
   Sex_female  Sex_male
0           0         1
1           1         0
2           0         1
3           0         1
4           1         0
5           1         0
6           0         1

Data after One-Hot Encoding (using scikit-learn OneHotEncoder):
   Sex_female  Sex_male
0         0.0       1.0
1         1.0       0.0
2         0.0       1.0
3         0.0       1.0
4         1.0       0.0
5         1.0       0.0
6         0.0       1.0
