# Coding Exercise 3: Encoding Categorical Data for Machine Learning

1: Import required libraries - Pandas, Numpy, and required classes for this task - ColumnTransformer, OneHotEncoder, LabelEncoder.

2: Start by loading the Titanic dataset into a pandas data frame. This can be done using the pd.read_csv function. The dataset's name is 'titanic.csv'.

3: Identify the categorical features in your dataset that need to be encoded. You can store these feature names in a list for easy access later.

4: To apply OneHotEncoding to these categorical features, create an instance of the ColumnTransformer class. Make sure to pass the OneHotEncoder() as an argument along with the list of categorical features.

5: Use the fit_transform method on the instance of ColumnTransformer to apply the OneHotEncoding.

6: The output of the fit_transform method should be converted into a NumPy array for further use.

7: The 'Survived' column in your dataset is the dependent variable. This is a binary categorical variable that should be encoded using LabelEncoder.

8: Print the updated matrix of features and the dependent variable vector

In [8]:
# Importing the necessary libraries
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

# Load the dataset
df = pd.read_csv('datasets/iris.csv')
print(df.head())

# Split features and target
X = df.iloc[:, [0] + list(range(2, df.shape[1]))]
y = df.iloc[:, 1]


   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  


In [None]:
print(X)

     PassengerId  Pclass                                               Name  \
0              2       1  Cumings, Mrs. John Bradley (Florence Briggs Th...   
1              4       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)   
2              7       1                            McCarthy, Mr. Timothy J   
3             11       3                    Sandstrom, Miss. Marguerite Rut   
4             12       1                           Bonnell, Miss. Elizabeth   
..           ...     ...                                                ...   
178          872       1   Beckwith, Mrs. Richard Leonard (Sallie Monypeny)   
179          873       1                           Carlsson, Mr. Frans Olof   
180          880       1      Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)   
181          888       1                       Graham, Miss. Margaret Edith   
182          890       1                              Behr, Mr. Karl Howell   

        Sex   Age  SibSp  Parch    Ticket     Fare 

In [None]:
print(y)

0      1
1      1
2      0
3      1
4      1
      ..
178    1
179    0
180    1
181    1
182    1
Name: Survived, Length: 183, dtype: int64


In [None]:
print(df.dtypes)

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


In [None]:
print(df.isna().sum())

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64


### Encoding Categorical data

In [None]:
# Identify the categorical data
categorical_features = ['Sex', 'Embarked', 'Pclass']

# Implement an instance of the ColumnTransformer class
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), categorical_features)], remainder='passthrough')

# Apply the fit_transform method on the instance of ColumnTransformer
X = ct.fit_transform(df)

# Convert the output into a NumPy array
X = np.array(X)

### Encoding Dependent Variable

In [None]:
# Use LabelEncoder to encode binary categorical data
le = LabelEncoder()
y = le.fit_transform(y)

# Print the updated matrix of features and the dependent variable vector
print(y)

[1 1 0 1 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1
 1 1 1 0 1 0 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1 1 1 1
 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 0 1 0 1 0 1 0 1 1 1 0 1 1
 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 1 1
 0 1 1 1 1 1 0 1 0 0 1 1 1 1 0 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 1]
