# Coding Exercise 3: Encoding Categorical Data for Machine Learning

Instructions:

- Import required libraries - Pandas, Numpy, and required classes for this task - ColumnTransformer, OneHotEncoder, LabelEncoder.

- Start by loading the Titanic dataset into a pandas data frame. This can be done using the pd.read_csv function. The dataset's name is 'titanic.csv'.

- Identify the categorical features in your dataset that need to be encoded. You can store these feature names in a list for easy access later.

- To apply OneHotEncoding to these categorical features, create an instance of the ColumnTransformer class. Make sure to pass the OneHotEncoder() as an argument along with the list of categorical features.

- Use the fit_transform method on the instance of ColumnTransformer to apply the OneHotEncoding.

- The output of the fit_transform method should be converted into a NumPy array for further use.

- The 'Survived' column in your dataset is the dependent variable. This is a binary categorical variable that should be encoded using LabelEncoder.

- Print the updated matrix of features and the dependent variable vector

# Data Preprocessing

## Importing the libraries

In [5]:
import pandas as pd 
import numpy as np 
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

## Importing the dataset
In order to use the same dataset in the udemy dataset, I need to drop Age and Cabin null data.

In [7]:
# Dataset: https://github.com/datasciencedojo/datasets/blob/master/titanic.csv
dataset = pd.read_csv('titanic.csv')

# cabin_not_null = dataset[dataset['Cabin'].notnull()]  # Only Cabin null data
dataset = dataset.dropna(subset=['Age', 'Cabin'])
# print(dataset)

## Identify the categorical data

In [9]:
categorical_features = ['Sex', 'Embarked', 'Pclass']  # Assuming these are the categorical features

## Implement an instance of the ColumnTransformer class


In [11]:
ct = ColumnTransformer(
    transformers=[
        ('encoder', OneHotEncoder(), categorical_features)
    ],
    remainder='passthrough'  # Pass through any other columns
)

## Apply the fit_transform method on the instance of ColumnTransformer



In [13]:
X = ct.fit_transform(dataset)


## Convert the output into a NumPy array

In [37]:
X = np.array(X)

## Use LabelEncoder to encode binary categorical data

In [17]:
le = LabelEncoder()
y = le.fit_transform(dataset['Survived'])

## Print the updated matrix of features and the dependent variable vector

In [19]:
print("Updated matrix of features (after OneHotEncoding):\n", X)
print("Dependent variable vector (encoded using LabelEncoder):\n", y)



Updated matrix of features (after OneHotEncoding):
 [[1.0 0.0 1.0 ... 'PC 17599' 71.2833 'C85']
 [1.0 0.0 0.0 ... '113803' 53.1 'C123']
 [0.0 1.0 0.0 ... '17463' 51.8625 'E46']
 ...
 [1.0 0.0 1.0 ... '11767' 83.1583 'C50']
 [1.0 0.0 0.0 ... '112053' 30.0 'B42']
 [0.0 1.0 1.0 ... '111369' 30.0 'C148']]
Dependent variable vector (encoded using LabelEncoder):
 [1 1 0 1 1 1 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 1 1 1 0 1
 1 1 1 1 0 1 0 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1 1 1
 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 0 1 0 1 0 1 0 1 1 1 0 1
 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 1
 1 0 1 1 1 1 1 0 1 0 0 1 1 1 1 0 1 1 0 0 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1]
missnig data PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       2
dtype: int64


In [55]:
missing_data = dataset.isnull().sum()
print("Missnig data:\n", missing_data )

print("\n",np.array_equal(X, y))
print(X.shape)
print(y.shape)


Missnig data:
 PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       2
dtype: int64

 False
(185, 18)
(185,)
