## The Titanic Dataset

The "Titanic dataset" is a classic dataset used in data science and machine learning. It contains information about the passengers aboard the Titanic, including various attributes such as age, sex, ticket class, fare, and whether or not they survived the sinking of the Titanic.

Here are the details of the columns in the Titanic dataset:

1. **PassengerId**: A unique identifier for each passenger.
2. **Survived**: Indicates whether the passenger survived or not. This is the target variable. It is binary: 0 for not survived, 1 for survived.
3. **Pclass (Ticket class)**: Represents the class of the ticket purchased by the passenger, a proxy for socio-economic status.
4. **Name**: The name of the passenger.
5. **Sex**: The gender of the passenger.
6. **Age**: The age of the passenger in years. (Some entries may be missing, denoted by "NaN".)
7. **SibSp**: The number of siblings or spouses aboard the Titanic.
8. **Parch**: The number of parents or children aboard the Titanic.
9. **Ticket**: The ticket number.
10. **Fare**: The fare paid by the passenger.
11. **Cabin**: The cabin number occupied by the passenger. (Some entries may be missing, denoted by "NaN".)
12. **Embarked**: The port of embarkation for the passenger. It can take one of the following values: <br>
a. C: Cherbourg <br>
b. Q: Queenstown (now known as Cobh) <br>
c. S: Southampton



## Include all the required libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Load the Dataset

In [None]:
dataset = pd.read_csv("titanic.csv")

In [None]:
print("Total number of rows: " + str(dataset.shape[0]))

Total number of rows: 891


In [None]:
column_data_types = dataset.dtypes
categorical_column = column_data_types[column_data_types == "object"].index.tolist()
print(categorical_column)

['Sex', 'Embarked']


## Deleting the rows which are unnecessary for prediction

1. **PassengerId**: Just an identifier for each passenger.
2. **Name**: Name of the passenger.
3. **Ticket**: Ticket numbers are arbitrary identifiers.
4. **Cabin**: This attribute has a large number of missing values and may not be reliable or informative for predicting survival.

In [None]:
attributes_to_drop = ["PassengerId", "Name", "Ticket", "Cabin"]
dataset.drop(columns = attributes_to_drop, inplace = True)

## Taking Care of Missing Values


In [None]:
null_count = dataset.isnull().sum()
print(null_count)

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64


In [None]:
nan_count = dataset.isna().sum()
print(nan_count)

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64


In [None]:
mode_value = dataset["Embarked"].mode()[0]
print(mode_value)

imputer = SimpleImputer(missing_values = np.nan, strategy = "constant", fill_value = mode_value)

embarked_column = dataset["Embarked"].values.reshape(-1, 1)
embarked_column = imputer.fit_transform(embarked_column)
dataset["Embarked"] = embarked_column.flatten()

S


In [None]:
nan_count = dataset.isna().sum()
print(nan_count)

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      0
dtype: int64


In [None]:
imputer = SimpleImputer(missing_values = np.nan, strategy = "mean")

age_column = dataset["Age"].values.reshape(-1, 1)
age_column = imputer.fit_transform(age_column)
dataset["Age"] = age_column.flatten()

In [None]:
nan_count = dataset.isna().sum()
print(nan_count)

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64


In [None]:
X = dataset.iloc[:, 1:].values
y = dataset.iloc[:, 0].values

In [None]:
print(X.shape)

(891, 7)


In [None]:
print(X[:5, :])

[[3 'male' 22.0 1 0 7.25 'S']
 [1 'female' 38.0 1 0 71.2833 'C']
 [3 'female' 26.0 0 0 7.925 'S']
 [1 'female' 35.0 1 0 53.1 'S']
 [3 'male' 35.0 0 0 8.05 'S']]


In [None]:
print(y.shape)

(891,)


## Encoding Categorical Data

In [None]:
column_index = 0
pclass_column = X[:, column_index]
distinct_values = np.unique(pclass_column)
print(distinct_values)

[1 2 3]


In [None]:
column_index = 1
sex_column = X[:, column_index]
distinct_values = np.unique(sex_column)
print(distinct_values)

['female' 'male']


In [None]:
column_index = 6
embarked_column = X[:, column_index]
distinct_values = np.unique(embarked_column)
print(distinct_values)

['C' 'Q' 'S']


In [None]:
# Labeling the Sex columns with 0 and 1 values

le_sex = LabelEncoder()
X[:, 1] = le_sex.fit_transform(X[:, 1])
print(np.unique(X[:, 1]))

[0 1]


In [None]:
  # One Hot Encoding the embarked column values
  ct = ColumnTransformer(transformers = [("encoder", OneHotEncoder(), [6])], remainder = "passthrough")
  X = np.array(ct.fit_transform(X))

In [None]:
print(X[0:5, :])

[[0.0 0.0 1.0 3 1 22.0 1 0 7.25]
 [1.0 0.0 0.0 1 0 38.0 1 0 71.2833]
 [0.0 0.0 1.0 3 0 26.0 0 0 7.925]
 [0.0 0.0 1.0 1 0 35.0 1 0 53.1]
 [0.0 0.0 1.0 3 1 35.0 0 0 8.05]]


## Splitting the dataset into training and test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
print(y_test[:10])

[1 0 0 1 1 1 1 0 1 1]


In [None]:
print(y_train[:10])

[0 0 0 0 0 0 0 0 0 1]


## Feature Scaling

In [None]:
scaler = StandardScaler()

X_train[:, 3:] = scaler.fit_transform(X_train[:, 3:])
X_test[:, 3:] = scaler.fit_transform(X_test[:, 3:])