<a href="https://colab.research.google.com/github/convenience-tinashe-chibatamoto/Projects/blob/main/Titanic_Survival_Prediction_Using_an_XGBoost_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***This is a classification task, where the goal is to predict whether a passenger on the Titanic survived or not.
The target variable in this dataset is the "Survived" column, which is a binary variable indicating whether the passenger survived (1) or not (0). The other features in the dataset, such as Pclass (passenger class), Sex, Age, SibSp (number of siblings/spouses aboard), Parch (number of parents/children aboard), Fare, Embarked (port of embarkation), Cabin, and FamilySize, are used as input features to predict the survival outcome.
The objective is to train a machine learning model, in this case, an XGBoost classifier, to learn the relationship between the input features and the survival outcome, and then use the trained model to make predictions on new, unseen data.
The output of the model will be a probability or a binary prediction (0 or 1) indicating whether a passenger is predicted to have survived or not. The accuracy of the model can be evaluated by comparing the predicted values to the actual survival outcomes in the test set.
In summary, this is a binary classification task where the goal is to predict the survival outcome of Titanic passengers based on the available features in the dataset.***

In [1]:
#Importing the necessary modules
import pandas as pd
import plotly.express as px
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [2]:

# Loading the Titanic dataset
data = pd.read_csv('titanic.csv')

In [3]:
# Preprocessing the data
# Handling missing values in the 'Age' and 'Embarked' columns
data['Age'].fillna(data['Age'].mean(), inplace=True)
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)

# Handling missing values in the 'Cabin' column and creating a new feature 'Cabin'
data['Cabin'] = data['Cabin'].fillna('U')
data['Cabin'] = data['Cabin'].apply(lambda x: x[0])

# Creating a new feature 'FamilySize'
data['FamilySize'] = data['SibSp'] + data['Parch'] + 1

# Encoding categorical features
# Encoding 'Sex' and 'Embarked' columns
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})
data['Embarked'] = data['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

# Encoding 'Cabin' column
data['Cabin'] = data['Cabin'].map({'U': 0, 'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'T': 7})


In [4]:

# Splitting the data into features and target
X = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Cabin', 'FamilySize']]
y = data['Survived']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
# Training the XGBoost model
# Creating an XGBoost classifier model with the specified hyperparameters
model = XGBClassifier(objective='binary:logistic', n_estimators=100, max_depth=3, learning_rate=0.1, random_state=42)

# Fitting the model to the training data
model.fit(X_train, y_train)

# Evaluating the model
# Making predictions on the testing data
y_pred = model.predict(X_test)


In [6]:
# Calculating the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.82


***Visualising the data, taking note of some key correlations***

In [7]:
# Survival rate by gender
fig = px.bar(data, x='Sex', y='Survived', title='Survival Rate by Gender')
fig.show()

In [8]:
# Survival rate by passenger class
fig = px.bar(data, x='Pclass', y='Survived', title='Survival Rate by Passenger Class')
fig.show()

In [9]:
# Survival rate by age
fig = px.histogram(data, x='Age', color='Survived', title='Survival Rate by Age')
fig.show()

In [10]:
# Survival rate by family size
fig = px.bar(data, x='FamilySize', y='Survived', title='Survival Rate by Family Size')
fig.show()