<h2>Predicting Titanic Survivors with Machine Learning</h2>

<h3>Objetive</h3>

The goal of this project is to develop a machine learning model to predict whether a passenger on the Titanic survived or not, based on various features such as age, gender, class, and ticket fare.

<h3>Dataset</h3>

The dataset used is the famous Titanic dataset, containing information about passengers, including whether they survived or not. It includes features such as age, sex, ticket class, and embarkation point.
The dataset can be downloaded here: https://www.kaggle.com/competitions/titanic/data

#References

https://www.kaggle.com/code/startupsci/titanic-data-science-solutions/notebook
https://www.kaggle.com/code/louyuechen0122/titanic-analysis
https://www.kaggle.com/code/eliassaker/titanic-survival-chances

<h3>About the data</h3>

<p><strong>According to the dataset owner (Kaggle):</strong></p>

<p>The <b>training set</b> should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.</p>

<p>The <b>test set</b> should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.</p>

<h2>Step 1 - Problem Identification</h2>

<p><b>Some Relevant information about the disaster:</b></p>

<ul>
    <li>On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.</li>
    <li>One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.</li>
    <li>Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.</li>
</ul>

<p><b>Problem to be solved:</b> "Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not."</p>

<h2>Step 2 - Data Collection</h2>

<p>Before acquiring the dat, we will import the needed libraries to proceed with the project</p>

In [2]:
# data mining and analysis
import pandas as pd #import pandas to deal with dataframes
import numpy as np #import numpy to work with math functions

# visualition of data
import matplotlib.pyplot as plt # library that allows you to build graphs

# machine learning algorithms (from sklearn library)
from sklearn.tree import DecisionTreeClassifier # algorithm based on decision tree
from sklearn.linear_model import LinearRegression # algorithm based on linear regression
from sklearn.linear_model import LogisticRegression # algorithm based on logistics regression
from sklearn.neighbors import KNeighborsClassifier # algorithms based on k-nearest neighbors
from sklearn.svm import SVC, LinearSVC # algorithm based on Support Vector Classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron # neural network with 1 neuron
from sklearn.linear_model import SGDClassifier 

<p>With the libraries imported, it is possible to proceed with the data acquisition</p>

In [4]:
# using the read_csv from Pandas to acquire the data
train_df = pd.read_csv('./datasets/train.csv')
test_df = pd.read_csv('./datasets/test.csv')

In [5]:
# Show the first 5 rows of train_df
train_df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
# Show the first 5 rows of test_df
test_df.head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


<h2>Step 3 - Data Preparation</h2>

<p>To understand how to prepare the dataset, it is necessary to analyze by describing data </p>

<p>Lets see the columns names: </p>

In [7]:
display(train_df.columns.values)

array(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype=object)

<p>By looking the features (columns names), it is possible to separete the categorical and numerical data (continuous or discrete).</p>

<ul>
    <li><b>Categorical:</b> Survived, Sex, Embarked and Pclass (Pclass is also an ordinal data)</li>
    <li><b>Continuous: </b>Age, Fare. Discrete: SibSp, Parch</li>
    <li><b>DIscrete: </b> SibSp, Parch</li>
</ul>

<p>Now, lets see the data type of each feature:</p>

In [12]:
display(train_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


None

<p>We can see that "Age", "Cabin" and "Embarked" contains null values</p>

<p>Lets use Describe to see more info:</p>

In [19]:
display(train_df.describe())

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [20]:
train_df.describe(include=['O'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Gustafsson, Mr. Anders Vilhelm",male,1601,C23 C25 C27,S
freq,1,577,7,4,644


<h2>Step 3 - Data Analysis</h2>

<h2>Step 4 - Model Building</h2>

<h2>Step 5 - Model Evaluation</h2>

<h2>Step 6 - Model Deployment</h2>