# **0. Introduction and Objectives:**

### Project Goal: to use machine learning to create a model that predicts which passengers survived the Titanic shipwreck while analysing differnt factors that led to our predictor to achive this result

# **1. Data loading**

In [1]:
import pandas as pd
data=pd.read_csv("/kaggle/input/titanic/train_and_test2.csv")
data.head()

Unnamed: 0,Passengerid,Age,Fare,Sex,sibsp,zero,zero.1,zero.2,zero.3,zero.4,...,zero.12,zero.13,zero.14,Pclass,zero.15,zero.16,Embarked,zero.17,zero.18,2urvived
0,1,22.0,7.25,0,1,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0
1,2,38.0,71.2833,1,1,0,0,0,0,0,...,0,0,0,1,0,0,0.0,0,0,1
2,3,26.0,7.925,1,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,1
3,4,35.0,53.1,1,1,0,0,0,0,0,...,0,0,0,1,0,0,2.0,0,0,1
4,5,35.0,8.05,0,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0


# **2. Exploratory Data Analysis**

## **2.0 EDA Objectives: Titanic Survivor Predictor**

The primary goal of this Exploratory Data Analysis (EDA) is to thoroughly understand the Titanic dataset in preparation for building a model that can predict passenger survival. Specifically, this EDA aims to:

* **Understand the Characteristics of Each Feature:**
    * Analyze the distribution, central tendency, and spread of individual features such as `PassengerId`, `Survived`, `Pclass`, `Name`, `Sex`, `Age`, `SibSp`, `Parch`, `Ticket`, `Fare`, `Cabin`, and `Embarked`.
    * Identify the data type of each feature and any potential inconsistencies.
    * Gain an initial understanding of the range and variability of values within each feature.

* **Identify Missing Values and Outliers:**
    * Determine the presence and extent of missing data in each column.
    * Analyze the patterns of missingness to understand if it's random or related to other features.
    * Detect potential outliers in numerical features like `Age` and `Fare` that might require further investigation or handling.

* **Explore Relationships Between Features:**
    * Investigate the relationships between different passenger features (e.g., `Pclass` and `Fare`, `Age` and `Pclass`, `SibSp` and `Parch`).
    * Analyze correlations between numerical features to identify potential dependencies.
    * Examine the relationship between categorical features using cross-tabulations.

* **Gain Insights into the Target Variable (`Survived`):**
    * Determine the overall survival rate in the dataset.
    * Analyze the distribution of survivors and non-survivors across different passenger demographics and characteristics (e.g., by `Sex`, `Pclass`, `Age`, `Embarked`).

* **Formulate Initial Hypotheses About Potential Predictors of Survival:**
    * Based on the observed relationships, develop initial hypotheses about which features are likely to be strong predictors of survival. For example, "Passengers in higher classes had a higher survival rate," or "Female passengers were more likely to survive."

* **Identify Potential Data Quality Issues:**
    * Uncover any inconsistencies, errors, or unusual patterns in the data that might need to be addressed during data preprocessing. This could include inconsistent formatting, illogical values, or data entry errors.

data.tail()

In [2]:
data.columns

Index(['Passengerid', 'Age', 'Fare', 'Sex', 'sibsp', 'zero', 'zero.1',
       'zero.2', 'zero.3', 'zero.4', 'zero.5', 'zero.6', 'Parch', 'zero.7',
       'zero.8', 'zero.9', 'zero.10', 'zero.11', 'zero.12', 'zero.13',
       'zero.14', 'Pclass', 'zero.15', 'zero.16', 'Embarked', 'zero.17',
       'zero.18', '2urvived'],
      dtype='object')

Data Dictionary

    Passengerid: Unique identifier for each passenger.
    Age: Age of the passenger.
    Fare: Fare paid by the passenger.
    Sex: Gender of the passenger.
    sibsp: Number of siblings/spouses aboard.
    zero: Placeholder column (potentially unused or for future data).
    zero.1: Placeholder column.
    zero.2: Placeholder column.
    zero.3: Placeholder column.
    zero.4: Placeholder column.
    zero.5: Placeholder column.
    zero.6: Placeholder column.
    Parch: Number of parents/children aboard.
    zero.7: Placeholder column.
    zero.8: Placeholder column.
    zero.9: Placeholder column.
    zero.10: Placeholder column.
    zero.11: Placeholder column.
    zero.12: Placeholder column.
    zero.13: Placeholder column.
    zero.14: Placeholder column.
    Pclass: Passenger class (1st, 2nd, or 3rd).
    zero.15: Placeholder column.
    zero.16: Placeholder column.
    Embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).
    zero.17: Placeholder column.
    zero.18: Placeholder column.
    2urvived: Survival status (0 = No; 1 = Yes).


data.info()

In [3]:
data.describe()

Unnamed: 0,Passengerid,Age,Fare,Sex,sibsp,zero,zero.1,zero.2,zero.3,zero.4,...,zero.12,zero.13,zero.14,Pclass,zero.15,zero.16,Embarked,zero.17,zero.18,2urvived
count,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,...,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1307.0,1309.0,1309.0,1309.0
mean,655.0,29.503186,33.281086,0.355997,0.498854,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.294882,0.0,0.0,1.492731,0.0,0.0,0.261268
std,378.020061,12.905241,51.7415,0.478997,1.041658,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.837836,0.0,0.0,0.814626,0.0,0.0,0.439494
min,1.0,0.17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,328.0,22.0,7.8958,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0
50%,655.0,28.0,14.4542,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,2.0,0.0,0.0,0.0
75%,982.0,35.0,31.275,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,2.0,0.0,0.0,1.0
max,1309.0,80.0,512.3292,1.0,8.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,2.0,0.0,0.0,1.0


In [4]:
data["Sex"].nunique()

2

In [5]:
data.isnull().sum().sum()

2

## **2.2 Univariante analysis**