# **TITANIC EXPLORATORY DATA ANALYSIS**

### **History of the `Titanic`**
<p>The RMS Titanic was a British cruise ship that sank on April 15, 1912, during its maiden voyage from Southampton, England, to New York City. It was the largest ship in the world at the time, measuring 882 feet long and 92 feet wide, and displacing 52,310 long tons. The Titanic departed from Southampton on April 10, 1912, stopping at the French port of Cherbourg and the Irish port of Queenstown to pick up more passengers before setting off across the Atlantic Ocean.</p>
<p>Despite receiving several ice warnings, the Titanic continued at full speed, and at 11 p.m. on April 14, 1912, lookout Fred Fleet spotted an iceberg dead ahead. First Officer William Murdoch ordered the ship turned hard to port and signaled the engine room to reverse direction, but the ship was too large, moving too fast, and the iceberg was too close. The Titanic struck the iceberg, causing several compartments to begin filling with water.</p>
<p>The ship's design included a double-hull and 16 watertight compartments sealed by massive doors that could be instantly triggered by a single electric switch on the bridge or automatically by electric water-sensors. However, the ship was not designed to withstand a collision that would flood more than four compartments, and the iceberg had caused five compartments to begin filling with water. The ship began to sink, and the crew began to get people aboard the lifeboats. There were not enough lifeboats for all the passengers, and many left the Titanic only half full.</p>
<p>The Titanic sank at 2 a.m. on April 15, 1912, with approximately 1,500 people still on board. The disaster led to the establishment of the International Ice Patrol and the first International Convention for Safety of Life at Sea, which required every ship to have lifeboat space for each person embarked, hold lifeboat drills, and maintain a 24-hour radio watch.</p>

### **Objective**
- To conduct an `Exploratory Data Analysis` on the `Titanic` dataset. This includes `Univariate`, `Bivariate`, `Multivariate`, `Outlier`, and `Target` Analysis of the data.

### **Data Dictionary**
| **Variable**         |   **Definition**                                   |    **Categories (Optional)**          |
|----------------------|----------------------------------------------------|---------------------------------------|
| PassengerId          | Unique identifier of the passenger                 |                                       |
| Survived             | Survival                                           |     0 = No, 1 = Yes                   |
| Pclass               | Ticket class                                       |  1=Upper, 2=Middle, 3=Lower           |      
| Name                 | Name of the passenger                              |                                       |
| Sex                  | Gender of the passenger                            |   male, female                        |
| Age                  | Age in years                                       |                                       |
| SibSp                | Number of siblings / spouses aboard the Titanic    |                                       |
| Parch                | Number of parents / children aboard the Titanic    |                                       |
| Ticket               | Ticket number                                      |                                       |
| Fare                 | Passenger fare                                     |                                       |
| Cabin                | Cabin number                                       |                                       |
| Embarked             | Port of Embarkation                                |    S, C, and Q                        |

**Embarked**
| **Category** |    Name        |
|--------------|----------------|
|       S      | Southampton    |
|       C      | Cherbourg      |
|       Q      | Queenstown     |

In [None]:
# Import relevant libraries
import pandas as pd
import numpy as np

# Import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Supress warnings
import warnings
warnings.filterwarnings("ignore")

### **Data Loading**

In [None]:
# Loading data using pandas
df = pd.read_csv("https://drive.google.com/file/d/1zC89wtvmoMdsN8Va2kciRZmVS0sxs2a7/view?usp=sharing")

ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 3142


In [None]:
# Check the first five rows of the dataset
df.head()

## **Data Preprocessing**

In [None]:
df.shape

**Comment**
- The dataset consists of 891 rows and 12 columns.

In [None]:
# Checking for columns in the dataset
df.columns.to_list()

In [None]:
# Check for irregularities in the data i.e. null values, data types, etc.
df.info()

**Comments**
- Presence of missing values in three columns: `Age`, `Cabin`, and `Embarked`.

In [None]:
# Percentage of missing values in each column
(df.isnull().sum() / len(df) * 100).round(2)

# df.isnull().mean().round(2) * 100

**Comments**
- `Age` has `19.87%` missing values, `Cabin`, `77.10%` missing values, and `Embarked`, `0.22%` missing values.

In [None]:
# Check for statistics
df.describe()

**Comments**
- Presence of outliers in `Age`, `SibSp`, `Parch`, and `Fare` columns. The columns consists of extreme values from the 3rd quartile value.

In [None]:
# Check for duplicates
df.duplicated().sum()

**Comment**
- No duplicate values in the dataset.

In [None]:
# Check values in `Embarked`
df['Embarked'].value_counts()

**Comments** \
The following were `titanic` intended passenger destinations:
- S: `Southampton`: `644` passengers.
- C: `Cherbourg`: `168` passengers.
- Q: `Queenstown`: `77` passengers.

In [None]:
# Percentage of passengers embarked from each port
(df['Embarked'].value_counts() / len(df) * 100).round(2)

**Comment**
- Passengers destined to `Southampton` accounted for `72.50%` of the entire population, with `Cherbourg` accounting for `18.86%`, and `Queenstown`, `8.64%` respectively.

In [None]:
# Impute `Age` with the mean
df['Age'].fillna(df['Age'].mean(), inplace=True)

In [None]:
# Impute `Embarked` with the mode "S"
# df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df['Embarked'].fillna('S', inplace=True)

In [None]:
# Drop `Cabin` column due to high percentage of missing values
df.drop(columns=['Cabin'], inplace=True)

In [None]:
# Check for null values again
df.isnull().sum()

In [None]:
# Check for unique values in `Survived`
df['Survived'].value_counts()


In [None]:
# Percentage of passengers who survived
(df['Survived'].value_counts() / len(df) * 100).round(2)

**Comments**
- The number of passengers who survived the accident were `342` accounting for `38.38%` of the entire population compared to the number of passengers who perished.

## **Exploratory Data Analysis**
**Objective**
- To analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.

**Benefits**
- Helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

### Univariate Analysis
- The data being analyzed consists of just one variable.

In [None]:
# Visualizing the number of survivors
plt.figure(figsize=(10, 4))
sns.countplot(x='Survived', data=df, palette='Set1')
plt.title('Count of Survival')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.show()

- The number of passengers who survived the accident was low compared to the passengers who perished.

In [None]:
# Visualizing the distribution of age
plt.figure(figsize=(8, 6))
sns.kdeplot(df['Age'], shade=True, color='cyan')
plt.title('Age Density Plot')
plt.xlabel('Age')
plt.ylabel('Density')
plt.show()


- The `Age` is distributed at the middle between age `20` and `40`.
- The number of passengers were positively skewed according to age.

In [None]:
# Visualizing the distribution of `Embarked`
plt.figure(figsize=(10, 4))
sns.countplot(x='Embarked', data=df, palette='Set1')
plt.title('Count of Embarked Locations')
plt.xlabel('Embarked')
plt.ylabel('Count')
plt.show()

- Majority of the passengers were destined for `Southampton` port compared to `Cherbourg` and `Queenstown`.

In [None]:
# Visualizing the distribution of `Pclass`
plt.figure(figsize=(10, 4))
sns.countplot(x='Pclass', data=df, palette='Set1')
plt.title('Count of Passenger Classes')
plt.xlabel('Pclass')
plt.ylabel('Count')
plt.show()


- Majority of the passengers preferred `3rd class` compared to `1st` and `2nd` class.

In [None]:
# Visualizing the distribution of ticket fare
plt.figure(figsize=(8, 4))
sns.kdeplot(df['Fare'], shade=True, color='cyan')
plt.title('Fare Density Plot')
plt.xlabel('Fare')
plt.ylabel('Density')
plt.show()

- The plot is skewed to the right meaning majority of the passengers paid less than `100` for the service.

### Bivariate Analysis
- It allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at.

In [None]:
# Visualizing the relationship between `Pclass` and `Survived`
plt.figure(figsize=(10, 4))
sns.countplot(x='Pclass', hue='Survived', data=df)
plt.title('Survival Count by Passenger Class')
plt.xlabel('Pclass')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right')
plt.xticks([0, 1, 2], ['1st Class', '2nd Class', '3rd Class'])
plt.show()


- Majority of passengers in the `1st` class survived the accident compared to the passengers in the `2nd` and `3rd` class.
- Many passengers in the `3rd` class succumbed to the accident.

In [None]:
# Visualizing the relationship between `Pclass` and `Fare`
plt.figure(figsize=(10, 4))
sns.boxplot(x='Pclass', y='Fare', data=df)
plt.title('Fare Distribution by Passenger Class')
plt.xlabel('Pclass')
plt.ylabel('Fare')
plt.show()

- Presence of extreme `fare` values in `1st` class compared to the other classes, i.e., `2nd` and `3rd` classes were cheaper than `1st` class.

In [None]:
# Visualizing the relationship between `Sex` and `Survived`
plt.figure(figsize=(10, 4))
sns.countplot(x='Sex', hue='Survived', data=df)
plt.title('Survival Count by Sex')
plt.xlabel("Sex")
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right')
plt.show();

- Majority of men succumbed to the accident compared to women.

In [None]:
# Visualizing the relationship between `Age` and `Survived`
plt.figure(figsize=(10, 4))
sns.boxplot(x='Survived', y='Age', data=df, palette='Set2')
plt.title('Age Distribution by Survival')
plt.xlabel('Survived')
plt.ylabel('Age')
plt.xticks([0, 1], ['Did not survive', 'Survived'])
plt.show();

- The population that survived included individuals of extreme age like `80` years. This is an outlier that could be further investigated.

In [None]:
# Visualizing the relationship between `Fare` and `Survived`
plt.figure(figsize=(10, 4))
sns.boxplot(x='Survived', y='Fare', data=df, palette='Set2')
plt.title('Fare Distribution by Survival')
plt.xlabel('Survived')
plt.ylabel('Fare')
plt.xticks([0, 1], ['Did not survive', 'Survived'])
plt.show()

- There is a relationship between passengers who paid high fares and survival. The higher the `fare` price, the higher the chances of `survival`.

### Multivariate Analysis
- Used for mapping and understanding interactions between different fields in the data.

In [None]:
# Visualize the relationship between `Pclass`, `Age`, and `Survived`
plt.figure(figsize=(12, 6))
sns.boxplot(x='Pclass', y='Age', hue='Survived', data=df, palette='Set1')
plt.title('Age Distribution by Passenger Class and Survival')
plt.xlabel('Pclass')
plt.ylabel('Age')
plt.xticks([0, 1, 2], ['1st Class', '2nd Class', '3rd Class'])
plt.legend(title='Survived', loc='upper right')
plt.show();

- There is a relationship between passengers who were in `1st` class, `age`, and `survival` rate.
- Majority of passengers who survived the accident belonged to the `1st` class and were averagely in their `30s`, with the passengers surviving in the `2nd` and `3rd` class being `below 30` years.
- Majority of the passengers in the `1st` class who died were `above 30` compared to passengers in the `2nd` and `3rd` class which contained passengers of `below 30` years succumbing to the accident.

In [None]:
# Visualize the relationship between `Pclass`, `Age`, and `Fare`
g = sns.FacetGrid(df, col="Pclass", height=5, aspect=1)
g.map(sns.scatterplot, "Age", "Fare", alpha=0.7)
g.set_axis_labels("Age", "Fare")
g.set_titles("Pclass {col_name}")
plt.show()

# plt.figure(figsize=(12, 6))
# sns.scatterplot(x='Age', y='Fare', hue='Pclass', data=df, palette='Set2')
# plt.title('Relationship Between Age, Fare, and Pclass')
# plt.xlabel('Age')
# plt.ylabel('Fare')
# plt.legend(title='Pclass')
# plt.show()

- Passengers who occupied `1st` class paid higher `fares` compared to passengers in the `2nd` and `3rd` classes who paid below `100`.

In [None]:
# Visualize the relationship between `survived`, `Emberked`, and `Pclass`
heatmap_data = df.pivot_table(index='Embarked', columns='Pclass', values='Survived', aggfunc='mean')

plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, annot=True, cmap='Blues', fmt=".2f")
plt.title('Survival Rate by Embarked and Pclass')
plt.xlabel('Pclass')
plt.ylabel('Embarked')
plt.show()

**Comments**
- There is a correlation between passengers who boarded `1st` class and were destined for `Cherbourg` port. Majority of the passengers survived the accident.
- There is also a correlation between passengers who boarded `2nd` class and were destined for `Queenstown` port. Majority of this group survived the accident as well.
- Majority of passengers who boarded `3rd` class and were destined for `Southampton` port succumbed to the accident with low survival rate.

### Outlier Detection and Handling

In [None]:
df.select_dtypes(include=['float64', 'int64']).columns

In [None]:
# # Checking for outliers in `Age`, `SibSp`, `Parch`, and `Fare` columns

# Specify the columns to visualize
columns_to_plot = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

# Create the figure and axes
fig, axes = plt.subplots(3, 2, figsize=(12, 10), sharey=False)

# Flatten the axes array for iteration
axes = axes.flatten()

# Plot box plots for each column
for ax, column in zip(axes, columns_to_plot):
    sns.boxplot(x=df[column], ax=ax, color='blue', width=0.5)
    ax.set_title(f'Box Plot of {column}')
    ax.set_xlabel(column)

# Remove any unused axes
for ax in axes[len(columns_to_plot):]:
    ax.remove()

plt.tight_layout()
plt.show()

**Comments**
- Presence of outliers in `Age`, `SibSp`, `Parch`, and `Fare` columns.

**Handling Outliers**
\
Outliers cannot be removed because:

- It can lead to loss of valuable information, distorting the dataset and/or tampering with the integrity of the data.
- It can be used to detect rare events such as elderly surviving the titanic accident as seen in the dataset.
- It can be used to improve predicitons during modeling in future and support insights.
- It can introduce biasness in the dataset especially if the outlier is meaningful and not an error.

# Target Variable Exploration

A `target variable` is the variable that the user would want to predict using the rest of the dataset especially in supervised machine learning model.

In [None]:
(df['Survived'].value_counts(normalize=True) * 100).round(2)

In [None]:
# Analyze the `survived` column
survival_counts = df['Survived'].value_counts(normalize=True) * 100
survival_counts = survival_counts.rename({0: 'Did not survive', 1: 'Survived'})
# Create a bar plot for survival rates
plt.figure(figsize=(8, 5))
sns.barplot(x=survival_counts.index, y=survival_counts.values, palette='Set1')
plt.title('Survival Rates')
plt.xlabel('Survival Status')
plt.ylabel('Percentage (%)')
plt.xticks(rotation=0)
plt.ylim(0, 100)
plt.show();

**Comments**
- `61.62%` of the passengers lost their lives from the titanic accident compared to `38.83%` who survived.

In [None]:
# Visualize the distribution of `Survived` by `Pclass`
plt.figure(figsize=(10, 4))
sns.countplot(x='Pclass', hue='Survived', data=df, palette='Set1')
plt.title('Survival Count by Passenger Class')
plt.xlabel('Pclass')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right')
plt.xticks([0, 1, 2], ['1st Class', '2nd Class', '3rd Class'])
plt.show()

**Comments**
- Most passengers who survived the accident came from the `1st` class, while majority of the passengers who died from the accident came from `3rd` class.
- In `2nd` class, both survivors and those who died were of average population.

In [None]:
# Visualize the distribution of `Survived` by `Sex`
plt.figure(figsize=(10, 4))
sns.countplot(x='Sex', hue='Survived', data=df, palette='Set1')
plt.title('Survival Count by Sex')
plt.xlabel('Sex')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right')
plt.xticks(rotation=0)
plt.show();

**Comments**
- The number of female passengers who survived the accident were more compared to men.

In [None]:
# Visualize the distribution of `survived` by `Embarked`
plt.figure(figsize=(10, 4))
sns.countplot(x='Embarked', hue='Survived', data=df, palette='Set1')
plt.title('Survival Count by Embarked Location')
plt.xlabel('Embarked')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right')
plt.xticks(rotation=0)
plt.show();

**Comments**
- Passengers destined for `Southampton` port contained both high deaths and survival rates compared to `Cherbourg` and `Queenstown` ports.
- Most passengers destined for `Cherbourg` port survived the accident.

In [None]:
# Visualize the distribution of `survived` by `Age`
plt.figure(figsize=(10, 4))
sns.boxplot(x='Survived', y='Age', data=df, palette='Set1')
plt.title('Age Distribution by Survival')
plt.xlabel('Survived')
plt.ylabel('Age')
plt.xticks([0, 1], ['Did not survive', 'Survived'])
plt.show();

**Comments**
- Majority of passengers who died and survived the accident were averagly in their `min-20s` and `mid-30s` respectively.
- There were outliers in the deaths which included children below approximately `5` years and passengers above the age of `50`.
- However, there were survivors who were approximately `60` years and above who are part of the outliers.

In [None]:
# Visualize the distribution of `survived` by `Fare`
plt.figure(figsize=(10, 4))
sns.boxplot(x='Survived', y='Fare', data=df, palette='Set1')
plt.title('Fare Distribution by Survival')
plt.xlabel('Survived')
plt.ylabel('Fare')
plt.xticks([0, 1], ['Did not survive', 'Survived'])
plt.show();

**Comments**
- Passengers who paid higher `fares` were likely to survive the accident compared to the majority who paid below `100`.

In [None]:
# Visualize the relationship between `Pclass`, `Sex`, and `Survived`
plt.figure(figsize=(10, 4))
sns.countplot(x='Pclass', hue='Survived', data=df, palette='Set1')
plt.title('Survival Count by Passenger Class')
plt.xlabel('Pclass')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right')
plt.xticks([0, 1, 2], ['1st Class', '2nd Class', '3rd Class'])
plt.show();

**Comments**
- Majority of passengers who survived the accident were in `1st` class compared to `2nd` and `3rd` classes.
- `3rd` class experiences the highest death rates.

### **Conclusion**

In conclusion, few passengers survived the accident. i.e. `61.62%` of the population succumbed to the accident compared to `38.38%` who survived. Majority of passengers who boarded the titanic were between the age of `20` and `40` accounting for the young adults in the population. Majority of passengers in the 3rd class died compared to 1st and 2nd class.\
Majority of men succumbed to the accident compared to women. Majority of the passengers who succumbed to the accident were destined for `Southampton` port. However, there was an exception case of an elderly who survived accident that need to be looked into.  
