## **Analyzing and Visualizing the Titanic Disaster with Pandas and matplotlib**
The sinking of the Titanic is one of the most famous/infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of **1502** out of **2224** passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

We are going to use Pandas to correlate existing passangers data (ie name, age, gender, socio-economic class, etc).

This dataset describes the survival status of individual passengers on the Titanic. The titanic data frame does not contain information from the crew, but it does contain actual ages of half of the passengers. The principal source for data about Titanic passengers is the Encyclopedia Titanica:

- **PassengerId**: Id of every passenger.
- **Survived**: Survival (0 = No; 1 = Yes).
- **Pclass**: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd).
- **Name**: Name of passenger.
- **Sex**: Gender of passenger.
- **Age**: Age of passenger.
- **SibS**: Indication that passenger have siblings and spouse.
- **Parch**: Whether a passenger is alone or have family.
- **Ticket**: Ticket number of passenger.
- **Fare**: Indicating the fare.
- **Cabin**: The cabin of passenger.
- **Embarked**: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).

In [2]:
import pandas as pd
pd.set_option('display.expand_frame_repr', False)


titanic = pd.read_csv("data/titanic.csv")

# shape returns (rows, columns)
titanic.shape


# read_pickle, read_table, read_fwf, read_clipboard, read_excel, read_json, read_html, read_xml,read_hdf, read_feather, 
# read_parquet, read_orc, read_sas, read_spss, read_sql_table, read_sql_query, read_sql, read_gbq, read_stata

(891, 12)

**describe()** is used to view DataFrames' basic statistical details: percentile, mean, std, min, max, etc:

In [3]:

titanic.describe() 
#titanic.describe(include = 'all') 

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


**dtypes** Dataframe property attribute containing the name and data type of each column:

In [None]:
titanic.info()

In [None]:
titanic
#pandas.DataFrame.to_string

In [None]:
# Dropping columns:

columnsNotUseful = ['Ticket', 'Cabin']

titanic = titanic.drop(columnsNotUseful, axis=1, errors='ignore')

titanic


In [None]:
titanic.head()

In [None]:
titanic.tail()

In [None]:
# Basic Filtering by column Age:

above_18 = titanic[titanic["Age"] > 18]
print('Qt. passengers above 18 yrs: ', len(above_18))
above_18.head()

### **Income and Prices in 1910 (wages):**

| Occupation  | Income |
| ------------- |:-------------:|
| Average of all Industries      | $ 574/year     |
| State and Local Government Workers      | $ 699/year     |
| Public School Teacher      | $ 492/year     |
| Building Trades      | ¢ 52/hour     |
| Medical Health Services Worker      | $ 338/year     |


In [None]:
# The Fare sum() among assessed passengers:

sumFares = titanic['Fare'].sum()
print('$', sumFares.round(2))

In [None]:
# Filtering by Fare
with pd.option_context('display.max_rows', None, 'display.max_columns', None): 
    higherFares = titanic.loc[titanic["Fare"] > 500]
print(len(higherFares))
higherFares.head()


In [None]:
# Verifying age min(), max() and mean():

print( f'{"Age min: "}{titanic.Age.min()} ')

print( f'{"Age mean: "}{titanic.Age.mean().round(2)} ')

print( f'{"Age max: "}{titanic.Age.max()} ')

In [None]:
# Age mean() by Sex:

meanAgeGroupedBySex = titanic.groupby(['Sex']).Age.mean()

print('Women Age mean():', meanAgeGroupedBySex['female'].round(0))
print('Men Age mean():', meanAgeGroupedBySex['male'].round(0))



In [None]:
# Sorting our titanic DataFrame by column Age asceding

with pd.option_context('display.max_rows', None, 'display.max_columns', None): 
    print(titanic.sort_values('Age', ascending=True))
    

In [None]:
# Adding 'Title' column to our DataFrame:

import re  #regular expressions to extract the titles from people's names. Like Mr, Mrs, Ms etc

def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    if title_search:
        return title_search.group(1)
    return ""

titanic["Title"] = titanic["Name"].apply(get_title)

titanic

In [None]:
pd.crosstab(titanic['Title'], titanic['Sex'])

In [None]:
titanic.query('Name.str.contains("Sir. ")')

#equivalent to:

#titanic[titanic['Name'].str.contains('Sir. ')]



In [None]:
#Query with multiple conditions:

titanic.query('Fare>200 and Survived==0')

In [None]:
# Random sample selection:

titanic.sample(n = 15)


# **Visualizing with matplotlib:**

In [None]:
#pip install matplotlib

import matplotlib.pyplot as plt


### **Understanding passengers Age and Gender distribution:**

In [None]:
#Rendering a histogram chart from passengers Age:

plt.title('Titanic Passengers Age Histogram')
plt.ylabel('Count')
plt.xlabel('Age Categories by Decade (years)')

titanic['Age'].hist()

plt.show()

In [None]:
#Calculating passengers' sex proportion:

# sum the instances of males and females
males = (titanic['Sex'] == 'male').sum()
females = (titanic['Sex'] == 'female').sum()

# add them into a list called proportions
proportions = [males, females]

print(proportions)

# Create a pie chart
plt.pie(
    proportions,
    labels = ['Males', 'Females'],
    autopct = '%.2f%%'
)

# Set labels
plt.title("Sex Proportion")

# View the plot
plt.tight_layout()

plt.show()

In [None]:
survival = titanic.groupby('Sex')['Survived'].value_counts().unstack()

#survival

survival.plot(kind='bar', stacked=True)
plt.legend(('Died', 'Survived'), loc='best')
plt.title('Survivors by Gender/Sex')
plt.xlabel('Gender/Sex')
plt.ylabel('Count')
plt.xticks(rotation=0)

plt.show()

In [None]:
women = titanic.loc[titanic['Sex'] == 'female']["Survived"]
rate_women = round(sum(women)/len(women), 2)
men = titanic.loc[titanic['Sex'] == 'male']["Survived"]
rate_men = round(sum(men)/len(men), 2)
print("% of women who survived:", rate_women)
print("% of men who survived:", rate_men)

### **Understanding passengers socio-economics and Class distribution:**

In [None]:
plt.rc('figure', figsize=(10, 5)) #sets runtime configuration figure size

titanic['Pclass'].value_counts().plot(kind='bar', title='Passenger Class')

plt.xticks(rotation=0)
plt.ylabel('Count')
plt.xlabel('Passenger Class (first class / second class / third class)')
plt.title('Passenger Class distribution')
plt.show()

In [None]:
plt.rc('figure', figsize=(10, 5))

passangersByClass = titanic.groupby('Pclass')['Survived'].value_counts().unstack()

passangersByClass.plot(kind='bar', stacked=True)

plt.legend(('Died', 'Survived'), loc='best')
plt.title('Survivors by Class')
plt.xlabel('Passenger Class (first class / second class / third class)')
plt.ylabel('Count')
plt.xticks(rotation=0)

plt.show()

# **Bonus features:**

Pandas DataFrame can be converted to HTML with **to_html()** function. This is useful if you need to send automated reports via HTML:

In [None]:
df_html = titanic.to_html() 
with open('titanic.html', 'w') as f:
    f.write(df_html)

**Markdown** features:

In [None]:
print(titanic.describe().to_markdown())