# Titanic Data Analysis

There have been huge disasters throughout the history of mankind, but the magnitude of the Titanic’s disaster ranks as one of the highest. So much so that the subsequent disasters have always been described as “Titanic in proportion,” implying huge losses. Anyone who has ever read about the Titanic knows that a perfect combination of natural events and human errors had led to the sinking of the Titanic on its fateful maiden journey from Southampton to New York on April 14, 1912. 

There have been several questions put forward to understand the cause(s) of the tragedy; foremost among them is, what made it sink and even more intriguing, how can a 46,000-ton ship sink to the depth of 13,000 feet in a matter of 3 hours? This is a mind-boggling question indeed! 

There have been as many investigations as there have and is still poses too many questions and an equal types of data analysis methods have been applied to arrive at a conclusion. This post is not about analyzing why or what made the Titanic sink; it is about the data analysis of data available from Titanic. This Titanic data is public-ally available and the titanic data set is described below under the heading Data Set Description.

Using this dataset, we will perform some data analysis and will draw out some insights, like finding the average age of male and females who died in the Titanic, and the number of males and females who died in each compartment.

**Data Set Description:**

- **Column 1:** PassengerId

- **Column 2:** Survived  (survived=0 & died=1)

- **Column 3:** Pclass

- **Column 4:** Name

- **Column 5:** Sex

- **Column 6:** Age

- **Column 7:** SibSp

- **Column 8:** Parch

- **Column 9:** Ticket

- **Column 10:** Fare

- **Column 11:** Cabin

- **Column 12:** Embarked

In [1]:
import findspark
findspark.init()
import pyspark

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('usecase_8').getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('ERROR')

### Titanic Data

In [4]:
titanic_df = spark.read.format('csv').options(header=False, inferSchema=True).load('TitanicData.txt')

In [5]:
titanic_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: integer (nullable = true)
 |-- _c2: integer (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: double (nullable = true)
 |-- _c6: integer (nullable = true)
 |-- _c7: integer (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: double (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)



In [6]:
titanic_df.show(3)

+---+---+---+--------------------+------+----+---+---+----------------+-------+----+----+----+
|_c0|_c1|_c2|                 _c3|   _c4| _c5|_c6|_c7|             _c8|    _c9|_c10|_c11|_c12|
+---+---+---+--------------------+------+----+---+---+----------------+-------+----+----+----+
|  1|  0|  3|Braund Mr. Owen H...|  male|22.0|  1|  0|       A/5 21171|   7.25|null|   S|null|
|  2|  1|  1|Cumings Mrs. John...|female|38.0|  1|  0|        PC 17599|71.2833| C85|   C|null|
|  3|  1|  3|Heikkinen Miss. L...|female|26.0|  0|  0|STON/O2. 3101282|  7.925|null|   S|null|
+---+---+---+--------------------+------+----+---+---+----------------+-------+----+----+----+
only showing top 3 rows



In [7]:
titanic_df = titanic_df.drop('_c12')

In [8]:
headers = ['PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked']

In [9]:
for i,h in enumerate(titanic_df.columns):
    titanic_df = titanic_df.withColumnRenamed(h,headers[i])

In [10]:
titanic_df.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



In [11]:
titanic_df.show(3)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund Mr. Owen H...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings Mrs. John...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen Miss. L...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
only showing top 3 rows



### Problem Statement 1: Average age of males and females who died in the Titanic tragedy?

In [25]:
titanic_df.count()

891

In [27]:
titanic_df.filter(titanic_df['age'] > 0).count()

714

##### Average age of males who died in the Titanic tragedy

In [36]:
titanic_df.filter(titanic_df['age'] > 0).filter(titanic_df['Sex'] == 'male').filter(titanic_df['Survived'] == 1).groupBy().agg({'age':'avg'}).show()

+------------------+
|          avg(age)|
+------------------+
|27.276021505376345|
+------------------+



##### Average age of females who died in the Titanic tragedy

In [38]:
titanic_df.filter(titanic_df['age'] > 0).filter(titanic_df['Sex'] == 'female').filter(titanic_df['Survived'] == 1).groupBy().agg({'age':'avg'}).show()

+-----------------+
|         avg(age)|
+-----------------+
|28.84771573604061|
+-----------------+



### Problem Statement 2: Find the number of people who died or survived in each class, along with their gender and age.

In [55]:
titanic_df.groupBy('Survived','Pclass','Sex','Age').count().orderBy(desc('count')).show(5)

+--------+------+------+----+-----+
|Survived|Pclass|   Sex| Age|count|
+--------+------+------+----+-----+
|       0|     3|  male|null|   85|
|       1|     3|female|null|   25|
|       0|     3|female|null|   17|
|       0|     1|  male|null|   16|
|       0|     3|  male|22.0|   13|
+--------+------+------+----+-----+
only showing top 5 rows



## Closing spark session

In [58]:
spark.stop()