# Titanic Dataset
## by Febi Andika Dani Fajar Suryawan

## Background
The dataset contains data for 887 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including whether they survived (S), their age (A), their passenger-class (C), their sex (G) and the fare they paid (X). The table below shows the Data Dictionary.

## Load Library

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

## 1. What is the dimension (col, row) of the data frame?

In [2]:
# Load dataset
df = pd.read_csv('titanic.csv')

In [3]:
# Show dataset
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [4]:
# Show dimension of the data frame
df.shape

(887, 8)

**There are 887 rows and 8 columns in the data frame.**

## 2. How to know data type of each variable?

In [5]:
# Show data type of each variable
df.dtypes

Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object

**There are 4 variables with integer data types, 2 variables with float data types, 2 variables with string data types.**

## 3. How many passengers survived (Survived=1) and not-survived (Survived=0)?

In [7]:
# Grouping the passengers by survived column
df.groupby('Survived').size()

Survived
0    545
1    342
dtype: int64

**There were 342 passengers survived and 545 passengers not-survived.**

## 4. How to drop column ‘Name’ from the data frame?

In [8]:
# Show data frame before drop column 'Name'
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [9]:
# Drop column 'Name' from the data frame
df.drop(columns=['Name'], inplace=True)

In [10]:
# Show data frame after drop column 'Name'
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,male,22.0,1,0,7.25
1,1,1,female,38.0,1,0,71.2833
2,1,3,female,26.0,0,0,7.925
3,1,1,female,35.0,1,0,53.1
4,0,3,male,35.0,0,0,8.05


## 5. Add one new column called ‘family’ to represent number of family-member aboard (hint: family = sibsp + parch)

In [11]:
# Add new column
df['family'] = df['Siblings/Spouses Aboard'] + df['Parents/Children Aboard']
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,family
0,0,3,male,22.0,1,0,7.25,1
1,1,1,female,38.0,1,0,71.2833,1
2,1,3,female,26.0,0,0,7.925,0
3,1,1,female,35.0,1,0,53.1,1
4,0,3,male,35.0,0,0,8.05,0


## 6. As shown, columns ‘Age’ contains missing values. Please add new column named ‘Age_miss’ to indicate whether Age is missing or not (Age_miss = ‘YES’ for missing value and ‘NO’ for non-missing value).

In [12]:
# Checking if there are many missing value
df.isnull().sum()

Survived                   0
Pclass                     0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
family                     0
dtype: int64

In [13]:
# Make new column named 'Age_miss'
Age_miss = []
for i in range(len(df)):
    if df['Age'][i] == 0:
        Age_miss.append('Yes')
    else:
        Age_miss.append('No')

In [14]:
# Add column named 'Age_miss' to data frame
df['Age_miss'] = Age_miss
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,family,Age_miss
0,0,3,male,22.0,1,0,7.25,1,No
1,1,1,female,38.0,1,0,71.2833,1,No
2,1,3,female,26.0,0,0,7.925,0,No
3,1,1,female,35.0,1,0,53.1,1,No
4,0,3,male,35.0,0,0,8.05,0,No


## 7. Please fill Age missing value with means of existing Age values

In [16]:
# Fill the missing values 
df['Age'] = df['Age'].fillna(df['Age'].mean())

## 8. What is the maximum passenger Age who survived from the tragedy? 

In [17]:
# Check the maximum passenger age who survived from tragedy
df[df['Survived']==1]['Age'].max()

80.0

**The maximum passenger Age who survived from the tragedy is 80 years old.**

## 9. How many passengers survived from each ‘PClass’? 

In [18]:
# Grouping by 'Pclass' 
df[df['Survived']==1].groupby('Pclass')['Survived'].count()

Pclass
1    136
2     87
3    119
Name: Survived, dtype: int64

**There were 136 passengers survived from PClass=1, 87 passengers survived from PClass=2, and 119 passengers survived from PClass=3.**

## 10. How to randomly split the data frame into 2 parts (titanic1 and titanic2) with proportion of 0.7 for tttanic1 and 0.3 for titanic2 ?

In [19]:
# Split the data frame
titanic1 = df.sample(frac=0.7).reset_index()
titanic1.head()

Unnamed: 0,index,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,family,Age_miss
0,846,0,3,male,4.0,4,2,31.275,6,No
1,585,0,3,male,22.0,0,0,8.05,0,No
2,759,1,1,female,36.0,1,2,120.0,3,No
3,619,1,3,male,20.0,1,1,15.7417,2,No
4,166,0,3,female,45.0,1,4,27.9,5,No


In [20]:
# Split the data frame
titanic2 = df.drop(titanic1.index)
titanic2.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,family,Age_miss
621,0,3,male,21.0,0,0,16.1,0,No
622,0,1,male,61.0,0,0,32.3208,0,No
623,0,2,male,57.0,0,0,12.35,0,No
624,1,1,female,21.0,0,0,77.9583,0,No
625,0,3,male,26.0,0,0,7.8958,0,No
