# TITANIC DATASET
## BY DWI SMARADAHANA INDRALOKA

## IMPORT LIBRARY

In [1]:
import pandas as pd
import numpy as np

## UPLOAD DATASET

In [2]:
data = pd.read_csv("titanic.csv")
data.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


## 1. What is the dimension (row, col) of the data frame?
* The data frame dimension is (887, 8), that means data frame has 887 row and 8 columns

In [3]:
data.shape

(887, 8)

## 2. How to know data type of each variable?
* To know the data type of each variable we can use ".dtypes". There are 4 variables with integer type, 2 variables with object type and 2 variables with float type

In [4]:
data.dtypes

Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object

## 3. How many passengers survived (Survived = 1) and not-survived (Survived = 0)?
* There are 342 passengers that survived and 545 passengers that not survived

In [5]:
data.groupby('Survived').size()

Survived
0    545
1    342
dtype: int64

## 4. How to drop column 'Name' from the data frame?
* To drop column 'Name' from the data frame we can use ".drop("Name", axis = 1)"

In [6]:
data.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [7]:
data = data.drop("Name", axis =1)
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,male,22.0,1,0,7.25
1,1,1,female,38.0,1,0,71.2833
2,1,3,female,26.0,0,0,7.925
3,1,1,female,35.0,1,0,53.1
4,0,3,male,35.0,0,0,8.05


## 5. Add one new column called 'family' to represent number of family-member aboard

In [8]:
data["Family"] = data["Siblings/Spouses Aboard"] + data["Parents/Children Aboard"]
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Family
0,0,3,male,22.0,1,0,7.25,1
1,1,1,female,38.0,1,0,71.2833,1
2,1,3,female,26.0,0,0,7.925,0
3,1,1,female,35.0,1,0,53.1,1
4,0,3,male,35.0,0,0,8.05,0


## 6. Please add new column named 'Age_miss' to indicate whether Age is missing or not

**Checking Missing Value**

In [9]:
data.isnull().sum()

Survived                   0
Pclass                     0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
Family                     0
dtype: int64

* There is no missing value

**Add New Column Named 'Age_miss'**

In [10]:
age_miss = []
for i in range(len(data)):
    if data["Age"][i] == 0:
        age_miss.append("Yes")
    else:
        age_miss.append("No")
data["Age_miss"] = age_miss
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Family,Age_miss
0,0,3,male,22.0,1,0,7.25,1,No
1,1,1,female,38.0,1,0,71.2833,1,No
2,1,3,female,26.0,0,0,7.925,0,No
3,1,1,female,35.0,1,0,53.1,1,No
4,0,3,male,35.0,0,0,8.05,0,No


## 7. Please fill Age missing value with means of existing Age values
* To fill Age missing value with means of existing Age values we can use ".fillna(data["Age"].mean())"

In [11]:
data["Age"] = data["Age"].fillna(data["Age"].mean())

## 8. What is the maximum passenger Age who survived from the tragedy?
* The maximum passenger age who survived from the tragedy is 80

In [12]:
data[data["Survived"] == 1]["Age"].max()

80.0

## 9. How many passengers survived from each 'PClass'?
* The number of passengers survived from PClass 1 are 136, from PClass 2 are 87 and from PClass 3 are 119

In [13]:
data[["Survived", "Pclass"]].groupby(["Pclass"]).sum()

Unnamed: 0_level_0,Survived
Pclass,Unnamed: 1_level_1
1,136
2,87
3,119


## 10. How to randomly split the data frame into 2 parts (titanic1 and titanic2) with proportion of 0.7 for titanic1 and 0.3 for titanic2? 

* To randomly split the data frame into 2 parts with proportion 0.7 for titanic1 we can use ".sample(frac = 0.7)" and proportion 0.3 for titanic2 we can just drop the titanic1 data from data frame

**Titanic1**

In [14]:
titanic1 = data.sample(frac = 0.7)
titanic1.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Family,Age_miss
321,1,2,female,22.0,1,1,29.0,2,No
33,0,2,male,66.0,0,0,10.5,0,No
17,1,2,male,23.0,0,0,13.0,0,No
667,1,2,female,40.0,1,1,39.0,2,No
337,0,1,male,45.0,0,0,35.5,0,No


**Titanic2**

In [15]:
titanic2 = data.drop(titanic1.index)
titanic2.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Family,Age_miss
2,1,3,female,26.0,0,0,7.925,0,No
3,1,1,female,35.0,1,0,53.1,1,No
4,0,3,male,35.0,0,0,8.05,0,No
7,0,3,male,2.0,3,1,21.075,4,No
10,1,3,female,4.0,1,1,16.7,2,No


***
# THANK YOU