# TITANIC SURVIVOR ANALYSIS 

In [7]:
import pandas as pd
import numpy as np


**READING DATA USING PANDAS**

We use pandas read_csv function to read the csv file in python and dataFrame method to convert file into the data frame

In [8]:
df = pd.DataFrame(pd.read_csv('/content/train.csv'))
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [9]:
df.shape

(891, 12)

**Description of the attributes of the datast**

Pclass: Passenger Class(1 = 1st; 2 = 2nd; 3 = 3rd)

survival:Survival(0 = No; 1 = Yes)

name: Name

sex: Sex

age: Age

sibsp: Number of siblings/Spouses Aboard

parch: Number of parents/Children Aboard

ticket: Ticket Number

fare: Passenger Fare(British pound)

cabin: Cabin

embarked: Port of Embarkation(C = Cherbourg; Q = Queenstown; S = Southamton)

# ***HANDELING NULL VALUES***
The dataset may contains many rows and columns for which some values are missing,we can't leave those missing values as it is.

In such cases we have two option:



1.   Either drop the entire row or column
2.   Fill the missing values with some appropriate value let's say mean
 of all the values for that column may do the job.


In [10]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Seperating out the columns which have more thn 35% of the values missing in the dataset 

In [11]:
# df.isnull().sum() returns a pandas series with column name as the label index
# and total count of null values in the column as it's value
# And we are storing only those columns which have more than 35% of the data missing.

drop_col = df.isnull().sum()[df.isnull().sum()>(35/100 * df.shape[0])]
drop_col

Cabin    687
dtype: int64

In [12]:
drop_col.index

Index(['Cabin'], dtype='object')

In [13]:
df.drop(drop_col.index, axis=1, inplace=True)
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Embarked         2
dtype: int64

In [14]:
df.fillna(df.mean(), inplace = True)
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       2
dtype: int64

Because **Embarked** contains string values,we see the details of that column seperately from others as strings does not have mean and all.

In [15]:
df['Embarked'].describe()

count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

For Embarked attribute,we fill the null values with the most frequent value in the column.

In [16]:
df['Embarked'].fillna('s',inplace=True)

In [17]:
df.isnull().sum()    #NOW ALL THE NULL VALUES HAVE BEEN FILLED

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [18]:
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.033207,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.069809,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.331339,0.083081,0.018443,-0.5495
Age,0.033207,-0.069809,-0.331339,1.0,-0.232625,-0.179191,0.091566
SibSp,-0.057527,-0.035322,0.083081,-0.232625,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.179191,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.091566,0.159651,0.216225,1.0


sibsp: Number of siblings/Spouses Aboard

parch: Number of parents/Children Aboard

So we can make a new column family_size by combining these two column

In [19]:
df['FamilySize'] = df['SibSp'] + df['Parch']
df.drop(['SibSp','Parch'], axis=1, inplace=True)
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare,FamilySize
PassengerId,1.0,-0.005007,-0.035144,0.033207,0.012658,-0.040143
Survived,-0.005007,1.0,-0.338481,-0.069809,0.257307,0.016639
Pclass,-0.035144,-0.338481,1.0,-0.331339,-0.5495,0.065997
Age,0.033207,-0.069809,-0.331339,1.0,0.091566,-0.248512
Fare,0.012658,0.257307,-0.5495,0.091566,1.0,0.217138
FamilySize,-0.040143,0.016639,0.065997,-0.248512,0.217138,1.0


**FamilySize in the ship does not have much correlance with survival rate**

Let's check if we weather the person was alonw or not can affect the survival rate.

In [20]:
df['Alone'] = [0 if df['FamilySize'][i]>0 else 1 for i in df.index]
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Embarked,FamilySize,Alone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.25,S,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,STON/O2. 3101282,7.925,S,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1,S,1,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.05,S,0,1


In [21]:
df.groupby(['Alone'])['Survived'].mean()

Alone
0    0.505650
1    0.303538
Name: Survived, dtype: float64

If the person is alone he/she has less chance of surviving.


***The reason might be  the person who is  travelling with his family might be belonging to rich class and might be prioritized over other.***

In [22]:
df[['Alone','Fare']].corr()

Unnamed: 0,Alone,Fare
Alone,1.0,-0.271832
Fare,-0.271832,1.0


So we can see if the person was not alone, the chance the ticket price is higher is high

In [23]:
df['Sex'] = [0 if df['Sex'][i]=='male' else 1 for i in df.index]  # 1 for female, 0 for male
df.groupby(['Sex'])['Survived'].mean()

Sex
0    0.188908
1    0.742038
Name: Survived, dtype: float64

It shows,female passengers have more chance of surviving than male ones.

It shows,women were prioritized over men.

In [24]:
df.groupby(['Embarked'])['Survived'].mean()

Embarked
C    0.553571
Q    0.389610
S    0.336957
s    1.000000
Name: Survived, dtype: float64

In [25]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Fare,FamilySize,Alone
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,0.352413,29.699118,32.204208,0.904602,0.602694
std,257.353842,0.486592,0.836071,0.47799,13.002015,49.693429,1.613459,0.489615
min,1.0,0.0,1.0,0.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,0.0,22.0,7.9104,0.0,0.0
50%,446.0,0.0,3.0,0.0,29.699118,14.4542,0.0,1.0
75%,668.5,1.0,3.0,1.0,35.0,31.0,1.0,1.0
max,891.0,1.0,3.0,1.0,80.0,512.3292,10.0,1.0


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    int64  
 5   Age          891 non-null    float64
 6   Ticket       891 non-null    object 
 7   Fare         891 non-null    float64
 8   Embarked     891 non-null    object 
 9   FamilySize   891 non-null    int64  
 10  Alone        891 non-null    int64  
dtypes: float64(2), int64(6), object(3)
memory usage: 76.7+ KB


In [35]:
from IPython.display import Image
Image(url= "https://static1.squarespace.com/static/5006453fe4b09ef2252ba068/5095eabce4b06cb305058603/5095eabce4b02d37bef4c24c/1352002236895/100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8.jpg")

# **How titanic sank?**

In [36]:
Image(url= "https://static1.squarespace.com/static/5006453fe4b09ef2252ba068/t/5090b249e4b047ba54dfd258/1351660113175/TItanic-Survival-Infographic.jpg?format=1500w")