#  TITANIC SURVIVOR ANALYSIS 


Using Python libraries and csv files of the data of people in the ship, we find the chances of survivals of people in different categories, helping us understand better about the happenings during the disaster

#**Importing Directories**

In [1]:
import numpy as np
import pandas as pd


#**Reading the CSV file given**

Here we will read the given dataset and store it in form of a pandas DataFrame titan






In [2]:
titan = pd.DataFrame(pd.read_csv("/content/sample_data/train (1).csv"))
titan.head()



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# **Handling NULL Values**

First we check for the present NULL values in the given dataset

In [3]:
#titan.isnull().sum() returns a pandas series with index as the column name
null_series = titan.isnull().sum()
null_series

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Now we check which column has more than 40% of its data as NULL

In [4]:
#we are using the shape function to find the total number of entries in the dataset
red_col = null_series[null_series>(0.4*titan.shape[0])]
red_col

Cabin    687
dtype: int64

The Cabin feild is now redundant as it has more than 40% of its entries as missing so we must manage this by dropping the column

In [5]:
titan.drop(red_col.index,axis=1,inplace=True)
titan.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Embarked         2
dtype: int64

Now we can see that the redundant column of 'Cabin' has been removed , But sill there is some null values in Age column, we can remove this by filling the null values with the average age of the given set of ages.

In [6]:
titan['Age'].fillna(titan['Age'].mean(),inplace=True)
titan.isnull().sum()


PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       2
dtype: int64

Since 'Embarked' has all string values average cant be found so we fill it with most 'Embarked' place

In [7]:
 titan['Embarked'].describe()
 

count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

The top value is the most frequent one so now we will fill the null spaces with 'S'

In [8]:
titan['Embarked'].fillna("S",inplace=True)
titan.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

# **Checking for relation in Family size and the Survival** 

Family size is found by adding the sibsp and the parch columns

In [9]:
titan['F_size'] = titan['SibSp']+titan['Parch']
titan.drop(['SibSp','Parch'],axis=1,inplace=True)
titan.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare,F_size
PassengerId,1.0,-0.005007,-0.035144,0.033207,0.012658,-0.040143
Survived,-0.005007,1.0,-0.338481,-0.069809,0.257307,0.016639
Pclass,-0.035144,-0.338481,1.0,-0.331339,-0.5495,0.065997
Age,0.033207,-0.069809,-0.331339,1.0,0.091566,-0.248512
Fare,0.012658,0.257307,-0.5495,0.091566,1.0,0.217138
F_size,-0.040143,0.016639,0.065997,-0.248512,0.217138,1.0


From the Correlation chart it is evident that there is not much relation between Family size and survival as it is pretty low of 0.016639 which shows us that the plot is scattered and we cannot predict any relations between family size and

# Checking the survival rate if the person was alone in the ship  

In [10]:
titan['Lone']=[ 0 if titan['F_size'][i]>1 else 1 for i in titan.index]
titan.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Embarked,F_size,Lone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.25,S,1,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C,1,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,STON/O2. 3101282,7.925,S,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1,S,1,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.05,S,0,1


In [11]:
titan.groupby(['Lone'])['Survived'].sum()

Lone
0     90
1    252
Name: Survived, dtype: int64

So it is showing that if the person was alone the chances of him surviving were lower than if he was with someone.

In [12]:
titan[['Lone','Fare','Pclass']].corr()


Unnamed: 0,Lone,Fare,Pclass
Lone,1.0,-0.166672,-0.043796
Fare,-0.166672,1.0,-0.5495
Pclass,-0.043796,-0.5495,1.0


 We see that there is a negative correlation between the Fare and the Lone columns this signifies that there is a high chance that the lower fared seats were bought by people travelling alone, 

# Checking Survival rate of people in different Classes of seats

In [13]:
titan[['Survived','Pclass']].corr()

Unnamed: 0,Survived,Pclass
Survived,1.0,-0.338481
Pclass,-0.338481,1.0


We see that if the classes were higher the correlation with survived 
0.338481 which shows us that if a person had bought more premium seats more were his/her chances of survival

# Checking Survival rate of Younger and Older people and that of Males and Females 

In [20]:
titan[['Survived','Age']].corr()

Unnamed: 0,Survived,Age
Survived,1.0,-0.069809
Age,-0.069809,1.0


It shows that if people were younger the survival chances were more as there is a negative correlation but also this shows us that age did not have a major correlation in survival

In [21]:
titan.groupby(['Survived'])['Age'].mean()

Survived
0    30.415100
1    28.549778
Name: Age, dtype: float64

In [22]:
titan["Age"].mean()

29.699117647058763

This shows us that the average age of the people died and survived were 30.41 and 28.54 repectively showing us that age did not have a major correlation in survival 

In [23]:
titan.groupby(['Sex'])['Survived'].mean()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

This shows us that Females had a higher survival rate in the disaster . Showing us women were prioritized 

# Checking for Survival Rates in people embarking from different regions 

In [24]:
titan.groupby(['Embarked'])['Survived'].mean()

Embarked
C    0.553571
Q    0.389610
S    0.339009
Name: Survived, dtype: float64

This shows us that Cherbourg people survived the most  

# *CONCLUSION*
* Female Passengers were prioritized 
*  People of Cherbourg Survived in higher proportion than the others
* People who bought premium seats had a higher survival rate 
* People travelling alone had a lower survival rate


