### **Project = '  *TITANIC SURVIVAL ANALYSIS* '**

RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in 1912, after colliding with an iceberg during her maiden voyage from Southampton, UK, to New York City, US. The sinking resulted in the deaths of more than 1,500 passengers and crew, making it one of the deadliest commercial peacetime maritime disasters in modern history.

Using the provided dataset and the knowledge gained in ***SHAPEAI's PYTHON DATA ANALYSIS BOOTCAMP***, I’ll try to identify factors made people more likely to survive.

 To know if there were any difference in survival chances between males and females; passengers of different classes and age groups. And curious if the rule "Children and Women first" worked at Titanic.

 For the current project I'm using the jupyter notebook, python and the number of libraries (pandas and numphy).

# 1.  Importing libraries

In [2]:
import pandas as pd
import numpy as np

# 2. Uploading Dataset

Provided data set represents the passengers on the Titanic, and some information about them. On this step I'm going to dig into the data and clean it if necessary.

Our data consists from the following variables:

*   PassengerId - A numerical id assigned to each passenge
*   Survived - Whether the passenger survived (1), or didn't (0). This is going to be the dependent variable of our project.
*   Pclass - The class the passenger was in - first class (1), second class (2), or third class (3). Pclass is going to be one of the independent variables in our project.
*   Name - the name of the passenger
*   Sex - The gender of the passenger - male or female. Sex is going to be one of the independent variables in our study
*   Age - The age of the passenger. Fractional. Age (of age groups) is going to be one of the independent variables in our project
*   SibSp - The number of siblings and spouses the passenger had on board
*   Parch - The number of parents and children the passenger had on board
*   Ticket - The ticket number of the passenger
*   Fare - How much the passenger paid for the ticker
*   Cabin - Which cabin the passenger was in
*   Embarked - Where the passenger boarded the Titanic

The operational definition of the dependent variable - the passenger survived or perished in the catastrophe. Thinking of the survival factors, I guess that age, sex and the class of the passenger might be important variables.

In [5]:
df=pd.DataFrame(pd.read_csv('/content/train (1).csv'))
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
df.shape

(891, 12)

In [7]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


# 3. Seperating the columns which have more than 35% of  the values missing in the dataset. 


In [8]:
# df.isnull().sum() returns a pandas series with columns name as the label index
# and total count of null values in the column as its value
# And we are storing only those columns which have more 35%of the data missing.

x=df.isnull().sum()

drop_col = x[x>(35/100 * df.shape[0])]
drop_col

Cabin    687
dtype: int64

NOTE: There is no specific number after which you should drop the column

In [9]:
drop_col.index

Index(['Cabin'], dtype='object')

In [10]:
df.drop(drop_col.index, axis=1, inplace=True)
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Embarked         2
dtype: int64

In [11]:
df.fillna(df.mean(), inplace=True)
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       2
dtype: int64

Because all **Embarked** contains string values, we see the details of that column seperately from others as strings does not have mean and all.

In [12]:
df['Embarked'].describe()

count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

# 4.  For Embarked attribute, we fill the NULL values with the most frequent value in the column.

In [13]:
df['Embarked'].fillna('S',inplace=True)

In [14]:
df.isnull().sum()      #NOW ALL NULL VALUES HAVE BEEN FILLED

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [15]:
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.033207,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.069809,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.331339,0.083081,0.018443,-0.5495
Age,0.033207,-0.069809,-0.331339,1.0,-0.232625,-0.179191,0.091566
SibSp,-0.057527,-0.035322,0.083081,-0.232625,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.179191,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.091566,0.159651,0.216225,1.0


# 5. So we can make a new column ***family_size*** by combining these two columns.

sibsp:Number of siblings/spouses aboard

parch:Number of parents/children aboard

In [16]:
df['FamilySize'] = df['SibSp'] + df['Parch']
df.drop(['SibSp','Parch'],axis=1,inplace =True)
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare,FamilySize
PassengerId,1.0,-0.005007,-0.035144,0.033207,0.012658,-0.040143
Survived,-0.005007,1.0,-0.338481,-0.069809,0.257307,0.016639
Pclass,-0.035144,-0.338481,1.0,-0.331339,-0.5495,0.065997
Age,0.033207,-0.069809,-0.331339,1.0,0.091566,-0.248512
Fare,0.012658,0.257307,-0.5495,0.091566,1.0,0.217138
FamilySize,-0.040143,0.016639,0.065997,-0.248512,0.217138,1.0


FamilySize in the ship does not have much correlance with survival rate.

# 6. Lets check weather the person was ***alone*** or not can affect the survival rate.

In [17]:
df['Alone']=[0 if df['FamilySize'][i]>0 else 1 for i in df.index]
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Embarked,FamilySize,Alone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.25,S,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,STON/O2. 3101282,7.925,S,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1,S,1,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.05,S,0,1


In [18]:
df.groupby(['Alone'])['Survived'].mean()

Alone
0    0.505650
1    0.303538
Name: Survived, dtype: float64

If the person is alone he/she has less chance of surviving

The reason might be the person who is travelling with his family might be belonging to rich class and might be prioritized over others.

In [19]:
df[['Alone','Fare']].corr()

Unnamed: 0,Alone,Fare
Alone,1.0,-0.271832
Fare,-0.271832,1.0


So we can see if the person was not alone, the chance the ticket price is higher is high.

# 7. To check if ***Sex*** of a person is affecting the survival. 

In [20]:
df['Sex']=[0 if df['Sex'][i]== 'male' else 1 for i in df.index]     #1 for female and 0 for male
df.groupby(['Sex'])['Survived'].mean()

Sex
0    0.188908
1    0.742038
Name: Survived, dtype: float64

It shows, female passengers have more chance of surviving than male ones. 

It shows women were prioritized over men.

In [21]:
df.groupby(['Embarked'])['Survived'].mean()

Embarked
C    0.553571
Q    0.389610
S    0.339009
Name: Survived, dtype: float64

### **8**.  **CONCLUSIONS**


# *   Female passengers were prioritized over men.
# *   People with high class or rich people have higher survival rate than others. The hierarichy might have been followed while saving the passengers.
# *   Passengers travelling with thier family have higher survival rate.
# *   Passengers boarded the ship at Cherbourg, survived more in proportion then the others.




