In [None]:
import numpy as np
import pandas as pd

**Reading DATA using Pandas**
We are using the read_csv function to read the dataset into a pandas data frame.


In [None]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**DESCRIPTION OF THE ATTRIBUTES:**

*   Pclass: Passenger Class(1 = 1st; 2 = 2nd; 3 = 3rd)
*   Survival: Survival(0 = NO; 1 = YES)
*   Name: Name
*   Sex: Sex
*   sibsp: Number of siblings/Spouses Aboard
*   parch: Number of parents/ Children Aboard
*   Ticket: Ticket Number
*   Fare: Passenger Fare(British Pound)
*   Cabin: Cabin
*   Embarked: Port of Embarkation( C = Cherbourg, Q = Queenstown, S = Southhampton)













**Handeling Null Values:**

The data set contains many rows and columns for which some data is missing. In such a situation we have 2 methods to cater these missing values.
1. Dropping the entire row or column
2. Replacing the value with the mean of the column.

In [None]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [None]:
print(df.shape)
                                                      

(891, 12)


**Seperating the column which has more than 35% of its values missing**

In [None]:
drop_column = df.isnull().sum()[df.isnull().sum() > (35/100 * df.shape[0])]
drop_column


Cabin    687
dtype: int64

In [None]:
drop_column.index

Index(['Cabin'], dtype='object')

**Dropping the column**

In [None]:
df.drop(drop_column.index, axis=1, inplace=True)

In [None]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Embarked         2
dtype: int64

**Filling the remaining null values with mean of the column**

In [None]:
df.fillna(df.mean(), inplace=True)
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       2
dtype: int64

**Because Embarked has string values and we can not find mean of string values.**

In [None]:
df["Embarked"].describe()

count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

**We are filling in the null values in Embarked with the most frequent value in the column.**

In [None]:
df['Embarked'].fillna('S', inplace=True)
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

**Finding the corelation**

In [None]:
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.033207,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.069809,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.331339,0.083081,0.018443,-0.5495
Age,0.033207,-0.069809,-0.331339,1.0,-0.232625,-0.179191,0.091566
SibSp,-0.057527,-0.035322,0.083081,-0.232625,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.179191,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.091566,0.159651,0.216225,1.0


*   sibsp: Number of siblings/Spouses Aboard
*   parch: Number of parents/ Children Aboard

Combining them to make a new column named family size.





In [None]:
df['FamilySize'] = df["SibSp"] + df["Parch"]
df.drop(["SibSp", "Parch"], inplace=True, axis=1)
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare,FamilySize
PassengerId,1.0,-0.005007,-0.035144,0.033207,0.012658,-0.040143
Survived,-0.005007,1.0,-0.338481,-0.069809,0.257307,0.016639
Pclass,-0.035144,-0.338481,1.0,-0.331339,-0.5495,0.065997
Age,0.033207,-0.069809,-0.331339,1.0,0.091566,-0.248512
Fare,0.012658,0.257307,-0.5495,0.091566,1.0,0.217138
FamilySize,-0.040143,0.016639,0.065997,-0.248512,0.217138,1.0


**FamilySize on the ship doesn't have much corevelance with the survival rate.**


Now checking if being alone on the ship affected the survival rate.

In [None]:
df["Alone"] = [0 if df["FamilySize"][i] > 0 else 1 for i in df.index]
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Embarked,FamilySize,Alone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.25,S,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,STON/O2. 3101282,7.925,S,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1,S,1,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.05,S,0,1


In [None]:
df.groupby(['Alone'])['Survived'].mean()


Alone
0    0.505650
1    0.303538
Name: Survived, dtype: float64

0 = Not Alone

1 = Alone

So, if a person is alone, he/she has a lesser chance of survival.

**The reason might be that a person travelling with a family must belong to an upper class and they must have been prioritized over the other.**


In [None]:
df[['Alone', 'Fare']].corr()

Unnamed: 0,Alone,Fare
Alone,1.0,-0.271832
Fare,-0.271832,1.0


So we can see that the person was not alone, the chancethat the ticket price will be high is higher.

In [None]:
df['Sex'] = [0 if df['Sex'][i] == 'male' else 1 for i in df.index]
df.groupby(['Sex'])['Survived'].mean()

Sex
0    0.188908
1    0.742038
Name: Survived, dtype: float64

**This shows that a female passenger had a higher chane of survival.**

**It shows that females were prioritized over male paggengers.**

In [None]:
df.groupby(['Embarked'])['Survived'].mean()

Embarked
C    0.553571
Q    0.389610
S    0.339009
Name: Survived, dtype: float64

**This shows that people who embarked from Cherbourg had a higher chance of survival.**



> **CONCLUSION**


*   Female passengers were prioritized over male.
*   Passengers travelling with their families had a higher chance of survival.

*   Passengers who boarded fron Cherbourg, survived more in proportion to others.
*   People belonging from upper class had a higher chance of survival. A class hierarichy must have been followe while saving the passengers.









