# 1 - Define the problem / Questions

# 2 - Data Collection

# 3 - Data Cleaning

- Handle missing values
- Identify and remove the duplicates
- Handle the outliers
- Transfer the data into suitable format

# 4 - EDA - Exploratory Data Analysis
# 5 - Data Visualization
# 6 - Analyze the insights & Report


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# load the data in titanic variable

titanic = sns.load_dataset("titanic")

In [3]:
# lets start the initial data inspection

# check wheather the data is loaded or not

titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [4]:
# lets check the all data information

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


#insights -
- we have total 891 passenger data. . where two features are in boolian , two features are in categorical data, two features are in float data, four features are in intiger, also we have five object feature.

- as we seen we have 891 passengers but the given data features contain some missing values

- so lets check the exact number

In [5]:
titanic.isnull().sum()

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,177
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


# insights

- we have missing data like
- age have 177 missing values - should be imputed based on pclass, sex - replace by mean or median

- embarked /embarked town have 2 missing values - can be imputed with mode

- deck - massive missing values - better to drop this column

In [6]:
# impute the age missing values based on pclass and sex with median

In [12]:
titanic["age"] = titanic.groupby(['pclass', 'sex'])['age'].transform(lambda x : x.fillna(x.median()))

# categorical features always in paranthesis, and
# numerical features always in square braces
# note - Lambda is annonymous funcion that replace every iteration
# note - we use fill na- its fill the missing values with median
# note .transform is use for the line by line check with condition

In [17]:
titanic['age'].isnull().sum()

np.int64(0)

In [18]:
#replace the embarked with mode

In [23]:
titanic['embarked'].mode()[0]

'S'

In [25]:

titanic['embarked'].fillna(titanic['embarked'].mode()[0], inplace=True)

#inplace True for permenent changes

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic['embarked'].fillna(titanic['embarked'].mode()[0], inplace=True)


In [27]:
titanic.isnull().sum()

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0
embarked,0
class,0
who,0


In [28]:
#lets drop the unwanted columns
titanic.drop(columns=['deck', 'embark_town','who', 'alive', 'adult_male'], inplace=True)

In [29]:
titanic.isnull().sum()

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0
embarked,0
class,0
alone,0


In [30]:
# it may contain the duplicated values .
# so lets check weather the values  duplicated or not

In [35]:
titanic.duplicated().sum()

np.int64(0)

In [32]:
# as we seen we have 118 duplicated records - so lets drop it permenently
titanic.drop_duplicates(inplace= True)

In [33]:
# lets check it again
titanic.duplicated().sum()

np.int64(0)

In [34]:
# so finally lets check the shape of data
titanic.shape

(773, 10)

In [37]:
# so we have finl 773 passengers data -
# lets check for the statistical features

In [38]:
titanic.describe(include='all')

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,alone
count,773.0,773.0,773,773.0,773.0,773.0,773.0,773,773,773
unique,,,2,,,,,3,3,2
top,,,male,,,,,S,Third,True
freq,,,482,,,,,562,400,436
mean,0.415265,2.247089,,29.558111,0.529107,0.421734,35.003315,,,
std,0.493087,0.85307,,13.988257,0.99128,0.84138,52.443053,,,
min,0.0,1.0,,0.42,0.0,0.0,0.0,,,
25%,0.0,1.0,,21.0,0.0,0.0,8.05,,,
50%,0.0,3.0,,28.0,0.0,0.0,16.1,,,
75%,1.0,3.0,,38.0,1.0,1.0,34.375,,,


# insights -
- Survival rate only  41% while 59% not survived
-  half of the passenger were in third class
- most passenger were young, adult median age is 30.but must be outliers shown 80 yer old also
- median fare is 35 $ but max was 512.33 it may contain outliers
- maximum male passenger and alone passenger
- maximum passenger embarked from "s" station
- most passenger is in thired class