In [1]:
import pandas as pd

titanic_df = pd.read_csv("train.csv")

## Basic Information

In [4]:
titanic_df.shape

(891, 12)

In [5]:
titanic_df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [6]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
titanic_df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [8]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Specific Infromation

In [10]:
titanic_df[["Age","Fare"]].mean()

Age     29.699118
Fare    32.204208
dtype: float64

In [11]:
titanic_df[["Age","Fare"]].describe()

Unnamed: 0,Age,Fare
count,714.0,891.0
mean,29.699118,32.204208
std,14.526497,49.693429
min,0.42,0.0
25%,20.125,7.9104
50%,28.0,14.4542
75%,38.0,31.0
max,80.0,512.3292


**Description:**
1. The average age of passengers is around 30 years, which suggests that the majority of passengers are young adults, mostly between 20 to 40 years.

2. The fare distribution is highly skewed with an average of 32, but most passengers paid less than 15, and only a few paid extremely high fares above 500.

3. The minimum age is 0.42 years, indicating infants on board, and the maximum age is 80 years, showing the presence of elderly passengers.

4. Some fares are 0, which might correspond to crew members or complimentary tickets.



In [13]:
titanic_df.value_counts("Survived")

Survived
0    549
1    342
Name: count, dtype: int64

In [14]:
titanic_df.value_counts("Age")

Age
24.0    30
22.0    27
18.0    26
19.0    25
30.0    25
        ..
53.0     1
66.0     1
70.5     1
74.0     1
80.0     1
Name: count, Length: 88, dtype: int64

In [15]:
titanic_df['Sex'].value_counts()

Sex
male      577
female    314
Name: count, dtype: int64

In [17]:
titanic_df.value_counts(["Sex","Survived"], normalize= True)

Sex     Survived
male    0           0.525253
female  1           0.261504
male    1           0.122334
female  0           0.090909
Name: proportion, dtype: float64

**Description:**

1. The majority of male passengers did not survive, accounting for approximately 52.53% of all passengers, while only 12.23% of males survived.

2. In contrast, female passengers had a higher survival rate, with 26.15% surviving, while only 9.09% did not survive.

3. This clearly shows that females had a much higher chance of survival than males, reflecting the “women and children first” evacuation policy followed during the disaster.

In [19]:
titanic_df[titanic_df["Sex"] == "female"].value_counts("Survived", ascending=True, normalize = True)

Survived
0    0.257962
1    0.742038
Name: proportion, dtype: float64

Nearly one third population of the females survived

In [27]:
titanic_df[titanic_df["Embarked"] == "S"].value_counts("Sex", ascending=True)


Sex
female    203
male      441
Name: count, dtype: int64

The analysis looks at how many people in the Titanic from those who boarded in Southampton, split by gender. It shows that 203 men boarded  and 441 women boarded out of total 644 people

In [28]:
titanic_df[titanic_df["Embarked"] == "S"].value_counts("Survived", ascending=True)


Survived
1    217
0    427
Name: count, dtype: int64

The analysis looks at how many people survived the Titanic from those who boarded in Southampton. It shows that 217 survived and 427 did not, with a total of 644 people.

In [3]:
# nunique(), unique()
titanic_df["Embarked"].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [4]:
titanic_df["Embarked"].nunique()

3

In [6]:
titanic_df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [11]:
group_by_gender = titanic_df[["Sex","Fare","Age","Survived"]].groupby(by = "Sex")

In [12]:
group_by_gender.mean()

Unnamed: 0_level_0,Fare,Age,Survived
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,44.479818,27.915709,0.742038
male,25.523893,30.726645,0.188908


In [14]:
group_by_gender.value_counts(["Survived"])

Sex     Survived
female  1           233
        0            81
male    0           468
        1           109
Name: count, dtype: int64

In [4]:
group_by_gender = titanic_df[["Embarked","Fare","Age","Survived"]].groupby(by = ["Embarked","Survived"])

In [5]:
group_by_gender.mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Fare,Age
Embarked,Survived,Unnamed: 2_level_1,Unnamed: 3_level_1
C,0,35.443335,33.666667
C,1,79.720926,28.973671
Q,0,13.335904,30.325
Q,1,13.182227,22.5
S,0,20.743987,30.203966
S,1,39.547081,28.113184


## Correlation and Covariance

In [7]:
titanic_df[["Age","Fare"]].cov()

Unnamed: 0,Age,Fare
Age,211.019125,73.84903
Fare,73.84903,2469.436846


In [8]:
titanic_df[["Age","Fare"]].corr()

Unnamed: 0,Age,Fare
Age,1.0,0.096067
Fare,0.096067,1.0


In [17]:
sampled = titanic_df.sample(n=100, random_state=99)

In [18]:
sampled[["Age","Fare"]].corr()

Unnamed: 0,Age,Fare
Age,1.0,0.120865
Fare,0.120865,1.0


In [21]:
pd.crosstab(titanic_df.Sex,titanic_df.Survived,margins=True)

Survived,0,1,All
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,81,233,314
male,468,109,577
All,549,342,891


In [22]:
pd.crosstab([titanic_df.Embarked,titanic_df.Pclass],[titanic_df.Sex,titanic_df.Survived],margins=True)

Unnamed: 0_level_0,Sex,female,female,male,male,All
Unnamed: 0_level_1,Survived,0,1,0,1,Unnamed: 6_level_1
Embarked,Pclass,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
C,1.0,1,42,25,17,85
C,2.0,0,7,8,2,17
C,3.0,8,15,33,10,66
Q,1.0,0,1,1,0,2
Q,2.0,0,2,1,0,3
Q,3.0,9,24,36,3,72
S,1.0,2,46,51,28,127
S,2.0,6,61,82,15,164
S,3.0,55,33,231,34,353
All,,81,231,468,109,889
