## Import data

In [19]:
import pandas as pd
import seaborn

path_to_file = 'train_titanic.csv'

titanic_data = pd.read_csv(path_to_file, index_col='PassengerId')

# Context

## What is titanic dataset about?

Titanic is a dataset that contains features about people who were on Titanic like age, port in which they were embarked, whether they survived the Titanic catastrophe and etc.

In [22]:
titanic_data.head(15)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,1,1,"Oconnor, Frankie",male,,2,0,209245,27.14,C12239,S
1,0,3,"Bryan, Drew",male,,0,0,27323,13.35,,S
2,0,3,"Owens, Kenneth",male,0.33,1,2,CA 457703,71.29,,S
3,0,3,"Kramer, James",male,19.0,0,0,A. 10866,13.04,,S
4,1,3,"Bond, Michael",male,25.0,0,0,427635,7.76,,S
5,0,2,"Sassano, Jonathan",male,35.0,0,0,13363,6.71,,S
6,0,3,"Conway, Jose",male,7.0,0,0,A/5,9.77,,S
7,1,3,"Werner, Linda",female,6.0,1,2,434426,31.5,,S
8,0,1,"Wardlaw, Michael",male,27.0,2,1,474849,73.02,A7253,S
9,0,2,"Greigo, Rudy",male,66.0,0,0,6981,9.14,D2969,C


## What is the unique id for each row of data?

Each row of data is identified by a 'PassengerId' column.

## Understanding the meaning of each variable

In [9]:
titanic_data.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Let's understand the meaning of all features (variables) in a dataset

- 'Survived' - shows the value either 0 or 1 representing whether the person survived during Titanic crash where 0 means dead and 1 means survived

- 'Pclass'

# Data quality assessment

## Number of raws and columnms

Let's check how many raws and columns the initial dataset has using shape method

In [12]:
titanic_data_shape = titanic_data.shape
titanic_data_shape

(100000, 11)

In other words, there are 100,000 raws and 11 columns (where 1 column is index column and 12 are features)

## Types of columns

Let's inspect data types of columns

In [25]:
titanic_data.dtypes

Survived      int64
Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object

In [26]:
titanic_data.dtypes.value_counts()

object     5
int64      4
float64    2
dtype: int64

The dataset has 4 columns with integers, 2 columns with floats and 5 columns with object values.

## Categorizing data

Our features can be divided in 2 groups based on their datatypes:

1. Categorical features. All data that is NOT numerical (in our dataset it's data with an object type).

In [36]:
categorical_columns = list(titanic_data.select_dtypes(include=['object']).columns.values)
categorical_columns

['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']


2. Quantitative features. In our dataset it's data with integer and float data types.

In [35]:
quantitative_columns = list(titanic_data.select_dtypes(include=['float64', 'int64']).columns.values)
quantitative_columns

['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

## Checking for duplicates

## Dealing with missing data

In [18]:
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 11 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   Survived  100000 non-null  int64  
 1   Pclass    100000 non-null  int64  
 2   Name      100000 non-null  object 
 3   Sex       100000 non-null  object 
 4   Age       96708 non-null   float64
 5   SibSp     100000 non-null  int64  
 6   Parch     100000 non-null  int64  
 7   Ticket    95377 non-null   object 
 8   Fare      99866 non-null   float64
 9   Cabin     32134 non-null   object 
 10  Embarked  99750 non-null   object 
dtypes: float64(2), int64(4), object(5)
memory usage: 9.2+ MB


## Outlier analysis (boxplots)

## Избавимся от ненужных колонок

In [4]:
#Проверим сколько ненулевых данных содержит каждая колонка
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 11 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   Survived  100000 non-null  int64  
 1   Pclass    100000 non-null  int64  
 2   Name      100000 non-null  object 
 3   Sex       100000 non-null  object 
 4   Age       96708 non-null   float64
 5   SibSp     100000 non-null  int64  
 6   Parch     100000 non-null  int64  
 7   Ticket    95377 non-null   object 
 8   Fare      99866 non-null   float64
 9   Cabin     32134 non-null   object 
 10  Embarked  99750 non-null   object 
dtypes: float64(2), int64(4), object(5)
memory usage: 9.2+ MB


Получим, что только меньше трети данных содержащихся в колонке Cabin ненулевые. Восстановить эти данные было бы невозможно, так как они содержат id. Поэтому избавимся от этого признака.

Moreover, let's delete columns 'Name' and 'Ticket' because they don't contain important data that could impact our conclusions.

In [5]:
unnessery_columns = ['Name', 'Ticket', 'Cabin']
titanic_data_without_unnes_columns = titanic_data.drop(columns=unnessery_columns)
titanic_data_without_unnes_columns

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,1,1,male,,2,0,27.14,S
1,0,3,male,,0,0,13.35,S
2,0,3,male,0.33,1,2,71.29,S
3,0,3,male,19.00,0,0,13.04,S
4,1,3,male,25.00,0,0,7.76,S
...,...,...,...,...,...,...,...,...
99995,1,2,female,62.00,0,0,14.86,C
99996,0,2,male,66.00,0,0,11.15,S
99997,0,3,male,37.00,0,0,9.95,S
99998,0,3,male,51.00,0,1,30.92,S


## Заполним пропуски

Let's fill NaN values that are numerical with median because median will not impact our results as much as filling NaN values with mean, 0 or other values.

As for the categorical data that is NaN, let's fill it with mode.

# Data exploration

## Vizualization and summary statistics for each variable

### 'Survived' column

In [41]:
titanic_data.Survived.describe()

count    100000.000000
mean          0.427740
std           0.494753
min           0.000000
25%           0.000000
50%           0.000000
75%           1.000000
max           1.000000
Name: Survived, dtype: float64

In [43]:
titanic_data.Survived.value_counts()

0    57226
1    42774
Name: Survived, dtype: int64

## Assert relationships

In [6]:
посмотреть отдельные калонки зависимость их и выживания
посмотреть у связь выживания только у женщин

SyntaxError: invalid syntax (Temp/ipykernel_1124/2313424041.py, line 1)

# Summary