# Data understanding

We will analyze the *titanic* dataset:

* to realize what information we have (statistical units, variables)
* to check data quality and reliability of data
* to understand distributions of variables and their relationships
* to suggest steps for data cleaning
* to suggest useful data transformations

## 0. What is our goal?

Analysis of date comes out from the goal of the **business understanding**. So first we set that goal:

> We analyse Titanic data to find out how survival for each passenger can be predicted from his or her attributes.

Let's start with loading data and making a quick overview.

In [2]:
### Setup
%matplotlib inline
# should enable plotting without explicit call .show()

# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# classes for special types
from pandas.api.types import CategoricalDtype

# Apply the default theme
sns.set_theme()

# Reading and inspecting data
df = pd.read_csv("titanic_train.csv")
df

Unnamed: 0,passenger_id,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,survived
0,1216,3,"Smyth, Miss. Julia",female,,0,0,335432,7.7333,,Q,13,,,1
1,699,3,"Cacic, Mr. Luka",male,38.0,0,0,315089,8.6625,,S,,,Croatia,0
2,1267,3,"Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go...",female,30.0,1,1,345773,24.1500,,S,,,,0
3,449,2,"Hocking, Mrs. Elizabeth (Eliza Needs)",female,54.0,1,3,29105,23.0000,,S,4,,"Cornwall / Akron, OH",1
4,576,2,"Veal, Mr. James",male,40.0,0,0,28221,13.0000,,S,,,"Barre, Co Washington, VT",0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
845,158,1,"Hipkins, Mr. William Edward",male,55.0,0,0,680,50.0000,C39,S,,,London / Birmingham,0
846,174,1,"Kent, Mr. Edward Austin",male,58.0,0,0,11771,29.7000,B37,C,,258.0,"Buffalo, NY",0
847,467,2,"Kantor, Mrs. Sinai (Miriam Sternin)",female,24.0,1,0,244367,26.0000,,S,12,,"Moscow / Bronx, NY",1
848,1112,3,"Peacock, Miss. Treasteall",female,3.0,1,1,SOTON/O.Q. 3101315,13.7750,,S,,,,0


## 1. Basic overview of the data

1. Rows: How many? What are statistical units? How can a unit be identified?
2. Columns: How many? What are their names, types, meanings? At the first glance, do values seem plausible? Are all of them useful for our purpose?

Summary: do we need to carry out any initial transformations? (i. e. to make a sample of rows or columns; to convert column names to lowercase; to provide a column with ID; to remove some columns etc.)

In [4]:
print(df.shape)
print(df.dtypes)

(850, 15)
passenger_id      int64
pclass            int64
name             object
sex              object
age             float64
sibsp             int64
parch             int64
ticket           object
fare            float64
cabin            object
embarked         object
boat             object
body            float64
home.dest        object
survived          int64
dtype: object


## 2. Checking the data quality

* Are there any duplicated rows (with exclusion of ID)?
* What are counts and shares of missing values in the dataset columns?
* Are counts of missing values expectable and acceptable?
* Are any columns or rows (almost) empty and may be removed as useless?
* In which columns should we consider fixing of values (correction, filling)?

In [14]:
df.count(axis=1).value_counts()

12    265
11    189
13    169
14    121
10    106
dtype: int64

After all these check we can do a summary about data quality and make recommendations for preprocessing (cleaning, fixing) data. Some of them can be done immediately if it is necessary or may be useful for the analysis.

## 3. Checking variable distributions

It's a good idea to start with the most important variables: the target one (*survived*) and the ones we expect to provide great information for the target one while being complete (*sex*, *pclass*, *fare*, *embarked*). Then we go to variables which are more complicated or need a fixing (*age*).

For each of those six variables above, try to do following:

* Make descriptive statistics of the distribution and a proper graph.
* Consider if the distribution is expectable and seems plausible (no strange or obviously invalid values).
* If the variable has missing values, try to figure out reasons of it and to suggest a fixing, if necessary.

## 4. Analysis of relationships

The last part of this practice section is to analyze relationship between variables. Check how is *survival* related to each of five remaining variables considered in the previous part (*sex*, *pclass*, *fare*, *embarked*, *age*).