# __Titanic dataset exploratory analysis__

##### With this project, I want to learn data analysis approaches using Python and special libraries such as:
* Pandas
* Seaborn
* Matplotlib.pyplot


##### There won't be any specific questions I want to ask this data before getting started. I will explore as many questions as I can during the process. Some of the research may be redundant and not practically useful for a real-case data analysis. However, primarily, this project focuses on working with data visualization tools. Throughout the process, I will enhance my ability to understand and work with data in Python.


### Part #1: Cleansing Data

#### __Let's start by importing libraries__

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### __Next, we want to import our file using Pandas__
* We'll see what we've imported using ".head()" function

In [None]:
titanic = pd.read_csv('Titanic.csv')

titanic.head(10)

#### __Also, we want to check the types of data__
* We've found some inconsistencies within this dataset using ".info()" function
* Columns 'Age', 'Cabin', 'Embarked' definitely have NA values

In [None]:
titanic.info()

#### __We can try dropping all NA values and see what is left__
* We see that we lost ~79.5% of the data, so we need to modify our code chunk

In [None]:
titanic.dropna().info()

#### __This time, we will attempt to exclude NA values from the "Age" and the "Embarked" columns__
* We see that we lost ~20% of data from all other columns except the "Cabin" column, where we lost only ~10% of the data

In [None]:
titanic.dropna(subset=['Age','Embarked']).info()

#### __Let's create our first bar chart to visualize the difference between the original data and the one with dropped NA values from the "Age" and the "Embarked" columns__
* I couldn't find the proper way to place two bar charts next to each other, so I added labels to each bar
* Men   died ~23%    Men   Alive ~15%
* Women died ~21%    Women Alive ~16%

In [None]:
titanic_c = titanic.dropna(subset=['Age','Embarked'])

titanic.loc[:, 'Survived'] = titanic['Survived'].map({0:'No', 1:'Yes'})
titanic_c.loc[:, 'Survived'] = titanic_c['Survived'].map({0:'No', 1:'Yes'})

g1 = sns.catplot(x = 'Survived', data = titanic, kind = 'count', hue = titanic['Sex'], legend = True)
# iterate through axes (taken from the internet)
for ax in g1.axes.ravel():
    
    # add annotations (taken from the internet)
    for c in ax.containers:
        labels = [f'{(v.get_height()):.0f}' for v in c]
        ax.bar_label(c, labels=labels, label_type='edge')
    ax.margins(y=0.2)

g1.fig.subplots_adjust(top=0.9)
g1.fig.suptitle('Titanic original data')

g2 = sns.catplot(x = 'Survived', data = titanic_c, kind = 'count', hue = titanic_c['Sex'])
# iterate through axes (taken from the internet)
for ax in g2.axes.ravel():
    
    # add annotations (taken from the internet)
    for c in ax.containers:
        labels = [f'{(v.get_height()):.0f}' for v in c]
        ax.bar_label(c, labels=labels, label_type='edge')
    ax.margins(y=0.2)

g2.fig.subplots_adjust(top=0.9)
g2.fig.suptitle('Titanic "cleared" data')

plt.show()