
    
##                      # <center> [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 



# <center> Topic 1. Exploratory data analysis with Pandas
## <center>Practice. Analyzing "Titanic" passengers


In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

# Graphics in SVG format are more sharp and legible
%config InlineBackend.figure_format = 'svg'
pd.set_option("display.precision", 2)

**Read data into a Pandas DataFrame**

In [2]:
data = pd.read_csv("../../data/titanic_train.csv", index_col="PassengerId")

Learning

**First 5 rows**

In [3]:
data.head(5)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.28,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.92,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
data.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.38,2.31,29.7,0.52,0.38,32.2
std,0.49,0.84,14.53,1.1,0.81,49.69
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.12,0.0,0.0,7.91
50%,0.0,3.0,28.0,0.0,0.0,14.45
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.33


**Let's select those passengers who embarked in Cherbourg (Embarked=C) and paid > 200 pounds for their ticker (fare > 200).**

Make sure you understand how actually this construction works.

In [5]:
data[(data["Embarked"] == "C") & (data.Fare > 200)].head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
119,0,1,"Baxter, Mr. Quigg Edmond",male,24.0,0,1,PC 17558,247.52,B58 B60,C
259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.33,,C
300,1,1,"Baxter, Mrs. James (Helene DeLaudeniere Chaput)",female,50.0,0,1,PC 17558,247.52,B58 B60,C
312,1,1,"Ryerson, Miss. Emily Borie",female,18.0,2,2,PC 17608,262.38,B57 B59 B63 B66,C
378,0,1,"Widener, Mr. Harry Elkins",male,27.0,0,2,113503,211.5,C82,C


**We can sort these people by Fare in descending order.**

In [None]:
data[(data["Embarked"] == "C") & (data["Fare"] > 200)].sort_values(
    by="Fare", ascending=False
).head()

**Let's create a new feature.**

In [6]:
def age_category(age):
    """
    < 30 -> 1
    >= 30, <55 -> 2
    >= 55 -> 3
    """
    if age < 30:
        return 1
    elif age < 55:
        return 2
    elif age >= 55:
        return 3

In [7]:
age_categories = [age_category(age) for age in data.Age]
data["Age_category"] = age_categories

**Another way is to do it with `apply`.**

In [8]:
data["Age_category"] = data["Age"].apply(age_category)

**1. How many men/women were there onboard?**
- 577 men and 314 women

In [26]:
data['Sex'].value_counts().get('male'), data['Sex'].value_counts().get('female')

(577, 314)

In [22]:
#Alternative method
male_count=(data['Sex']=='male').sum()
female_count=(data['Sex']=='female').sum()
print(f'There were {male_count} men onboard.')
print(f'There were {female_count} women onboard.')



There were 577 men onboard.
There were 314 women onboard.


**2. Print the distribution of the `Pclass` feature. Then the same, but for men and women separately. How many men from second class were there onboard?**
- 108

In [29]:
pd.crosstab(data['Pclass'],data['Sex'], margins=True)

Sex,female,male,All
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,94,122,216
2,76,108,184
3,144,347,491
All,314,577,891


**3. What are median and standard deviation of `Fare`?. Round to two decimals.**
- median is  14.45, standard deviation is 49.69

In [32]:
import numpy as np
fare_std=np.std(data['Fare'])
fare_median=np.median(data['Fare'])
print(f'The standard deviation of the fare is {fare_std} and the median of the fare is {fare_median}.')

The standard deviation of the fare is 49.6655344447741 and the median of the fare is 14.4542.


**4. Is that true that the mean age of survived people is higher than that of passengers who eventually died?**
- No


In [39]:
survived_mean_age=data[data['Survived']==1]['Age'].agg({'Mean':'mean'})
died_mean_age=data[data['Survived']==0]['Age'].agg({'Mean':'mean'})
print(f'Surived Mean Age:{survived_mean_age},Died Mean Age:{died_mean_age}')

Surived Mean Age:Mean    28.34
Name: Age, dtype: float64,Died Mean Age:Mean    30.63
Name: Age, dtype: float64


**5. Is that true that passengers younger than 30 y.o. survived more frequently than those older than 60 y.o.? What are shares of survived people among young and old people?**

- 40.6% among young and 22.7% among old


In [58]:
young_30_survived=((data['Survived']==1)&(data['Age']<30)).sum()
young_30_survived_died=(data['Age']<30).sum()
young_survival_frequency=(young_30_survived/young_30_survived_died)*100
young_survival_frequency


40.625

In [59]:
old_60_survived=((data['Survived']==1)&(data['Age']>60)).sum()
old_60_survived_died=((data['Age']>60)).sum()
old_survival_frequency=(old_60_survived/old_60_survived_died)*100
old_survival_frequency

22.727272727272727

In [73]:
#Alternative Method
young_survived = data.loc[data["Age"] < 30, "Survived"]
young_survived.mean()




0.40625

In [75]:
old_survived = data.loc[data["Age"] > 60, "Survived"]
old_survived.mean()

0.22727272727272727

**6. Is that true that women survived more frequently than men? What are shares of survived people among men and women?**

- 18.9% among men and 74.2% among women

In [78]:
men_survived=data.loc[(data['Sex']=='male'),'Survived'].mean()
women_surived=data.loc[(data['Sex']=='female'),'Survived'].mean()
print(men_survived)
print(women_surived)

0.18890814558058924
0.7420382165605095


**7. What's the most popular first name among male passengers?**
- William


In [86]:
data['first_name'] = [nombre.split(',')[1].split()[1] for nombre in data['Name']]

In [89]:
data['first_name'].value_counts().head(1)

first_name
William    48
Name: count, dtype: int64

**8. How is average age for men/women dependent on `Pclass`? Choose all correct statements:**
- On average, men of 1 class are older than 40
- Men of all classes are on average older than women of the same class
- On average, passengers of the first class are older than those of the 2nd class who are older than passengers of the 3rd class

In [98]:
data.groupby(['Pclass', 'Sex'])['Age'].aggregate(Average='mean')


Unnamed: 0_level_0,Unnamed: 1_level_0,Average
Pclass,Sex,Unnamed: 2_level_1
1,female,34.61
1,male,41.28
2,female,28.72
2,male,30.74
3,female,21.75
3,male,26.51
