<img src="https://datasciencecampus.ons.gov.uk/wp-content/uploads/sites/10/2017/03/data-science-campus-logo-new.svg"
             alt="ONS Data Science Campus Logo"
             width = "240"
             style="margin: 0px 60px"
             />

In [28]:
# import the helper functions from the parent directory,
# these help with things like graph plotting and notebook layout
import sys
sys.path.append('../..')
from helper_functions import *

# set things like fonts etc - comes from helper_functions
set_notebook_preferences()

# add a show/hide code button - also from helper_functions
toggle_code(title = "import functions")

In [27]:
#dependencies
import pandas as pd
titanic = pd.read_csv('../../data/titanic.csv')

#columns from NewVar script
titanic['child'] = (titanic['age'] < 18).astype(int)
titanic['embarked_city'] = titanic['embarked'].map({'S':'Southampton','C':'Cherbourg','Q':'Queenstown'})
titanic['surname'] = titanic['name'].str.split(',',expand=True)[0]


toggle_code(title='dependencies')

# 4. Descriptive Statistics

There are really two broad types of data in our dataframes at the moment that we want to look at - numerical data (i.e. ints and floats) and text data (i.e. strings; slightly confusingly called objects).

In this section, we will explore some basic univariate descriptive statistics.

## 4.1 Describing Numerical Data
Let's start with the numerical data, because thats the easiest to work with. Pandas even has a built in function called `.describe()` which will provide some descriptive statistics for all the numerical columns in a dataframe.

In [3]:
# Describe the titanic dataframe
titanic.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,child
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,1309.0
mean,2.294882,0.381971,29.881135,0.498854,0.385027,33.295479,0.117647
std,0.837836,0.486055,14.4135,1.041658,0.86556,51.758668,0.322313
min,1.0,0.0,0.1667,0.0,0.0,0.0,0.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,0.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,0.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,0.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,1.0


As it happens, the descriptive statistics output for our titanic dataset is also a dataframe!

Make sure you understand what each row means in this table:
* **count** - the number (count) of entries in the given column.
* **mean** - the average (arithmetic mean) data value in the given column.
* **std** - the standard deviation (spread) of values in the given column.
* **min** - the smallest value in the given column.
* **25%** - the value of the data at the lower quartile (i.e. after the first 25% of data, ordered from smallest to largest).
* **50%** - the middle value of the data (aka the median), half the values are larger than this value, and half smaller.
* **75%** - the value of the data at the upper quartile (i.e. after the first 75% of data, ordered from smallest to largest).
* **max** - the maximum data value recorded.

We can get a sense of the data from these descriptive statistics. For instance: 

## 4.2 Descriptive Statistics for Numerical Data

`.describe()` is great to get an overview, but what if we just wanted particular statistics and not the whole lot?

Well, pandas will let you run a range of statistics individually! Some examples are given in the code below. 

In [4]:
# count() can be defined for all datatypes, so all columns are computed. Note which columns have some missing data.
titanic.count()

pclass           1309
survived         1309
name             1309
sex              1309
age              1046
sibsp            1309
parch            1309
ticket           1309
fare             1308
cabin             295
embarked         1307
child            1309
embarked_city    1307
surname          1309
dtype: int64

In [5]:
# mean() is only defined for numeric columns
titanic.mean()

pclass       2.294882
survived     0.381971
age         29.881135
sibsp        0.498854
parch        0.385027
fare        33.295479
child        0.117647
dtype: float64

In [6]:
# std() is also only defined for numeric columns
titanic.std()

pclass       0.837836
survived     0.486055
age         14.413500
sibsp        1.041658
parch        0.865560
fare        51.758668
child        0.322313
dtype: float64

In [7]:
# min() has a definition for numeric and text data.
# The minimum value of a text field is the text which is first alphabetically.
titanic.min() 

pclass                        1
survived                      0
name        Abbing, Mr. Anthony
sex                      female
age                      0.1667
sibsp                         0
parch                         0
ticket                   110152
fare                          0
child                         0
surname                  Abbing
dtype: object

In [8]:
# max() has a definition for numeric and text data.
# The maximum value of a text field is the text which is last alphabetically.
titanic.max() 

pclass                                3
survived                              1
name        van Melkebeke, Mr. Philemon
sex                                male
age                                  80
sibsp                                 8
parch                                 9
ticket                        WE/P 5735
fare                            512.329
child                                 1
surname                   van Melkebeke
dtype: object

In [9]:
# quantile() allows you to specify quantiles, such as 0.25 (lower quartile), 0.5 (median), and 0.75 (upper quartile)
# for convenience median() also exists
titanic.quantile(0.25) # 25% - lower quartile. 

pclass       2.0000
survived     0.0000
age         21.0000
sibsp        0.0000
parch        0.0000
fare         7.8958
child        0.0000
Name: 0.25, dtype: float64

In [10]:
# sum() works to concatenate text, producing a curious output.
titanic.sum()

pclass                                                   3004
survived                                                  500
name        Allen, Miss. Elisabeth WaltonAllison, Master. ...
sex         femalemalefemalemalefemalemalefemalemalefemale...
age                                                   31255.7
sibsp                                                     653
parch                                                     504
ticket      2416011378111378111378111378119952135021120501...
fare                                                  43550.5
child                                                     154
surname     AllenAllisonAllisonAllisonAllisonAndersonAndre...
dtype: object

In [11]:
# Hopefully though it is obvious that these methods could be called on selected columns, too.
titanic['fare'].sum()

43550.4869

In [12]:
# As it happens, Python has a built-in sum, min and max functions which does the same thing.
# however, pandas sum is better when confronted with missing data:
sum(titanic['fare'])

nan

In [13]:
# Try this instead
sum(titanic[titanic['fare'].notnull()]['fare'])

43550.4869000002

In the above cell, a new filter condition for working with missing data is apparent: `notnull()` this returns `True` for rows that have a valid value, and `False` otherwise. Similar to the behaviour of `bool()`. The opposite of `notnull()` is `isnull()`.

This method of selection is similar to making conditional statements with object methods that return a Boolean, e.g.
```python
if string_variable.islower():
    # Do something
```
The same principle can apply to other contexts, for instance the `Series` object has a large number of string methods collected as `.str.`, calling `titanic['name'].str.contains('Mr.', regex=False)` returns `True` or `False` for each row in a column depending on whether it contains the substring 'Mr.'

In [14]:
# Select passengers with title Mr. and get mean fare
titanic[titanic['name'].str.contains('Mr.', regex=False)]['fare'].mean()

24.796184788359792

In [15]:
# Select passengers with title Mrs. and get mean fare
titanic[titanic['name'].str.contains('Mrs.', regex=False)]['fare'].mean()

50.5607233502538

## 4.3 Describing Text (or Categorical) Data

We can still use `.describe()` to look at text data, however we need to specify that we're looking at object (text) data types.

Really, the descriptive statistics below are for categorical data, they don't work very well if every value in a field is a different piece of text!

In [16]:
# In the describe parameters we're only choosing to include object datatypes, given by 'O'.
# The 'O' is in a list, because we could include other data types in the list if we wanted to.
titanic.describe(include=['O'])

Unnamed: 0,name,sex,ticket,cabin,embarked,embarked_city,surname
count,1309,1309,1309,295,1307,1307,1309
unique,1307,2,929,186,3,3,875
top,"Connolly, Miss. Kate",male,CA. 2343,C23 C25 C27,S,Southampton,Andersson
freq,2,843,11,6,914,914,11


When you are describing an object you get some different summary statistics than with numerical data:

* **count** as before, a count of the values present in each column.
* **unique** a count of the number of unique values in each column.
* **top** is the most common value - aka the mode.
* **freq** is the frequency of occurance of the most common value.

Let's dig a bit deeper into some of these columns.

In [17]:
# Interestingly there are 2 James Kellys, however they don't appear to be duplicates.
titanic[titanic['name'] == 'Kelly, Mr. James']

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,child,embarked_city,surname
924,3,0,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0,Queenstown,Kelly
925,3,0,"Kelly, Mr. James",male,44.0,0,0,363592,8.05,,S,0,Southampton,Kelly


In [18]:
# One way we could check for other name duplicates is by taking the mode.
# As mode can be non-unique it returns a series
# Looks like Kate Connolly is another possible duplicate.
titanic['name'].mode()

0    Connolly, Miss. Kate
1        Kelly, Mr. James
dtype: object

In [19]:
# Again, these appear to be different people!
titanic[titanic['name'] == 'Connolly, Miss. Kate']

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,child,embarked_city,surname
725,3,1,"Connolly, Miss. Kate",female,22.0,0,0,370373,7.75,,Q,0,Queenstown,Connolly
726,3,0,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q,0,Queenstown,Connolly


In [20]:
# the unique() function will give us all the unique objects in a column.
titanic['embarked_city'].unique()

array(['Southampton', 'Cherbourg', nan, 'Queenstown'], dtype=object)

In [21]:
# the value_counts() function gives a count for each unique value in a chosen column.
titanic['embarked_city'].value_counts()
# Most people embarked in Southampton.

Southampton    914
Cherbourg      270
Queenstown     123
Name: embarked_city, dtype: int64

## 4.4 Sorting Data

Sorting data is straightforward in pandas, a simple sort on one columns used the dataframe method `.sort_values()`:
```python
titanic.sort_values('age')
```
The default is to sort in ascending order, from smallest to largest value. Set the ascending parameter to `False` for a descending sort:
```python
titanic.sort_values('age', ascending = False)
```
This approach sorts and returns the entire DataFrame, if you want to sort a single column on its own values, .sort_values works similarly:
```python
titanic['age'].sort_values()
```
More complicated sorting behaviours can be managed by passing a list, in the order you would like the sort to occur:
```python
titanic.sort_values(['pclass','age'], ascending = [True, False])
```
In the above code I sort first by 'pclass' then by 'age'. in addition I pass a list to ascending indicating that 'pclass' is to be sorted in ascending order, and 'age' in descending order.

In [22]:
# sort fare descending
titanic.sort_values('fare', ascending = False).head(8)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,child,embarked_city,surname
183,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C,0,Cherbourg,Lesurer
302,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C,0,Cherbourg,Ward
49,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C,0,Cherbourg,Cardeza
50,1,1,"Cardeza, Mrs. James Warburton Martinez (Charlo...",female,58.0,0,1,PC 17755,512.3292,B51 B53 B55,C,0,Cherbourg,Cardeza
113,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S,0,Southampton,Fortune
114,1,0,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S,0,Southampton,Fortune
115,1,0,"Fortune, Mr. Mark",male,64.0,1,4,19950,263.0,C23 C25 C27,S,0,Southampton,Fortune
116,1,1,"Fortune, Mrs. Mark (Mary McDougald)",female,60.0,1,4,19950,263.0,C23 C25 C27,S,0,Southampton,Fortune


In [23]:
# sort by sex ascending, then age descending
titanic.sort_values(['sex','age'], ascending = [True, False]).head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,child,embarked_city,surname
61,1,1,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1,0,19877,78.85,C46,S,0,Southampton,Cavendish
78,1,1,"Compton, Mrs. Alexander Taylor (Mary Eliza Ing...",female,64.0,0,2,PC 17756,83.1583,E45,C,0,Cherbourg,Compton
83,1,1,"Crosby, Mrs. Edward Gifford (Catherine Elizabe...",female,64.0,1,1,112901,26.55,B26,S,0,Southampton,Crosby
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S,0,Southampton,Andrews
286,1,0,"Straus, Mrs. Isidor (Rosalie Ida Blun)",female,63.0,1,0,PC 17483,221.7792,C55 C57,S,0,Southampton,Straus


In [24]:
# sort by sex descending, then age descending
titanic.sort_values(['sex','age'], ascending = False).head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,child,embarked_city,surname
14,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S,0,Southampton,Barkworth
1235,3,0,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S,0,Southampton,Svensson
9,1,0,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,0,Cherbourg,Artagaveytia
135,1,0,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C,0,Cherbourg,Goldschmidt
727,3,0,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q,0,Queenstown,Connors


# Exercise 5

1. How old is the oldest passenger in the dataset?
2. How many men and women are in the dataset?
    * Check the pd.Series.value_counts() docstring and figure out how to get proportions of men and women.


In [25]:
## Question 1

#print("The oldest passenger is {} years old.\n".format(titanic['age'].max()))

## Question 2

#print(titanic['sex'].value_counts(),'\n')

## Question 2b

#print(titanic['sex'].value_counts(normalize = True),'\n')


toggle_code()

# Homework

Please complete the **DescStatHomework.ipynb** notebook found in the **homework_tasks** folder.