# 1. Introduction

Welcome to the 2nd part of this course - now that (presumably) you have a solid grasp of the principles surrounding Numerical computing in NumPy, we will move on to data management in Python. The most common way to do this is in **tabular** format (i.e in a table) with relational databases. The most commonly used powerful library which provides in-memory database-like data handling is **Pandas**. Pandas is well suited for:

* **Tabular** data with heterogeneously-typed columns, such as in an SQL database or Excel spreadsheet.
* Ordered and unordered **time-series** data.
* Arbitrary **matrix** data with row and column labels.

Some of the interesting features include:

* Handling missing data fluently
* Size mutability
* Easy-to-use *data alignment*
* Label-based *slicing*, *fancy indexing* and *subsetting*
* Intuitive *merging* and *joining* of datasets by label
* Hierarchical labelling of axes
* Decent IO tools for importing from an array of different formats
* Flexible reshaping and *pivoting* of tables

In [1]:
import pandas as pd

**Pandas** is broken down into two primary classes:

1. **Series**: think of this as an any-type (templated) unordered array with an index. A generalized *numpy array*.
2. **DataFrame**: think of this as a 2-D heterogeneous table with a *Series* for each column.

## Series

In [2]:
counts = pd.Series([644, 1276, 3554, 154])
counts

0     644
1    1276
2    3554
3     154
dtype: int64

If we don't specify an index, a default sequence of integers (from `np.arange()`) is assigned as the index. A numpy array comprises the values of the *Series*, which the index is another *Pandas* object: 

In [3]:
counts.values

array([ 644, 1276, 3554,  154])

In [4]:
counts.index

RangeIndex(start=0, stop=4, step=1)

We can assign meaningful labels to the series, as:

In [5]:
foods = pd.Series([644, 1276, 3554, 154], index=['Oranges', 'Apples', 'Melons', 'Pumpkins'])
foods

Oranges      644
Apples      1276
Melons      3554
Pumpkins     154
dtype: int64

A useful way to think of a *Series* is to use **key-value** pairs, i.e input using a dictionary:

In [6]:
food_d = {
    'Oranges': 644,
    'Apples': 1276,
    'Melons': 3554,
    'Pumpkins': 154
}

pd.Series(food_d)

Apples      1276
Melons      3554
Oranges      644
Pumpkins     154
dtype: int64

This can also be achieved via separate lists:

In [7]:
labels = ['Oranges', 'Apples', 'Melons', 'Pumpkins']
counts = [644, 1276, 3554, 154]
pd.Series(dict(zip(labels,counts)))

Apples      1276
Melons      3554
Oranges      644
Pumpkins     154
dtype: int64

## DataFrame

One of the really nice aspects about Dataframes, particularly in Jupyter notebook, is the automatic HTML/Javascript generated when visualizing tables:

In [8]:
data = pd.DataFrame({'value': [632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient': [1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum': ['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
                                'Bacteroidetes', 'Firmicutes', 'Proteobacteria',
                                'Actinobacteria', 'Bacteroidetes']})
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,433
5,2,Proteobacteria,1130
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


For most datasets it is impractical to display all the values, there are methods to only view the first $n$ rows: head by default only views the first 5 rows.

In [9]:
data.head()

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,433


We can extract the column names as:

In [10]:
data.columns

Index(['patient', 'phylum', 'value'], dtype='object')

### Reading and Writing Files

There are a number of powerful functions that can achieve this, for instance if our data is in excel:

In [11]:
titanic = pd.read_excel("titanic.xlsx")
titanic.head()

Unnamed: 0,PassengerId,Age,Cabin,Port Embarked,Fare,Name,n_parents,Pclass,Sex,n_siblings,Survived,Ticket
0,1,22.0,,Southampton,7.25,"Braund, Mr. Owen Harris",0,3rd class,male,1,0,A/5 21171
1,2,38.0,C85,Cherbourg,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,1st class,female,1,1,PC 17599
2,3,26.0,,Southampton,7.925,"Heikkinen, Miss. Laina",0,3rd class,female,0,1,STON/O2. 3101282
3,4,35.0,C123,Southampton,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,1st class,female,1,1,113803
4,5,35.0,,Southampton,8.05,"Allen, Mr. William Henry",0,3rd class,male,0,0,373450


We can also extract from csv or any other flat-file k-delimited style format. This can be specified in the 'sep' argument within a call to `read_csv` or `read_table`.

Checking the size of the dataset is a priority:

In [12]:
titanic.shape

(1309, 12)

As well as determining the number of missing values from each column:

In [13]:
titanic.isnull().sum()

PassengerId         0
Age               263
Cabin            1014
Port Embarked       2
Fare                1
Name                0
n_parents           0
Pclass              0
Sex                 0
n_siblings          0
Survived            0
Ticket              0
dtype: int64

We can select a column using the square-bracket notation [] or using direct.dot notation:

In [14]:
titanic.Age
titanic['Age'].head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

Like NumPy, we can index and select using similar methods:

In [15]:
titanic.Age[:5]

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

In [16]:
titanic[2:10:2]

Unnamed: 0,PassengerId,Age,Cabin,Port Embarked,Fare,Name,n_parents,Pclass,Sex,n_siblings,Survived,Ticket
2,3,26.0,,Southampton,7.925,"Heikkinen, Miss. Laina",0,3rd class,female,0,1,STON/O2. 3101282
4,5,35.0,,Southampton,8.05,"Allen, Mr. William Henry",0,3rd class,male,0,0,373450
6,7,54.0,E46,Southampton,51.8625,"McCarthy, Mr. Timothy J",0,1st class,male,0,0,17463
8,9,27.0,,Southampton,11.1333,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",2,3rd class,female,0,1,347742


Given that this dataset is by passengers, it would be wise to set PassengerID as the index, as such:

In [17]:
titanic = titanic.set_index("PassengerId")
titanic.head()

Unnamed: 0_level_0,Age,Cabin,Port Embarked,Fare,Name,n_parents,Pclass,Sex,n_siblings,Survived,Ticket
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,22.0,,Southampton,7.25,"Braund, Mr. Owen Harris",0,3rd class,male,1,0,A/5 21171
2,38.0,C85,Cherbourg,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,1st class,female,1,1,PC 17599
3,26.0,,Southampton,7.925,"Heikkinen, Miss. Laina",0,3rd class,female,0,1,STON/O2. 3101282
4,35.0,C123,Southampton,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,1st class,female,1,1,113803
5,35.0,,Southampton,8.05,"Allen, Mr. William Henry",0,3rd class,male,0,0,373450


### Querying, Selection

We can select passengers by the row, using `.loc[]`. Loc can also give us a dataframe and is multidimensional.

In [18]:
titanic.loc[3]

Age                                  26
Cabin                               NaN
Port Embarked               Southampton
Fare                              7.925
Name             Heikkinen, Miss. Laina
n_parents                             0
Pclass                        3rd class
Sex                              female
n_siblings                            0
Survived                              1
Ticket                 STON/O2. 3101282
Name: 3, dtype: object

Or values by including a column term

In [19]:
titanic.loc[3, 'Age']

26.0

We can quickly subset the dataset using boolean operators:

These can be done directly, such as:

    [pandas.Series] < $value
    
Or function-chained using pre-defined functions:

    - pandas.Series.lt (less than)
    - pandas.Series.gt (greater than)
    - pandas.Series.eq (equals)
    - pandas.Series.neq (not equal to)
    - pandas.Series.lte (less than or equal to)
    - pandas.Series.gte (greater than or equal to)

In [22]:
titanic.Age.lt(30).values

array([ True, False,  True, ..., False, False, False])

In [20]:
titanic[titanic.Age > 30].head()

Unnamed: 0_level_0,Age,Cabin,Port Embarked,Fare,Name,n_parents,Pclass,Sex,n_siblings,Survived,Ticket
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,38.0,C85,Cherbourg,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,1st class,female,1,1,PC 17599
4,35.0,C123,Southampton,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,1st class,female,1,1,113803
5,35.0,,Southampton,8.05,"Allen, Mr. William Henry",0,3rd class,male,0,0,373450
7,54.0,E46,Southampton,51.8625,"McCarthy, Mr. Timothy J",0,1st class,male,0,0,17463
12,58.0,C103,Southampton,26.55,"Bonnell, Miss. Elizabeth",0,1st class,female,0,1,113783


Or select columns between two identified as:

In [23]:
titanic.loc[:3, "Cabin":"Fare"]

Unnamed: 0_level_0,Cabin,Port Embarked,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,,Southampton,7.25
2,C85,Cherbourg,71.2833
3,,Southampton,7.925


Alternatively, we can index using the absolute *position* using `iloc[]`.

In [24]:
titanic.iloc[1, 2]

'Cherbourg'

In [25]:
titanic.iloc[1]

Age                                                             38
Cabin                                                          C85
Port Embarked                                            Cherbourg
Fare                                                       71.2833
Name             Cumings, Mrs. John Bradley (Florence Briggs Th...
n_parents                                                        0
Pclass                                                   1st class
Sex                                                         female
n_siblings                                                       1
Survived                                                         1
Ticket                                                    PC 17599
Name: 2, dtype: object

We can use the `isin()` method to search if a value or values exist within a Series:

In [26]:
titanic['Port Embarked'].isin(['Cherbourg']).head()

PassengerId
1    False
2     True
3    False
4    False
5    False
Name: Port Embarked, dtype: bool

We can find all the indices where the condition is met, and returns the values that satisfy the condition but retains the shape of the original dataframe, which is crucial when alignment is required:

In [27]:
import numpy as np
x = pd.DataFrame(np.random.rand(5,7))
x.where(x < 0.5)

Unnamed: 0,0,1,2,3,4,5,6
0,0.489699,0.276745,,0.078895,0.302194,0.129713,
1,,,,,0.212922,,
2,,0.022741,,0.437566,,0.28884,
3,,0.291442,0.229133,0.395242,,,
4,,0.024332,0.332566,0.469705,,,0.318429


Alternatively, instead of just getting NaNs, we could place a value or an array into the 'other' argument of `where`.

In [28]:
x.where(x < 0.5, other=-x)

Unnamed: 0,0,1,2,3,4,5,6
0,0.489699,0.276745,-0.579479,0.078895,0.302194,0.129713,-0.854664
1,-0.631276,-0.804616,-0.591556,-0.871399,0.212922,-0.673123,-0.985896
2,-0.518755,0.022741,-0.8168,0.437566,-0.590415,0.28884,-0.589929
3,-0.650238,0.291442,0.229133,0.395242,-0.57214,-0.539013,-0.667999
4,-0.615861,0.024332,0.332566,0.469705,-0.760968,-0.814236,0.318429


In [29]:
x.where(x > 0.5, other=lambda y: y**3-1)

Unnamed: 0,0,1,2,3,4,5,6
0,-0.882568,-0.978805,0.579479,-0.999509,-0.972403,-0.997818,0.854664
1,0.631276,0.804616,0.591556,0.871399,-0.990347,0.673123,0.985896
2,0.518755,-0.999988,0.8168,-0.916222,0.590415,-0.975903,0.589929
3,0.650238,-0.975245,-0.98797,-0.938257,0.57214,0.539013,0.667999
4,0.615861,-0.999986,-0.963218,-0.896373,0.760968,0.814236,-0.967712


Selection using `query()` feels an awful lot like SQL, which can take raw variables as part of it using @

In [30]:
n_parents = 2
titanic.query("(Age < 25) & ((Pclass == '1st class') | (n_parents == @n_parents))").head()

Unnamed: 0_level_0,Age,Cabin,Port Embarked,Fare,Name,n_parents,Pclass,Sex,n_siblings,Survived,Ticket
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
28,19.0,C23 C25 C27,Southampton,263.0,"Fortune, Mr. Charles Alexander",2,1st class,male,3,0,19950
44,3.0,,Cherbourg,41.5792,"Laroche, Miss. Simonne Marie Anne Andree",2,2nd class,female,1,1,SC/Paris 2123
59,5.0,,Southampton,27.75,"West, Miss. Constance Mirium",2,2nd class,female,1,1,C.A. 34651
60,11.0,,Southampton,46.9,"Goodwin, Master. William Frederick",2,3rd class,male,5,0,CA 2144
64,4.0,,Southampton,27.9,"Skoog, Master. Harald",2,3rd class,male,3,0,347088


#### Aggregation

The toys of NumPy are back in a similar form: max, min, mean, sum etc.

In [31]:
titanic.sum()

Age                                                     31255.7
Fare                                                    43550.5
Name          Braund, Mr. Owen HarrisCumings, Mrs. John Brad...
n_parents                                                   504
Pclass        3rd class1st class3rd class1st class3rd class3...
Sex           malefemalefemalefemalemalemalemalemalefemalefe...
n_siblings                                                  653
Survived                                                    494
Ticket        A/5 21171PC 17599STON/O2. 31012821138033734503...
dtype: object

In [32]:
titanic.Age.mean()

29.881137667304014

In [33]:
titanic.describe()

Unnamed: 0,Age,Fare,n_parents,n_siblings,Survived
count,1046.0,1308.0,1309.0,1309.0,1309.0
mean,29.881138,33.295479,0.385027,0.498854,0.377387
std,14.413493,51.758668,0.86556,1.041658,0.484918
min,0.17,0.0,0.0,0.0,0.0
25%,21.0,7.8958,0.0,0.0,0.0
50%,28.0,14.4542,0.0,0.0,0.0
75%,39.0,31.275,0.0,1.0,1.0
max,80.0,512.3292,9.0,8.0,1.0


We could check the correlation between two factors.

In [34]:
titanic.Fare.corr(titanic.Age)

0.17873985599964118

Or generate the correlation matrix, with variance as the diagonal (=1).

In [35]:
titanic.corr()

Unnamed: 0,Age,Fare,n_parents,n_siblings,Survived
Age,1.0,0.17874,-0.150917,-0.243699,-0.053695
Fare,0.17874,1.0,0.221539,0.160238,0.233622
n_parents,-0.150917,0.221539,1.0,0.373587,0.108919
n_siblings,-0.243699,0.160238,0.373587,1.0,0.00237
Survived,-0.053695,0.233622,0.108919,0.00237,1.0


In [36]:
titanic.agg(['min','max'])

Unnamed: 0,Age,Fare,Name,n_parents,Pclass,Sex,n_siblings,Survived,Ticket
min,0.17,0.0,"Abbing, Mr. Anthony",0,1st class,female,0,0,110152
max,80.0,512.3292,"van Melkebeke, Mr. Philemon",9,3rd class,male,8,1,WE/P 5735


In [37]:
titanic.agg({'Fare': ['mean','std'], 'Age': ['min', 'max']})

Unnamed: 0,Fare,Age
max,,80.0
mean,33.295479,
min,,0.17
std,51.758668,


Or we can apply another operation not found in Pandas but in NumPy, or our own, as:

In [38]:
titanic[['Age','Fare','n_parents','n_siblings']].dropna().apply(np.median)

Age           28.00
Fare          15.75
n_parents      0.00
n_siblings     0.00
dtype: float64

In [39]:
def age_fare_ratio(x):
    if (x.Fare > 0.):
        return x.Age / x.Fare
    else:
        return 0.

titanic['Age_Fare_rat'] = titanic.apply(age_fare_ratio, axis=1)
titanic.head()

Unnamed: 0_level_0,Age,Cabin,Port Embarked,Fare,Name,n_parents,Pclass,Sex,n_siblings,Survived,Ticket,Age_Fare_rat
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,22.0,,Southampton,7.25,"Braund, Mr. Owen Harris",0,3rd class,male,1,0,A/5 21171,3.034483
2,38.0,C85,Cherbourg,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,1st class,female,1,1,PC 17599,0.533084
3,26.0,,Southampton,7.925,"Heikkinen, Miss. Laina",0,3rd class,female,0,1,STON/O2. 3101282,3.280757
4,35.0,C123,Southampton,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,1st class,female,1,1,113803,0.659134
5,35.0,,Southampton,8.05,"Allen, Mr. William Henry",0,3rd class,male,0,0,373450,4.347826


One of the most powerful forms of aggregation is **groupby**. This allows us to perform an aggregation function not *only on one column*, but on multiple ones, allowing us to control for different factors:

In [40]:
titanic.groupby(['Sex',"Pclass"]).agg(['mean', 'std'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Age,Fare,Fare,n_parents,n_parents,n_siblings,n_siblings,Survived,Survived,Age_Fare_rat,Age_Fare_rat
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std
Sex,Pclass,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
female,1st class,37.037594,14.27246,109.412385,82.885854,0.472222,0.774998,0.555556,0.666667,0.979167,0.143325,0.537688,0.479183
female,2nd class,27.499223,12.911747,23.234827,11.239817,0.650943,0.862361,0.5,0.636209,0.943396,0.232182,1.508919,1.040279
female,3rd class,22.185329,12.205254,15.32425,11.786512,0.731481,1.262014,0.791667,1.446126,0.666667,0.4725,2.036906,1.404201
male,1st class,41.029272,14.578529,69.888385,74.079427,0.27933,0.653571,0.340782,0.541597,0.251397,0.435033,0.903206,0.792811
male,2nd class,30.81538,13.9774,19.904946,14.775149,0.192982,0.488886,0.327485,0.55126,0.099415,0.300097,1.920682,1.301327
male,3rd class,25.962264,11.682415,12.415462,11.261638,0.255578,0.788377,0.470588,1.218775,0.095335,0.293975,2.883671,1.709359


#### Sorting, Ranking

In [41]:
titanic.sort_values(by='Age', ascending=False).head(3)

Unnamed: 0_level_0,Age,Cabin,Port Embarked,Fare,Name,n_parents,Pclass,Sex,n_siblings,Survived,Ticket,Age_Fare_rat
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
631,80.0,A23,Southampton,30.0,"Barkworth, Mr. Algernon Henry Wilson",0,1st class,male,0,1,27042,2.666667
988,76.0,C46,Southampton,78.85,"Cavendish, Mrs. Tyrell William (Julia Florence...",0,1st class,female,1,1,19877,0.963855
852,74.0,,Southampton,7.775,"Svensson, Mr. Johan",0,3rd class,male,0,0,347060,9.517685


In [42]:
titanic.sort_index(ascending=False).head(3)

Unnamed: 0_level_0,Age,Cabin,Port Embarked,Fare,Name,n_parents,Pclass,Sex,n_siblings,Survived,Ticket,Age_Fare_rat
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1309,,,Cherbourg,22.3583,"Peter, Master. Michael J",1,3rd class,male,1,0,2668,
1308,,,Southampton,8.05,"Ware, Mr. Frederick",0,3rd class,male,0,0,359309,
1307,38.5,,Southampton,7.25,"Saether, Mr. Simon Sivertsen",0,3rd class,male,0,0,SOTON/O.Q. 3101262,5.310345


In [43]:
titanic.sort_values(by=['n_parents','Fare'], ascending=[False,True]).head()

Unnamed: 0_level_0,Age,Cabin,Port Embarked,Fare,Name,n_parents,Pclass,Sex,n_siblings,Survived,Ticket,Age_Fare_rat
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1234,,,Southampton,69.55,"Sage, Mr. John George",9,3rd class,male,1,0,CA. 2343,
1257,,,Southampton,69.55,"Sage, Mrs. John (Annie Bullen)",9,3rd class,female,1,1,CA. 2343,
679,43.0,,Southampton,46.9,"Goodwin, Mrs. Frederick (Augusta Tyler)",6,3rd class,female,1,0,CA 2144,0.916844
1031,40.0,,Southampton,46.9,"Goodwin, Mr. Charles Frederick",6,3rd class,male,1,0,CA 2144,0.852878
886,39.0,,Queenstown,29.125,"Rice, Mrs. William (Margaret Norton)",5,3rd class,female,0,0,382652,1.339056


We can `rank()` each value relative to the others if desired:

In [44]:
titanic.Fare.rank().head()

PassengerId
1     108.5
2    1155.5
3     349.0
4    1091.5
5     391.5
Name: Fare, dtype: float64

### Counts

We can count the number of unique values in a column with `value_counts()` - incredibly useful!

In [45]:
titanic.Survived.value_counts()

0    815
1    494
Name: Survived, dtype: int64

In [46]:
titanic.Sex.value_counts()

male      843
female    466
Name: Sex, dtype: int64

### Handling Complex String columns

We may wish to break down the 'name' category into title, first and last names.

In [47]:
titanic.Name.head()

PassengerId
1                              Braund, Mr. Owen Harris
2    Cumings, Mrs. John Bradley (Florence Briggs Th...
3                               Heikkinen, Miss. Laina
4         Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                             Allen, Mr. William Henry
Name: Name, dtype: object

In [48]:
complex_names = titanic.Name.str.extract("(?P<Surname>[a-zA-Z]+),\s(?P<Title>[a-zA-Z]+).\s(?P<Forename>[a-zA-Z]+)",
                         expand=True)
complex_names.head()

Unnamed: 0_level_0,Surname,Title,Forename
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Braund,Mr,Owen
2,Cumings,Mrs,John
3,Heikkinen,Miss,Laina
4,Futrelle,Mrs,Jacques
5,Allen,Mr,William


In [49]:
# or alternatively, splitting a string by a common character, such as comma
titanic.Name.str.split(" ", expand=True).head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,"Braund,",Mr.,Owen,Harris,,,,,,,,,,
2,"Cumings,",Mrs.,John,Bradley,(Florence,Briggs,Thayer),,,,,,,
3,"Heikkinen,",Miss.,Laina,,,,,,,,,,,
4,"Futrelle,",Mrs.,Jacques,Heath,(Lily,May,Peel),,,,,,,
5,"Allen,",Mr.,William,Henry,,,,,,,,,,


In [50]:
# make a new titanic with names appended!
titanic = pd.concat([ complex_names, titanic ], axis=1)
titanic.head()

Unnamed: 0_level_0,Surname,Title,Forename,Age,Cabin,Port Embarked,Fare,Name,n_parents,Pclass,Sex,n_siblings,Survived,Ticket,Age_Fare_rat
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,Braund,Mr,Owen,22.0,,Southampton,7.25,"Braund, Mr. Owen Harris",0,3rd class,male,1,0,A/5 21171,3.034483
2,Cumings,Mrs,John,38.0,C85,Cherbourg,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,1st class,female,1,1,PC 17599,0.533084
3,Heikkinen,Miss,Laina,26.0,,Southampton,7.925,"Heikkinen, Miss. Laina",0,3rd class,female,0,1,STON/O2. 3101282,3.280757
4,Futrelle,Mrs,Jacques,35.0,C123,Southampton,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,1st class,female,1,1,113803,0.659134
5,Allen,Mr,William,35.0,,Southampton,8.05,"Allen, Mr. William Henry",0,3rd class,male,0,0,373450,4.347826


# Tasks

The titanic dataset is a rich dataset to begin using Pandas with, as it is not too large or cumbersome and has plenty of categorical variables to work with.

### Task 1.

Select all of the passengers who are under the age of 30 and calculate the mean Fare that they paid, by gender.

In [51]:
# your answer here
titanic.query("Age < 30").groupby("Sex").Fare.mean()

Sex
female    37.062521
male      22.954652
Name: Fare, dtype: float64

### Task 2.

Calculate the number of people that embarked from each port; by sex, survival and class.

In [59]:
# your answer here
titanic.groupby(["Sex","Survived","Pclass","Port Embarked"]).count().Forename

Sex     Survived  Pclass     Port Embarked
female  0         1st class  Cherbourg          1
                             Southampton        2
                  2nd class  Southampton        5
                  3rd class  Cherbourg          7
                             Queenstown         9
                             Southampton       54
        1         1st class  Cherbourg         69
                             Queenstown         2
                             Southampton       66
                  2nd class  Cherbourg         11
                             Queenstown         2
                             Southampton       77
                  3rd class  Cherbourg         22
                             Queenstown        47
                             Southampton       70
male    0         1st class  Cherbourg         53
                             Queenstown         1
                             Southampton       80
                  2nd class  Cherbourg         15
       

### Task 3. 

The **Gini coefficient** is a measure of dispersion usually related to represent an income or wealth distribution between individuals, which always exists between 0 and 1.

Gini coefficients of 0 express perfect equality, whereby coefficients of 1 expresses maximal inequality among values.

The gini coefficient is calculated in a sample, that is indexed in a non-decreasing fashion ($y_i \leq y_{i+1}$), $i=1,2,\dots,n$ as:

$$
G=\frac{1}{n}\left(n+1-2\frac{\sum_{i=1}^n\left(n+1-i\right)y_i}{\sum_{i=1}^n y_i}\right)
$$

where $y_i$ refers to a continuous value representing wealth, and $i$ refers to the index.

Write a function `gini` that calculates the gini index for an array, and calculate the Titanics' Fare distribution equality.

In [88]:
def gini(x):
    xl = len(x)
    return (1./(xl - 1))*(xl + 1 - 2*(((x*(xl+1-np.arange(xl))).sum())/(x.sum())))

In [89]:
gini(titanic.Fare.sort_values().dropna().reset_index(drop=True))

0.5782268521909104