# Pandas crash course: things you can do with a DataFrame

This magic command reduces the noise in exception tracebacks.

In [1]:
%xmode Plain

Exception reporting mode: Plain


## Summary

These notes demonstrate what you can do with the two key Pandas data types: `Series` and `DataFrame`.  For a fuller treatment, see Chapter 3 of [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake VanDerplas.

## Warming up

Before we look at Pandas, we'll quickly look at two data types that are built into Python: `list` and `dict`.

### `list`

A Python list is an ordered collection of items of any type.  Items in a list don't have to be the same type, but they usually are.  (More about types later.)

We can define a list directly in our code.  For instance, here's a list of the names some Oxford colleges:

In [2]:
colleges = ["St Anne's", "St Antony's", "St Benet's", "St Catherine's", "St Cross", "St Edmund", "St Hilda's", "St Hugh's", "St John's", "St Peter's"]
colleges

["St Anne's",
 "St Antony's",
 "St Benet's",
 "St Catherine's",
 'St Cross',
 'St Edmund',
 "St Hilda's",
 "St Hugh's",
 "St John's",
 "St Peter's"]

Or we can get a list as a result of calling some function that returns a list.  For instance, here's a list of all files in the one of the directories in the OpenPrescribing codebase (on my laptop):

In [3]:
import os
filenames = os.listdir('/Users/inglesp/work/ebmdatalab/openprescribing/deploy')
filenames

['clean_up_bq_test_data.sh',
 'crontab-openprescribing',
 'fetch_and_import_ncso_concessions.sh',
 'fetch_drug_tariff.sh',
 'run_pipeline_e2e_tests.sh']

Once we have a list, there are several things we can do with it.

We can find out how many items are in it:

In [4]:
len(colleges)

10

We can find out whether a given item is in the list:

In [5]:
"St Anne's" in colleges

True

In [6]:
"St Stephen's" in colleges

False

We can access the item at a given position (called an "index") in the list:

In [7]:
colleges[2]

"St Benet's"

Note that the first index is zero:

In [8]:
colleges[0]

"St Anne's"

If we try to access an item by an index that is too big, Python raises an `IndexError`:

In [9]:
colleges[10]

IndexError: list index out of range

We can access items by counting back from the end of a list:

In [10]:
colleges[-3]

"St Hugh's"

We can also slice a list, to get a subset of its items.  `l[lower:upper]` returns a list of items in `l` with indexes between `lower` (inclusive) and `upper` (exclusive).  For instance:

In [11]:
colleges[2:5]

["St Benet's", "St Catherine's", 'St Cross']

If we want the first part of a list, `lower` is 0, and we can omit it altogether:

In [12]:
colleges[:5]

["St Anne's", "St Antony's", "St Benet's", "St Catherine's", 'St Cross']

Similarly, if we want the last part of the list, we can omit `upper`:

In [13]:
colleges[5:]

['St Edmund', "St Hilda's", "St Hugh's", "St John's", "St Peter's"]

It's common to need to iterate over a list, and do something with each element.  For instance, here, we're displaying how many characters are in each college's name:

In [14]:
for c in colleges[:5]:
    print('There are {} characters in "{}"'.format(len(c), c))

There are 9 characters in "St Anne's"
There are 11 characters in "St Antony's"
There are 10 characters in "St Benet's"
There are 14 characters in "St Catherine's"
There are 8 characters in "St Cross"


We can modify a list by updating individual items:

In [15]:
colleges[1] = "St Anthony's"
colleges

["St Anne's",
 "St Anthony's",
 "St Benet's",
 "St Catherine's",
 'St Cross',
 'St Edmund',
 "St Hilda's",
 "St Hugh's",
 "St John's",
 "St Peter's"]

We can also add or remove items from a list.  Take a look at the [Python documentation](https://docs.python.org/3.6/tutorial/datastructures.html#more-on-lists) for details of how to do this, and more.

### `dict`

A Python dictionary is another kind of collection of items.  Rather than looking up items by their position in the collection (as with a list), we look them up by an associated key.  Unlike lists, dictionaries are not ordered.  (This is not actually true anymore: since Python 3.6 they are.  Additionally, you can use [`collections.OrderedDict`]() if you need an ordered dictionary with Python < 3.6.)

We can define a dictionary directly in our code.  For instance, here's a dictionary of names of things in the BNF, keyed by BNF code:


In [16]:
presentations = {
    '010101000BBABA0': 'Langdales_Cinnamon Tab',
    '010101000BBADA0': 'Mylanta 11_Tab',
    '010101000BBAEA0': 'Mylanta 11_Liq',
    '010101000BBAFA0': 'Rennie Plus_Tab',
    '010101000BBAIA0': 'Sab Simplex_Susp',
}
presentations

{'010101000BBABA0': 'Langdales_Cinnamon Tab',
 '010101000BBADA0': 'Mylanta 11_Tab',
 '010101000BBAEA0': 'Mylanta 11_Liq',
 '010101000BBAFA0': 'Rennie Plus_Tab',
 '010101000BBAIA0': 'Sab Simplex_Susp'}

Given a dictionary, we can look up items by key:

In [17]:
presentations['010101000BBABA0']

'Langdales_Cinnamon Tab'

If a key is not present, Python raises a `KeyError`:

In [18]:
presentations['23990001111']

KeyError: '23990001111'

Dictionaries are mutable, and we can add, remove, and update items:

In [19]:
presentations['010101000BBAJA0'] = 'Boots_Indigest Mix'
del presentations['010101000BBABA0']
presentations['010101000BBADA0'] = 'Mylanta 11 Tablet'

presentations

{'010101000BBADA0': 'Mylanta 11 Tablet',
 '010101000BBAEA0': 'Mylanta 11_Liq',
 '010101000BBAFA0': 'Rennie Plus_Tab',
 '010101000BBAIA0': 'Sab Simplex_Susp',
 '010101000BBAJA0': 'Boots_Indigest Mix'}

We can iterate over the items in a dictionary:

In [20]:
for bnf_code, name in presentations.items():
    print('{} has BNF code {}'.format(name, bnf_code))

Mylanta 11 Tablet has BNF code 010101000BBADA0
Mylanta 11_Liq has BNF code 010101000BBAEA0
Rennie Plus_Tab has BNF code 010101000BBAFA0
Sab Simplex_Susp has BNF code 010101000BBAIA0
Boots_Indigest Mix has BNF code 010101000BBAJA0


(Remember that before Python 3.6, dictionaries are not ordered, so the order of iteration will not be predictable.)

### Types and `type()`

Often, we will call a Python function and will want to use the return value of the function.  What we can do with a value depends on what type of thing it is.  There is a built-in function, `type`, that tells us what type of thing something is:

In [21]:
type(123)

int

In [22]:
type(123.0)

float

In [23]:
type('123')

str

In [24]:
type([123])

list

In [26]:
type({123: '123'})

dict

## `Series`

We're now ready to start talking about Pandas.

:panda_face:

In [27]:
import pandas as pd

We'll begin by talking about `Series`, which behaves a bit like a cross between `list` and `dict`.

Here's a dictionary whose keys are practice identifiers, and whose values are quantities of something (I've lost my original query) prescribed by each practice in a given month:

In [28]:
quantity_by_practice = {
    'A81002': 70,
    'A81004': 112,
    'A81005': 28,
    'A81006': 56,
    'A81007': 56,
    'A81013': 56,
    'A81016': 28,
    'A81020': 56,
    'A81023': 28,
    'A81029': 112,
}

We can create a `Series` from this dictionary.  (There are several other ways to create a `Series`, but most often you'll use them as a component of a `DataFrame`.)

In [29]:
s = pd.Series(quantity_by_practice)
s

A81002     70
A81004    112
A81005     28
A81006     56
A81007     56
A81013     56
A81016     28
A81020     56
A81023     28
A81029    112
dtype: int64

Let's check what kind of thing it is:

In [30]:
type(s)

pandas.core.series.Series

Pandas has a useful way of showing us the first part of a `Series`, so that we can get a flavour for what it contains.  (This is particularly useful for large datasets.)

In [31]:
s.head()

A81002     70
A81004    112
A81005     28
A81006     56
A81007     56
dtype: int64

The `.dtype` attribute tells us what kind of thing the `Series` contains.  (Note that it's a numpy data type, which isn't the sme as a Python type.)

In [32]:
s.dtype

dtype('int64')

### Indexing and slicing (and `.loc()` and `.iloc()`)

We said that `Series` behaves a bit like `list`.  `Series` actually wraps a numpy `array`, which is like an efficient implementation of `list`.  We can access the underlying `array`:

In [33]:
s.values

array([ 70, 112,  28,  56,  56,  56,  28,  56,  28, 112])

And as with `list`, we can access individual items by index:

In [34]:
s[0]

70

In [35]:
s[10]

IndexError: index out of bounds

But a `Series` also has keys, which like `dict` we can use to access items:

In [36]:
s['A81004']

112

In [37]:
s['Y05364']

KeyError: 'Y05364'

To save a few keystrokes, Pandas allows us to access items as if they were attributes using the dot notation:

In [38]:
s.A81004

112

Question: What problems might this cause?

Using the square bracket notation to access items in a `Series` by position and by key might cause problems if the keys of the series are integers.

Here's an example:

In [43]:
s1 = pd.Series({1: 1, 3: 9, 5: 25, 7: 49})
s1

1     1
3     9
5    25
7    49
dtype: int64

In [44]:
s1[3]

9

Question: in the line above, is `3` a position or a key?

To resolve the amibiguity, a `Series` has a pair of attributes to access items by key (`.loc`) or by position (`.iloc`).  The "i" in "iloc" is for "implicit".

Here are some examples:

In [46]:
s1.loc[3]

9

In [47]:
s1.iloc[3]

49

In [48]:
s1.loc[2]

KeyError: 'the label [2] is not in the [index]'

In [49]:
s1.iloc[5]

IndexError: single positional indexer is out-of-bounds

And here are `.loc` and `.iloc` with our original `Series`:

In [50]:
s.loc['A81004']

112

In [51]:
s.loc[1]

TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1] of <class 'int'>

In [52]:
s.iloc[1]

112

In [53]:
s.iloc['A81004']

TypeError: cannot do positional indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [A81004] of <class 'str'>

Like `list`, we can slice `Series`:

In [54]:
s['A81005':'A81013']

A81005    28
A81006    56
A81007    56
A81013    56
dtype: int64

In [55]:
s.loc['A81005':'A81013']

A81005    28
A81006    56
A81007    56
A81013    56
dtype: int64

In [56]:
s.iloc[2:6]

A81005    28
A81006    56
A81007    56
A81013    56
dtype: int64

Slicing, without `.loc` or `.iloc` is always done by position:

In [57]:
s1

1     1
3     9
5    25
7    49
dtype: int64

In [58]:
s1[1:3]

3     9
5    25
dtype: int64

In [59]:
s1.loc[1:3]

1    1
3    9
dtype: int64

In [60]:
s1.iloc[1:3]

3     9
5    25
dtype: int64

We can also index a `Series` by a `list`.

Question: what does this do?

In [61]:
s[['A81005', 'A81013']]

A81005    28
A81013    56
dtype: int64

In [63]:
type(s[['A81005', 'A81013']])

pandas.core.series.Series

Another way to index a `Series` by a `list` is to use a list of boolean values.

Question: what does this do?

In [64]:
s[[True, False, True, False, True, False, True, False, True, False]]

A81002    70
A81005    28
A81007    56
A81016    28
A81023    28
dtype: int64

In [65]:
s1

1     1
3     9
5    25
7    49
dtype: int64

In [66]:
s1[[True, True, False, False]]

1    1
3    9
dtype: int64

### Universal fuctions ("ufuncs")

Here's our `Series`:

In [67]:
s

A81002     70
A81004    112
A81005     28
A81006     56
A81007     56
A81013     56
A81016     28
A81020     56
A81023     28
A81029    112
dtype: int64

Here's some magic: we can perform an operation on every item in the `Series` at once:

In [68]:
s / 28

A81002    2.5
A81004    4.0
A81005    1.0
A81006    2.0
A81007    2.0
A81013    2.0
A81016    1.0
A81020    2.0
A81023    1.0
A81029    4.0
dtype: float64

In [69]:
s + 10

A81002     80
A81004    122
A81005     38
A81006     66
A81007     66
A81013     66
A81016     38
A81020     66
A81023     38
A81029    122
dtype: int64

In [70]:
s * s

A81002     4900
A81004    12544
A81005      784
A81006     3136
A81007     3136
A81013     3136
A81016      784
A81020     3136
A81023      784
A81029    12544
dtype: int64

In [71]:
-s

A81002    -70
A81004   -112
A81005    -28
A81006    -56
A81007    -56
A81013    -56
A81016    -28
A81020    -56
A81023    -28
A81029   -112
dtype: int64

In [72]:
s > 60

A81002     True
A81004     True
A81005    False
A81006    False
A81007    False
A81013    False
A81016    False
A81020    False
A81023    False
A81029     True
dtype: bool

Functions or operators that operate on a whole `Series` in one go are called "universal functions", and they come from numpy.  Read more about them [here](https://jakevdp.github.io/PythonDataScienceHandbook/02.03-computation-on-arrays-ufuncs.html).

### Masking

We saw how we could use a `list` of boolean values to filter a `Series`.  We can use the same principle to filter a `Series` by a condition.

In [73]:
s[s > 60]

A81002     70
A81004    112
A81029    112
dtype: int64

In [74]:
s[s == 70]

A81002    70
dtype: int64

In [75]:
s[s < 30]

A81005    28
A81016    28
A81023    28
dtype: int64

This is called "masking".  Again, it comes from numpy, and you can read more [here](https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html).

## `DataFrame`

Now we can talk about Pandas' main data type: `DataFrame`.  A `DataFrame` behave a bit like a `list` of `list`s, or a `dict` of `dict`s.

Here we have several `dict`s with data about the same ten practices.  We can create a `DataFrame` to combine this data.  Each row corresponds to a practice, and each column to the type of data.

In [76]:
names = {
    'A81002': 'Queens Park Medical Centre',
    'A81004': 'Bluebell Medical Centre',
    'A81005': 'Springwood Surgery',
    'A81006': 'Tennant Street Medical Practice',
    'A81007': 'Bankhouse Surgery',
    'A81013': 'Brotton Surgery',
    'A81016': 'Park Surgery',
    'A81020': 'Martonside Medical Centre',
    'A81023': 'The Endeavour Practice',
    'A81029': 'Prospect Surgery',
}

quantities = {
    'A81002': 70,
    'A81004': 112,
    'A81005': 28,
    'A81006': 56,
    'A81007': 56,
    'A81013': 56,
    'A81016': 28,
    'A81020': 56,
    'A81023': 28,
    'A81029': 112,
}

costs = {
    'A81002': 15,
    'A81004': 56.96,
    'A81005': 14.24,
    'A81006': 28.48,
    'A81007': 28.48,
    'A81013': 28.47,
    'A81016': 14.24,
    'A81020': 28.48,
    'A81023': 14.24,
    'A81029': 56.96,
}

df = pd.DataFrame({'name': names, 'quantity': quantities, 'cost': costs}, columns=['name', 'quantity', 'cost'])

df

Unnamed: 0,name,quantity,cost
A81002,Queens Park Medical Centre,70,15.0
A81004,Bluebell Medical Centre,112,56.96
A81005,Springwood Surgery,28,14.24
A81006,Tennant Street Medical Practice,56,28.48
A81007,Bankhouse Surgery,56,28.48
A81013,Brotton Surgery,56,28.47
A81016,Park Surgery,28,14.24
A81020,Martonside Medical Centre,56,28.48
A81023,The Endeavour Practice,28,14.24
A81029,Prospect Surgery,112,56.96


Again, we can check what type of thing it is:

In [77]:
type(df)

pandas.core.frame.DataFrame

And we can see just the head:

In [78]:
df.head()

Unnamed: 0,name,quantity,cost
A81002,Queens Park Medical Centre,70,15.0
A81004,Bluebell Medical Centre,112,56.96
A81005,Springwood Surgery,28,14.24
A81006,Tennant Street Medical Practice,56,28.48
A81007,Bankhouse Surgery,56,28.48


The `.shape` attribute gives us some useful information about how many rows and columns the `DataFrame` has:

In [79]:
df.shape

(10, 3)

And `.dtypes` tells us the `dtype` of each column:

In [80]:
df.dtypes

name         object
quantity      int64
cost        float64
dtype: object

### Indexing and slicing (and .loc() and .iloc())

Like `Series`, `DataFrame` is a wrapper around a numpy object.  Specifically, it's a wrapper around an `array` of `array`s.  We can access this with `.values`:

In [81]:
df.values

array([['Queens Park Medical Centre', 70, 15.0],
       ['Bluebell Medical Centre', 112, 56.96],
       ['Springwood Surgery', 28, 14.24],
       ['Tennant Street Medical Practice', 56, 28.48],
       ['Bankhouse Surgery', 56, 28.48],
       ['Brotton Surgery', 56, 28.47],
       ['Park Surgery', 28, 14.24],
       ['Martonside Medical Centre', 56, 28.48],
       ['The Endeavour Practice', 28, 14.24],
       ['Prospect Surgery', 112, 56.96]], dtype=object)

Now you might think that because of this you can access the first row using by using square brackets and index `0`.  But:

In [82]:
df[0]

KeyError: 0

Intead, the square brackets let use access a column:

In [83]:
df['name']

A81002         Queens Park Medical Centre
A81004            Bluebell Medical Centre
A81005                 Springwood Surgery
A81006    Tennant Street Medical Practice
A81007                  Bankhouse Surgery
A81013                    Brotton Surgery
A81016                       Park Surgery
A81020          Martonside Medical Centre
A81023             The Endeavour Practice
A81029                   Prospect Surgery
Name: name, dtype: object

What kind of thing is a column?

In [84]:
type(df['name'])

pandas.core.series.Series

In [85]:
df['name']['A81002']

'Queens Park Medical Centre'

In [86]:
df['name'][0]

'Queens Park Medical Centre'

We can also access columns by attribute lookup (but again, this is generally not a good idea):

In [87]:
df.name

A81002         Queens Park Medical Centre
A81004            Bluebell Medical Centre
A81005                 Springwood Surgery
A81006    Tennant Street Medical Practice
A81007                  Bankhouse Surgery
A81013                    Brotton Surgery
A81016                       Park Surgery
A81020          Martonside Medical Centre
A81023             The Endeavour Practice
A81029                   Prospect Surgery
Name: name, dtype: object

We can look up multiple columns at once:

In [88]:
df[['name', 'quantity']]

Unnamed: 0,name,quantity
A81002,Queens Park Medical Centre,70
A81004,Bluebell Medical Centre,112
A81005,Springwood Surgery,28
A81006,Tennant Street Medical Practice,56
A81007,Bankhouse Surgery,56
A81013,Brotton Surgery,56
A81016,Park Surgery,28
A81020,Martonside Medical Centre,56
A81023,The Endeavour Practice,28
A81029,Prospect Surgery,112


In [89]:
type(df[['name', 'quantity']])

pandas.core.frame.DataFrame

If we want to access a row, we have to use `.loc()` to access rows by key:

In [90]:
df.loc['A81002']

name        Queens Park Medical Centre
quantity                            70
cost                                15
Name: A81002, dtype: object

In [91]:
type(df.loc['A81002'])

pandas.core.series.Series

We can use a slice with `.loc()`:

In [92]:
df.loc['A81005':'A81013']

Unnamed: 0,name,quantity,cost
A81005,Springwood Surgery,28,14.24
A81006,Tennant Street Medical Practice,56,28.48
A81007,Bankhouse Surgery,56,28.48
A81013,Brotton Surgery,56,28.47


And we can use `iloc()` to access rows by position:

In [93]:
df.iloc[0]

name        Queens Park Medical Centre
quantity                            70
cost                                15
Name: A81002, dtype: object

If we want a slice of a `DataFrame`, we don't need to use `.loc()` or `.iloc()` (but `.iloc()` is more explicit):

In [94]:
df[0:5]

Unnamed: 0,name,quantity,cost
A81002,Queens Park Medical Centre,70,15.0
A81004,Bluebell Medical Centre,112,56.96
A81005,Springwood Surgery,28,14.24
A81006,Tennant Street Medical Practice,56,28.48
A81007,Bankhouse Surgery,56,28.48


As with `Series`, we can filter a `DataFrame` by indexing with a `list` of boolean values:

In [95]:
df[[True, False, True, False, True, False, True, False, True, False]]

Unnamed: 0,name,quantity,cost
A81002,Queens Park Medical Centre,70,15.0
A81005,Springwood Surgery,28,14.24
A81007,Bankhouse Surgery,56,28.48
A81016,Park Surgery,28,14.24
A81023,The Endeavour Practice,28,14.24


### Ufuncs

As `DataFrame` columns are `Series` objects, we can use ufuncs:

In [96]:
df['cost'] / df['quantity']

A81002    0.214286
A81004    0.508571
A81005    0.508571
A81006    0.508571
A81007    0.508571
A81013    0.508393
A81016    0.508571
A81020    0.508571
A81023    0.508571
A81029    0.508571
dtype: float64

In [97]:
type(df['cost'] / df['quantity'])

pandas.core.series.Series

It is common to assign the result of a ufunc to a new column in a `DataFrame`:

In [98]:
df['unit cost'] = df['cost'] / df['quantity']

In [99]:
df

Unnamed: 0,name,quantity,cost,unit cost
A81002,Queens Park Medical Centre,70,15.0,0.214286
A81004,Bluebell Medical Centre,112,56.96,0.508571
A81005,Springwood Surgery,28,14.24,0.508571
A81006,Tennant Street Medical Practice,56,28.48,0.508571
A81007,Bankhouse Surgery,56,28.48,0.508571
A81013,Brotton Surgery,56,28.47,0.508393
A81016,Park Surgery,28,14.24,0.508571
A81020,Martonside Medical Centre,56,28.48,0.508571
A81023,The Endeavour Practice,28,14.24,0.508571
A81029,Prospect Surgery,112,56.96,0.508571


### Masking

Masking works in much the same way as with `Series`:

In [100]:
df['quantity'] > 60

A81002     True
A81004     True
A81005    False
A81006    False
A81007    False
A81013    False
A81016    False
A81020    False
A81023    False
A81029     True
Name: quantity, dtype: bool

In [101]:
df[df['quantity'] > 60]

Unnamed: 0,name,quantity,cost,unit cost
A81002,Queens Park Medical Centre,70,15.0,0.214286
A81004,Bluebell Medical Centre,112,56.96,0.508571
A81029,Prospect Surgery,112,56.96,0.508571


The expression used to do the masking can be arbitrarily complicated:

In [102]:
df[(df['cost'] / df['quantity']) < 0.3]

Unnamed: 0,name,quantity,cost,unit cost
A81002,Queens Park Medical Centre,70,15.0,0.214286


But it's probably better (for readability) to assign some or all of the expression to a variable:

In [103]:
unit_cost = df['cost'] / df['quantity']
df[unit_cost < 0.3]

Unnamed: 0,name,quantity,cost,unit cost
A81002,Queens Park Medical Centre,70,15.0,0.214286


## The Zen of Python

In [104]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
