# Pandas - Data Science with Python

Numpy and numpy arrays are our tool of choice for numeric data that resembles vectors, matrices (and higher dimensional tensors).

Where data is gathered from experiments, and in particular where we want to extract meaning from the combination of different data sources, and where data is often incomplete, the pandas library offers a number of useful tools (and has become a standard tool for data scientists).

In this section, we introduce the basics of Pandas.

In particular, we introduce the two key data types in Pandas: the ``Series`` and the ``DataFrame`` objects.

By convention, the `pandas` library is imported under the name `pd` (the same way that `numpy` is imported under the name `np`:

In [1]:
import pandas as pd

## Motivational example


Imagine we are working on software for a greengrocer or supermarket, and need to track the number of apples (10), oranges(3) and bananas (22) that are available in the supermarket. 

We could use a python list (or a numpy array) to track these numbers:

In [2]:
stock = [10, 3, 22]

However, we would need to remember separately that the entries are in the order of apples, oranges, and bananas. This could be achieved through a second list: 

In [3]:
stocklabels = ['apple', 'orange', 'banana']

In [4]:
assert len(stocklabels) == len(stock)  # check labels and 
                                       # stock are consistent
for label, count in zip(stocklabels, stock):
    print('{:10s} : {:4d}'.format(label, count))

apple      :   10
orange     :    3
banana     :   22


The above 2-list solution is a little awkward in two ways: firstly, we have use two lists to describe one set of data (and thus need to be carefuly to update them simulatenously, for example), and secondly, the access to the data given a label is inconvenient: We need to find the index of the label with one list, then use this as the index to the other list, for example

In [12]:
index = stocklabels.index('banana')
bananas = stock[index]
print("There are {} banasis [index={}].".format(bananas, index))

There are 22 banasis [index=2].


We have come across similar examples in the section on dictionaries, and indeed a dictionary is a more convenient solution:

In [13]:
stock_dic = {'apple': 10, 
             'orange': 3,
             'banana': 22}

In a way, the keys of the dictionary contain the stock labels and the values contain the actual values:

In [14]:
stock_dic.keys()

dict_keys(['banana', 'apple', 'orange'])

In [15]:
stock_dic.values()

dict_values([22, 10, 3])

To retrieve (or change) the value for `apple`, we use `apple` as the key and retrieve the value through the dictionary's indexing notation:

In [16]:
stock_dic['apple']

10

And we can summarise the stock as follows:

In [17]:
for label in stock_dic:
    print('{:10s} : {:4d}'.format(label, stock_dic[label]))

banana     :   22
apple      :   10
orange     :    3


This is a vast improvement over the 2-lists solution: (i) we only maintain one structure, which contains a value for every key - so we don't need to check that the lists have the same length. (ii) we can access individual elements through the label (using it as a key for the dictionary). 

The Pandas Series object address the requriments above. It is similar to a dictionary, but with improvements for the given problem:

* the order of the items is maintained
* the values have to have the same type (higher execution performance)
* a (large) number of convenience functions, for example to deal with missing data

## Pandas `Series`

We can create a `Series` object - for example - from a dictionary:

In [18]:
stock = pd.Series({'apple': 10, 
                   'orange': 3,
                   'banana': 22})

The default presentation shows the entries one per row, with the label on the left, and the value on the right. 

In [20]:
stock

apple     10
banana    22
orange     3
dtype: int64

The items on the left are referred to as the `index` of the Series, and are available as the `index` attribute of the `series` object:

In [21]:
stock.index

Index(['apple', 'banana', 'orange'], dtype='object')

In [22]:
type(stock.index)

pandas.indexes.base.Index

Regarding data access, the `Series` object behaves like a dictionary:

In [24]:
stock['apple']

10

In [27]:
stock['potato'] = 101    # adding more values
stock['cucumber'] = 1


In [28]:
print(stock)

apple        10
banana       22
orange        3
potato      101
cucumber      1
dtype: int64


In [26]:
stock

apple        10
banana       22
orange        3
potato      101
cucumber      1
dtype: int64

In [5]:
s = pd.Series([10, 20, 400, -42, 15.], index=["dog", "cat", "mouse", "wasp", "beea
                                              "])

In [8]:
s['cat']

20.0