# 1. Introduction

Welcome to the 2nd part of this course - now that (presumably) you have a solid grasp of the principles surrounding Numerical computing in NumPy, we will move on to data management in Python. The most common way to do this is in **tabular** format (i.e in a table) with relational databases. The most commonly used powerful library which provides in-memory database-like data handling is **Pandas**. Pandas is well suited for:

* **Tabular** data with heterogeneously-typed columns, such as in an SQL database or Excel spreadsheet.
* Ordered and unordered **time-series** data.
* Arbitrary **matrix** data with row and column labels.

Some of the interesting features include:

* Handling missing data fluently
* Size mutability
* Easy-to-use *data alignment*
* Label-based *slicing*, *fancy indexing* and *subsetting*
* Intuitive *merging* and *joining* of datasets by label
* Hierarchical labelling of axes
* Decent IO tools for importing from an array of different formats
* Flexible reshaping and *pivoting* of tables

In [None]:
import pandas as pd

**Pandas** is broken down into two primary classes:

1. **Series**: think of this as an any-type (templated) unordered array with an index. A generalized *numpy array*.
2. **DataFrame**: think of this as a 2-D heterogeneous table with a *Series* for each column.

## Series

In [None]:
counts = pd.Series([644, 1276, 3554, 154])
counts

If we don't specify an index, a default sequence of integers (from `np.arange()`) is assigned as the index. A numpy array comprises the values of the *Series*, which the index is another *Pandas* object: 

In [None]:
counts.values

We can assign meaningful labels to the series, as:

In [None]:
foods = pd.Series([644, 1276, 3554, 154], index=['Oranges', 'Apples', 'Melons', 'Pumpkins'])
foods

A useful way to think of a *Series* is to use **key-value** pairs, i.e input using a dictionary:

In [None]:
food_d = {
    'Oranges': 644,
    'Apples': 1276,
    'Melons': 3554,
    'Pumpkins': 154
}

pd.Series(food_d)

This can also be achieved via separate lists:

In [None]:
labels = ['Oranges', 'Apples', 'Melons', 'Pumpkins']
counts = [644, 1276, 3554, 154]
pd.Series(dict(zip(labels,counts)))

## DataFrame

One of the really nice aspects about Dataframes, particularly in Jupyter notebook, is the automatic HTML/Javascript generated when visualizing tables:

In [None]:
data = pd.DataFrame({'value': [632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient': [1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum': ['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
                                'Bacteroidetes', 'Firmicutes', 'Proteobacteria',
                                'Actinobacteria', 'Bacteroidetes']})
data

For most datasets it is impractical to display all the values, there are methods to only view the first $n$ rows: head by default only views the first 5 rows.

In [None]:
data.head()

We can extract the column names as:

In [None]:
data.columns

### Reading and Writing Files

There are a number of powerful functions that can achieve this:

In [None]:
titanic = pd.read_excel("titanic.xlsx")
titanic.head()

Checking the size of the dataset is a priority:

In [None]:
titanic.shape

As well as determining the number of missing values from each column:

In [None]:
titanic.isnull().sum()

We can select a column using the square-bracket notation [] or using direct.dot notation:

In [None]:
titanic.Age
titanic['Age'].head()

Like NumPy, we can index and select using similar methods:

In [None]:
titanic.Age[:5]

In [None]:
titanic[2:10:2]

Given that this dataset is by passengers, it would be wise to set PassengerID as the index, as such:

In [None]:
titanic = titanic.set_index("PassengerId")
titanic.head()

### Querying, Selection

We can select passengers by the index/row, using `.loc[]`

In [None]:
titanic.loc[3]

Or values by including a column term

In [None]:
titanic.loc[3, 'Age']

We can quickly subset the dataset using boolean operators:

In [None]:
titanic[titanic.Age > 30].head()

Or select columns between two identified as:

In [None]:
titanic.loc[:3, "Cabin":"Fare"]

Alternatively, we can index using the absolute *position* using `iloc[]`.

In [None]:
titanic.iloc[1, 2]

In [None]:
titanic.iloc[1]

We can use the `isin()` method to search if a value or values exist within a Series:

In [None]:
titanic['Port Embarked'].isin(['Cherbourg']).head()

We can find all the indices where the condition is met, and returns the values that satisfy the condition but retains the shape of the original dataframe, which is crucial when alignment is required:

In [None]:
import numpy as np
x = pd.DataFrame(np.random.rand(5,7))
x.where(x < 0.5)

We can instead of replacing values with NaN, use a value or function to apply to values that are not part of the condition, like:

In [None]:
x.where(x < 0.5, other=-x)

In [None]:
x.where(x > 0.5, other=lambda y: y**3-1)

Selection using `query()` feels an awful lot like SQL, which can take raw variables as part of it using @

In [None]:
n_parents = 2
titanic.query("(Age < 25) & ((Pclass == '1st class') | (n_parents == @n_parents))").head()

#### Aggregation

The toys of NumPy are back in a similar form: max, min, mean, sum etc.

In [None]:
titanic.sum()

In [None]:
titanic.Age.mean()

In [None]:
titanic.describe()

We could check the correlation between two factors.

In [None]:
titanic.Fare.corr(titanic.Age)

Or generate the correlation matrix, with variation as the diagonal (=1).

In [None]:
titanic.corr()

In [None]:
titanic.agg(['min','max'])

In [None]:
titanic.agg({'Fare': ['mean','std'], 'Age': ['min', 'max']})

Or we can apply another operation not found in Pandas but in NumPy, or our own, as:

In [None]:
titanic[['Age','Fare','n_parents','n_siblings']].dropna().apply(np.median)

In [None]:
def age_fare_ratio(x):
    if (x.Fare > 0.):
        return x.Age / x.Fare
    else:
        return 0.

titanic['Age_Fare_rat'] = titanic.apply(age_fare_ratio, axis=1)
titanic.head()

One of the most powerful forms of aggregation is **groupby**. This allows us to perform an aggregation function not *only on one column*, but on multiple ones, allowing us to control for different factors:

In [None]:
titanic.groupby(['Sex',"Pclass"]).agg(['mean', 'std'])

#### Sorting, Ranking

In [None]:
titanic.sort_values(by='Age', ascending=False).head(3)

In [None]:
titanic.sort_index(ascending=False).head(3)

In [None]:
titanic.sort_values(by=['n_parents','Fare'], ascending=[False,True]).head()

We can `rank()` each value relative to the others if desired:

In [None]:
titanic.Fare.rank().head()

### Counts

We can count the number of unique values in a column with `value_counts()` - incredibly useful!

In [None]:
titanic.Survived.value_counts()

In [None]:
titanic.Sex.value_counts()

### Handling Complex String columns

We may wish to break down the 'name' category into title, first and last names.

In [None]:
titanic.Name.head()

In [None]:
complex_names = titanic.Name.str.extract("(?P<Surname>[a-zA-Z]+),\s(?P<Title>[a-zA-Z]+).\s(?P<Forename>[a-zA-Z]+)",
                         expand=True)
complex_names.head()

In [None]:
# or alternatively, splitting a string by a common character, such as comma
titanic.Name.str.split(" ", expand=True).head()

In [None]:
# make a new titanic with names appended!
titanic = pd.concat([ complex_names, titanic ], axis=1)
titanic.head()