# Advanced Pandas

Here we will touch on some of the most complex areas of Pandas, continuing from a number of the Intermediate topics mentioned previously, to give you as full of an experience as possible using `pandas`.

In [2]:
import pandas as pd
import numpy as np

## `pandas.melt`: From wide to long

A number of the packages require data to exist in *long-form*, this often means that columns contain duplicates and is memory and disk intensive. It is far more common to keep data in wide-form. However when we need to convert data that has many similar-like columns into *long-form*, `pd.melt` is one of the best functions in Pandas to achieve this.

Take the `cdystonia` dataset for example.

In [47]:
cdystonia = pd.read_csv("datasets/cdystonia.csv")
print(cdystonia.shape)
cdystonia.head(3)

(631, 9)


Unnamed: 0,patient,obs,week,site,id,treat,age,sex,twstrs
0,1,1,0,1,1,5000U,65,F,32
1,1,2,2,1,1,5000U,65,F,30
2,1,3,4,1,1,5000U,65,F,24


Using aforementioned methods, we can expand out the `twstrs` response column to be multiple columns using a *pivot*. Here we use the `week` as the columns (identical to observation `obs`), and use the set difference to eliminate, keeping all the other columns available.

In [46]:
cdystonia_wide = cdystonia.pivot_table("twstrs", index=cdystonia.columns.difference(["twstrs","obs","week"]).tolist(), columns="week")
print(cdystonia_wide.shape)
cdystonia_wide.head()

(109, 6)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,week,0,2,4,8,12,16
age,id,patient,sex,site,treat,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
26,8,75,F,7,Placebo,42.0,48.0,26.0,37.0,37.0,43.0
31,10,22,M,2,Placebo,44.0,40.0,32.0,36.0,42.0,43.0
34,11,90,F,8,10000U,49.0,25.0,30.0,49.0,55.0,58.0
35,4,50,F,5,10000U,50.0,50.0,,46.0,50.0,57.0
35,12,38,M,3,5000U,29.0,42.0,35.0,24.0,29.0,42.0


You can see that $(631,9)$ is substantially larger than $(109,6)$ in terms of dimensional size. By specifying the columns we want to keep as identifiers, `pd.melt` selects every other column and collapses it into a single column, that we name back as `twstrs`:

In [87]:
cdystonia_long = pd.melt(cdystonia_wide.reset_index(), id_vars=["age","id","patient","sex","site","treat"], value_name="twstrs", var_name="week")
cdystonia_long.head(3)

Unnamed: 0,age,id,patient,sex,site,treat,week,twstrs
0,26,8,75,F,7,Placebo,0,42.0
1,31,10,22,M,2,Placebo,0,44.0
2,34,11,90,F,8,10000U,0,49.0


## Vectorized String Operations

One strength of Python is its relative ease in handling and manipulating string data. Pandas builds on this and provides a comprehensive set of vectorized string operations that become an essential piece of the type of munging required when working with (read: cleaning up) real-world data. In this section, we'll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the Internet.

If we recall from NumPy, one of the key advantages was the *vectorization* of mathematical operations, such as this:

In [88]:
x=np.array([2,3,5,7,11,13])
x**2

array([  4,   9,  25,  49, 121, 169])

Whereas for arrays of strings, NumPy does not provide such simple access, and we have to fall back to using a Pythonic list comprehension:

In [89]:
x=np.array(['peter','Paul','mary','guido'])
[s.capitalize() for s in x]

['Peter', 'Paul', 'Mary', 'Guido']

In [90]:
x.capitalize()

AttributeError: 'numpy.ndarray' object has no attribute 'capitalize'

In addition, this Pythonic method will break in cases where there is missing data:

In [91]:
x=np.array(['peter','Paul',None,'mary','guido'])
[s.capitalize() for s in x]

AttributeError: 'NoneType' object has no attribute 'capitalize'

Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the `str` attribute of `pd.Series` and `pd.Index` objects containing string information. 

In [93]:
names = pd.Series(["Jeff", "alan", "Steve", "gUIDO", None, "job", None])
names

0     Jeff
1     alan
2    Steve
3    gUIDO
4     None
5      job
6     None
dtype: object

We can now call a single method to capitalize the entries, as follows:

In [94]:
names.str.capitalize()

0     Jeff
1     Alan
2    Steve
3    Guido
4     None
5      Job
6     None
dtype: object

### Available methods in `pandas.str`

Nearly all of the Python built-in string methods are mirrored in Pandas vectorized string methods, here is a tabular list:

| & | & | &  | &|
|------- | ----------- | ----------- | ------------- |
| `len()` | `lower()` | `translate()` | `islower()` |
| `ljust()` | `rjust()` | `lower()` | `upper()` | 
| `startswith()` | `endswith()` | `find()` | `isnumeric()` |
| `center()` | `rfind()` | `isalnum()` | `isdecimal()` | 
| `zfill()` | `index()` | `isalpha()` | `split()` |
| `strip()` | `rindex()` | `isdigit()` | `rsplit()` |
| `rstrip()` | `capitalize()` | `isspace()` | `partition()` |
| `lstrip()` | `swapcase()` | `istitle()` | `rpartition()` |

Note that there are variable return values, for instance `lower()` returns a string, but `len()` returns an integer, `startswith()` returns a boolean value, etc.

### Additional method using regular expressions

This is where the true power of Pandas comes in: not only can we do direct matching and string manipulation, but also provide functionality to examine the content of each element using a regular expression. Some of the below functions we can use are:

| **Method** | **Description** |
| ---------- | -------------------------------- |
| `match()` | Calls `re.match()` on each element, returning a boolean |
| `extract()` | Calls `re.extract()` on each element, returning matched groups as strings |
| `findall()` | Calls `re.findall()` on each element |
| `replace()` | Replaces occurences of pattern with some other string |
| `contains()` | Calls `re.search()` on each element, returning a boolean |
| `count()` | Count occurrences of pattern |
| `split()` | Calls `str.split()`, but accepts regular expressions |
| `rsplit()` | Calls `str.rsplit()` but accepts regular expressions |

With these, we have a wide range of interesting operations. For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of the element:

In [120]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'], name="names")

In [121]:
monte.str.extract("([A-Za-z]+)", expand=False)

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
Name: names, dtype: object

Note that if we return `expand=True`, we return a 1-D dataframe, else we get a `pd.Series`. Or we could do something more complicated, like finding all the names that start and end with a consonant, make use of the start-of-string (^) and end-of-string (\$) regular expression characters:

In [122]:
monte.str.findall(r"^[^AEIOU].*[^aeiou]$")

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
Name: names, dtype: object

### Miscallaneous methods

Finally, there are a number of convenient operations which Pandas uniquely provides that can be invaluable when *function chaining*:

| **Method** | **Description** |
| ----------- | ----------------------------- |
| `get()` | Index each element |
| `slice()` | Slice each element |
| `slice_replace()` | Replace slice in each element with passed value |
| `cat()` | Concatenate strings |
| `repeat()` | Repeat values |
| `normalize()` | Return a unicode form of the string |
| `pad()` | Add whitespace to the left, right or both sides of a string |
| `wrap()` | Split long strings into lines of length less than a given width |
| `join()` | Join strings in each element of the Series with passed separator |
| `get_dummies()` | Extract dummy variables as DataFrame |

### Vectorized item access and slicing

The `get()` and `slice()` operations, enable vectorized element access from each array. For example:

In [123]:
monte.str[:3]

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
Name: names, dtype: object

Is equivalent to:

In [124]:
monte.str.slice(0,3)

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
Name: names, dtype: object

In [125]:
monte.str.split(" ", expand=True)

Unnamed: 0,0,1
0,Graham,Chapman
1,John,Cleese
2,Terry,Gilliam
3,Eric,Idle
4,Terry,Jones
5,Michael,Palin


### Indicator Variables

Another method that requires a bit of extra explanation is the `get_dummies()` method. This is useful when your data has a column containing some sort of coded indicator. For example, we might have a dataset that contains information in the form of codes, such as A="born in America," B="born in the United Kingdom," C="likes cheese," D="likes spam":

In [126]:
info=pd.Series(["B|C|D","B|D","A|C","B|D","B|C", "B|C|D"])
info.name="info"

In [128]:
full_monte = pd.concat([monte, info],axis=1)

The `get_dummies` routine lets you split out indicator variables into a new DataFrame:

In [129]:
full_monte["info"].str.get_dummies("|")

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1


## Categorical Types

Categoricals are a pandas data type corresponding to categorical variables, such as from statistics. A categorical variable takes on a limited, fixed, number of possible values. Examples include gender, blood type, country or rating. Categorical data may be ordered, but numerical operations are not possible on them.

All of the values in categorical data are either in categories or `np.nan`. Order is defined by the order of *categories*, not the lexical order of the values. Using a categorical data type has a number of **advantages**:

- A string variable consisting of only a few different values can be *efficiently* stored internally as each string is represented by an integer, and only unique strings are in the categories array.
- Sorting through an ordered categorical variable is substantially faster.
- Provides valuable metadata to Pandas when it comes to smart plotting, operations, etc.

Much of this material is drawn from the Pandas documentation, which is extensive and found [here](https://pandas.pydata.org/pandas-docs/stable/categorical.html). 

In [130]:
c = pd.Categorical(['a', 'b', 'b', 'c', 'a', 'b', 'a', 'a', 'a', 'c'])
c

[a, b, b, c, a, b, a, a, a, c]
Categories (3, object): [a, b, c]

In [131]:
c.describe()

Unnamed: 0_level_0,counts,freqs
categories,Unnamed: 1_level_1,Unnamed: 2_level_1
a,5,0.5
b,3,0.3
c,2,0.2


In [132]:
c.codes

array([0, 1, 1, 2, 0, 1, 0, 0, 0, 2], dtype=int8)

You can provide information as to the ordering of the categories:

In [133]:
c.as_ordered()

[a, b, b, c, a, b, a, a, a, c]
Categories (3, object): [a < b < c]

In [134]:
c.dtype

CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)

Converting an existing 'object' feature into a category:

In [135]:
s = pd.Series(["air", "water", "fire", "fire", "water", "earth", "fire", "fire", "water", "air"])
s.astype("category")

0      air
1    water
2     fire
3     fire
4    water
5    earth
6     fire
7     fire
8    water
9      air
dtype: category
Categories (4, object): [air, earth, fire, water]

## Time-Series Data

Pandas as a tool was initially developed in the context of financial modelling, so as you might expect, there is a rather large suite of tools for working with dates, times and time-indexed data. There are a number of different formats that date data can come in:

- *Time stamps* reference particular moments in time (e.g Dec 25, 2011 at 7:45pm).
- *Time intervals* and periods reference a length of time with a beginning and end point.
- *Time deltas* or durations reference an exact length of time (e.g duration of 22.56 seconds).

## Method Chaining

You notice in one of the above examples of merging the wide-format into the whole dataset, we used function chaining to get what we wanted.

Let's say we wanted to perform a series of different operations on this data to obtain a more useful column/metric and output:

In [1]:
(cdystonia.assign(age_group=pd.cut(cdystonia.age, [0, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 90], right=False))
    .groupby(['age_group','sex']).mean()
    .twstrs.unstack("sex")
    .fillna(0.0)
    .plot.barh(figsize=(10,5)))

NameError: name 'cdystonia' is not defined

## Pipes

One of the problems with method chaining is that it requires all of the functionality you need for data processing to be implemented somewhere as methods which return the actual DataFrame object in order to chain. Occasionally we want to do custom manipulations to our data, this is solved in *pipe*.

For example, we may wish to calculate the *proportion of twstrs* in the whole dataset to see differences between each patient in proportional terms across time to all of the other patients in their age group, their state of pain etc.

In [None]:
def to_proportions(df, axis=1):
    row_totals = df.sum(axis)
    return df.div(row_totals, True - axis)

(cdystonia.assign(age_group=pd.cut(cdystonia.age, [0, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 90], right=False))
    .groupby(["week","age_group"]).mean()
    .twstrs.unstack("age_group")
    .pipe(to_proportions, axis=1))

We can now see the proportion of response variable across the age groups, per week.

## Data Transformation

We have several options for *transforming* labels and other columns into more useful features:

In [None]:
cdystonia.treat.replace({'Placebo': 0, "5000U": 1, "10000U": 2}).head(10)

In [None]:
cdystonia.treat.astype("category").head(10)

In [None]:
pd.cut(cdystonia.age, [20,40,60,80], labels=["Young","Middle-Aged","Old"])[-25:]

We can use qcut to automatically divide our data into even-sized $q$-tiles. For example $q=4$ refers to quartiles.

In [None]:
pd.qcut(cdystonia.age, 4)[-10:]

## Sparse Dataframes

*Sparse* version of Series and DataFrame are implemented in Pandas. They are not sparse in the typical sense, rather these objects are **compressed** where any data matching a specific value (`NaN`/missing) is omitted. A special `SparseIndex` object tracks where data has been *sparsified*. See this example:

In [2]:
ts = pd.Series(np.random.randn(10))
ts[2:-2] = np.nan
sts = ts.to_sparse()
sts

NameError: name 'pd' is not defined

The `to_sparse()` method allows us to fill the value with something other than `NaN`:

In [3]:
ts.fillna(0.).to_sparse(fill_value=0)

NameError: name 'ts' is not defined

These Sparse objects are mostly useful for memory-efficient reasons. Suppose you had a mostly `NaN` DataFrame:

In [4]:
df = pd.DataFrame(np.random.rand(100,100))
df_sp = df.where(df < 0.02).to_sparse()
print(df_sp.density)
df_sp.head()

NameError: name 'pd' is not defined

In [None]:
print("Memory usage [sparse]: %d bytes\nMemory usage [dense]: %d bytes" % (df_sp.memory_usage().sum(), df.memory_usage().sum()))

Pandas also supports creating sparse dataframes directly from `scipy.sparse` matrices. It is worth mentioning that Pandas converts scipy matrices NOT in COOrdinate format to COO, copying data as needed. 

In [None]:
from scipy import sparse

scip_sps = sparse.coo_matrix(np.random.choice([0,1], size=(1000,1000), p=(.95, .05)))
scip_sps

In [None]:
sdf = pd.SparseDataFrame(scip_sps)
sdf.head()

## Tasks

Recipe Database