# 2. Intermediate

We have covered a number of features within Pandas, such as the basics of **Series** and **DataFrame**, reading files, checking for missing data, querying/selection, aggregation, sorting/ranking and handling strings.

In this section we're going to cover types of different data, reshaping dataframes and method chaining.

In [None]:
import numpy as np
import pandas as pd
%matplotlib inline

### Categorical Types

In [None]:
c = pd.Categorical(['a', 'b', 'b', 'c', 'a', 'b', 'a', 'a', 'a', 'c'])
c

In [None]:
c.describe()

In [None]:
c.codes

In [None]:
# you can provide information as to the ordering of the categories
c.as_ordered()

In [None]:
c.dtype

### DateTime Types

In [None]:
dates = pd.date_range("1/1/2016", periods=70, freq="D")
dates

In [None]:
y = pd.Series(np.random.randn(70), index=dates)
y.head()

In [None]:
y.cumsum().plot()

When a datetime type is in a DataFrame, there is a special accessor to access the information inside it:

In [None]:
y.index.day

In [None]:
y.index.week

In [None]:
y.loc["2016-01-01":"2016-01-05"]

Our timeseries is in days, we can easily resample to weeks, months, years etc, depending on the versatility of our data, and interpolate.

In [None]:
y.resample("W").mean()

Or going to higher frequencies, we can fill in the missing values optionally:

In [None]:
y.asfreq("H", method='ffill').head()

Lagging/rolling timeseries is trivial:

In [None]:
y.shift(1).head()

In [None]:
y_cum = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2017", periods=1000, freq="H")).cumsum()
y_cum.plot()
roll = y_cum.rolling(window=30).mean().plot()

### Reshaping DataFrame objects

In the context of a single DataFrame, we are often interested in re-arranging the layout of our data; particularly for machine learning where algorithms for prediction rely on strict criteria for $X$ and $y$ inputs.

To illustrate this, we will work with a dataset from "*Statistical Methods for the Analysis of Repeated Measurements by Charles S. Davis, pp. 161-163 (Springer, 2002)*", which handles some data from a controlled trial of botulinum toxin type B (BoTB) in patients with cervical dystonia.
* Response variable: (twstrs), measuring severity, pain, and disability caused from cervical dystonia.
* Measured multiple times per patient in weeks 0, 2, 4, 8, 12 and 16.

In [None]:
cdystonia = pd.read_csv("cdystonia.csv")
cdystonia.head(10)

We could use the *stack()* method to rotate the dataframe, so that columns are represented as rows:

In [None]:
cdystonia.stack().head(20)

We could create a **hierarchical index** with this to make the data more understandable:

In [None]:
cdystonia2 = cdystonia.set_index(['patient','obs']).drop("id",axis=1)
cdystonia2.head(10)

We could disregard most of the table, and unstack the response variable in columns to make it 'per patient':

In [None]:
twstrs_wide = cdystonia2.twstrs.unstack("obs")
twstrs_wide.head()

Now we could re-merge this back into the original dataset with observation and twstrs as column-data, making a wide-format:

In [None]:
cdystonia_wide = (cdystonia[['patient','site','treat','age','sex']]
     .drop_duplicates()
     .merge(twstrs_wide, right_index=True, left_on="patient", how='inner'))
cdystonia_wide.head()

We can revert back to long-form using `melt()`.

In [None]:
pd.melt(cdystonia_wide, id_vars=["patient","site","treat","age","sex"], var_name="obs", value_name="twstrs").head()

Alternatively we can use pivots:

In [None]:
cdystonia.pivot(index="patient", columns="week", values="twstrs").head()

This can include hierarchical indexing/columns:

In [None]:
cdystonia.pivot_table(index=["patient","id"], columns=["week","obs"], values="twstrs").head()

### Method Chaining

You notice in one of the above examples of merging the wide-format into the whole dataset, we used function chaining to get what we wanted.

Let's say we wanted to perform a series of different operations on this data to obtain a more useful column/metric and output:

In [None]:
(cdystonia.assign(age_group=pd.cut(cdystonia.age, [0, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 90], right=False))
    .groupby(['age_group','sex']).mean()
    .twstrs.unstack("sex")
    .fillna(0.0)
    .plot.barh(figsize=(10,5)))

### Pipes

One of the problems with method chaining is that it requires all of the functionality you need for data processing to be implemented somewhere as methods which return the actual DataFrame object in order to chain. Occasionally we want to do custom manipulations to our data, this is solved in *pipe*.

For example, we may wish to calculate the *proportion of twstrs* in the whole dataset to see differences between each patient in proportional terms across time to all of the other patients in their age group, their state of pain etc.

In [None]:
def to_proportions(df, axis=1):
    row_totals = df.sum(axis)
    return df.div(row_totals, True - axis)

(cdystonia.assign(age_group=pd.cut(cdystonia.age, [0, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 90], right=False))
    .groupby(["week","age_group"]).mean()
    .twstrs.unstack("age_group")
    .pipe(to_proportions, axis=1))

We can now see the proportion of response variable across the age groups, per week.

### Data Transformation

We have several options for *transforming* labels and other columns into more useful features:

In [None]:
cdystonia.treat.replace({'Placebo': 0, "5000U": 1, "10000U": 2}).head(10)

In [None]:
cdystonia.treat.astype("category").head(10)

In [None]:
pd.cut(cdystonia.age, [20,40,60,80], labels=["Young","Middle-Aged","Old"])[-25:]

We can use qcut to automatically divide our data into even-sized $q$-tiles. For example $q=4$ refers to quartiles.

In [None]:
pd.qcut(cdystonia.age, 4)[-20:]