<h1>Advanced Pandas</h1>

<h3>Categorical Data</h3>

Frequently, a column in a table may contain repeated instances of a smaller set of distincct values. We have already seen functions like <b>unique</b> and <b>value_counts</b>, which enable us to extract the distinct values from an array and compute their frequencies respetively:

In [1]:
import numpy as np
import pandas as pd

In [2]:
values = pd.Series(['apple', 'orange', 'apple', 'apple']*2)

In [3]:
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [4]:
values.unique()

array(['apple', 'orange'], dtype=object)

In [5]:
values.value_counts()

apple     6
orange    2
dtype: int64

In [6]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [7]:
pd.value_counts(values)

apple     6
orange    2
dtype: int64

Many data systems (for data warehousing, statistical computing, or other uses) have developed specialized approaches for representing data with repeated values for more efficient storage and computation. In data warehousing, a best practice is to use so-called <b>dimension tables</b> containing distinct values and storing the primary observations as integer keys referencing the dimension table:

In [8]:
values = pd.Series([0,1,0,0,] * 2)

In [9]:
dim = pd.Series(['apple', 'orange'])

In [10]:
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [11]:
dim

0     apple
1    orange
dtype: object

In [12]:
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

This representation as integers is called the <b>categorical</b> or <b>dictionary-encoded</b> representation. The array of distinct values can be called the <b>categories, dictionary, or levels</b> of the data. The integer values that reference the categories are called the <b>category codes or simple codes</b>.

We can also perform transformations on the categories while leaving the codes unmodified. Some example transformations that can be made at relatively low cost are:
<ul>
    <li>Renaming categories</li>
    <li>Appending a new category without changing the order or position of the existing categories</li>
</ul>

<h3>Categorical Type in Pandas</h3>

pandas has a special <b>Categorical</b> type for holding data that uses the integers-based categorical representation or encoding. 

In [13]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2

In [14]:
N = len(fruits)

In [15]:
df = pd.DataFrame({
    'fruit': fruits,
    'basket_id': np.arange(N),
    'count': np.random.randint(3,15,size = N),
    'weight': np.random.uniform(0,4,size=N)
    },
    columns = ['basket_id', 'fruit', 'count', 'weight']
)

In [16]:
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,5,0.432444
1,1,orange,3,0.185635
2,2,apple,3,2.979993
3,3,apple,6,2.057781
4,4,apple,13,1.694966
5,5,orange,5,1.451326
6,6,apple,7,1.296764
7,7,apple,9,0.893377


In [17]:
fruit_cat = df['fruit'].astype('category')

In [18]:
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

The values for fruit_cat are not a NumPy array, but an instance of pandas.Categorical:

In [19]:
c = fruit_cat.values

In [20]:
c

['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']
Categories (2, object): ['apple', 'orange']

In [21]:
type(c)

pandas.core.arrays.categorical.Categorical

The <b>Categorical</b> object has categories and codes attributes:

In [22]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [23]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

We can convert a DataFrame column to categorical by assigning the converted result:

In [24]:
df['fruit'] = df['fruit'].astype('category')

In [25]:
df['fruit']

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

We can also create <b>pandas.Categorical</b> directly from other types of Python Sequences:

In [26]:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])

In [27]:
my_categories

['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']

In [28]:
my_categories.categories

Index(['bar', 'baz', 'foo'], dtype='object')

In [29]:
my_categories.codes

array([2, 0, 1, 2, 0], dtype=int8)

If we have obtained categorical encoded data from another source, we can use the alternative <b>from_codes</b> constructor:

In [30]:
categories =['foo','bar', 'baz']

In [31]:
codes = [0,1,2,0,0,1]

In [32]:
my_cats_2 = pd.Categorical.from_codes(codes, categories)

In [33]:
my_cats_2

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo', 'bar', 'baz']

Unless explicitely specified, categorical conversion assume no specific ordering of the categories. So the categories array may be in a different order depending on the ordering of the input data. When using <b>from_codes</b> or any of the other constructors, we can indicate that the categories have a meaningful ordering: 

In [34]:
ordered_cat = pd.Categorical.from_codes(codes, categories, ordered=True)

In [35]:
ordered_cat

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

The output ['foo'< 'bar'<'baz'] indicates that 'foo' precedes 'bar' in the ordering and so on. An unordered categorical instance can be made ordered with as_ordered:

In [36]:
my_cats_2.as_ordered()

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

<h3>Computations with Categoricals</h3>

Using <b>Categorical</b> in pandas compared with the non-encoded version (like an array of strings) generally behaves the same way. Some part of pandas, like the gorupby function, perform better when working with categoricals. Therer are also some functions that can utilized teh ordered flag.

Let's consider some random numeic data, and use the <b>pandas.qcut</b> bining function. This returns <b>pandas.Categorical</b>; we used <b>pandas.cut<b> in the earlier chapters but gloosed over the details of how categoricals work:

In [37]:
np.random.seed(12345)

In [38]:
draws = np.random.randn(1000)

In [39]:
draws[:5]

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

Let's compute a quartile binning of this data and extract some statistics:

In [40]:
bins = pd.qcut(draws, 4)

In [41]:
bins

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

While useful, the exact sample quartiles may be less useful for producing a report than quartile names. We can achieve this with the labels arugment to qcut:

In [42]:
bins = pd.qcut(draws, 4, labels = ['Q1', 'Q2', 'Q3', 'Q4'])

In [43]:
bins

['Q2', 'Q3', 'Q2', 'Q2', 'Q4', ..., 'Q3', 'Q2', 'Q1', 'Q3', 'Q4']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

In [44]:
bins.codes[:10]

array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

The labeled bins categorical does not contain information about the bin edges in the data, so we can use groupby to extract some summary statistics

In [45]:
bins = pd.Series(bins, name='quartile')

In [56]:
bins

0      Q2
1      Q3
2      Q2
3      Q2
4      Q4
       ..
995    Q3
996    Q2
997    Q1
998    Q3
999    Q4
Name: quartile, Length: 1000, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

In [46]:
results = (pd.Series(draws).groupby(bins)
          .agg(['count', 'min', 'max'])
          .reset_index())

In [47]:
results

Unnamed: 0,quartile,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


The <b>'quartile'</b> columns in the result retains the original categorical information, including ordering, from bins:

In [48]:
results['quartile']

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

<h3>Better Performance with Categoricals</h3>

If we do a lot of analytics on a particular dataset, converting to categorical can yield substantial overall performances gains. A categorical version of a DataFrame column will often use significantly less memory, too. Let's consider some Series with 10 million elements and a small number of distinct categories:

In [49]:
N = 10000000

In [50]:
draws = pd.Series(np.random.randn(N))

In [51]:
labels = pd.Series(['foo', 'bar', 'baz', 'qux']*(N//4))

Now we convert labels to categorical:

In [52]:
categories = labels.astype('category')

Now we note that labels uses significantly more memory than categories:

In [53]:
labels.memory_usage()

80000128

In [54]:
categories.memory_usage()

10000320

The conversion to cateogry is not free, of course, but it is a one-time cost:

In [55]:
%time _ = labels.astype('category')

Wall time: 719 ms


GroupBy operations can be significantly faster with categoricals because the underlying algorithms use the integer-based codes array instead of an array of strings.

<h3>Categorical Methods</h3>

Series containing categorical data have several special methods similar to the Series.str specialized string methods. This also provides convenient access to the categories and codes. Consider the Series:

In [57]:
s = pd.Series(['a', 'b', 'c', 'd']*2)

In [58]:
s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: object

In [59]:
cat_s = s.astype('category')

The special attribute cat provides access to categorical methods:

In [65]:
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [69]:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [70]:
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

Suppose that we know the actual set of categories for this data extends beyond the four values observed in the data. We can use the <b>set_categories</b> method to change them:

In [71]:
actual_categories = ['a', 'b', 'c', 'd', 'e']

In [72]:
cat_s2 = cat_s.cat.set_categories(actual_categories)

In [73]:
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

While it appears that the data is unchanged, the new categories will be reflected in operations that use them. For example, value_counts respects the categories, if present:

In [74]:
cat_s.value_counts()

d    2
c    2
b    2
a    2
dtype: int64

In [75]:
cat_s2.value_counts()

d    2
c    2
b    2
a    2
e    0
dtype: int64

In large datasets, categoricals are often used as convenient tool for memory saving and better performance. After we filter a large DataFrame or Series, many of the categories may not appear in the data. To help with this, we can use the <b>remove_unused_categories</b> method to trim unobserved categories:

In [76]:
cat_s3 = cat_s[cat_s.isin(['a','b'])]

In [77]:
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [78]:
cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): ['a', 'b']

![alt Text](Images/AdvancedPandas/cat_series.png)

<h3>Creating dummy variables for modeling</h3>

When we're using statistics or machine learning tools, we'll often transform categorical data into dummy variables, also known as one-hot encoding. This involves creating a DataFrame with a column for each distinct category; these columns contain 1s for occurences of a given cateory and 0 otherwise.

In [79]:
cat_s  =pd.Series(['a', 'b', 'c', 'd']*2, dtype='category')

As mentioned in earlier chapter 7, the <b>pandas.get_dummies</b> function converts this one-dimensional categorical data into a DataFrame containing the dummy variable:

In [81]:
pd.get_dummies(cat_s)

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1


<h3>Advanced GroupBy Use</h3>

While we've already discussed using the groupby method for Series and DataFrame in depth in Chapter 10, there are some additional techniques that we may find of use:

<h4>Group Transforms and "Unwrapped" GroupBys</h4>

In Chapter 10, we looked at the apply method in grouped operations for performing transformations. There is another built-in method called <b>transform</b>, which is similar to apply but imposes more constraint on the kind of function we can use:
<ul>
    <li>It can produce a scalar value to be broadcast to the shape of the group</li>
    <li>It can produce an object of the same shape as the input group</li>
    <li>It must not mutate its input</li>
</ul>

Let's consider a simple example for illustration:

In [84]:
df = pd.DataFrame({'key': ['a', 'b', 'c'] * 4,
                  'value': np.arange(12.)})

In [85]:
df

Unnamed: 0,key,value
0,a,0.0
1,b,1.0
2,c,2.0
3,a,3.0
4,b,4.0
5,c,5.0
6,a,6.0
7,b,7.0
8,c,8.0
9,a,9.0


Here are the gorup means by key:

In [86]:
g = df.groupby('key').value

In [87]:
g.mean()

key
a    4.5
b    5.5
c    6.5
Name: value, dtype: float64

Suppose instead we wanted to produce a Series of the same shape as df['value'] but with values replaces by the average grouped by 'key'. We can pass the function labda x:x.mean() to transform:

In [90]:
g.transform(lambda x : x.mean())

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

For built-in aggregation functions, we can pass a string alias as with the GroupBy agg method:

In [92]:
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

Like <b>apply, transform</b> works with functions that return Series, but the result must be the same size as the input. For example, we can multiply each group by 2 using a lambda function:

In [94]:
g.transform(lambda x: x**2)

0       0.0
1       1.0
2       4.0
3       9.0
4      16.0
5      25.0
6      36.0
7      49.0
8      64.0
9      81.0
10    100.0
11    121.0
Name: value, dtype: float64

As a more complicated example, we can compute the ranks in descending order for each gorup:

In [95]:
g.transform(lambda x:x.rank(ascending = False))

0     4.0
1     4.0
2     4.0
3     3.0
4     3.0
5     3.0
6     2.0
7     2.0
8     2.0
9     1.0
10    1.0
11    1.0
Name: value, dtype: float64

Consider a group transformation function composed from simple aggregations:


In [96]:
def normalize(x):
    return (x-x.mean())/x.std()

We can obtain equivalent results in this case, either using transform or apply:

In [97]:
g.transform(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

In [98]:
g.apply(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

Built-in aggregate functions like 'mean' or 'sum' are often much faster than a general apply function. These also have a 'fast past' when used with transform. This allows us to perform a so-called <b>unwrapped</b> group operation:

In [99]:
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [101]:
normalized = (df['value'] - g.transform('mean'))/g.transform('std')

In [102]:
normalized

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

While an unwrapped group operation may involve multiple group aggregations, the overall benefit of vectorized operations often outweight this.

<h3>Grouped Time Resampling</h3>

For time series data, the <b>resample</b> method is semantically a group operation based on a time intervalization. Here's a small example table:

In [103]:
N = 15

In [104]:
times = pd.date_range('2017-05-20 00:00', freq='1min', periods=N)

In [105]:
times

DatetimeIndex(['2017-05-20 00:00:00', '2017-05-20 00:01:00',
               '2017-05-20 00:02:00', '2017-05-20 00:03:00',
               '2017-05-20 00:04:00', '2017-05-20 00:05:00',
               '2017-05-20 00:06:00', '2017-05-20 00:07:00',
               '2017-05-20 00:08:00', '2017-05-20 00:09:00',
               '2017-05-20 00:10:00', '2017-05-20 00:11:00',
               '2017-05-20 00:12:00', '2017-05-20 00:13:00',
               '2017-05-20 00:14:00'],
              dtype='datetime64[ns]', freq='T')

In [106]:
df = pd.DataFrame({
    'time': times,
    'value': np.arange(N)
})

In [107]:
df

Unnamed: 0,time,value
0,2017-05-20 00:00:00,0
1,2017-05-20 00:01:00,1
2,2017-05-20 00:02:00,2
3,2017-05-20 00:03:00,3
4,2017-05-20 00:04:00,4
5,2017-05-20 00:05:00,5
6,2017-05-20 00:06:00,6
7,2017-05-20 00:07:00,7
8,2017-05-20 00:08:00,8
9,2017-05-20 00:09:00,9


Here, we can index by 'time' and then resample:

In [120]:
df.set_index('time').resample('5min').count()

Unnamed: 0_level_0,value
time,Unnamed: 1_level_1
2017-05-20 00:00:00,5
2017-05-20 00:05:00,5
2017-05-20 00:10:00,5


<h3>Techniques for Method Chaining</h3>

When we apply a sequence of transformations to a dataset, we may find ourselves creating numerous temporary variable that are never used in our analysis. Consider this example, for instance:

LEARN ABOUT METHOD CHAINING SEPARATELY

<h3>The Pipe Method</h3>

We can accomplish a lot with built-in functions and the approaches to method chaining with callables that we just looked at. However, sometimes we need to use our own functions or functions from third-party libararies. This is where the <b>pipe</b> method comis in.

Consider a sequence of function calls:

<pre>
a = f(df, arg1=v1)
b = g(a, v2, arg3=v3)
c = h(b, arg4=v4)
</pre>

When using functions that accept and return Series or DataFrame objects, we can rewrite this using calls to pipe:

<pre>
result = (df.pipe(f, arg1=v1)
            .pipe(g, v2, arg3=v3)
            .pipe(h, arg4=v4))
</pre>

The statement f(df) and df.pipe(f) are equvalent, but pipe makes chained invocation easier.

A potentially useful pattern for pipe is to generalize sequences of operations into reusable functions. As an example, let's consider subtracting group menas from a column:

<pre>
    g = df.groupby(['key1', 'key2'])
    df['col1'] = df['col1'] - g.transfrom('mean')
</pre>

Suppose that we wanted to be able to demean more than one column and easily change the group keys. Additionally, we might want to perform this transformation in a method chaina. Here is an example implementation:

<pre>
def group_demean(df, by,cols):
    result = df.copy()
    g = df.groupby(by)
    for c in cols:
        result[c] = df[c] - g[c].transform('mean')
    return result
</pre>

Then it is possible to write:
result = (df[df.cole1<0].pipe(group_demean, ['key1', 'key2'], ['col1']))