## Aggregates in Python (NumPy and Pandas)

### University of Virginia
### DS 5100: Programming for Data Science

### Filename: Module 10 - Aggregates (Part II: Pandas)
---  


# Aggregates using Pandas:

### Aggregation and Grouping

An essential piece of analysis of large data is efficient summarization: computing aggregations like
     sum(), mean(), median(), min(), and max(), 
in which a single number gives insight into the nature of a potentially large dataset. 

The following examples illustrate aggregations in Pandas. The previous Aggregates file illustrated aggregations in NumPy. 

#### Simple Aggregation in Pandas

Previously we explored some of the data aggregations available for NumPy arrays. As with a one-dimensionl NumPy array, for a Pandas Series the aggregates return a single value.

In [None]:
import pandas as pd
import numpy as np

X = np.random.RandomState(42)     # create a RandomState   ( <mtrand.RandomState at 0x252c464b3f0> )
ser = pd.Series(X.rand(5))        # 5 random float values in a Pandas Series
ser

0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64

In [None]:
theSum = ser.sum()
print(theSum)

theMean = ser.mean()
print(theMean)

theMin = ser.min()
print(theMin)

2.811925491708157
0.5623850983416314
0.15601864044243652


Pandas Series and DataFrames include all of the common aggregates mentioned previously (in relation to NumPy). In addition, there is a convenient method called 'describe()' that computes several common aggregates for each column and resturls the result. 

In [None]:
ser.describe()

count    5.000000
mean     0.562385
std      0.308748
min      0.156019
25%      0.374540
50%      0.598658
75%      0.731994
max      0.950714
dtype: float64

#### For a DataFrame, by default the aggregates return results within each column

In [None]:
# Create a DataFrame with two columns
df = pd.DataFrame({'A': X.rand(5),
                   'B': X.rand(5)})
df

Unnamed: 0,A,B
0,0.155995,0.020584
1,0.058084,0.96991
2,0.866176,0.832443
3,0.601115,0.212339
4,0.708073,0.181825


In [None]:
df.mean()
# Note, the output is the mean of each column ("A" and "B" in this case)

A    0.477888
B    0.443420
dtype: float64

#### By specifying the axis argument, you can instead aggregate within each row

In [None]:
df.mean(axis='columns')

0    0.088290
1    0.513997
2    0.849309
3    0.406727
4    0.444949
dtype: float64

#### Some Built-in Pandas aggregations:

count()         - Total number of items

first(), last() - First and last item

mean(), median() - Mean and median

min(), max() - Minimum and maximum

std(), var() - Standard deviation and variance

prod() - Product of all items

sum() - Sum of all items

## -------------------

## GroupBy: Split, Apply, Combine

Simple aggregations can only take you so far; can only give you a flavor of the dataset at hand. However, sometimes it is preferable to aggregate conditionally on some label or index: this is implemented in the "groupby" operation. Under the hood, it performs the operations of "split", "apply", and "combine." 

#### Split, Apply, Combine

A canonical example of this split-apply-combine operation, where the "apply" is a summation aggregation, is illustrated next: 

The *split* step involves breaking up and grouping a DataFrame depending on the value of the specified key

The *apply* step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups

The *combine* step merges the results of these operations into an output array

In [None]:
# Create the input DataFrame (two columns, "key" and "data")
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],      # key has some repetitions
                   'data': range(6)}, columns=['key', 'data']) # data is very simply numbers 0 through 5
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [None]:
# Using groupby
# Passing the name of the desired key column:
df.groupby('key')

# Notice the output! What is returned is not a set of DataFrames, but a DataFrameGroupBy object.
# The object is where the magic is: you can think of it as a special view of the DataFrame

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000027A68A634A8>

In [None]:
# To produce a result, we can apply an aggregate to this DataFrameGroupBy object, which will 
# perform the appropriate apply/combine steps to produce the desired result:
df.groupby('key').sum()  # For each key, calculate the sum

# The sum() method is just one possibility here; you can apply virtually any common Pandas or NumPy 
# aggregation function, as well as virtually any valid DataFrame operation


Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


The GroupBy object is a very flexible abstraction. In many ways, you can simply treat it as if it's a collection of DataFrames, and it does the difficult things under the hood. 