# Aggregation Functions

A very common task in numerical computing and data analysis is to compute one or more aggregate functions of an array. Especially common examples of aggregates include sums, products, averages, variances, and so on. By using `numpy`'s vectorized tools, it is often possible to construct sophisticated mathematical functions that incorporate aggregations, using very simple code. 

In [1]:
import numpy as np

In [2]:
a = np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

A quick tour of some of the most useful aggregation functions. Most of these functions are available either as functions imported by `numpy` (e.g. `np.sum()`) or as methods of the `array` class (e.g. `a.sum()`).

In [3]:
np.sum(a), a.sum()

(45, 45)

In [4]:
a.mean()

4.5

In [8]:
a.min(), a.max()

(0, 9)

In [9]:
# `np.prod()` calculates the product of all the elements in the input array
a.prod() # since the first element is 0

0

In [7]:
(a+1).prod()

3628800

### Example: Variance

The *variance* of a set of numbers $x_1, \ldots, x_n$ is 

$$\mathrm{var}(x_1,\ldots,x_n) = \frac{1}{n}\left[(x_1 - \bar{x})^2 + \cdots + (x_n - \bar{x})^2\right]\;,$$

where 

$$\bar{x} = \frac{1}{n}\left(x_1 + \cdots + x_n\right)$$

is the *mean*. Data sets with large variance are more "spread out" or "variable." 

While the formula for variance looks somewhat complicated, we can compute it as a one-liner using `numpy`'s vectorization and aggregation tools. 

`µ` stands for the mean

`σ^2` stands for the variable

σ^2 = (x - µ)^2 = x^2 - µ^2


In [10]:
x = np.random.rand(100)
v = np.mean((x - np.mean(x))**2)
v

0.08263552727934466

As it turns out, there's also an `np.var` function that would have given us the same answer. 

In [11]:
np.var(x), x.var()

(0.08263552727934466, 0.08263552727934466)

## Boolean Aggregations

There are two especially useful aggregation functions for working with boolean aggregations. `np.any()` returns `True` if ANY of the array elements are `True`. `np.all()` returns `True` if ALL of the array elements are `True`. 

In [12]:
a = np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [13]:
a > 8

array([False, False, False, False, False, False, False, False, False,
        True])

In [14]:
np.any(a > 8)

True

In [15]:
np.all(a < 10)

True

## Aggregation Along Multiple Dimensions

Often we don't just want the sum of all the numbers in an array -- we want the sum of all numbers *per row*. For another example, maybe we want the largest entry *in each column*. `numpy` makes it easy to accomplish these tasks via the `axis` argument. 

In [16]:
A = np.reshape(np.arange(15), (3, 5))
A

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [17]:
# by default, A.sum() adds up all the numbers
A.sum()

105

In [19]:
# the zeroth axis corresponds to ROWS. 
# summing over this axis gives the totals in each COLUMN. 

A.sum(axis = 0) # proceed with each collumn, with the output-format: x-axis

array([15, 18, 21, 24, 27])

In [20]:
# the first axis corresponds to COLUMNS. 
# summing over this axis gives the totals in each ROW. 

A.sum(axis = 1) # proceed with each row, with theoutput-format: y-axis

array([10, 35, 60])

Here's a helpful way to remember how to use the `axis` argument. In our case, `A.shape = (3,5)`. The sum `A.sum(axis = 0)` "eliminates" the zeroth dimension of size 3, leaving the dimension of size 5. So, `A.sum(axis = 0)` should be a 1d array of size 5, which is what we saw above. Similarly, `A.sum(axis = 1)` eliminates the first dimension of size 5, leaving the zeroth dimension of size 3. 

Other aggregation functions also accept the `axis` argument.

In [22]:
A.min(axis = 0)

array([0, 1, 2, 3, 4])

In [23]:
A.max(axis = 1)

array([ 4,  9, 14])

In [24]:
A.mean(axis = 1)

array([ 2.,  7., 12.])

In [25]:
A.cumsum(axis = 1)

array([[ 0,  1,  3,  6, 10],
       [ 5, 11, 18, 26, 35],
       [10, 21, 33, 46, 60]])

# Dealing with NaN values

What happens when there is a NaN value in your array? Since NaNs propagate, aggregations involving NaNs will generate more NaNs:

In [24]:
A = 1.0*A # to convert to float
A[0,:] = np.nan
A

array([[nan, nan, nan, nan, nan],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.]])

In [25]:
A.mean(axis = 0)

array([nan, nan, nan, nan, nan])

In [26]:
A.mean(axis = 1)

array([nan,  7., 12.])

In some cases, we might want to completely ignore NaN values. In this case, we can use the `nan*` versions of `numpy`'s aggregation functions, like `np.nansum()`, `np.nanmean()`, `np.nanmax()`, `np.nanmin()`, and so on:

In [27]:
np.nanmean(A, axis = 0)

array([ 7.5,  8.5,  9.5, 10.5, 11.5])

In [28]:
np.nanmean(A, axis = 1)

  np.nanmean(A, axis = 1)


array([nan,  7., 12.])

In [34]:
 # indeterminate form

print(np.array(0.0) / 0)
print(np.array(np.inf) / (np.inf))

nan
nan


  print(np.array(0.0) / 0)
  print(np.array(np.inf) / (np.inf))


In [26]:
print(np.array(1.0) / 0)
print(np.array(np.inf) / 0)

inf
inf


  print(np.array(1.0) / 0)
