# CS 5010
## Learning Python (Python version: 3)
### Topics:
###   - Aggregates in Python (NumPy and Pandas)

### Filename: Module 10 - Aggregates (Part I: NumPy)

# Aggregates using NumPy:

#### Use of 'reduce' method with another function.

If you would like to reduce an array with a particular operation you can
use the reduce method with another function. The outcome of using reduce
will repeatedly apply a given operation to the elements of an array until
only a single result remains

In [1]:
import numpy as np

In [2]:
x = np.arange(4)

In [3]:
np.add.reduce(x) #np.ufunc.reduce (ufun: universal function)

6

In [4]:
# reduce with add function
x = np.arange(1,6) # Create a list from 1-5
print("The list:")
print(x)
print("Using add and reduce")
print( np.add.reduce(x) )

The list:
[1 2 3 4 5]
Using add and reduce
15


In [5]:
# reduce with multiply function
print("Using multipily and reduce")
print( np.multiply.reduce(x) )

Using multipily and reduce
120


#### Use of 'accumulate' method with another function for keeping intermediate results

If you want to store all the intermediate results of the computation, use, instead, the 'accumulate' function

In [6]:
# accumulate with add function (showing intermediate results) 
print("Using add and accumulate")
print( np.add.accumulate(x) ) # np.ufunc.accumulate

# accumulate with multiply function (showing intermediate results)
print("Using multiply and accumulate")
print( np.multiply.accumulate(x) )

Using add and accumulate
[ 1  3  6 10 15]
Using multiply and accumulate
[  1   2   6  24 120]


In [7]:
np.cumsum(x)

array([ 1,  3,  6, 10, 15])

#### Computing summary statistics

Common summary statistics include mean and standard deviation, but other aggregates are useful as well, such as:
      sum, product, median, minimum, maximum, etc...

#### Summing the values in an array

Computing the sum of all values in an array - use the built-in 'sum' function

In [8]:
L = np.random.random(100)
small = np.random.random(5)*10  # An example that's easier to view/visualize
print ( small )
#print( L ) # Can print out L just to see what it looks like
print( sum(L) )
print( sum(small) )

# Note, quite similar to NumPy's 'sum' function (result is the same)
print( np.sum(small) )

# NumPy's version of the operation is computed much more quickly, however
# the sum function and the np.sum function are not identical (e.g. their
# optional arguments have different meanings and np.sum can work with 
# multiple array dimentions.)

[7.56096289 3.91295114 0.60734587 6.68775061 5.19679336]
57.54738963674549
23.965803866653207
23.965803866653207


#### Minimum and Maximum

Python has built-in min and max functions to find min or max value of any given array. NumPy's corresponding functions have similar syntax (but are much more efficient.) 

In [9]:
big_array = np.random.rand(1000000) # 1 million elements
print( min(big_array), max(big_array) )         # built-in functions
print( np.min(big_array), np.max(big_array) )   # NumPy functions 

2.7362703503008845e-07 0.9999996924776672
2.7362703503008845e-07 0.9999996924776672


#### Let's run an experiment to time how long the built-in function works in comparison to the NumPy version of the same function

In [10]:
print("Let's time these two functions...")

import timeit
# Any setup for the statements to be able to run:
SETUP = '''
import numpy as np
big_array = np.random.rand(1000000)'''

# Lines of code to time:
CODE_TO_TIME1 = '''min(big_array)'''
CODE_TO_TIME2 = '''np.min(big_array)'''

# Time these lines of code to compare the difference (use timeit)
# (NumPy version should execute quicker!) 

# ** May take a little while to execute ** 

t1 = timeit.timeit(setup = SETUP, stmt = CODE_TO_TIME1, number = 100) 
t2 = timeit.timeit(setup = SETUP, stmt = CODE_TO_TIME2, number = 100) 

print('min function time: {}'.format(t1)) 
print('np.min function time: {}'.format(t2)) 
print("===================")

Let's time these two functions...
min function time: 8.565271394000035
np.min function time: 0.04193960499998184


#### Using multidimensional aggregates

Aggregation operation along a row or column

In [11]:
M = np.random.random((3,4)) # row=3; col=4
print(M)

print( M.sum() )  # NumPy aggregate fucntion over entire array

[[0.45555803 0.0710914  0.42646826 0.4188181 ]
 [0.42232175 0.66459923 0.66782649 0.51632432]
 [0.23195422 0.57090933 0.8000835  0.04066027]]
5.2866149034049625


In [12]:
# min on an axis - takes an additional argument
print( M.min(axis=0) )  # returns 4 values, min for each of the 4 columns

# max on an axis
print( M.max(axis=1) )  # returns 3 values, max for each of the 3 rows

[0.23195422 0.0710914  0.42646826 0.04066027]
[0.45555803 0.66782649 0.8000835 ]


Important note about 'axis' additional argument:
The way the axis is specified can be confusing. The axis keyword specifies the DIMENSION of the array that will be collapsed, rather than the dimension that will be returned. 

So, specifying axis=0 means that the FIRST axis will be collapsed:
for two-dimensional arrays, this means that the values within each COLUMN will be aggregated. 

Ref: https://books.google.com/books?id=xYmNDQAAQBAJ&pg=PA60&lpg=PA60&dq=The+way+the+axis+is+specified+can+be+confusing.+The+axis+keyword+specifies+the+DIMENSION+of+the+array+that+will+be+collapsed,+rather+than+the+dimension+that+will+be+returned.&source=bl&ots=XqeLj5nk2N&sig=ACfU3U3dogcaPpVlJMiWunOM05BsLYo48Q&hl=en&sa=X&ved=2ahUKEwj5s8Cl6NblAhVKpFkKHU2nAu0Q6AEwAHoECAkQAQ#v=onepage&q=The%20way%20the%20axis%20is%20specified%20can%20be%20confusing.%20The%20axis%20keyword%20specifies%20the%20DIMENSION%20of%20the%20array%20that%20will%20be%20collapsed%2C%20rather%20than%20the%20dimension%20that%20will%20be%20returned.&f=false

#### Many other aggregate functions are available in NumPy

For example:
np.mean, np.std, np.var, np.argmin, np.argmax, np.median, np.percentile, ...

Review what each of these do. 