# Module 5 - Numerical Python II


NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays (source: https://en.wikipedia.org/wiki/NumPy).

Numpy is a Python library, so it comes as a collection of Python modules.
- More about numpy: https://www.w3schools.com/python/numpy/default.asp



In [1]:
import numpy as np
np.set_printoptions(precision=2)

## Aggregation Functions and Statistics

An aggregation function is a function that can map a collection of values to a single value:

Examples:

- Mean
- Max and min
- Count (distinct)
- ...

In [2]:
# creating an array
v = np.array([1, 2, 1, 3, 2, 4, 6, 5, 6.])
v

array([1., 2., 1., 3., 2., 4., 6., 5., 6.])

- We can calculate different aggregation function very easily: 

In [3]:
# built-in in np array
v.mean()

3.3333333333333335

In [4]:
np.mean(v)

3.3333333333333335

In [5]:
np.max(v)

6.0

In [6]:
np.min(v)

1.0

In [7]:
len(np.unique(v))

6

### Aggregating higher dimensional ndarrays

In [8]:
# creating a 2D array distribute in 3 rows and 5 columns
M = np.arange(15).reshape(3,5)
M

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

The matrix $M$ has two axes. This creates multiple possible aggregations.

1. Treat $M$ as a single collection of values.
2. Aggregate each row to a single value. This means we aggregate *along* the column axis.
3. Aggregate each column to a single value. This means we aggregate *along* the row axis.

In [9]:
# Sum up the whole matrix
np.sum(M)

105

In [10]:
# Sum up along the row axis
np.sum(M, 0)

array([15, 18, 21, 24, 27])

In [11]:
# Sum up along the column axis
np.sum(M, 1)

array([10, 35, 60])

### Making sense of the aggregation along an axis

- Shape of M: `(3, 5)`
- `Y = np.sum(M, 0)`: collapses the row dimension to `1`. So the resulting shape is `(5,)`.
- `Y = np.sum(M, 1)`: collapses the column dimension to `1`. The output shape is `(3,)`.

**Challenge**

Numpy allows us to aggregate along multiple axes. Can you make sense of the following?

In [12]:
np.sum(M, ()) # didnt specify along which axis to sum and therefore nothing happened (shaped unchanged)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [45]:
np.sum(M, (0, 1)) # collapses both axes so output has zero axes - just single value

105

Let's make sure we understand the transformation:

- shape of M is `(3,5)`.
- `np.sum(M, ())` does not collapse any of the axes, so the output shape is unchanged: `(3,5)`.
- `np.sum(M, (0, 1)` collapses both axes, so the output shape has zero axes `()`, which is just a single value.

## Statistics

A statistics of a collection of samples from a population is an aggregated value that reflects the underlying nature of the whole population.

- Mininum and maximum
- Percentile
- Median
- Mean
- Standard deviation and variance
 
**Minimum and maximum**
- Find the extremum value in a collection.
- Find the index of the extremum.

In [14]:
nums = np.random.randn(10)
nums

array([ 0.04,  1.17,  1.14,  0.25, -1.83,  1.15,  1.88, -0.22,  1.66,
        0.12])

In [15]:
# What is the minimum and maximum?
print("mininum =", nums.min())
print("maximum =", nums.max())

mininum = -1.8278614079393756
maximum = 1.8818538463152292


In [16]:
# But where are they?
i_min = nums.argmin()
i_max = nums.argmax()
print("nums[{}] = {:.2f}".format(i_min, nums[i_min]))
print("nums[{}] = {:.2f}".format(i_max, nums[i_max]))

nums[4] = -1.83
nums[6] = 1.88


**Argmin and argmax for multiple axes**


In [24]:
X = np.random.rand(3 * 5).reshape(3,5) * 100
X

array([[82.36,  4.49, 60.88, 69.85,  4.72],
       [38.31, 93.07, 29.27, 98.95, 71.  ],
       [25.42, 13.44, 75.04, 57.73,  0.2 ]])

In [25]:
# By default, argmin flattens the ndarray
np.argmin(X) # finding the index with the minimum out of the entire array (no axis specified)

14

In [26]:
# We can also compute min-locations if we
# aggregate along an axis.

np.argmin(X, 0) # finding the indicies with the minimum along axis 0 (each column of the maxtrix)

array([2, 0, 1, 2, 2])

**Percentile**

1. Let $X$ be a collective of $N$ numbers $\{x_1, x_2, x_3,..., x_N\}$. 
2. Let $r$ be a percentage: $0 \leq r \leq 100$

A $r$-percentile is a value $c$ such that at least $\frac{r}{100}$.$N$ values in $X$ are less than $c$.

$\frac{|\{x \in X:x\leq c\}|}{N}=\frac{r}{100}$

In [30]:
np.set_printoptions(suppress=True)
X = np.random.rand(100)*100
X

array([88.64, 96.74, 16.61, 57.36,  2.03, 27.81, 46.04, 51.81,  2.75,
       33.9 , 59.21, 84.16, 56.54,  3.29, 87.15, 85.73, 10.91, 60.87,
       96.59, 13.01, 40.86, 51.86, 99.25, 87.05, 67.04, 36.74, 97.63,
       50.34,  8.59, 23.96, 28.12, 30.27, 64.27, 49.34, 36.44, 66.53,
       98.32, 54.38, 87.77, 29.78,  5.  , 12.34, 13.35,  9.46, 46.63,
       95.49, 25.87,  6.58, 70.29, 42.14, 32.23, 45.66, 16.27, 47.37,
       25.91, 20.47, 82.66, 75.3 ,  0.21, 65.32, 38.08, 22.37, 30.47,
       56.8 , 35.73, 70.44, 47.52, 14.14, 98.42, 20.39, 85.62, 28.51,
       26.53,  5.77, 74.29, 78.08, 58.94, 94.62, 29.31, 14.  , 96.35,
       46.84, 16.65, 65.75,  1.4 , 57.26,  3.75, 12.15,  7.81, 23.96,
       10.54, 32.68, 29.72, 12.43, 84.3 , 46.44, 61.65, 48.51, 28.02,
       22.07])

In [31]:
np.percentile(X, 20)

14.10891834165329

In [32]:
np.sum(X <= np.percentile(X, 20))

20

**Did you know?**

- 0-percentile is the minimum.
- 100-percentile is the maximum.

In [33]:
print("0-percentile = {:.2f}".format(np.percentile(X, 0)))
print("100-percentile = {:.2f}".format(np.percentile(X, 100)))


0-percentile = 0.21
100-percentile = 99.25


In [34]:
print("min = {:.2f}".format(X.min()))
print("max = {:.2f}".format(X.max()))

min = 0.21
max = 99.25


**Median vs Mean**
- The median is another name for 50-percentile.
- The mean is the average.

In [35]:
np.median(X)

41.49592329989936

In [36]:
np.percentile(X, 50)

41.49592329989936

In [37]:
np.mean(X)

44.66533326981542

**Variations in Data**

There is a family of statistics of samples that measure the fluctuation in the data.

- Variance:

$var(X)=\frac{1}{n}\sum{(x_i-mean(X))^2}$

- Standard Deviation:

$std(X)= \sqrt{var(X)}$

In [38]:
X

array([88.64, 96.74, 16.61, 57.36,  2.03, 27.81, 46.04, 51.81,  2.75,
       33.9 , 59.21, 84.16, 56.54,  3.29, 87.15, 85.73, 10.91, 60.87,
       96.59, 13.01, 40.86, 51.86, 99.25, 87.05, 67.04, 36.74, 97.63,
       50.34,  8.59, 23.96, 28.12, 30.27, 64.27, 49.34, 36.44, 66.53,
       98.32, 54.38, 87.77, 29.78,  5.  , 12.34, 13.35,  9.46, 46.63,
       95.49, 25.87,  6.58, 70.29, 42.14, 32.23, 45.66, 16.27, 47.37,
       25.91, 20.47, 82.66, 75.3 ,  0.21, 65.32, 38.08, 22.37, 30.47,
       56.8 , 35.73, 70.44, 47.52, 14.14, 98.42, 20.39, 85.62, 28.51,
       26.53,  5.77, 74.29, 78.08, 58.94, 94.62, 29.31, 14.  , 96.35,
       46.84, 16.65, 65.75,  1.4 , 57.26,  3.75, 12.15,  7.81, 23.96,
       10.54, 32.68, 29.72, 12.43, 84.3 , 46.44, 61.65, 48.51, 28.02,
       22.07])

In [39]:
np.var(X)

862.6855687270435

In [40]:
np.std(X)

29.371509473076856

**Note:** `std(X)` has the same unit as `X`, and thus is easier to reason with.

**Cumulative Sum and Product**

Given a series of values $x_1,x_2,x_3,...,x_n$, the cumulative sum is another series defined as:
- $y_0 = x_0$
- $y_0 = x_0 + x_1$
- $\vdots$
- $y_0 = \sum_{i=0}^{i}{x_i}$
- $\vdots$

$[y_0, y_1, ..., y_n]$ is called the cumulative sums of $[x_1,x_2,x_3,...,x_n]$.

The cumulative product is defined similarly.

In [41]:
xs = np.arange(10) + 1
xs

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [42]:
np.cumsum(xs)

array([ 1,  3,  6, 10, 15, 21, 28, 36, 45, 55])

In [43]:
np.cumprod(xs)

array([      1,       2,       6,      24,     120,     720,    5040,
         40320,  362880, 3628800])