# NumPy: Numerical Python

NumPy (short for Numerical Python) is an open source Python library for doing scientific computing with Python.<br><br>
It gives an ability to create multidimensional array objects and perform faster mathematical operations.<br>
The library contains a long list of useful mathematical functions, including some functions for linear algebra and complex mathematical operations such as Fourier Transform (FT) and random number generator (RNG).

## Importing Numpy

In [None]:
## although not obligatory, everyone assigns the alias np while importing the numpy library:
import numpy as np

In [None]:
## i will start to use libraries more freely now. for instance the inspect library is usefull to do introspection
from inspect import getmembers, isfunction

In [None]:
## now it is possible to list all the functions defined in numpy using
[o[0] for o in getmembers(np) if isfunction(o[1])]

## The High Level Overview

Linear Algebra is at the hart of scientific computing.<br>
Fast and efficient ways of manipulating multi-dimensional arrays are at the hart of Linear Alebra.<br>
NumPy is/was developed by a huge number of contributors to Python with a fast and efficient way to operate on arrays.<br><br>

NumPy provides the following:
* A powerful N-dimensional array object
* Sophisticated (broadcasting) functions
* Tools for integrating C/C++ and Fortran code
* Useful linear algebra, Fourier transform, and random number capabilities

Almost all libraries that do serious number crunching are build on top of NumPy (use the array datatype and the functions to manipulate them in NumPy).<br>
For instance:
* SciPy: provides additional algorithms used in scientific computating
* Pandas: provide DateFrame and Series objects to work with 
* SciKit-Learn: provides a wealth of machine learning algorithms accesable using a single clean API

## Arrays

NumPy’s main object is the homogeneous multidimensional array.<br>
It is a table of elements (usually numbers), all of the **same type**, indexed by a tuple of non-negative integers.<br>
In NumPy dimensions are called axes.

In [None]:
lst = [ix+1 for ix in range(1000)]
## to create a 1d array
array1d = np.array(lst)
print(type(array1d), array1d.shape)

In [None]:
print(f'Size of a numpy ndarray   = {sys.getsizeof(array1d)}')
print(f'Size of the original list = {sys.getsizeof(lst)}')

In [None]:
## most operation on an array are 'vectorized' meaning it will automatically be executed for each element
(array1d[:10] % 2)==0

In [None]:
## meaning it is often orders of magnitude faster - the interpreter stes up the vectorized operation & is not involved in each step
%timeit np.sum((array1d % 2)==0)

In [None]:
## the similar logic using lists
%timeit sum([1 for ix in lst if (ix%2)==0])

In [None]:
## to create a 2d array, do:
array2d = np.array([[2, 0], [0, 1]])
print(array2d.shape)
array2d + array2d

In [None]:
# 2d arrays are basically matrices, to do matrix multiplication use @ or np.dot
print(array2d @ array2d)
print(np.dot(array2d,array2d))

In [None]:
## to create a 3d array
array3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]], [[9, 10], [11, 12]]])
print(array3d, '\n', '-'*100)

scale2 = np.identity(2)*2
print(array3d @ scale2)
print('\n', '-'*100)

## an ndarray can be reshaped
array3d = array3d.reshape([1,2,6])
print(array3d.shape)
print(array3d)

## Data Types

NumPy supports a much greater variety of numerical types than Python does. The following table shows different scalar data types defined in NumPy.

| Sr.No. | Data Types & Description
| :----- | :-----
| bool_ | Boolean (True or False) stored as a byte
| int_ | Default integer type (same as C long; normally either int64 or int32)
| intc | Identical to C int (normally int32 or int64)
| intp | Integer used for indexing (same as C ssize_t; normally either int32 or int64)
| int8 | Byte (-128 to 127)
| int16 | Integer (-32768 to 32767)
| int32 | Integer (-2147483648 to 2147483647)
| int64 | Integer (-9223372036854775808 to 9223372036854775807)
| uint8 | Unsigned integer (0 to 255)
| uint16 | Unsigned integer (0 to 65535)
| uint32 | Unsigned integer (0 to 4294967295)
| uint64 | Unsigned integer (0 to 18446744073709551615)
| float_ | Shorthand for float64
| float16 | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
| float32 | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
| float64 | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
| complex_ | Shorthand for complex128
| complex64 | Complex number, represented by two 32-bit floats (real and imaginary components)
| complex128 | Complex number, represented by two 64-bit floats (real and imaginary components)

In [None]:
print(f'Size of a numpy ndarray of int8 = {sys.getsizeof(np.array(lst,dtype=np.int8))}')
print(f'Size of the original list = {sys.getsizeof(lst)}')

In [None]:
arr_cmplx = np.array([1,2,3,4,5],dtype=np.complex)
arr_cmplx

In [None]:
arr_float16 = arr_cmplx.astype(np.float16)

In [None]:
## array indexing works pretty much like it does with lists
arr_float16[2:4] += 10
arr_float16

## Missing in NumPy: Missing Integer Values

In [None]:
## foating point numbers have a missing value in NumPy
arr_float16[4] = np.NAN
arr_float16

In [None]:
## integer data types do not have a missing value / NaN in NumPy, this is problematic
## as a consequence all libraries build on top of NumPy lack native support for integer missing values
## you have to (1) convert to float, or (2) use a masked array
## worse yet, when converting from floating point to integer value, missing values become 0's
arr_int16 = arr_float16.astype(np.int16)
arr_int16

In [None]:
arr_int16[4] = np.NaN

There is just no way we can set an element of a normal numpy array of some integer dtype to missing!<br>
Masked arrays are arrays that may have missing or invalid entries.<br>
The numpy.ma module provides a nearly work-alike replacement for numpy that supports data arrays with masks.

In [None]:
## lets say we have array, where 99 is used as missing value
ar = [1,2,99,4,5]
## and we want to mask the missing values
ma = np.ma.masked_equal(ar, 99)
ma

In [None]:
## for the sum we use
ma.sum()

In [None]:
# for the dot product we use:
np.ma.dot(ma,np.ones(ma.shape[0]))

In [None]:
np.sum(ma*np.ones(ma.shape[0]))

## Creating Arrays

In [None]:
np.zeros((3,3), dtype=np.int8)

In [None]:
np.ones((3,))

In [None]:
np.identity(2)

In [None]:
## using list comprehensions
np.array([[r*10+c for c in range(3)] for r in range(3)])

## Stacking & Reshaping

In [None]:
## stacking
print(np.hstack([np.identity(2),np.identity(2)]))
print('-'*20)
print(np.vstack([np.identity(2),np.identity(2)]))

In [None]:
np.array([1,2,3,4,5,6]).reshape((2,-1))
## this is the same as order='C' --> first loop over the 'right-most' index (for 2D array columns)
np.array([1,2,3,4,5,6]).reshape((2,-1),order='C')

In [None]:
## order = 'F' uses the Fortran rule: first loop over 'left' (for 2D array rows) indices first
np.array([1,2,3,4,5,6]).reshape((2,-1),order='F')

## Broadcasting

In [None]:
## the shapes of the vectors line up
v1 = np.array([1,2,3])
v2 = np.array([4,5,6])
v1 + v2

In [None]:
## what happens when we add a column vector and a row vector
## linear algebra does not define this
## --> numpy broadcasts the shapes into something that is valid and defined mathematicallly
v1 = np.array([ 1, 2, 3]).reshape((-1, 1))
v2 = np.array([10,20,30]).reshape(( 1,-1))
v1 + v2

Above the column vector gets 'expanded' into a matrix where the columns are replicates of the column vector & (replicated # elements in the row vector)<br>
row vector gets 'expanded' into a matrix where the rows are replicates of the row vector (replicated # elements in the column vector)

Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays:
* **Rule 1**: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
* **Rule 2**: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
* **Rule 3**: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

In [None]:
# lets work through an example
M = np.ones((2, 3))
a = np.arange(3)
M * a

shape M = (2, 3) & shape a = (3,)<br>
**rule 1** pad a to the left with ones<br>
shape M = (2, 3) & shape a = (1,3) ==> a= [[0,1,2]]<br>
**rule 2** first dim disagrees and has value 1 in a, so we 'stretch' dim 1 of a to 2 (replicating each element)<br>
shape M = (2, 3) & shape a = (2,3) ==> a= [[0,1,2],[0,1,2]]<br>
We are now good to go!!

In [None]:
M = np.ones((3, 2))
a = np.arange(3)
M * a ## a will become (1) [[0,1,2]] (2) [[0,1,2],[0,1,2],[0,1,2]] --> but (3,2) * (3,3) is not defined

### np.newaxis

In [None]:
## if we want to be explicit and replicate the vector into a matrix with columns [a,a], we need to be explicit where we want the new expanded axis to go using np.newaxis
M * a[:,np.newaxis]

## Functions On Arrays

In [None]:
## numpy has many functions defined that do the right thing on array's
x = np.linspace(0,2*np.pi,num=5)
np.sin(x)

In [None]:
## what is the index where the max occurs
np.argmax(np.sin(x))

In [None]:
2 * np.random.random(20)

### Let's implement least squares regression

In [None]:
x = np.linspace(1,10,20)
y = 12 + 0.5 * x + 1 * np.random.rand(20)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(x,y)

If $y^* = a + b * x$ and let $\beta = [a,b]$, then the least squares estimate boils down to:<br>
minimize $(y - y^*)^T (y - y^*) = (y - (\beta_1 + \beta_2 x))^T (y - \beta_1 + \beta_2 x)$ by setting the derivative to $\beta$ to $0$, resulting in:<br>
$(X^T X)\beta = X^T y$ or $\beta = (X^T X)^-1 (X^T y)$<br>
This is most efficiently solved using **np.linalg.solve**

In [None]:
def least_squares_reg(x, y):
    ## add column of ones for the intercept
    X = np.vstack([np.ones(len(x)),x]).T
    b = np.linalg.solve(X.T @ X, X.T @ y)
    return(b)
least_squares_reg(x, y)

In [None]:
## Could also use the scikit-learn library
from sklearn import linear_model
reg = linear_model.LinearRegression()
lm  = reg.fit(x.reshape(-1,1), y)
print(f'intercept = {lm.intercept_}')
print(f'slope = {lm.coef_[0]}')

## Random Numbers and Permutations

In [None]:
## a lot of cool functionality for random sampling is provided in numpy.random
dir(np.random)

In [None]:
## to generate a bunch of pseudo random numbers from the unifor distribution on [0.0,1.0), use:
np.random.rand(3,5) ## 3 rows & 5 columns

In [None]:
## random integer
np.random.randint(low=1, high=10, size=(3,5))

In [None]:
## random numbers from the F-distribution (with df numerator = 101 and df denominator = 35
np.random.f(101,35,size=(2,5)) ## 2 rows & 5 columns

In [None]:
np.random.choice(['A','B','C','D'], size=(3,7), replace=True, p=[0.1,0.1,0.1,0.7])

In [None]:
## shuffling
lst = [1,2,3,4,5,6,7,8,9]
np.random.shuffle(lst)
lst

### Let's implement a nonparametric test for group means

In [None]:
g1 = np.random.binomial(20,0.48,size= 5);
g2 = np.random.binomial(20,0.52,size=10);
print(f'mean group 1 {np.mean(g1)} & mean group 2 {np.mean(g2)} --> difference = {np.mean(g2) - np.mean(g1)}')

In [None]:
df_same = np.concatenate([g1,g2])
def mean_dif(df_same):
    np.random.shuffle(df_same)
    return(np.mean(df_same[:5]) - np.mean(df_same[5:]))

dist_mean_diff = [mean_dif(df_same) for _ in range(100000)]

In [None]:
sns.distplot(dist_mean_diff, bins=100);

In [None]:
print(f'% where diff in mean between the two groups is > 1.3 = {np.sum(np.array(dist_mean_diff) > 1.3) / len(dist_mean_diff)}')

# SciPy: Scientific Python