# Numpy -  multidimensional data arrays

J.R. Johansson (jrjohansson at gmail.com)

The latest version of this [IPython notebook](http://ipython.org/notebook.html) lecture is available at [http://github.com/jrjohansson/scientific-python-lectures](http://github.com/jrjohansson/scientific-python-lectures).

The other notebooks in this lecture series are indexed at [http://jrjohansson.github.io](http://jrjohansson.github.io).

## Introduction

The `numpy` package (module) is used in almost all numerical computation using Python. It is a package that provide high-performance vector, matrix and higher-dimensional data structures for Python. It is implemented in C and Fortran so when calculations are vectorized (formulated with vectors and matrices), performance is very good. 

In [1]:
import numpy as np

## Creating `numpy` arrays

There are a number of ways to initialize new numpy arrays, for example from

* a Python list or tuples
* using functions that are dedicated to generating numpy arrays, such as `arange`, `linspace`, etc.
* reading data from files

### From lists

For example, to create new vector and matrix arrays from Python lists we can use the `numpy.array` function.

In [2]:
# a vector: the argument to the array function is a Python list
v = np.array([1,2,3,4])

v

array([1, 2, 3, 4])

In [3]:
# a matrix: the argument to the array function is a nested Python list
M = np.array([[1, 2], [3, 4]])

M

array([[1, 2],
       [3, 4]])

The `v` and `M` objects are both of the type `ndarray` that the `numpy` module provides.

In [4]:
type(v), type(M)

(numpy.ndarray, numpy.ndarray)

The difference between the `v` and `M` arrays is only their shapes. We can get information about the shape of an array by using the `ndarray.shape` property.

In [5]:
v.shape

(4,)

In [6]:
M.shape

(2, 2)

The number of elements in the array is available through the `ndarray.size` property:

In [7]:
M.size

4

Equivalently, we could use the function `numpy.shape` and `numpy.size`

In [8]:
np.shape(M)

(2, 2)

In [9]:
np.size(M)

4

Number of dimensions

In [10]:
print(np.ndim(v),np.ndim(M))

1 2


In [11]:
print(v.ndim,M.ndim)

1 2


So far the `numpy.ndarray` looks awefully much like a Python list (or nested list). Why not simply use Python lists for computations instead of creating a new array type? 

There are several reasons:

* Python lists are very general. They can contain any kind of object. They are dynamically typed. They do not support mathematical functions such as matrix and dot multiplications, etc. Implementing such functions for Python lists would not be very efficient because of the dynamic typing.
* Numpy arrays are **statically typed** and **homogeneous**. The type of the elements is determined when the array is created.
* Numpy arrays are memory efficient.
* Because of the static typing, fast implementation of mathematical functions such as multiplication and addition of `numpy` arrays can be implemented in a compiled language (C and Fortran is used).

Using the `dtype` (data type) property of an `ndarray`, we can see what type the data of an array has:

In [12]:
M.dtype

dtype('int32')

We get an error if we try to assign a value of the wrong type to an element in a numpy array:

In [13]:
M[0,0] = "hello"

ValueError: invalid literal for int() with base 10: 'hello'

If we want, we can explicitly define the type of the array data when we create it, using the `dtype` keyword argument: 

In [15]:
M = np.array([[1, 2], [3, 4]], dtype=complex)

M

array([[1.+0.j, 2.+0.j],
       [3.+0.j, 4.+0.j]])

Common data types that can be used with `dtype` are: `int`, `float`, `complex`, `bool`, `object`, etc.

We can also explicitly define the bit size of the data types, for example: `int64`, `int16`, `float128`, `complex128`.

### Using array-generating functions

For larger arrays it is inpractical to initialize the data manually, using explicit python lists. Instead we can use one of the many functions in `numpy` that generate arrays of different forms. Some of the more common are:

#### arange

In [16]:
# create a range

x = np.arange(0, 10, 1) # arguments: start, stop, step

x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [17]:
x = np.arange(-1, 1, 0.1)

x

array([-1.00000000e+00, -9.00000000e-01, -8.00000000e-01, -7.00000000e-01,
       -6.00000000e-01, -5.00000000e-01, -4.00000000e-01, -3.00000000e-01,
       -2.00000000e-01, -1.00000000e-01, -2.22044605e-16,  1.00000000e-01,
        2.00000000e-01,  3.00000000e-01,  4.00000000e-01,  5.00000000e-01,
        6.00000000e-01,  7.00000000e-01,  8.00000000e-01,  9.00000000e-01])

#### linspace and logspace

In [18]:
# using linspace, both end points ARE included
np.linspace(0, 10, 25)

array([ 0.        ,  0.41666667,  0.83333333,  1.25      ,  1.66666667,
        2.08333333,  2.5       ,  2.91666667,  3.33333333,  3.75      ,
        4.16666667,  4.58333333,  5.        ,  5.41666667,  5.83333333,
        6.25      ,  6.66666667,  7.08333333,  7.5       ,  7.91666667,
        8.33333333,  8.75      ,  9.16666667,  9.58333333, 10.        ])

In [19]:
# print(np.exp(10))
np.logspace(0, 10, 10, base=np.e) #e^i

array([1.00000000e+00, 3.03773178e+00, 9.22781435e+00, 2.80316249e+01,
       8.51525577e+01, 2.58670631e+02, 7.85771994e+02, 2.38696456e+03,
       7.25095809e+03, 2.20264658e+04])

#### mgrid

In [20]:
x, y = np.mgrid[0:5, 0:5] # similar to meshgrid in MATLAB

In [21]:
x

array([[0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4]])

In [22]:
y

array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4]])

#### random data

In [23]:
from numpy import random

In [24]:
# uniform random numbers in [0,1]
random.rand(5,5)

array([[0.23677247, 0.08166005, 0.05689641, 0.59090804, 0.32632728],
       [0.50321869, 0.30642142, 0.16203125, 0.57146682, 0.58704888],
       [0.64898187, 0.14721152, 0.65291354, 0.28193689, 0.15767935],
       [0.48967827, 0.53522195, 0.47656061, 0.39147882, 0.50916167],
       [0.65742986, 0.21526856, 0.63953496, 0.94817564, 0.58537754]])

In [25]:
# standard normal distributed random numbers
random.randn(5,5)

array([[ 0.69745229,  0.24878153,  1.79888556, -0.20407845,  1.09245863],
       [-0.10208846, -0.15855962, -0.31678497, -1.09653835,  0.24200574],
       [-1.47130382,  0.31920979,  2.11336168, -0.04743381, -0.50996396],
       [-0.11477273, -0.38474887, -1.01990719, -0.65076785,  0.33264072],
       [ 0.49402714, -0.68844188,  0.69889645,  0.30365611,  0.86845306]])

#### diag

In [26]:
# a diagonal matrix
np.diag([1,2,3])

array([[1, 0, 0],
       [0, 2, 0],
       [0, 0, 3]])

In [27]:
# diagonal with offset from the main diagonal
np.diag([1,2,3], k=1) 

array([[0, 1, 0, 0],
       [0, 0, 2, 0],
       [0, 0, 0, 3],
       [0, 0, 0, 0]])

#### zeros and ones

In [28]:
np.zeros((3,3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [29]:
np.ones((3,3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

## Manipulating arrays

### Indexing

We can index elements in an array using square brackets and indices:

In [30]:
# v is a vector, and has only one dimension, taking one index
v[0]

1

In [31]:
# M is a matrix, or a 2 dimensional array, taking two indices 
M[1,1]

(4+0j)

If we omit an index of a multidimensional array it returns the whole row (or, in general, a N-1 dimensional array) 

In [32]:
M

array([[1.+0.j, 2.+0.j],
       [3.+0.j, 4.+0.j]])

In [33]:
M[1]

array([3.+0.j, 4.+0.j])

The same thing can be achieved with using `:` instead of an index: 

In [34]:
M[1,:] # row 1

array([3.+0.j, 4.+0.j])

In [35]:
M[:,1] # column 1

array([2.+0.j, 4.+0.j])

In [36]:
np.shape(M[:,1])

(2,)

We can assign new values to elements in an array using indexing:

In [37]:
M[0,0] = 1

In [38]:
M

array([[1.+0.j, 2.+0.j],
       [3.+0.j, 4.+0.j]])

In [39]:
# also works for rows and columns
M[1,:] = 0
M[:,1] = -1

In [40]:
M

array([[ 1.+0.j, -1.+0.j],
       [ 0.+0.j, -1.+0.j]])

### Index slicing

Index slicing is the technical name for the syntax `M[lower:upper:step]` to extract part of an array:

In [41]:
A = np.array([1,2,3,4,5])
A

array([1, 2, 3, 4, 5])

In [42]:
A[1:3]

array([2, 3])

Array slices are *mutable*: if they are assigned a new value the original array from which the slice was extracted is modified:

In [43]:
A[1:3] = [-2,-3]

A

array([ 1, -2, -3,  4,  5])

We can omit any of the three parameters in `M[lower:upper:step]`:

In [44]:
A[::] # lower, upper, step all take the default values

array([ 1, -2, -3,  4,  5])

In [45]:
A[::2] # step is 2, lower and upper defaults to the beginning and end of the array

array([ 1, -3,  5])

In [46]:
A[:3] # first three elements

array([ 1, -2, -3])

In [47]:
A[3:] # elements from index 3

array([4, 5])

Negative indices counts from the end of the array (positive index from the begining):

In [48]:
A = np.array([1,2,3,4,5])

In [49]:
A[-1] # the last element in the array

5

In [50]:
A[-3:] # the last three elements

array([3, 4, 5])

Index slicing works exactly the same way for multidimensional arrays:

In [51]:
A = np.array([[n+m*10 for n in range(5)] for m in range(5)])

A

array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])

In [52]:
# a block from the original array
A[1:4, 1:4]

array([[11, 12, 13],
       [21, 22, 23],
       [31, 32, 33]])

In [53]:
# strides
A[::2, ::2]

array([[ 0,  2,  4],
       [20, 22, 24],
       [40, 42, 44]])

### Fancy indexing

Fancy indexing is the name for when an array or list is used in-place of an index: 

In [54]:
row_indices = [1, 2, 3]
A[row_indices]

array([[10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34]])

In [55]:
col_indices = [1, 2, -1] # remember, index -1 means the last element
A[row_indices, col_indices]

array([11, 22, 34])

We can also use index masks: If the index mask is an Numpy array of data type `bool`, then an element is selected (True) or not (False) depending on the value of the index mask at the position of each element: 

In [56]:
B = np.array([n for n in range(5)])
B

array([0, 1, 2, 3, 4])

In [57]:
row_mask = np.array([True, False, True, False, False])
B[row_mask]

array([0, 2])

In [58]:
# same thing
row_mask = np.array([1,0,1,0,0], dtype=bool)
B[row_mask]

array([0, 2])

This feature is very useful to conditionally select elements from an array, using for example comparison operators:

In [59]:
x =np.arange(0, 10, 0.5)
x

array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. ,
       6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5])

In [60]:
mask = (5 < x) * (x < 7.5)

mask

array([False, False, False, False, False, False, False, False, False,
       False, False,  True,  True,  True,  True, False, False, False,
       False, False])

In [61]:
x[mask]

array([5.5, 6. , 6.5, 7. ])

`np.where(condition,x,y)`

Returns elements chosen from x or y depending on condition.

In [62]:
a = np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [63]:
np.where(a < 5, a, 10*a)

array([ 0,  1,  2,  3,  4, 50, 60, 70, 80, 90])

In [64]:
a = np.array([[0, 1, 2],
              [0, 2, 4],
              [0, 3, 6]])
np.where(a < 4, a, -1)  # -1 is broadcast

array([[ 0,  1,  2],
       [ 0,  2, -1],
       [ 0,  3, -1]])

### Broadcasting

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. 

In [65]:
a = np.array([1.0, 2.0, 3.0])
b = 2.0
a * b

array([2., 4., 6.])

We can think of the scalar `b` being stretched during the arithmetic operation into an array with the same shape as `a`.

In [66]:
A = np.array([[1, 2, 3], [1, 2, 3]])
print('A:',A,'\n')
b = np.array([1, 2, 3])
print('b:',b,'\n')
C=A+b
print('C:',C)

A: [[1 2 3]
 [1 2 3]] 

b: [1 2 3] 

C: [[2 4 6]
 [2 4 6]]


In [67]:
b = np.array([[1],[2]])
print('b:',b,'\n')
C=A+b
print('C:',C)

b: [[1]
 [2]] 

C: [[2 3 4]
 [3 4 5]]


In [68]:
b = np.array([[1],[2],[3]])
print('b:',b,'\n')
C=A+b
print('C:',C)

b: [[1]
 [2]
 [3]] 



ValueError: operands could not be broadcast together with shapes (2,3) (3,1) 

## Linear algebra

### Matrix and Vector Operations

When we add, subtract, multiply and divide arrays with each other, the default behaviour is **element-wise** operations:

In [69]:
A * A 

array([[1, 4, 9],
       [1, 4, 9]])

In [70]:
b * b

array([[1],
       [4],
       [9]])

In [71]:
b/b

array([[1.],
       [1.],
       [1.]])

Matrix mutiplication.

In [None]:
np.dot(A, b)

In [None]:
# create matrices
matrix_a = np.array([[1, 1, 1],
                     [1, 2, 0]])

matrix_b = np.array([[1, 3],
                     [1, 2],
                     [0, 3]])

# multiply two matrices
np.dot(matrix_a, matrix_b)

Transposing

In [None]:
A.T

A vector cannot be transposed because it is just a collection of values:

In [None]:
a=np.array([1, 2, 3, 4, 5, 6])
print(a)
print(a.T)
a.shape

Inner (scalar) product

In [None]:
np.inner(a,a)

In [None]:
b=np.array([[1, 2, 3, 4, 5, 6]])
b.T

In [None]:
b.shape

In [None]:
np.inner(b,b)

In [None]:
np.inner(b,b).shape

In [None]:
np.dot(b,b.T)

In [None]:
np.dot(b.T,b)

### Matrix computations

#### Inverse

In [None]:
C=np.array([[1,2],[0,1]])
np.linalg.inv(C) # equivalent to C.I 

#### Determinant

In [None]:
np.linalg.det(C)

#### Trace

In [None]:
C.trace()

In [None]:
# create matrix
C = np.array([[1, 1, 1],
                   [1, 1, 10],
                   [1, 1, 15]])

# return matrix rank
np.linalg.matrix_rank(C)

#### Eigenvalues and Eigenvectors

In [None]:
eigenvalues, eigenvectors = np.linalg.eig(C)

In [None]:
eigenvalues

In [None]:
 eigenvectors

### Data processing

In [None]:
A=random.randn(5,5)
A

#### mean

In [None]:
# the temperature data is in column 3
np.mean(A[:,3])

#### standard deviations and variance

In [None]:
np.std(A[:,3]), np.var(A[:,3])

#### min and max

In [None]:
# lowest daily average temperature
A[:,3].min()

In [None]:
# highest daily average temperature
A[:,3].max()

When functions such as `min`, `max`, etc. are applied to a multidimensional arrays, it is sometimes useful to apply the calculation to the entire array, and sometimes only on a row or column basis. 

In [None]:
m = random.rand(3,3)
m

In [None]:
# global max
m.max()

In [14]:
# max in each column
m.max(axis=0)

NameError: name 'm' is not defined

In [None]:
# max in each row
m.max(axis=1)

## Reshaping, resizing and stacking arrays

The shape of an Numpy array can be modified without copying the underlaying data, which makes it a fast operation even for large arrays.

In [None]:
A

In [None]:
n, m = A.shape
print(n,m)

In [None]:
B = A.reshape((1,n*m))
B

In [None]:
B[0,0:5] = 5 # modify the array

B

In [None]:
A # and the original variable is also changed. B is only a different view of the same data

We can also use the function `flatten` to make a higher-dimensional array into a vector. But this function create a copy of the data.

In [None]:
B = A.flatten()

B

In [None]:
B[0:5] = 10

B

In [None]:
A # now A has not changed, because B's data is a copy of A's, not refering to the same data

#### Adding a new dimension: newaxis

With `newaxis`, we can insert new dimensions in an array, for example converting a vector to a column or row matrix:

In [None]:
v = np.array([1,2,3])

In [None]:
np.shape(v)

In [None]:
# make a column matrix of the vector v
v[:, np.newaxis]

In [None]:
# column matrix
v[:,np.newaxis].shape

In [None]:
# row matrix
v[np.newaxis,:].shape

#### Stacking and repeating arrays

Using function `repeat`, `tile`, `vstack`, `hstack`, and `concatenate` we can create larger vectors and matrices from smaller ones:

#### tile and repeat

In [None]:
a = np.array([[1, 2], [3, 4]])

In [None]:
# repeat each element 3 times
np.repeat(a, 3)

In [None]:
# tile the matrix 3 times 
np.tile(a, 3)

#### concatenate

In [None]:
b = np.array([[5, 6]])

In [None]:
np.concatenate((a, b), axis=0)

In [None]:
np.concatenate((a, b.T), axis=1)

#### hstack and vstack

In [None]:
np.vstack((a,b))

In [None]:
np.hstack((a,b.T))

### Copy and "deep copy"

To achieve high performance, assignments in Python usually do not copy the underlaying objects. This is important for example when objects are passed between functions, to avoid an excessive amount of memory copying when it is not necessary (technical term: pass by reference). 

In [None]:
A = np.array([[1, 2], [3, 4]])

A

In [None]:
# now B is referring to the same array data as A 
B = A 

In [None]:
# changing B affects A
B[0,0] = 10

B

In [None]:
A

If we want to avoid this behavior, so that when we get a new completely independent object `B` copied from `A`, then we need to do a so-called "deep copy" using the function `copy`:

In [None]:
B = np.copy(A)

In [None]:
# now, if we modify B, A is not affected
B[0,0] = -5

B

In [None]:
A

### Vectorizing functions

As mentioned several times by now, to get good performance we should try to avoid looping over elements in our vectors and matrices, and instead use vectorized algorithms. The first step in converting a scalar algorithm to a vectorized algorithm is to make sure that the functions we write work with vector inputs.

In [None]:
def Theta(x):
    """
    Scalar implemenation of the Heaviside step function.
    """
    if x >= 0:
        return 1
    else:
        return 0

In [None]:
Theta(np.array([-3,-2,-1,0,1,2,3]))

OK, that didn't work because we didn't write the `Theta` function so that it can handle a vector input... 

To get a vectorized version of Theta we can use the Numpy function `vectorize`. In many cases it can automatically vectorize a function:

In [None]:
Theta_vec = np.vectorize(Theta)

In [None]:
Theta_vec(np.array([-3,-2,-1,0,1,2,3]))

We can also implement the function to accept a vector input from the beginning (requires more effort but might give better performance):

In [None]:
def Theta(x):
    """
    Vector-aware implemenation of the Heaviside step function.
    """
    return 1 * (x >= 0)

In [None]:
Theta(np.array([-3,-2,-1,0,1,2,3]))

In [None]:
# still works for scalars as well
Theta(-1.2), Theta(2.6)

### Using arrays in conditions

When using arrays in conditions,for example `if` statements and other boolean expressions, one needs to use `any` or `all`, which requires that any or all elements in the array evalutes to `True`:

In [None]:
M

In [None]:
if (M > 5).any():
    print("at least one element in M is larger than 5")
else:
    print("no element in M is larger than 5")

In [None]:
if (M > 5).all():
    print("all elements in M are larger than 5")
else:
    print("all elements in M are not larger than 5")

### Type casting

Since Numpy arrays are *statically typed*, the type of an array does not change once created. But we can explicitly cast an array of some type to another using the `astype` functions (see also the similar `asarray` function). This always create a new array of new type:

In [None]:
M.dtype

In [None]:
M2 = M.astype(float)

M2

In [None]:
M2.dtype

In [None]:
M3 = M.astype(bool)

M3

### Saving Data in Binary Files

[Nelli - Python Data Analytics with Pandas, NumPy and Matplotlib [2nd ed.] (2018, Apress)]

Once you have an array to save, for example, one that contains the results of your
data analysis processing, you simply call the save() function and specify as arguments
the name of the file and the array. The file will automatically be given the .npy extension.

In [None]:
import os
os.chdir('D:\\Python_projects\\2020_Machine_learning')

In [None]:
data=([[ 0.86466285, 0.76943895, 0.22678279],
[ 0.12452825, 0.54751384, 0.06499123],
[ 0.06216566, 0.85045125, 0.92093862],
[ 0.58401239, 0.93455057, 0.28972379]])
np.save('saved_data',data)

When you need to recover the data stored in a .npy file, you use the load() function
by specifying the filename as the argument, this time adding the extension .npy.

In [None]:
loaded_data = np.load('saved_data.npy')
loaded_data

### Reading Files with Tabular Data

`genfromtxt()`: takes three arguments — the name of the file containing the data, the character that separates the
values from each other, and whether the data contain column headers.

In [None]:
nyse_1=np.genfromtxt('NYSE_1.csv',delimiter=',',dtype=np.ndarray(36),names=True)
nyse_1

In [None]:
nyse_1.shape

In [None]:
nyse_1[0]

In [None]:
nyse_1[0][0]

In [None]:
nyse_1['ahp'].shape

In [None]:
nyse_2=np.genfromtxt('NYSE_1.csv',delimiter=',',skip_header=1)

In [None]:
nyse_2.shape

In [None]:
nyse=np.genfromtxt('NYSE.txt')
nyse

In [None]:
nyse.shape

## Further reading

* http://numpy.scipy.org
* http://scipy.org/Tentative_NumPy_Tutorial
* http://scipy.org/NumPy_for_Matlab_Users - A Numpy guide for MATLAB users.