# Introduction to Numpy

Python lists:

* are very flexible
* don't require uniform numerical types
* are very easy to modify (inserting or appending objects).

However, flexibility often comes at the cost of performance, and lists are not the ideal object for numerical calculations.

This is where **Numpy** comes in. Numpy is a Python module that defines a powerful n-dimensional array object that uses C and Fortran code behind the scenes to provide high performance.

In [2]:
import time
import math

a = range(10000000)

def func(a):
    return 1e-6*(4*a**2.) #7+1./(a+3)**3.4-(23*a)**4.2)-20

# measure how long it takes in seconds
start_time = time.time()

new_a = []
for val in a:
    new_a.append(func(val))
    
print(f'{time.time()-start_time} seconds')

1.6883940696716309 seconds


In [3]:
a

range(0, 10000000)

In [4]:
import numpy

start_time = time.time()
a = numpy.array(a)
new_a = func(a)
print(f'{time.time()-start_time} seconds')

0.5598809719085693 seconds


In [7]:
L = range(1000)
%timeit [i**2 for i in L]

a = np.arange(1000)
%timeit a**2

209 µs ± 3.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
700 ns ± 7.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


The downside of Numpy arrays is that they have a more rigid structure, and require a single numerical type (e.g. floating point values), but for a lot of scientific work, this is exactly what is needed.

The Numpy module is imported with:

In [None]:
import numpy

Although in the rest of this course, and in many packages, the following convention is used:

In [6]:
import numpy as np

This is because Numpy is so often used that it is shorter to type ``np`` than ``numpy``.

## Creating Numpy arrays

The easiest way to create an array is from a Python list, using the ``array`` function:

In [8]:
a = np.array([10, 20.1, 30, 40])

In [9]:
b = list()
for n in range(5):
    b.append(n)
c = np.array(b)
c

array([0, 1, 2, 3, 4])

In [10]:
type(a)

numpy.ndarray

Numpy arrays have several attributes that give useful information about the array:

In [12]:
a.ndim  # number of dimensions

1

In [11]:
a.shape  # shape of the array

(4,)

In [13]:
a.dtype  # numerical type

dtype('float64')

There are several other ways to create arrays. For example, there is an ``arange`` function that can be used similarly to the built-in Python ``range`` function, with the exception that it can take floating-point input:

In [17]:
d =np.arange(1, 6, 1)
d
# np.arange(0, 10, 1)

array([1, 2, 3, 4, 5])

In [25]:
d.dtype

dtype('int64')

In [None]:
np.arange(30.5)

In [None]:
np.arange(3, 12, 2)

In [20]:
b = np.arange(1.2, 4.4, 0.1)


In [21]:
b

array([1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4,
       2.5, 2.6, 2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7,
       3.8, 3.9, 4. , 4.1, 4.2, 4.3])

In [22]:
b[3]

1.5000000000000002

Another useful function is ``linspace``, which can be used to create linearly spaced values between and including limits:

In [23]:
np.linspace(11, 12, 10)

array([11.        , 11.11111111, 11.22222222, 11.33333333, 11.44444444,
       11.55555556, 11.66666667, 11.77777778, 11.88888889, 12.        ])

and a similar function can be used to create logarithmically spaced values between and including limits:

In [12]:
np.logspace(1., 4., 7)

array([   10.        ,    31.6227766 ,   100.        ,   316.22776602,
        1000.        ,  3162.27766017, 10000.        ])

Finally, the ``zeros`` and ``ones`` functions can be used to create arrays intially set to ``0`` and ``1`` respectively:

In [31]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [34]:
np.ones([5, 3, 5])[0, :, 1:3]

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

np.array([X, X, X, X])

np.arange(start, finish, step)

np.linspace(start, finish-included, number of elements)

np.zeros([dim, dim]) 1D: np.zeros(X)

np.ones([dim, dim])

np.empty([dim, dim])

In [36]:
a 

array([10. , 20.1, 30. , 40. ])

In [37]:
b

array([1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4,
       2.5, 2.6, 2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7,
       3.8, 3.9, 4. , 4.1, 4.2, 4.3, 4.4])

In [28]:
c = np.vstack([a, a]) ### stack, vstack, hstack
c.shape

(2, 4)

In [40]:
np.shape(c)

(2, 4)

In [42]:
d = np.vstack([a, a])
d

array([[10. , 20.1, 30. , 40. ],
       [10. , 20.1, 30. , 40. ]])

In [45]:
e = np.hstack([a, b])

In [46]:
np.shape(e)

(37,)

## Exercise

Create an array which contains the value 2 repeated 10 times

In [36]:
np.ones(10) + 1 # * 2

array([2., 2., 2., 2., 2., 2., 2., 2., 2., 2.])

In [38]:
np.array([2 for i in range(10)])

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [39]:
np.zeros(10) + 2

array([2., 2., 2., 2., 2., 2., 2., 2., 2., 2.])

Create an array which contains values from 1 until 90 every one and then the values 95, 99, 99.9, 99.99.

In [40]:
v1 = np.arange(1, 90.1, 1)  
v2 = [95, 99, 99.9, 99.99]
np.hstack([v1, v2])

array([ 1.  ,  2.  ,  3.  ,  4.  ,  5.  ,  6.  ,  7.  ,  8.  ,  9.  ,
       10.  , 11.  , 12.  , 13.  , 14.  , 15.  , 16.  , 17.  , 18.  ,
       19.  , 20.  , 21.  , 22.  , 23.  , 24.  , 25.  , 26.  , 27.  ,
       28.  , 29.  , 30.  , 31.  , 32.  , 33.  , 34.  , 35.  , 36.  ,
       37.  , 38.  , 39.  , 40.  , 41.  , 42.  , 43.  , 44.  , 45.  ,
       46.  , 47.  , 48.  , 49.  , 50.  , 51.  , 52.  , 53.  , 54.  ,
       55.  , 56.  , 57.  , 58.  , 59.  , 60.  , 61.  , 62.  , 63.  ,
       64.  , 65.  , 66.  , 67.  , 68.  , 69.  , 70.  , 71.  , 72.  ,
       73.  , 74.  , 75.  , 76.  , 77.  , 78.  , 79.  , 80.  , 81.  ,
       82.  , 83.  , 84.  , 85.  , 86.  , 87.  , 88.  , 89.  , 90.  ,
       95.  , 99.  , 99.9 , 99.99])

In [44]:
np.array([i for i in range(1, 91)])

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
       52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
       69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
       86, 87, 88, 89, 90])

## Numerical operations with arrays

Numpy arrays can be combined numerically using the standard ``+-*/**`` operators:

In [45]:
x1 = np.array([1,2,3])
y1 = np.array([4,5,6])

In [46]:
y1

array([4, 5, 6])

In [47]:
2 * y1

array([ 8, 10, 12])

In [48]:
(x1 + 2) * y1

array([12, 20, 30])

In [49]:
x1 ** y1

array([  1,  32, 729])

Note that this differs from lists:

In [50]:
x = [1,2,3]
y = [4,5,6]

In [51]:
y.append('hola')
y

[4, 5, 6, 'hola']

In [54]:
3 * y

[4, 5, 6, 'hola', 4, 5, 6, 'hola', 4, 5, 6, 'hola']

In [72]:
2 * y

[4, 5, 6, 'hola', 4, 5, 6, 'hola']

In [69]:
x + 2 * y

[1, 2, 3, 4, 5, 6, 'hola', 4, 5, 6, 'hola']

## Accessing and Slicing Arrays

Similarly to lists, items in arrays can be accessed individually:

In [73]:
x = np.array([9,8,7])

In [74]:
x

array([9, 8, 7])

In [75]:
x[0]

9

In [76]:
x[1]

8

and arrays can also be **sliced** by specifiying the start and end of the slice (where the last element is exclusive):

In [77]:
y = np.arange(10, 20)
y

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [79]:
y[0:5:2] # Slices [start:end:step] [:]

array([10, 12, 14])

optionally specifying a step:

In [None]:
y[0:10:2]

As for lists, the start, end, and step are all optional, and default to ``0``, ``len(array)``, and ``1`` respectively:

In [80]:
y[:5]

array([10, 11, 12, 13, 14])

In [81]:
y[::2]

array([10, 12, 14, 16, 18])

## Exercise

Given an array ``x`` with 20 elements, find the array ``dx`` containing 19 values where ``dx[i] = x[i+1] - x[i]``. Do this without loops!

x = [3, 6, 7, 8]

l = list()

for ii, xi in enumerate(x[:-1]):
    l.append(x[ii+1] - xi)


for ii in range(len(x)-1):
    l.append(x[ii+1] - x[ii])

In [58]:
x = np.random.randint(10, 100, 5)
x

array([88, 88, 81, 79, 56])

In [62]:
x[1:, 1:] - x[:-1, :-1]

array([  0,  -7,  -2, -23])

In [63]:
np.diff(x)

array([  0,  -7,  -2, -23])

## Multi-dimensional arrays

<center> <img src="img/numpy_indexing.png" width="1600"/> </center>

In [104]:
y = np.ones([3,2,3])  # ones takes the shape of the array, not the values

In [108]:
y.shape

(3, 2, 3)

In [109]:
y

array([[[1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.]]])

Multi-dimensional arrays can be sliced differently along different dimensions:

In [106]:
z = np.ones([6,6,6])

In [107]:
z[::3, 1:4, :]

array([[[1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.]],

       [[1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1.]]])

## Constants

NumPy provides us access to some useful constants as well - remember you should never be typing these in manually! Other libraries such as SciPy and MetPy have their own set of constants that are more domain specific.

In [110]:
np.pi

3.141592653589793

In [111]:
np.e

2.718281828459045

In [112]:
1 + np.pi

4.141592653589793

## Functions

In addition to an array class, Numpy contains a number of **vectorized** functions, which means functions that can act on all the elements of an array, typically much faster than could be achieved by looping over the array.

For example:

In [64]:
theta = np.linspace(0., 2. * np.pi, 10)

In [67]:
theta

array([0.        , 0.6981317 , 1.3962634 , 2.0943951 , 2.7925268 ,
       3.4906585 , 4.1887902 , 4.88692191, 5.58505361, 6.28318531])

In [116]:
np.shape(theta)

(10,)

In [66]:
theta.max()

6.283185307179586

In [68]:
np.max(theta)

6.283185307179586

In [75]:
tt = np.vstack([theta, theta])
tt.shape

(2, 10)

In [74]:
tt.max()

6.283185307179586

In [None]:
aa = np.array([np.linspace(0., 2. * np.pi, 10), np.linspace(0., 2. * np.pi, 10)])

In [79]:
pp = [np.linspace(0., 2. * np.pi, 10), np.linspace(0., 2. * np.pi, 10)]
pp

[array([0.        , 0.6981317 , 1.3962634 , 2.0943951 , 2.7925268 ,
        3.4906585 , 4.1887902 , 4.88692191, 5.58505361, 6.28318531]),
 array([0.        , 0.6981317 , 1.3962634 , 2.0943951 , 2.7925268 ,
        3.4906585 , 4.1887902 , 4.88692191, 5.58505361, 6.28318531])]

In [None]:
type(pp)

In [80]:
aa = np.asarray(pp)

In [None]:
aa.shape

In [76]:
np.sin(tt)

array([[ 0.00000000e+00,  6.42787610e-01,  9.84807753e-01,
         8.66025404e-01,  3.42020143e-01, -3.42020143e-01,
        -8.66025404e-01, -9.84807753e-01, -6.42787610e-01,
        -2.44929360e-16],
       [ 0.00000000e+00,  6.42787610e-01,  9.84807753e-01,
         8.66025404e-01,  3.42020143e-01, -3.42020143e-01,
        -8.66025404e-01, -9.84807753e-01, -6.42787610e-01,
        -2.44929360e-16]])

In [81]:
np.sin(aa[0, 0])

0.0

Another useful package is the ``np.random`` sub-package, which can be used to genenerate random numbers fast:

In [120]:
# uniform distribution between 0 and 1
np.random.random(10)

array([0.48849716, 0.52766269, 0.83970123, 0.85602101, 0.64405742,
       0.54420396, 0.65176724, 0.44591857, 0.1544168 , 0.2488368 ])

In [121]:
# 10 values from a gaussian distribution with mean 3 and sigma 1
np.random.normal(3., 1., 10)

array([2.61840403, 2.16555013, 4.04433245, 5.64950756, 3.73826699,
       2.84408076, 2.61227934, 2.99571071, 2.28552334, 3.71846935])

In [123]:
a = np.arange(12).reshape(3, 4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [124]:
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [127]:
np.sum(a, axis=0)

array([12, 15, 18, 21])

In [None]:
a.shape

In [None]:
np.sum(a, axis=1)

Another very useful function in Numpy is [numpy.loadtxt](http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html) which makes it easy to read in data from column-based data. For example, given the following file:

In [1]:
import numpy as np

In [3]:
from pathlib import Path

dir_data = Path('data')
data = np.loadtxt(dir_data / 'columns.txt')

In [6]:
np.loadtxt?

[0;31mSignature:[0m
[0mnp[0m[0;34m.[0m[0mloadtxt[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfname[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdtype[0m[0;34m=[0m[0;34m<[0m[0;32mclass[0m [0;34m'float'[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcomments[0m[0;34m=[0m[0;34m'#'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdelimiter[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mconverters[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mskiprows[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0musecols[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0munpack[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mndmin[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mencoding[0m[0;34m=[0m[0;34m'bytes'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_rows[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0

## Masking

The index notation ``[...]`` is not limited to single element indexing, or multiple element slicing, but one can also pass a discrete list/array of indices:

In [9]:
x = np.array([1,6,4,7, 9, 8, 10])
#x[[True, False, True, False]]

In [10]:
x

array([ 1,  6,  4,  7,  9,  8, 10])

which is returning a new array composed of elements 1, 2, 4, etc from the original array.

Alternatively, one can also pass a boolean array of ``True/False`` values, called a **mask**, indicating which items to keep:

In [12]:
y = np.array([3, 4, 5])

In [13]:
y

array([3, 4, 5])

In [15]:
mask = np.array([True, False, False])

In [16]:
mask

array([ True, False, False])

In [19]:
y[mask]

array([3])

Now this doesn't look very useful because it is very verbose, but now consider that carrying out a comparison with the array will return such a boolean array:

In [141]:
x

array([ 1,  6,  4,  7,  9,  8, 10])

In [21]:
mask = x > 3.4

In [23]:
x[mask]

array([ 6,  4,  7,  9,  8, 10])

It is therefore possible to extract subsets from an array using the following simple notation:

In [25]:
mask = x> 3.4
x[x> 3.4]
# p[mask] 

array([ 6,  4,  7,  9,  8, 10])

In [26]:
x[mask]

array([ 6,  4,  7,  9,  8, 10])

In [27]:
x[x <= 3.4]

array([1])

In [28]:
x[~mask]

array([1])

Conditions can be combined:

### Conditional formating

##### Loops
and, or

#### Masking in numpy array
& (and), | (or)

In [152]:
x

array([ 1,  6,  4,  7,  9,  8, 10])

In [32]:
x[((x > 3.4) & (x < 9.5)) | (x<2)]

array([1, 6, 4, 7, 9, 8])

Of course, the boolean **mask** can be derived from a different array to ``x`` as long as it is the right size:

In [46]:
x = np.linspace(-1., 1., 14)
y = np.array([1,6,4,7,9,3,1,5,6,7,3,4,4,3])

In [155]:
y.shape

(14,)

In [156]:
x.shape

(14,)

In [34]:
y2 = y + 7
y2

array([10, 11, 12])

In [37]:
mask = y2 > 10

In [39]:
mask

array([False,  True,  True])

In [40]:
y[mask]

array([4, 5])

In [160]:
y[(x > -0.5) | (x < 0.4)]

array([1, 6, 4, 7, 9, 3, 1, 5, 6, 7, 3, 4, 4, 3])

Since the mask itself is an array, it can be stored in a variable and used as a mask for different arrays:

In [159]:
keep = (x > -0.5) & (x < 0.4)
x_new = x[keep]
y_new = y[keep]

In [None]:
keep

In [None]:
x_new

In [None]:
y_new

we can use this conditional indexing to assign new values to certain positions within our array, somewhat like a masking operation.

In [161]:
y

array([1, 6, 4, 7, 9, 3, 1, 5, 6, 7, 3, 4, 4, 3])

In [163]:
mask = y>5
mask

array([False,  True, False,  True,  True, False, False, False,  True,
        True, False, False, False, False])

In [43]:
y[mask] = 999

In [44]:
y

array([  3, 999, 999])

In [66]:
y = y + 0.1
y

array([1.2, nan, nan, nan, nan, 3.2, 1.2, nan, nan, nan, 3.2, nan, nan,
       3.2])

In [53]:
mm = y > 3

In [54]:
mm

array([False,  True,  True,  True,  True, False, False,  True,  True,
        True, False,  True,  True, False])

In [65]:
y[mm] = np.NaN
y

array([1.1, nan, nan, nan, nan, 3.1, 1.1, nan, nan, nan, 3.1, nan, nan,
       3.1])

In [62]:
pp = np.array([np.NaN, 3, 4, np.NaN])
pp

array([nan,  3.,  4., nan])

In [None]:
y[y > 5] = 3

### NaN values

In arrays, some of the values are sometimes NaN - meaning *Not a Number*. If you multiply a NaN value by another value, you get NaN, and if there are any NaN values in a summation, the total result will be NaN. One way to get around this is to use ``np.nansum`` instead of ``np.sum`` in order to find the sum:

In [68]:
x = np.array([1,2,3,np.NaN])
x

array([ 1.,  2.,  3., nan])

In [69]:
np.sum(x)

nan

In [171]:
np.NAN # np.nan | np.NaN | np.NAN

nan

In [70]:
np.nansum(x)

6.0

In [None]:
np.nansum(x)

In [175]:
np.nanmax(x)

3.0

You can also use ``np.isnan`` to tell you where values are NaN. For example, ``array[~np.isnan(array)]`` will return all the values that are not NaN (because ~ means 'not'):

In [71]:
np.isnan(x)

array([False, False, False,  True])

In [177]:
x[np.isnan(x)]

array([nan])

In [178]:
x[~np.isnan(x)]

array([1., 2., 3.])

### Statistics --> Scipy

In [None]:
import numpy.random as rnd


g = rnd.normal(loc=0, scale=1, size=1000000)

print(numpy.mean(g), numpy.median(g), numpy.std(g))

# specifying axis of operation gives different results:
a = [[1,1,1], [2,2,2], [3,3,3]]
print(numpy.mean(a))         # mean of all numbers in the array
print(numpy.mean(a, axis=0)) # mean along axis 0 (first axis = outermost axis = along columns)
print(numpy.mean(a, axis=1)) # mean along axis 1 (second axis = along rows)

# operations that ignore nans
b = [1, 2, 3, numpy.nan, 4, 5]
print(numpy.mean(b))    # returns nan
print(numpy.nanmean(b)) # ignores nan; see nanmedian, nanstd, ...

# determine percentiles
print(numpy.percentile(g, 50)) # the same as median
print(numpy.percentile(g, 68.27)-numpy.percentile(g, 31.73))

# create a histogram
hist, bins = numpy.histogram(g, bins=numpy.arange(-5, 6, 1))
print(bins)
print(hist)

## Exercise

The [data/SIMAR_gaps.txt](data/SIMAR_gaps.txt) data file gives the wave climate data in the Mediterranean Sea.

Read in the file using ``np.loadtxt``. The data contains bad values, which you can identify by looking at the minimum and maximum values of the array. Use masking to get rid of the bad values.

In [86]:
from pathlib import Path
dir_data = Path('data')
data = np.loadtxt(dir_data / 'SIMAR_gaps.txt', skiprows=1)
#np.genfromtxt()
var = data[:, 4]

Let's find the bad values by examining the max, min and mean and finding those that do not make sense

In [74]:
np.nanmin(var)

-99.9

In [75]:
np.nanmax(var)

3.5

In [87]:
np.nanmean(var)

0.4804713804713802

So the bad values are -99.9. I define a mask for those "bad" values.

In [90]:
mask = var > 0  # mask contains True for "good values"

1. Remove the bad values: create a new variable which contains only the "good" ones. --> Changing the lenght of the array

In [80]:
var_good = var[mask]

In [None]:
var_good

2. Assign NaN for the "bad" values. Creating a gap. --> The length of the array stays the same

In [92]:
var[~mask] = np.NaN

### Linear Algebra

In [180]:
import numpy
import numpy.linalg as la

a = [1,2,3] 
b = [4,5,6]

print(numpy.dot(a,b)) # dot product
print(numpy.inner(a,b)) 
print(numpy.outer(a,b)) 

i = numpy.diag([1,2,3])
print(la.eig(i)) # return eigenvalues and eigenvectors
print(la.det(i)) # return determinant

# solve linear equations
a = [[3,2,-1], [2,-2,4], [-1,0.5,-1]]
b = [1,-2,0]
print(la.solve(a,b)) 

32
32
[[ 4  5  6]
 [ 8 10 12]
 [12 15 18]]
(array([1., 2., 3.]), array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]]))
6.0
[ 1. -2. -2.]
