# DSC200 Lecture 7 

## An introduction to NumPy

## NumPy

NumPy provides an efficient representation of multidimensional datasets like vectors and matricies, and tools for linear algebra and general matrix manipulations - essential building blocks of virtually all technical computing

Typically NumPy is imported as `np`:

In [None]:
import numpy as np

NumPy, at its core, provides a powerful array object.  Let's start by exploring how the NumPy array differs from a Python list.  

We start by creating a simple Python list and a NumPy array with identical contents:

In [None]:
lst = [10, 20, 30, 40]
arr = np.array([10, 20, 30, 40])
print(lst)
print(arr)

### Element indexing

Elements of a one-dimensional array are accessed with the same syntax as a list:

In [None]:
print(lst[0], arr[0])

In [None]:
print(lst[-1], arr[-1])

In [None]:
print(lst[2:], arr[2:])

### Differences between arrays and lists

The first difference to note between lists and arrays is that arrays are *homogeneous*; i.e. all elements of an array must be of the same type.  In contrast, lists can contain elements of arbitrary type. For example, we can change the last element in our list above to be a string:

In [None]:
lst[-1] = 'a string inside a list'
lst

But the same can not be done with an array, as we get an error message:

In [None]:
arr[-1] = 'a string inside an array'

Caveat, it can be done, but really *don't do it*; lists are generally better at non-homogeneous collections.

## Array Properties and Methods

The following provide basic information about the size, shape and data in the array:

In [None]:
print('Data type                :', arr.dtype)
print('Total number of elements :', arr.size)
print('Number of dimensions     :', arr.ndim)
print('Shape (dimensionality)   :', arr.shape)
print('Memory used (in bytes)   :', arr.nbytes)

Arrays also have many useful statistical/mathematical methods:

In [None]:
print('Minimum and maximum             :', arr.min(), arr.max())
print('Sum and product of all elements :', arr.sum(), arr.prod())
print('Mean and standard deviation     :', arr.mean(), arr.std())

### Data types

The information about the type of an array is contained in its `dtype` attribute.

In [None]:
arr.dtype

Once an array has been created, its `dtype` is fixed (in this case to an 8 byte/64 bit signed integer) and it can only store elements of the same type.

For this example where the `dtype` is integer, if we try storing a floating point number in the array it will be automatically converted into an integer:

In [None]:
arr[-1] = 1.234
arr

NumPy comes with most of the common data types (and some uncommon ones too).

The most used (and portable) dtypes are:

* bool
* uint8
* int (machine dependent)
* int8
* int32
* int64
* float (machine dependent)
* float32
* float64

Full details can be found at http://docs.scipy.org/doc/numpy/user/basics.types.html.

What are the limits of the common NumPy integer types?

In [None]:
np.array(256, dtype=np.uint8)

In [None]:
float_info = ('{finfo.dtype}: max={finfo.max:<18}, '
              'approx decimal precision={finfo.precision};')
print(float_info.format(finfo=np.finfo(np.float32)))
print(float_info.format(finfo=np.finfo(np.float64)))

## Pop quiz

What is the output of the below code:

In [None]:
np.array(128, dtype=np.int8)

<table border="1" class="docutils">
<colgroup>
<col width="14%" />
<col width="86%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head"></th>
<th class="head">Number of matches</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">A</span></tt></td>
<td>-128</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">B</span></tt></td>
<td>-127</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">C</span></tt></td>
<td>0</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">D</span></tt></td>
<td>128</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">E</span></tt></td>
<td>An error</td>
</tr>
</tbody>
</table>

Floating point precision is covered in detail at http://en.wikipedia.org/wiki/Floating_point.

However, we can convert an array from one type to another with the ``astype`` method

In [None]:
np.array(1, dtype=np.uint8).astype(np.float32)

## Creating Arrays

Above we created an array from an existing list. Now let's look into other ways in which we can create arrays.

A common need is to have an array initialized with a constant value. Very often this value is 0 or 1.

`zeros` creates arrays of all zeros, with any desired dtype:

In [None]:
np.zeros(5, dtype=np.float64)

In [None]:
np.zeros(3, dtype=np.int32)

and similarly for `ones`:

In [None]:
print('5 ones:', np.ones(5, dtype=np.int32))

If we want an array initialized with an arbitrary value, we can create an empty array and then use the fill method to put the value we want into the array:

In [None]:
a = np.empty(4, dtype=np.float32)
a.fill(5.5)
a

Alternatives, such as:

 * ``np.ones(4) * 5.5``
 * ``np.zeros(4) + 5.5``

are generally less efficient, but are also reasonable.

### Filling arrays with sequences

NumPy also offers the `arange` function, which works like the builtin `range` but returns an array instead of a list:

In [None]:
np.arange(10, dtype=np.float64)

In [None]:
np.arange(5, 7, 0.1)

The `linspace` and `logspace` functions to create linearly and logarithmically-spaced grids respectively, with a fixed number of points that include both ends of the specified interval:

In [None]:
print("A linear grid between 0 and 1:")
print(np.linspace(0, 1, 5))

In [None]:
print("A logarithmic grid between 10**2 and 10**4:")
print(np.logspace(2, 4, 3))

### Creating random arrays

Finally, it is often useful to create arrays with random numbers that follow a specific distribution.  The `np.random` module contains a number of functions that can be used to this effect.
First, we must import it:

In [None]:
import numpy as np
import numpy.random

To produce an array of 5 random samples taken from a standard normal distribution (0 mean and variance 1):

In [None]:
print(np.random.randn(5))

For an array of 5 samples from the normal distribution with a mean of 10 and a variance of 3:

In [None]:
norm10 = np.random.normal(10, 3, 5)
print(norm10)

## Indexing with other arrays

Above we saw how to index NumPy arrays with single numbers and slices, just like Python lists.  

Arrays also allow for a more sophisticated kind of indexing that is very powerful: you can index an array with another array, and in particular with an array of boolean values.  

This is particularly useful to extract information from an array that matches a certain condition.

Consider for example that in the array `norm10` we want to replace all values above 9 with the value 0.  We can do so by first finding the *mask* that indicates where this condition is true or false:

In [None]:
mask = norm10 > 9
mask

In [None]:
nanarray=np.array([-1., 0., 1., np.NaN])

In [None]:
nanarray > 0.5

In [None]:
nanarray <= 0.5

In [None]:
nanarray[~np.isnan(nanarray)] > 0.5

Now that we have this mask, we can use it to either read those values or to reset them to 0:

In [None]:
print(('Values above 9:', norm10[mask]))

In [None]:
print('Resetting all values above 9 to 0...')
norm10[mask] = 0
print(norm10)

Whilst beyond the scope of this course, it is also worth knowing that a specific masked array object exists in NumPy.
Further details are available at http://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html

## Arrays with more than one dimension

Up until now all our examples have used one-dimensional arrays.  NumPy can also create arrays of arbitrary dimensions, and all the methods illustrated in the previous section work on arrays with more than one dimension.

A list of lists can be used to initialize a two dimensional array:

In [None]:
lst2 = [[1, 2, 3], [4, 5, 6]]
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2)
print(arr2.shape)

With two-dimensional arrays we start seeing the power of NumPy: while nested lists can be indexed by repeatedly using the `[ ]` operator, multidimensional arrays support a much more natural indexing syntax using a single `[ ]` and a set of indices separated by commas:

In [None]:
print(lst2[0][1])
print(arr2[0, 1])

Question: Why does the following example produce different results?

In [None]:
print(lst2[0:2][1])
print(arr2[:, 1])