# Lecture 2: NumPy & Pandas

## Part I : NumPy - Numerical Python via Arrays

## James Percival <j.percival@imperial.ac.uk>

### Slides based on the Numpy tutorials and work by Dr. Parastoo Salah

## Part 1: NumPy - multidimensional arrays
## Introduction

- NumPy is a Python library for numerical computing.
- Provides support for:
  - multidimensional arrays, matrices & linear algebra
  - arrays of random number 
  - Fourier transforms
  - Polynomials
  - Lots more useful stuff

## Before we start

- Explaining NumPy properly needs a bit of terminology, eg.
    - arrays,
    - contiguous memory,
    - datatypes,
    - dimension,
- If you're not familiar with any these terms, don't worry!
- We'll explain them as we go along, or ask also questions during the exercises.

## Where to start?

With any package we start by importing it. 
Convention is to import NumPy as `np`.

In [None]:
import numpy as np

You may see examples online with

```python
from numpy import *
```

This is dangerous; it replaces the built-in Python functions with NumPy functions, and not all functions have the same behaviour (e.g. `sum` or `max`).

In [None]:
max(-1, 0) # builtin `max` finds maximum of the inputs

In [None]:
np.max(-1, 0) # Numpy treats first entry as array, 2nd as axis

In [None]:
from numpy import *
max(-1, 0)

## What is an array?

An array is a collection of items stored at contiguous memory locations.

Python already has lists, which store collections of items. So why do we need arrays?

- Arrays are faster & more efficient than lists (if the contents are all the same type of "thing").
- Sometimes we want to have an ordering in more than one dimension (e.g. a matrix).
- Arrays can be used to represent vectors and matrices for linear algebra.
- Can expect arrays to support mathematical operations (e.g. addition, multiplication, determinants).

Python has a built-in `array` module, why not use that?

- The `array` module is limited to one-dimensional arrays.
- The method to define the type of the array is not as flexible or user friendly as NumPy.
- Base Python has been kept relatively `C` like.

Why not wrap Numpy into the base Python?

- NumPy is a large package with many dependencies, and not all of them are needed for every project.
- NumPy is a separate package, so it can be updated independently of the base Python.
- Would make the base Python package larger and slower to load, as well as limiting the equipment on which it can be run.
- Large penalty to update code for no real benefit.

- Some useful features (e.g. `math.isclose`) have transferred. 

  ## Creating a NumPy array

The most basic object in NumPy is the `ndarray` object.

In [None]:
my_list = [1, 2, 3, 4, 5]
np.array(my_list)

In [None]:
my_matrix = [[1,2,3], [4,5,6], [7,8,9]]
np.array(my_matrix)

All the elements at a particular level (_dimension_) must be the same length.

## Functions which create arrays

Numpy has a number of functions which create arrays based on a few parameters:

- `np.zeros` - creates an array of zeros of a specified size
- `np.ones` - creates an array of ones of a specified size
- `np.arange` - creates an array with a range of values
- `np.linspace` - creates an array with a range of values, with a specified number of points
- `np.eye` - creates an identity matrix

In [None]:
arr1 = np.zeros(3)
print(arr1)

In [None]:
arr2 = np.ones((3,3))
print(arr2)

In [None]:
arr3 = np.arange(start=0, stop=10, step=2)
print(repr(arr3)) # Note endpoint *isn't* included here

In [None]:
arr4 = np.linspace(0, 10, 6)
print(repr(arr4)) # Note endpoint *is* included here

In [None]:
arr5 = np.eye(3)
print(arr5)

## Shapes and dimensions

NumPy arrays have attributes which describe their shape and dimension.
- `ndim` - the number of dimensions
  - e.g. `np.eye(3).ndim` is `2`.
- `shape` - the size of the array in each dimension
  - e.g. `np.eye(3).shape` is `(3, 3)`.
- `size` - the total number of elements in the array
  - e.g. `np.eye(3).size` is `9`, i.e. `np.prod(np.eye(3).shape`).
- `nbytes` - the total memory (in bytes) used by Numpy for the array data
  - e.g. `np.eye(3).nbytes` is `72`, since each float is 8 bytes.

Can use the `ndaarray.reshape` method to change the shape of an array.

- Must have the same number of elements in the new shape as the old shape.
- I.E. the `size` cannot change.
- Equivalently `np.prod(old_shape) == np.prod(new_shape)`
- Use `-1` in one dimension to infer the size from the other dimensions.

In [None]:
arr = np.arange(0, 12)
print(arr.reshape(4, 3))

In [None]:
print(arr.reshape(2, 6))

In [None]:
print(arr.reshape(3, -1))

The `ndarray.ravel` method flattens an array (makes it 1D)

In [None]:
arr = np.arange(0, 12).reshape(3,4)
arr = np.array(arr, order='C') # Using 'C' style ordering (the default)

print(arr)
print(arr.ravel())

In [None]:
# Print like things are in memory
print(arr.ravel(order='K'))

The "flattened" version with `order='K'` is (roughly) how the array is stored in memory by your computer.

The `ndarray.T` attribute transposes an array.

In [None]:
arr = np.arange(0, 12).reshape(3,4)

print(arr)
print(arr.T)

## Data types & `dtype`

NumPy arrays have a `dtype` attribute which describes the type of the elements in the array.

If this type is of a fixed size, the array will be stored in a contiguous block of memory. This is more efficient than (e.g.) a Python list, where each element is a separate object.

Numpy can guess a `dtype` from the input data:

In [None]:
arr = np.array([1, 2, 3])
print(arr.dtype)

In [None]:
arr = np.array([1, 2.5, 3])
print(arr.dtype)

In [None]:
arr = np.array([1, 2, "Hello, World!"])
print(arr.dtype)

You can also specify the `dtype` when creating an array:

In [None]:
arr = np.array([1, 2.5, 3], dtype=np.int32)
print(arr.dtype)

In [None]:
print(arr)

In [None]:
arr = np.array([1, 2.5, 3], dtype=np.float64)
print(arr.dtype)
print(repr(arr))

Watch out inserting values with the "wrong" data type

In [None]:
int_arr = np.ones(3, dtype=np.int32)

print(int_arr)

In [None]:
int_arr[1] = 3.5
print(int_arr)

## Arithmetic and methods

NumPy arrays support maths:

- `+`, `-`, `*`, `/`, `**`, `%` (element-wise)
- `np.add`, `np.subtract`, `np.multiply`, `np.divide`, `np.power`, `np.mod` (element-wise)
- `@`, `np.matmul`, `np.dot` (matrix multiplication)

## Broadcasting

To be user-friendly, NumPy arrays can be added, subtracted, etc. even if they are different shapes.

- If the arrays are different shapes, NumPy will "broadcast" the smaller array to the larger array's shape.
- The smaller array must be compatible (i.e same magnitude of dimension) with the larger array in at least one dimension.

## Indexing, slicing & selection

NumPy arrays can be indexed, sliced and selected in much the same way as Python lists.

In [None]:
arr = np.arange(0, 24).reshape(3, 8)

print(arr[0, 0]) # Note can index multiple dimensions at once

In [None]:
print(arr[1:3, 1:6:2]) # Can also slice multiple dimensions at once

In [None]:
print(arr[slice(1, 3), slice(1, 6, 2)]) # Long form of the above

We can use a slice to assign a value to a slice of an array.

In [None]:
arr = np.arange(0, 12).reshape(3,4)
arr[1:3, 1:3] = 0 # Uses broadcasting rules
print(arr)

In [None]:
arr[1, :2] = np.array([1, 2]) # More broadcasting
print(arr)

Watch out with slicing & changing values!

In [None]:
old_arr = np.arange(0, 12).reshape(3,4)
new_arr = old_arr[1:3, 1:3]

new_arr[0, 0] = 100

print(old_arr)

To avoid this, use the `ndarray.copy` method.

In [None]:
old_arr = np.arange(0, 12).reshape(3,4)
new_arr = arr[1:3, 1:3].copy() # or do np.array(arr), which defaults to copy.

new_arr[0, 0] = 100

print(arr)
print(new_arr)

## Boolean indexing

We can use boolean arrays to index/select NumPy arrays.

In [None]:
arr = np.arange(0, 12)

print(arr[arr > 5])

In [None]:
arr[arr % 3 == 0] = -1

print(arr)

## More linear algebra

The `numpy.linalg` submodule has more linear algebra tools
- `det()` for determinants
- `eig()` etc for eigenvectos
- `norm()` for lengths/magnitudes of vectors & matrices
- `inv()` for _direct_ matrix inverses.
- `solve()` to _directly_ solve matrix equations

In [None]:
A  = np.array([[1, 1],
               [0, 1]])

In [None]:
np.linalg.det(A)

In [None]:
np.linalg.inv(A)

## Record arrays

NumPy has a `np.recarray` class which allows you to access fields (think like dimensions) of an array by name.

In [None]:
arr = np.rec.array([(1, 2.5, "Hello"),
                    (2, 10.5, "World")], 
                   dtype=[('val_1', np.int32), ('val_2', np.float64), ('note', 'U10')])

print(arr)

In [None]:
print(arr['val_1'])
print(arr.val_2[1])

We're going to stop here for now, but keep the idea of having names for different bits of data in mind.

We'll revisit this idea when we talk about Pandas in the next 2 sessions today
.

## Any Questions?