# 2. NumPy Basics

NumPy (Numerical Python) is the core library for scientific computing in Python. It provides a high-performance multi-dimensional array object, and tools for working with these arrays. The arrays are implemented in C and Python merely provides a front-end, making function operations across the array *considerably faster* for large datasets compared to using lists.

NumPy arrays therefore **must** be the same datatype (float, int etc).

The flow of this notebook is as follows:
1. Creating an array
2. Creating zeros, ones, linspace...
3. Generating random numbers
4. Inspecting the array
5. Arithmetic operations
6. Aggregation
7. Subsetting, slicing, indexing


We use the following convention **np** for numpy import:

In [None]:
import numpy as np

## Creating an array

In [None]:
# 1-d floats
a = np.array([6.0, -1.0, 5.0, -3.0])
a

In [None]:
# 2-d ints
b = np.array([[3.0, 2.0],[1.0, 2.0]], dtype=int)
b

## Creating zeros, ones, linspace, identity matrix...

In [None]:
c = np.zeros((4,2,3))
c

In [None]:
d = np.ones((2,2))
d

In [None]:
e = np.arange(0, 10, .5, dtype=float)
e

In [None]:
f = np.linspace(0, 5, 10)
f

In [None]:
g = np.eye(4)
g

## Generating Random Numbers

In [None]:
# uniform
h = np.random.rand(5)
h

In [None]:
# normal distribution
i = np.random.randn(10)
i

In [None]:
j = np.random.randint(4,10,(4,4))
j

## Data Types

In [None]:
print(np.int64)
print(np.float64)
print(np.bool)
print(np.string_)

## Inspecting the array

In [None]:
print(a.shape)
print(j.shape)
print(c.shape)

In [None]:
b.ndim

In [None]:
print(a.dtype)
print(b.dtype)

In [None]:
# cast
b.astype(np.float64)

## Arithmetic Operations

Elementwise addition, subtraction, multiplication and division!

In [None]:
a = np.arange(16).reshape(4,4)
a

In [None]:
b = np.eye(4) * 3
b

In [None]:
a + b

In [None]:
c = np.linspace(0,9,4)
c

In [None]:
# treats c vector as applied to every row - not dot!
a * c

In [None]:
np.sin(a)

In [None]:
np.dot(a,c)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
x = np.linspace(-np.pi*3,np.pi*3,100)
y = np.sin(x)
plt.plot(x,y)

In [None]:
b = y+1
c = y*2
d = -y
e = np.sin(2*x)
plt.plot(x,y,'k-',label="norm")
plt.plot(x,b,'r--',label="y+1")
plt.plot(x,c,'g--',label="y*2")
plt.plot(x,d,'b--',label="-y")
plt.plot(x,e,'x--',label="y**2")
plt.legend()

## Aggregation

Merging together values by row/column is very important.

In [None]:
x = np.random.randn(1000)
x.mean()

In [None]:
x.std()

In [None]:
a.mean(axis=0)

In [None]:
a.mean(axis=1)

In [None]:
a.cumsum()

In [None]:
f = np.random.randn(1000)

In [None]:
np.corrcoef(x,f)

In [None]:
plt.scatter(x,f, alpha=.3)

In [None]:
# applying covariance matrix
g = np.random.randn(1000,2)
c = np.array([[2., 1.],[1., 2.]])
# dot random values with covariance matrix
h = np.dot(g,c)
print(np.corrcoef(h[:,0],h[:,1]))

In [None]:
plt.scatter(h[:,0], h[:,1], alpha=.5, s=20.)

In [None]:
print("max: %f, min: %f, std: %f, sum: %f" % (h.max(), h.min(), h.std(), h.sum()))

In [None]:
s = np.random.randint(0,100,(30,))
np.sort(s)

In [None]:
# error bars
N = 100
P = 10
# normal(mean, variance, size)
s = np.random.normal(1.0, 0.5, size=(N,P))
t = np.arange(N)
sm = s.mean(axis=1)
sd = s.std(axis=1)
plt.plot(t,sm,'r-')
plt.fill_between(t, sm + sd, sm - sd, color='r', alpha=.4)
plt.xlabel("t")
plt.ylabel(r"$\epsilon$")

## Subsetting, Slicing, Indexing

In [None]:
a

In [None]:
a[0,:]

In [None]:
a[:2,1]

In [None]:
# use boolean mask selection
a[(a < 7) | (a > 10)]

In [None]:
# reverse, reverse!
a[::-1]

### Copies vs Views

A common **gotcha** moment.

When we use 'slice' notation to look at part of an array, it produces a *view*, meaning it points to the same memory of the original array. If we use *fancy-indexing*, it will assign direct changes to the array.

#### example:

In [None]:
x = np.arange(10)
print(x)

In [None]:
y = x[::2]
print(y)

In [None]:
y[3] = 100
print(y)
print(x)

## Array Manipulation

There are hosts of manipulations that can be applied to both vectors and matrices; we will explore the common ones here:

In [None]:
# transpose
print("{} \n\n {}".format(a, a.T))

In [None]:
a.ravel()

In [None]:
a.reshape(2,8)

In [None]:
np.vstack((
    a.reshape(2,8),
    a.reshape(2,8)
))

In [None]:
np.hstack((a,a))

In [None]:
b = np.eye(4)

In [None]:
np.concatenate((a,b), axis=0)

In [None]:
np.concatenate((a,b), axis=1)

### Fast-Fourier Transforms (FFTs)

In [None]:
# sample spacing
T = 1. / 300.
# n points to sample
N = 150
freq = 30.

x = np.linspace(0, N*T, N)
# calculate a sine wave with 30 frequency
y = np.sin(freq * 2.0*np.pi*x)
# fourier transform
yf = np.fft.fft(y)
# adjust x space based on number of points N and sample spacing
xf = np.linspace(0., 1./(2.*T), N/2)

fig,ax = plt.subplots(ncols=2, figsize=(14,4))

# plot
ax[0].plot(x,y)
ax[1].plot(xf, 2./N * np.abs(yf[0:int(N/2)]))
# ax[1].plot([freq,freq],[0,1])
ax[0].set_xlabel("$x$")
ax[0].set_ylabel(r"$\sin 100\pi x$")
ax[1].set_xlabel("$Hz$")
ax[1].set_ylabel("$y_f$")


### Is it actually faster than using lists?

NumPy is considerably faster than using in-house Python objects. Let's time it.

In [None]:
our_list = list(range(10000))
np_list = np.arange(10000)

# test the list
%timeit [i**2 for i in our_list]

In [None]:
# test numpy
%timeit np_list ** 2

# Tasks

One of the areas of interest in *population genetics* is the study of mutation, selection and crossover within genetic populations. We will be exploring the use of the **Fisher-Wright model**. In this example, we will consider the change of an allele in the genetic population from state *A* (normal) to state *B* (mutant). We make some assumptions:
1. *B* has a selective advantage of $1+s$.
2. *A* mutates to *B* with forward mutation rate $\mu$.
3. *B* mutates to *A* with backward mutation rate $\nu$.
4. The population size $P$ is finite.

To change the population $P$ over time $t$, we have a 3-stage process:

To begin with, the number of mutants $n_0=0$.

- We calculate the proportion of mutant seeds $p_s$ as:

$$
    p_s =\frac{(1+s)n}{P+sn}
$$

- With the proportion of mutant seeds, we can estimate the proportion of mutants $p_{sm}$ produced from those mutant seeds:

$$
    p_{sm}=(1-\nu)p_s + \mu(1-p_s)
$$

- Using the total population P with the proportion of mutants $p_{sm}$, we can draw $n_t$ mutants from the binomial distribution. This can be used from np.random.binomial(P, p_sm).

This program runs while the number of mutants $n_t<P$ population for takeover, or $t$ has exceeded some maximum time $T_{max}$.

### Task 1. 

Write a function *fisher_wright(P, s, mu, nu, Tmax)* that, given the population size $P=500$, the selective advantage $s=0.1$, the forward $\mu=0.05$ and backward $\nu=0.05$ mutation rates, and a maximum time $T_{max}=10^4$ returns the number of mutants timeseries as a numpy array.

### Task 2.

Plot time $t$ against the number of mutants $n$ using plt.plot. Remember to label your axes.

### Task 3.

Modify *fisher_wright()* to take an additional argument, $N_r$, which could be the number of realisations (or attempts). Re-run the function with $N_r=1000$, take the mean with respect to $N_r$ and plot time $t$ against mean $n$ with error bars (as standard deviation), using plt.fill_between().