# Chapter 4. Numpy Basics: Arrays and Vectorized Computation

NumPy, short for Numerical Python, is one of the most important foundation packages for numerical computing in Python. Most computational packages providing scientific functionality use NumPy's array objects as the lingua franca for data exchange.

Here are some of the things you'll fnd in NumPy:

* ndarray, an efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities
* Mathematical functions for fast operations on entire arrays of data without having to write loops
* Tools for reading/writing array data to disk and working with memory-mappedfiles
* Linear algebra, random number generation and Fourier transform capabilities
* A C api for connecting numpy with libraries written in C, C++ or FORTRAN

Because NumPy provides an easy to use C API, it is straightforward to pass data to external libraries written in low-level language and also for external libraries to return data to Python as NumPy arrays. This feature has made Python a language of choice for wrapping legacy C/C++/Fortran codebases and giving them a dunamic and easy-to-use interface.

While NumPy by itself does not provide modeling og scientific functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools with array-oriented semantics, like pandas, much more effectively. 

For most data analysis applications, the main areas of functionality we will focus on are:

* Fast vectorizewd array operations for data munging and cleaning, subsetting and filtering, transformation, and any other kinds of computations.

* Common array algorithms like sorting, unique and set operations.

* Eficient desciptive statistics and aggregating/summarizing data

* Data alignment and relational data manipulations for merging and joining together heterogeneous datasets.

* Expressing conditional logic as array expressions instead of loops with if-elif-else branches

* Group-wise data manipulations (aggregation, transformation, function application)

While numpy proved a computational foundation for general numerical data processing, many readers will want to use pandas as the basis for most kind of statistics or analytics, especially on tabular data. Pandas also provides som more domain-specific functionality like time-series manipulation.

One of the reasons NumPy is so important for numerical computations in Python is because it is designed for efficiency on large arrays of data. There are a number of reasons for this:

* Numpy internally stores data in a contagious block of memory, independent of the other builtin Python Objects. NumPy's library of algotihms are written in the C language can operate on this memory without any type checking or other overhead. Numpy arrays also use much less memory than built-in Python sequences.

* NumPy operations perform complex computations on entiry arrays without the need for Python for loops.

To give you an idea of the performance difference, consider a NumPy array of one million integers, and the equivalent python list:

In [None]:
import numpy as np

my_arr = np.arange(10**6)

my_list = list(range(10**6))

Let's multiply each sequence by 2:

In [None]:
#%%
%time for _ in range(10) : my_arr2 = my_arr * 2

In [None]:
%time for _ in range(10): my_list2 = [x * 2 for x in my_list]

NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory.

## The NumPy ndarray: A Multidimensional Array Object

One of the key features of NumPy is its N-dimensional array object. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.

To give you a flavor of how NumPy enables batch computations with similar syntax to scalar values on built-in Python object, i first import NumPy an generate a small array of random data:

In [None]:
import numpy as np 

data = np.random.randn(2, 3)

data

I then write mathematical opreations with data:

In [None]:
data * 10

In [None]:
data + data

An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array.

In [None]:
data.shape

In [None]:
data.dtype

### Creating ndarrays

The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data. For example, a list is a good candidate for conversion:

In [None]:
data1 = [6, 7.5, 8, 0, 1]

arr1 = np.array(data1)
arr1

Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array:

In [None]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]

arr2 = np.array(data2)
arr2

In [None]:
arr2.ndim

In [None]:
arr2.shape

Unless explicitly specified np.arrays tries to infer a good data type for the array that it creates. The data type is stored in a special dtype metadata object. For example, in the previous two examples we have

In [None]:
arr1.dtype

In [None]:
arr2.dtype

In addition to np.array, there are a number of other functions for creating new arrays. As examples, zeros and ones create arrays of 0s or 1s respectively. Examples follow below:

In [None]:
np.zeros(10)

In [None]:
np.zeros((3, 6))

In [None]:
np.empty((2, 3, 2))

Arange is an array-valued version of the built-in Python range function:

In [None]:
np.arange(15)

### Data types for ndarrays

The data type or dtype is a special object containing the information (metadata) the ndarray needs to interpret a chunk of memory as a particular type of data:

In [None]:
arr1 = np.array([1, 2, 3], dtype=np.float64)
arr2 = np.array([1, 2, 3], dtype=np.int32)

dtypes are a source of numpys flexibility for interacting with data coming from other systems. In most cases they provide a mapping directly onto an underlying disk or memory representation, whick makes it easy to read and write binary streams of data to disk and also to connect to code written in a low-level language like C or Fortran. 

You can explicitly convert or cast an array from one dtype to another using ndarray's astype method:

In [None]:
arr = np.array([1, 2, 3, 4, 5])
arr.dtype

In [None]:
float_arr = arr.astype(np.float64)
float_arr.dtype

In this example, integers were cast to floating point. if i cast some floating point numbers to be of integer dtype, the decimal part will be truncated.

### Arithmetic with NumPy Arrays

Arrays are important because they enable you to express batch operations on data without writing any for loops. Numpy users call this *vectorization*. Any arithmetic operation between equal size arrays applies the operation element-wise

In [None]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr

In [None]:
arr * arr

In [None]:
arr - arr

Aritmetic operations with scalars propagate the scalar argument to each element in the array:

In [None]:
1 / arr

In [None]:
arr ** 0.5

Comparisons between arrays of the same size yield boolean arrays

In [None]:
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr2 > arr

Evaluating operations between differently sized arrays is called broadcasting.

### Basic indexing and slicing

Numpy array indexing is a rich topic, as there are many ways you may want to select a subset of your data or individual elements. One dimensional arrays are simple; on the surface they act similarly to python lists:

In [None]:
arr = np.arange(10)
arr[5]

In [None]:
arr[5:8]

In [None]:
arr[5:8] = 12

In [None]:
arr

An important first distinction from pythons built-in lists is that array slices are *view* on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array.

To give an example of this, i first create a slice of arr:

In [None]:
arr_slice = arr[5:8]
arr_slice

In [None]:
arr_slice[1] = 12345
arr

The bare slice [:] will assign to all values in an array:

In [None]:
arr_slice[:] = 64
arr

With higher dimensional arrays, you have many more options. In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays:

In [None]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d[2]

Thus, individual elements can be accessed recursively. But that is a bit too much work, so you can pass comma sperated lists of indices to select individual elements. So these are equivalent

In [None]:
arr2d[0][2]

In [None]:
arr2d[0, 2]

In multidimensional arrays, if you omit later indices, the returned object wil be a lower dimensional ndarray consisting of all the data along the higher dimensions. So in the $2 \times 2 \times 3$ array arr3d.

In [None]:
arr3d = np.array([[[1, 2, 3], [1, 2, 3]] , [[7, 8, 9], [10, 11, 12]]])
arr3d

In [None]:
arr3d[0]

### Indexing with slices

Like one dimensional object such as Python lists, ndarrays can be sliced with the familiar syntax:

In [None]:
arr

In [None]:
arr[1:6]

Consider the two-dimensional array from before, arr2d. Slicing this array is a bit different:

In [None]:
arr2d

In [None]:
arr2d[:2]

As you can see, it has sliced along axis 0, the first axis. A slice, therefore, selects a range of elements along an axis. It can be helpful to read the exppression arr2d[:2] as "selec the first two rows of arr2d"

You can pass multiple slices just like yu can pass multiple indexes

In [None]:
arr2d[:2, 1:]

### Boolean indexing

Let's consider an example where we ave som data in an array and an array of names with duplicates Im going to use here the randn funcction in numpy.random to generate som random normally distributed data:

In [None]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = np.random.randn(7, 4)
names

In [None]:
data

Suppose each name correspons to a row in the data array and we wanted to select all the rows with corresponding name 'Bob'. Like arithmetic operations, comparisuns with arrays are also vecorized. Thus, comparing names with the string 'Bob' yields a boolean array:

In [None]:
names == 'Bob'

This boolean array can be passed when indexing the array:

In [None]:
data[names == 'Bob']

The boolean array must be of same length as the array axis it's indexing. You can even mix and match boolean arrays with slices or integers. In these examples, i select from the rows where names == 'Bob' and index the columns, too:

In [None]:
data[names == 'Bob', 2:]

In [None]:
data[names == 'Bob', 3]

To select everything but 'Bob' you can either use != or negate the condition using ~:

In [None]:
names != 'Bob'

In [None]:
data[~(names == 'Bob')]

Selecting two of the three names to combine multiple boolean conditions,use boolean arithmetic operators like & (and) and | (or):

In [None]:
mask = (names == 'Bob') | (names == 'Will')
mask

In [None]:
data[mask]

Selecting data from an array by boolean indexing always creates a copy of the data, even if the returned array is unchanged.

Setting values with boolean arrays works in a common-sense way. To set all of the negative values in data to 0 we need only to do:

In [None]:
data[data < 0] = 0
data

### Fancy indexing

*Fancy indexing* is a term adopted by NumPy to describe indexing using integer arrays. Suppose we had a 8x4 array:

In [None]:
arr = np.empty((8, 4))

In [None]:
for i in range(8):
    arr[i] = i
arr

To select out a subset of the rows in a particular order, you can simply pass a list or ndarray of integers specifying the desired order:

In [None]:
arr[[4, 3, 0, 6]]

In [None]:
arr = np.arange(32).reshape((8, 4))
arr

In [None]:
arr[[1, 5, 7, 2], [0, 3, 1, 2]]

In [None]:
The code above does something different. It extracts a array of the elements corresponding to the indexes (1, 0), (5, 3), (7, 2), (2, 2)

### Transposing Arrays and Swapping Axes

Transposing is a specal form of reshaping that similarly returns a view on the underlying data without copying. Arrays have the transpose method and also the special T attribute

In [None]:
arr = np.arange(15).reshape((3, 5))
arr.T

When doing matrix computations, you may do this very often - for example, when computing the inner matrix product using np.dot.

In [None]:
arr = np.random.randn(6, 3)
arr

In [None]:
np.dot(arr.T, arr)

For higher dimensional arrays, transpose will accept a tuple of axis numbers to permute the axes (for extra mind bending)

In [None]:
arr = np.arange(16).reshape((2, 2, 4))
arr

In [None]:
arr.transpose((1, 0, 2))

## Universal Functions: Fast Element-Wise Array Functions

A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functionas that take one or more scalar values and produce one or more scalar results. Many ufuncs are simple element-wise transformations, like sqrt or exp:

In [None]:
arr = np.arange(10)
arr

In [None]:
np.sqrt(arr)

In [None]:
np.exp(arr)

These are referred to as unary ufuncs. Others such as add or maximum, take two arrays and return a single array as the result:

In [None]:
x = np.random.randn(8)
y = np.random.randn(8)
x

In [None]:
y

In [None]:
np.maximum(x, y)

Here, numpy.maximum computed the element-wise maximum of the elements in x and y.

While not commond, a ufunc can return multiple arrays. modf is one example, a vectorized version of the builtin python divmod. It returns the fractional and integral parts of a floating point array.

In [None]:
arr = np.random.randn(7) * 5
arr

In [None]:
remainder, whole_part = np.modf(arr)

In [None]:
remainder

In [None]:
whole_part

Ufuncs accept an optional out argument that allows them to operate in-place on arrays.

In [None]:
arr

In [None]:
np.sqrt(arr)

In [None]:
np.sqrt(arr, arr)

## Array-Oriented Programming with Arrays

Using NumPy arrays enables you to express many kindsof data processing tasks as concise array expressions that might otherwise require writing loops. This practice of replacing explicit loops with arary expressions is commonly refered to as *vectorization*. In general, vectorized array operation will often be one or two orders of magnitude faster than their pure Python equivalents, with the biggest impact in any kind of numerical computations. 

As a simple example, suppose we wished to evaluate the function sqrt(x^2 + y^2) across a regular grid of values. The np.meshgrid function takes two 1D arrays and producees two 2D matrices corresponding to all pairs of (x, y) in the two arrays.

In [None]:
points = np.arange(-5, 5, 0.01)
xs, ys = np.meshgrid(points, points)
ys

In [None]:
xs

Now, evaluating the function is a matter of the same eqpression you ould write with two points:

In [None]:
z = np.sqrt(xs ** 2 + ys ** 2)

In [None]:
import matplotlib.pyplot as plt
plt.imshow(z, cmap=plt.cm.gray);
plt.colorbar();
plt.title("Image plot of $\sqrt{x^2 + y^2}$ for a grid of values")

### Expressing Conditional Logic as Array Operations

The numpy.where function is a vectorized version of the ternary expression *x if condition else y*. Suppose we had a boolean array and two arrays of values:

In [None]:
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])

Suppose we wanted to take a value from xarr whenever the corresponding value in cond is True, and otherwise take the value from yarr. A list comprehension doing this might look like:

In [None]:
result = [(x if c else y) for x, y, c in zip(xarr, yarr, cond)]
result

This has multiple problems. First, it will not be very fast for large arrays (because all the work is being done in interpreted Python code). Second, it will not work with multidimensional arrays. With np.where you can write this very concisely:

In [None]:
result = np.where(cond, xarr, yarr)
result

### Mathematical and Statistical Methods

A set of mathematical functions that compute statistics about and entire array or about the data along an axis are accessible as methods of the array class. You can use aggregations like sum, mean and std either by calling the array instance method or using the top-level NumPy function.

Here i generate some normally distributed random data and compute some aggregate statistics:

In [None]:
arr = np.random.randn(5, 4)
arr

In [None]:
arr.mean()

In [None]:
np.mean(arr)

In [None]:
arr.sum()

Functions like mean and sum take an optional axis argument that computes the statistic over the given axis, resulting in an array woth one fewer dimension:

In [None]:
arr.mean(axis=1)

In [None]:
arr.sum(axis=0)

Here, arr.mean(1) means "compute mean across the columns" where arr.sum(0) means "compute sums down the rows"

methods like cumsum and cumprod do not aggregate, instead producing an array of the intermediate results:

In [None]:
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7])
arr.cumsum()

### Methods for Boolean Arrays

Boolean values are coerced to 1 (True) and 0 (False) in the preceding methods. Thus, sum is often used as a means of counting True values in a boolean array:

In [None]:
arr = np.random.randn(100)
(arr > 0).sum()

There are two additional methods, any and all, useful especially for boolean arrays. Any tests whether one or more values in an array is True, while all checks if every value is True:

In [None]:
bools = np.array([False, False, True, False])
bools.any()

In [None]:
bools.all()

### Sorting

Like python's built-in list type, NumPy arrays can be sorted in-place with the sort method:

In [None]:
arr = np.random.randn(6)
arr

In [None]:
arr.sort()
arr

You can sort each one-dimensional section of values in a multidimentsional array in-place along an axis by passing the axis number to sort:

In [None]:
arr = np.random.randn(5, 3)
arr

In [None]:
arr.sort(1)
arr

In [None]:
large_arr = np.random.randn(1000)
large_arr.sort()
large_arr[int(0.05 * len(large_arr))]

### Unique and Other Set Logic

NumPy has some basic set operations for one-dimensional ndarrays. A commonly used one is np.unique, which returns the sorted unique values in an array:

In [None]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
np.unique(names)

In [None]:
ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])
np.unique(ints)

Contrast np.unique with the pure python alternative:

In [None]:
sorted(set(names))

Another function, np.in1d, tests membership of the values in one array in another, returning a boolean array:

In [None]:
values = np.array([6, 0, 0, 3, 2, 5, 6])
np.in1d(values, [2, 3, 6])

## File Input and Output with Arrays

Numpy is able to save and load data to and from disk either in text or binary format. In this section i only discuss NumPy's built in binary format, since most users will prefer pandas and other tools for loading text or tabular data.

np.save and np.load are the two workhorse functions for efficiently saving and loading array data on disk. Arrays are saved by default in an uncompressed raw binary format with file extension .npy:

In [None]:
arr = np.arange(10)
np.save('some_array', arr)

If the file path does not already end in .npy the extension will be appended. The array on disk can then be loaded with np.load.

In [None]:
np.load('some_array.npy')

You can save multiple arrays in an uncompressed archive using np.savez and passing the arrays as keyword arguments:

In [None]:
np.savez('array_archive.npz', a=arr, b=arr)

When loading an .npz file, you get back a dict-like object that loads the individual arrays lazily:

In [None]:
arch = np.load('array_archive.npz')
arch['b']

If your data compresses well, you may wish to use numpy.savez.conpressed instead:

In [None]:
np.savez_compressed('arrays_compressed.npz', a=arr, b=arr)

## Linear Algebra

Linear algebra, like matrix multiplication, decomposition, determinants and other square matrix math, is an important part of any array library. Unlike some other languages like MATLAB, multiplying two two-dimensional arrays with * is an element-wise product instead of a matrix dot product. Thus, there is a function dot, both an array method and a function in the numpy namespace, for matrix multiplication:

In [None]:
x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1., 7.], [8., 9.]])
x

In [None]:
y

In [None]:
x.dot(y)

A matrix product between a two dimensional array and a suitably sized one-dimensional array results in a one-dimensional array:

In [None]:
np.dot(x, np.ones(3))

The @ symbol also works as an infix operator that performs matrix multiplication:

In [None]:
x @ np.ones(3)

numpy.linalg has a standard set of matrix decompositions and things like inverse and determinant. These are implemented under the hood via the same industry-standard linear algebra libraries used in other languages like MATLAB and R, such as BLAS, LAPACK, or possibly Intel MLK (Math Kernel Library)

In [None]:
from numpy.linalg import inv, qr

X = np.random.randn(5, 5)

mat = X.T.dot(X)
inv(mat)

In [None]:
mat.dot(inv(mat))

In [None]:
q, r = qr(mat)
r

## Pseudorandom Number Generation

The numpy.random module supplements the built-in python random functions for efficiently generating whole arrays of sample values from many kinds of probability distributions. For example, you can get a 4x4 array of samples from the standard normal distribution using normal:

In [None]:
samples = np.random.normal(size=(4, 4))
samples

Python's built-in random module, by contrast only samples one value at a time. As you can see from this benchmark, numpy.random is well over an order of magnitude faster for generating very large samples.

In [None]:
from random import normalvariate

N = 1000000

%timeit samples = [normalvariate(0, 1) for _ in range(N)]

In [None]:
%timeit np.random.normal(size=N)

We say that these are pseudorandom numbers because they are generated by an algorithm with deterministic behavior based on the seed of the random number generator. You can change NumPy's random number generation seed using np.random.seed:

In [None]:
np.random.seed(1234)

The data generation functions in numpy.random use a global random seed. To avoid global state, you can use numpy.random.RandomState to create a random number generator isolated from others:

In [None]:
rng = np.random.RandomState(1234)
rng.randn(10)

## Example: Random Walks

The simulation of random walks provides an illustrative application of utilizing array operations. Let's first consider a simple random walk starting at 0 with steps of 1 and -1 occuring with equal probability. 

Here is a pure python way to implement a simple random walk with 1,000 steps using the built-in random module:

In [None]:
import random
position = 0
walk = [position]
steps = 1000
for i in range(steps):
    step = 1 if random.randint(0, 1) else -1
    position += step
    walk.append(position)
plt.plot(walk[:100])

You might make the observation that walk is simply the cumulative sum of the random steps and could be evaluated as an array expression. Thus, i use the np.random module to draw 1,000 coin flips at once, set these to 1 and -1 and compute the cumulative sum:

In [None]:
nsteps = 1000
draws = np.random.randint(0, 2, size=nsteps)
steps = np.where(draws > 0, 1, -1)
walk = steps.cumsum()

From this we can begin to extract statistics like the minimum and maximum value along the walk's trajectory:

In [None]:
walk.min()

In [None]:
walk.max()

A more complicated statistic is the first crossing time, the step at which the random walk reaches a particular value. Here we might want to know how long it took the random walk to get at least 10 steps away from the origin 0 in either direction.

In [None]:
(np.abs(walk) >= 10).argmax()

### Simulating many random walks at once

If your goal was to simulate many random walks at once, let's say 5000 of them, you can generate all of the random walks with minor modifications to the preceding code. If passed a 2-tuple, the numpy.random functions will generate a two-dimensional array of draws, and we can compute the cumulative sum accross the rows to compute all 5000 random walks in one shot:

In [None]:
nwalks = 5000
nsteps = 1000
draws = np.random.randint(0, 2, size=(nwalks, nsteps))
steps = np.where(draws > 0, 1, -1)
walks = steps.cumsum(1)
walks

Now we can compute the maximum and minimum values obtained over all of the walks:

In [None]:
walks.max()

In [None]:
walks.min()

Out of these walks, let's compute the minimum crossing time to 30 or -30.  This is slightly tricky because not all 5,000 of them reach 30. We can check this usign the any method:

In [None]:
hits30 = (np.abs(walks) >= 30).any(1)
hits30

In [None]:
hits30.sum()

In [None]:
We can use this boolean array to select out the rows of walks that actually cross the absolute 30 level and call argmax across axis 1 to get the crossing times:

In [None]:
crossing_times = (np.abs(walks[hits30]) >= 30).argmax(1)
crossing_times

Lastly, we compute the average minimum crossing time:

In [None]:
crossing_times.mean()