# Numpy

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large data sets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.

An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array

In [1]:
import numpy as np
from numpy import random
data = np.array([1,2,3]); data

array([1, 2, 3])

In [2]:
print(data.shape); data.dtype

(3,)


dtype('int64')

The easiest way to create an array is to use the array function. This accepts any se- quence-like object (including other arrays) and produces a new NumPy array contain- ing the passed data. Nested sequences, like a list of equal-length lists, will be converted into a multidimen- sional array:

In [3]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2); print(arr2)
arr2.shape

[[1 2 3 4]
 [5 6 7 8]]


(2, 4)

In [4]:
print(np.zeros((2,3)).shape); np.eye(4,2).shape

(2, 3)


(4, 2)

You can explicitly convert or cast an array from one dtype to another using ndarray’s astype method:

In [5]:
float_arr = arr2.astype(np.float64); float_arr

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.]])

arange is an array-valued version of the built-in Python range function:

In [6]:
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

Arrays are important because they enable you to express batch operations on data without writing any for loops. This is usually called vectorization. Any arithmetic op- erations between equal-size arrays applies the operation elementwise.

Arithmetic operations with scalars are as you would expect, propagating the value to each element.

In [7]:
print(arr2-arr2); arr2/2

[[0 0 0 0]
 [0 0 0 0]]


array([[ 0.5,  1. ,  1.5,  2. ],
       [ 2.5,  3. ,  3.5,  4. ]])

## Basic Index and slicing

NumPy array indexing is a rich topic, as there are many ways you may want to select a subset.

In [8]:
arr = np.arange(10)
arr[5]

5

In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays.

In multidimensional arrays, if you omit later indices, the returned object will be a lower- dimensional ndarray consisting of all the data along the higher dimensions (i.e., pass first dimension and np will understand as give me the other dimensions)

In [9]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d[2] # third row
arr2d[0, 2] # first row and second column
arr2d[:, :1] # All rows and the first column

array([[1],
       [4],
       [7]])

In multidimensional arrays, if you omit later indices, the returned object will be a lower- dimensional ndarray consisting of all the data along the higher dimensions.

Boolean indexing: both the boolean and the array must be of the same length. You can combine boolean indexing with integer indexs.

In [10]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = random.randn(7, 4)
data[names == 'Bob'] # pass me rows that match the boolean tuple name == 'Bob'

array([[-3.20911711, -1.38884606, -0.83156027, -0.52604923],
       [-2.01027204, -0.99008213,  0.306977  , -0.81474763]])

To select everything but 'Bob', you can either use != or negate the condition using ~:

In [11]:
data[~(names == 'Bob')]

array([[-0.58005747, -1.77111516,  0.71177701, -0.61684185],
       [ 0.23885563, -0.19405812,  0.8859226 ,  0.68956778],
       [ 1.22058672, -0.59902224, -0.75147394, -0.64215527],
       [-0.4185361 ,  0.55639793,  0.05265754,  0.59972979],
       [-1.33901997,  1.92209559,  0.01856184, -0.07996356]])

In [12]:
print(data.shape)
data.T.shape # Transpose

(7, 4)


(4, 7)

A universal function, or ufunc, is a function that performs elementwise operations on data in ndarrays

In [13]:
np.abs(data); # unary
np.greater(data, data)

array([[False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False]], dtype=bool)

In [14]:
x = random.randn(8)
y = random.randn(8)
np.maximum(x, y) # element-wise maximum

array([-0.03833925,  1.67362578,  1.11794244,  1.16783911,  0.7743553 ,
        0.87360944,  1.94866076,  1.68289127])

## Vectorization

Using NumPy arrays enables you to express many kinds of data processing tasks as concise array expressions that might otherwise require writing loops. This practice of replacing explicit loops with array expressions is commonly referred to as vectoriza- tion.

The numpy.where function is a vectorized version of the ternary expression x if condition else y. That is, it returns for the array evaluated what'd be done with a for loop.

 A typical use of where in data analysis is to produce a new array of values based on another array. Suppose you had a matrix of randomly generated data and you wanted to replace all positive values with 2 and all negative values with -2. This is very easy to do with np.where:

In [15]:
arr = random.randn(4, 4)
print(np.where(arr > 0, 2, -2))
np.where(arr > 0, 2, arr) # set only positive values to 2

[[-2  2  2  2]
 [ 2  2 -2 -2]
 [-2 -2 -2 -2]
 [ 2 -2  2  2]]


array([[-0.09628097,  2.        ,  2.        ,  2.        ],
       [ 2.        ,  2.        , -0.3344366 , -0.37126261],
       [-0.64207862, -0.74524815, -0.70817645, -1.40906301],
       [ 2.        , -0.94185651,  2.        ,  2.        ]])

## Mathematical functions

A set of mathematical functions which compute statistics about an entire array or about the data along an axis are accessible as array methods. arrays have different axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1).

In [16]:
arr = np.random.randn(5, 4) # normally-distributed data
print(arr)
arr.mean(axis = 0) # mean for columns
arr.mean(axis = 1) # mean for rows

[[-0.7933788   1.98878101  0.37703882 -0.58284269]
 [-1.51780756  0.50344228  0.35800617  0.98988589]
 [-1.54078163  0.1206137  -1.86086707  0.78009989]
 [ 1.54711692 -2.01198926 -0.95134599 -0.09529819]
 [ 0.69480458  0.40682105 -0.17722335 -1.17703781]]


array([ 0.24739958,  0.08338169, -0.62523378, -0.37787913, -0.06315888])

Boolean arrays also have different methods.

In [17]:
arr = random.randn(100)
(arr > 0).sum() # Number of positive values

49

In [18]:
large_arr = random.randn(1000)
large_arr.sort()
large_arr[int(0.05 * len(large_arr))] # 5% quantile

-1.6943747374984801

NumPy has some basic set operations for one-dimensional ndarrays. Probably the most commonly used one is np.unique, which returns the sorted unique values in an array:

In [19]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
np.unique(names)

array(['Bob', 'Joe', 'Will'], 
      dtype='<U4')

## Linear Algebra

In [20]:
x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[1., 2.], [3, 4], [5., 6.]])
x.dot(y)

array([[ 22.,  28.],
       [ 49.,  64.]])

In [21]:
mat = x.T.dot(x)
mat

array([[ 17.,  22.,  27.],
       [ 22.,  29.,  36.],
       [ 27.,  36.,  45.]])

## Random

The numpy random generator is much faster than the python one. For example, let's generate many random walks. 

In [None]:
nwalks = 5000
nsteps = 1000
draws = np.random.randint(0, 2, size = (nwalks, nsteps) )
steps = np.where(draws > 0, 1, -1)
walks = steps.cumsum(axis = 1) # Cumulative sum for each row, that is, each walk