Computers are useless. They can only give answers.
—Pablo Picasso

We will coveer the following data structures:

| Object type | Meaning | Used for |
|------------|---------|-----------|
| ndarray (regular) | n-dimensional array object | Large arrays of numerical data |
| ndarray (record) | 2-dimensional array object | Tabular data organized in columns |

# Arrays of Data

## Arrays with Python Lists

A simple list can already be considered a one-dimensional array.

In [2]:
v = [0.1, 0.2, 7, 989]
type(v)

list

In [7]:
m = [v, v, v]
m[1][0]

0.1

Note: Combining objects in the way just presented generally works with reference
pointers to the original objects. This means that:

In [10]:
v[0] = 'dhairya kantawala'
m

[['dhairya kantawala', 0.2, 7, 989],
 ['dhairya kantawala', 0.2, 7, 989],
 ['dhairya kantawala', 0.2, 7, 989]]

Even tho m was not changed, the final output changes. This can be avoided by using the deepcopy() function of the copy module.

In [15]:
from copy import deepcopy
v = [1,2,3,4,5,6]
m = 3*[deepcopy(v)]

In [18]:
v[0] = 'dhairya'
m

[[1, 2, 3, 4, 5, 6], [1, 2, 3, 4, 5, 6], [1, 2, 3, 4, 5, 6]]

## The Python array Class

This module defines an object type which can compactly represent an array of basic
values: characters, integers, floating point numbers. Arrays are sequence types and
behave very much like lists, except that the type of objects stored in them is constrained. The type is specified at object creation time by using a type code, which is a single character.

In [44]:
v = [1,2,3,4,5,6]
import array
a = array.array('f', v) #float type of array

In [45]:
a.append(0.222222)

In [46]:
a

array('f', [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 0.2222220003604889])

In [47]:
a.extend([1010, 9382.21])

In [48]:
2*a

array('f', [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 0.2222220003604889, 1010.0, 9382.2099609375, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 0.2222220003604889, 1010.0, 9382.2099609375])

Trying to append an object of a different data type than the one specified raises a
TypeError.

In [49]:
a.append('string')

TypeError: must be real number, not str

In [50]:
b = a.tolist()
b.append('dhairya')
print(b)

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 0.2222220003604889, 1010.0, 9382.2099609375, 'dhairya']


An advantage of the array class is that it has built-in storage and retrieval functionality.

In [51]:
f = open('array.apy', 'wb')
a.tofile(f)
f.close()

In [57]:
with open('array.apy', 'wb') as f: #writes binary data
    a.tofile(f) #this is an alternative method

In [58]:
b = array.array('f')
with open('array.apy', 'rb') as f: #reads binary data
    b.fromfile(f, 5) #shows the 5 elements
b

array('f', [1.0, 2.0, 3.0, 4.0, 5.0])

# Regular NumPy Arrays

## The Basics

numpy.ndarray is just such a class, built with the specific goal of handling n-dimensional arrays both conveniently and efficiently—i.e., in a highly performant manner.

In [59]:
import numpy as np
a = np.array([0.1, 0.2, 0.3, 8])
a

array([0.1, 0.2, 0.3, 8. ])

In [60]:
type(a)

numpy.ndarray

In [62]:
a = np.array(['a', 'b', 'c'])
a

array(['a', 'b', 'c'], dtype='<U1')

In [63]:
a = np.arange(2, 20, 2)
a

array([ 2,  4,  6,  8, 10, 12, 14, 16, 18])

In [67]:
a = np.arange(8, dtype=np.float64)
a

array([0., 1., 2., 3., 4., 5., 6., 7.])

In [69]:
print(a[5:])
print(a[:2])

[5. 6. 7.]
[0. 1.]


In [79]:
sum = 0.0
for i in range(8):
    sum+=i
print(sum)
print(a.sum())

28.0
28.0


In [82]:
print(a.std()) #this is the standard div

2.29128784747792


In [83]:
a.cumsum()

array([ 0.,  1.,  3.,  6., 10., 15., 21., 28.])

## Vectorised operations

In [92]:
print(2*a)
print(a**2)
print(a**a) #power of every element to itself

[ 0.  2.  4.  6.  8. 10. 12. 14.]
[ 0.  1.  4.  9. 16. 25. 36. 49.]
[1.00000e+00 1.00000e+00 4.00000e+00 2.70000e+01 2.56000e+02 3.12500e+03
 4.66560e+04 8.23543e+05]


Universal functions are another important feature of the NumPy package. They are
“universal” in the sense that they in general operate on ndarray objects as well as on
basic Python data types. However, when applying universal functions to, say, a
Python float object, one needs to be aware of the reduced performance compared to
the same functionality found in the math module.

In [96]:
print(np.exp(a))
print(np.sqrt(a))
print(np.sqrt(2.5))

[1.00000000e+00 2.71828183e+00 7.38905610e+00 2.00855369e+01
 5.45981500e+01 1.48413159e+02 4.03428793e+02 1.09663316e+03]
[0.         1.         1.41421356 1.73205081 2.         2.23606798
 2.44948974 2.64575131]
1.5811388300841898


In [97]:
import math
math.sqrt(a)

TypeError: only length-1 arrays can be converted to Python scalars

In [98]:
%timeit np.sqrt(2.5)
%timeit math.sqrt(2.5)

547 ns ± 7.16 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
28.2 ns ± 0.0126 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


## Multiple Dimensions

In [106]:
b = np.array([a, a*2])
print(b[1])

[ 0.  2.  4.  6.  8. 10. 12. 14.]


In [111]:
print(b[0, 2]) #selects the 1st row, 3rd column element

2.0


In [112]:
print(b[:, 1]) #selects the second column, both rows

[1. 2.]


In [115]:
b.sum()

np.float64(84.0)

In [116]:
b.sum(axis=0) #calculates the sum over the first axis or the columns

array([ 0.,  3.,  6.,  9., 12., 15., 18., 21.])

In [117]:
b.sum(axis=1) #calculates the sum over the second axis or the rows

array([28., 56.])

shape : Either an int, a sequence of int objects, or a reference to another ndarray

dtype : A dtype—these are NumPy-specific data types for ndarray objects

order : The order in which to store elements in memory: C for C-like (i.e., row-wise) or F for Fortran-like (i.e., column-wise)

In [121]:
c = np.zeros((2, 3), dtype='i', order='C')
c

array([[0, 0, 0],
       [0, 0, 0]], dtype=int32)

In [123]:
c = np.ones((2, 3, 4), dtype='i', order='c')
c

array([[[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]],

       [[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]]], dtype=int32)

In [126]:
d = np.zeros_like(c, dtype='f', order='C')
d

array([[[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]],

       [[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]]], dtype=float32)

In [128]:
d = np.ones_like(c, dtype='f', order='C')
d

array([[[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]], dtype=float32)

In [130]:
e = np.empty((2, 3, 2))
e

array([[[0.0078125, 0.0078125],
        [0.0078125, 0.0078125],
        [0.0078125, 0.0078125]],

       [[0.0078125, 0.0078125],
        [0.0078125, 0.0078125],
        [0.0078125, 0.0078125]]])

In [132]:
f = np.empty_like(c)
f

array([[[1065353216, 1065353216, 1065353216, 1065353216],
        [1065353216, 1065353216, 1065353216, 1065353216],
        [1065353216, 1065353216, 1065353216, 1065353216]],

       [[1065353216, 1065353216, 1065353216, 1065353216],
        [1065353216, 1065353216, 1065353216, 1065353216],
        [1065353216, 1065353216, 1065353216, 1065353216]]], dtype=int32)

In [135]:
np.eye(5) # I 5*5 matrix

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [137]:
g = np.linspace(5, 15, 12) #start, end and number of elements
g

array([ 5.        ,  5.90909091,  6.81818182,  7.72727273,  8.63636364,
        9.54545455, 10.45454545, 11.36363636, 12.27272727, 13.18181818,
       14.09090909, 15.        ])

| dtype | Description | Example |
|-------|-------------|---------|
| ? | Boolean | ? (True or False) |
| i | Signed integer | i8 (64-bit) |
| u | Unsigned integer | u8 (64-bit) |
| f | Floating point | f8 (64-bit) |
| c | Complex floating point | c32 (256-bit) |
| m | timedelta | m (64-bit) |
| M | datetime | M (64-bit) |
| O | Object | O (pointer to object) |
| U | Unicode | U24 (24 Unicode characters) |
| V | Raw data (void) | V12 (12-byte data block) |

## Metainformation

Every ndarray object provides access to a number of useful attributes.

In [141]:
print(g.size)
print(g.itemsize) #The number of bytes used to represent one element.
print(g.ndim) #number of dims
print(g.shape) #(x, y, z..)
print(d.dtype)
print(g.nbytes) #total number of bytes in mem

12
8
1
(12,)
float32
96


## Reshaping and Resizing

In [145]:
np.arange(15)
g

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [147]:
g.shape

(15,)

In [153]:
h = g.reshape(5, 3)

In [154]:
h

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [155]:
h.T #transposes

array([[ 0,  3,  6,  9, 12],
       [ 1,  4,  7, 10, 13],
       [ 2,  5,  8, 11, 14]])

In [157]:
h.transpose() #same transposes

array([[ 0,  3,  6,  9, 12],
       [ 1,  4,  7, 10, 13],
       [ 2,  5,  8, 11, 14]])

During a reshaping operation, the total number of elements in the ndarray object is
unchanged. During a resizing operation, this number changes—it either decreases
(“down-sizing”) or increases (“up-sizing”). Here some examples of resizing.

In [159]:
np.resize(g, (3, 1))

array([[0],
       [1],
       [2]])

In [160]:
np.resize(g, (1, 5))

array([[0, 1, 2, 3, 4]])

In [165]:
np.resize(g, (3, 10))

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14,  0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14]])

In [175]:
h = np.resize(g, (3, 5))
r = np.resize(g, (3, 6))
f = np.resize(g, (1, 5))

Stacking is a special operation that allows the horizontal or vertical combination of
two ndarray objects. However, the size of the “connecting” dimension must be the
same.

In [176]:
h

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [177]:
r

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14,  0,  1,  2]])

In [178]:
f

array([[0, 1, 2, 3, 4]])

In [171]:
np.hstack((h, r)) #horizontally placing both side by side

array([[ 0,  1,  2,  3,  4,  0,  1,  2,  3,  4,  5],
       [ 5,  6,  7,  8,  9,  6,  7,  8,  9, 10, 11],
       [10, 11, 12, 13, 14, 12, 13, 14,  0,  1,  2]])

In [182]:
np.vstack((h, f)) # vertically stacking both top and bottom|

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [ 0,  1,  2,  3,  4]])

Another special operation is the flattening of a multidimensional ndarray object to a
one-dimensional one. One can choose whether the flattening happens row-by-row (C
order) or column-by-column (F order).

In [183]:
h

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [184]:
h.flatten() #by default row-by-row

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [185]:
h.flatten(order='C')

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [186]:
h.flatten(order='F')

array([ 0,  5, 10,  1,  6, 11,  2,  7, 12,  3,  8, 13,  4,  9, 14])

In [187]:
for i in h.flat: #the flat attribute provides a flat iterator (C order)
    print(i, end=', ')


0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 

In [188]:
for j in h.ravel(order='F'): #this ravel() method is an alternative to flatten()
    print(j, end=', ')

0, 5, 10, 1, 6, 11, 2, 7, 12, 3, 8, 13, 4, 9, 14, 

## Boolean Arrays

Comparison and logical operations in general work on ndarray objects the same way,
element-wise, as on standard Python data types.

In [189]:
h

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [191]:
h > 8

array([[False, False, False, False, False],
       [False, False, False, False,  True],
       [ True,  True,  True,  True,  True]])

In [192]:
h <= 7

array([[ True,  True,  True,  True,  True],
       [ True,  True,  True, False, False],
       [False, False, False, False, False]])

In [194]:
(h == 5).astype(int) # Present True and False as integer values 0 and 1

array([[0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

In [195]:
(h > 4) & (h <= 12)

array([[False, False, False, False, False],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True, False, False]])

Such Boolean arrays can be used for indexing and data selection. Notice that the following operations flatten the data.

In [196]:
h[h > 8]

array([ 9, 10, 11, 12, 13, 14])

In [197]:
h[(h > 4) & ( h<= 12)]

array([ 5,  6,  7,  8,  9, 10, 11, 12])

In [199]:
h[(h < 4) | (h >= 12)]

array([ 0,  1,  2,  3, 12, 13, 14])

A powerful tool in this regard is the np.where() function, which allows the definition
of actions/operations depending on whether a condition is True or False. The result
of applying np.where() is a new ndarray object of the same shape as the original one.

In [200]:
np.where(h > 7, 1, 0)

array([[0, 0, 0, 0, 0],
       [0, 0, 0, 1, 1],
       [1, 1, 1, 1, 1]])

In [201]:
np.where(h%2 == 0, 'even', 'odd')

array([['even', 'odd', 'even', 'odd', 'even'],
       ['odd', 'even', 'odd', 'even', 'odd'],
       ['even', 'odd', 'even', 'odd', 'even']], dtype='<U4')

In [204]:
np.where(h <= 7, h, h-7)

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 1, 2],
       [3, 4, 5, 6, 7]])

## Speed Comparison 

In [215]:
import random
I = 5000

%time mat = [[random.gauss(0, 1) for j in range(I)] for i in range(I)]

CPU times: user 10.5 s, sys: 1.48 s, total: 11.9 s
Wall time: 12.9 s


In [218]:
mat[0][:5]

[0.5919640104629647,
 0.0920696364000629,
 1.2420580712924127,
 0.30792034887759984,
 -0.26537041805866757]

In [228]:
%time np.sum(np.sum(l) for l in mat)



CPU times: user 897 ms, sys: 337 ms, total: 1.23 s
Wall time: 1.42 s


np.float64(-178.50063992055544)

In [230]:
import sys
np.sum([sys.getsizeof(l) for l in mat])

np.int64(209400000)

In [233]:
%time mat = np.random.standard_normal((I, I))

CPU times: user 464 ms, sys: 22.4 ms, total: 487 ms
Wall time: 488 ms


In [234]:
%time mat.sum()

CPU times: user 17.7 ms, sys: 36.5 ms, total: 54.2 ms
Wall time: 55.8 ms


np.float64(-6242.784867715764)

In [235]:
mat.nbytes

200000000

In [236]:
sys.getsizeof(mat)

200000128

The use of NumPy for array-based operations and algorithms generally results in compact, easily readable code and significant performance improvements over pure Python code.

# Structured NumPy Arrays

NumPy provides structured ndarray and record recarray objects that allow you to
have a different dtype per column.

In [238]:
dt = np.dtype([('Name', 'S10'), ('Age', 'i4'), ('Height', 'f'), ('Children/Pets', 'i4', 2)])
dt

dtype([('Name', 'S10'), ('Age', '<i4'), ('Height', '<f4'), ('Children/Pets', '<i4', (2,))])

In [240]:
dt = np.dtype({'names': ['Name', 'Age', 'Height', 'Children/Pets'],
               'formats':'O int float int,int'.split()})
dt #this is an alternative method of doing the same

dtype([('Name', 'O'), ('Age', '<i8'), ('Height', '<f8'), ('Children/Pets', [('f0', '<i8'), ('f1', '<i8')])])

In [247]:
s = np.array([('Dhairya', 20, 1.78, (0, 2)),
              ('Shivam', 21, 1.80, (0,0))], dtype=dt)

In [250]:
print(s)
print(type(s))

[('Dhairya', 20, 1.78, (0, 2)) ('Shivam', 21, 1.8 , (0, 0))]
<class 'numpy.ndarray'>


The single columns can now be easily accessed by their names and the rows by their index values.

In [251]:
s['Name']

array(['Dhairya', 'Shivam'], dtype=object)

In [255]:
print(s['Height'].mean())

1.79


In [256]:
print(s[0])

('Dhairya', 20, 1.78, (0, 2))


In [257]:
print(s[1]['Age'])

21


One advantage of structured arrays is that a single element of a column can be
another multidimensional object and does not have to conform to the basic NumPy
data types.

# Vectorization of Code

Vectorization is a strategy to get more compact code that is possibly executed faster.
The fundamental idea is to conduct an operation on or to apply a function to a complex object “at once” and not by looping over the single elements of the object. In Python, functional programming tools such as map() and filter() provide some basic means for vectorization. However, NumPy has vectorization built in deep down in its core.

## Basic Vectorization

In [258]:
np.random.seed(100)
r = np.arange(12).reshape((4, 3))
s = np.arange(12).reshape((4, 3))*0.5

In [259]:
r

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [260]:
s

array([[0. , 0.5, 1. ],
       [1.5, 2. , 2.5],
       [3. , 3.5, 4. ],
       [4.5, 5. , 5.5]])

In [261]:
r + s

array([[ 0. ,  1.5,  3. ],
       [ 4.5,  6. ,  7.5],
       [ 9. , 10.5, 12. ],
       [13.5, 15. , 16.5]])

In [262]:
r + 3

array([[ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [263]:
2 * r

array([[ 0,  2,  4],
       [ 6,  8, 10],
       [12, 14, 16],
       [18, 20, 22]])

In [264]:
2*r + 3

array([[ 3,  5,  7],
       [ 9, 11, 13],
       [15, 17, 19],
       [21, 23, 25]])

In [266]:
s = np.arange(0,12,4)
s

array([0, 4, 8])

In [268]:
r

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [267]:
r + s

array([[ 0,  5, 10],
       [ 3,  8, 13],
       [ 6, 11, 16],
       [ 9, 14, 19]])

In [270]:
s = np.arange(0, 12, 3)
r + s

ValueError: operands could not be broadcast together with shapes (4,3) (4,) 

In [271]:
s

array([0, 3, 6, 9])

In [273]:
r.T + s

array([[ 0,  6, 12, 18],
       [ 1,  7, 13, 19],
       [ 2,  8, 14, 20]])

In [278]:
sr = s.reshape(-1, 1) #make 1*4 into 4*1
sr.shape

(4, 1)

In [279]:
r + sr

array([[ 0,  1,  2],
       [ 6,  7,  8],
       [12, 13, 14],
       [18, 19, 20]])

Often, custom-defined Python functions work with ndarray objects as well. If the
implementation allows, arrays can be used with functions just as int or float objects
can.

In [280]:
def f(x):
    return 3*x + 5

In [281]:
f(0.5)

6.5

In [282]:
f(r)

array([[ 5,  8, 11],
       [14, 17, 20],
       [23, 26, 29],
       [32, 35, 38]])

On the NumPy level, looping
over the ndarray object is taken care of by optimized code, most of it written in C
and therefore generally faster than pure Python. This explains the "secret" behind
the performance benefits of using NumPy for array-based use cases.

## Memory Layout

An optional argument for the memory layout is provided. This
argument specifies, roughly speaking, which elements of an array get stored in memory next to each other (contiguously). When working with small arrays, this has
hardly any measurable impact on the performance of array operations. However,
when arrays get large, and depending on the (financial) algorithm to be implemented
on them, the story might be different. This is when memory layout comes into play

In [298]:
x = np.random.standard_normal((1000000, 5))
y = 2*x + 3
C = np.array((x, y), order='C')
F = np.array((x, y), order='F')

In [314]:
%timeit C.sum(axis=0)
%timeit C.sum(axis=1)
%timeit F.sum(axis=0)
%timeit F.sum(axis=1)

10.6 ms ± 1.89 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
18.4 ms ± 389 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
43 ms ± 128 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
37 ms ± 13.8 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
