Today we'll cover:

1. [Array input and output](#Array-input-and-output)
2. [Linear algebra](#Linear-algebra)

# Array input and output

Sometimes you want store your arrays on disk for reading them later. `numpy.save` and `numpy.load` are the basic functions for doing that.

In [1]:
import numpy as np

In [2]:
array2d = np.array([[1, 2], [3, 4]]); print array2d

[[1 2]
 [3 4]]


In [3]:
np.save('my_array', array2d)  # stores the array on disk as my_array.npy (the extension .npy is added if not provided)

In [4]:
from glob import glob  # python module for finding files whose names match a pattern

In [5]:
glob('*.npy')  # find all files in the current directory with extension .npy

['my_array.npy']

In [6]:
del array2d  # del is a built-in Python function that deletes the binding of a name to an object

In [7]:
print array2d  # now raises an error since we deleted the name array2d

NameError: name 'array2d' is not defined

In [8]:
array2d = np.load('my_array.npy')  # need to supply the extension .npy when using np.load

In [9]:
print array2d

[[1 2]
 [3 4]]


You can also save multiple arrays via `numpy.savez` and later load them via `numpy.load`.

In [10]:
random_2d = np.random.random((3, 4)); print random_2d  # 10 random values drawn uniformly from [0,1)

[[ 0.94675603  0.78921939  0.67859779  0.86219206]
 [ 0.07040027  0.48954732  0.54008541  0.71729253]
 [ 0.02510995  0.75441661  0.29881107  0.74281693]]


In [11]:
log_2d = np.log(random_2d); print log_2d

[[-0.05471384 -0.23671094 -0.38772669 -0.14827723]
 [-2.65355825 -0.71427416 -0.61602798 -0.33227153]
 [-3.6844911  -0.28181052 -1.20794379 -0.29730566]]


In [12]:
exp_2d = np.exp(random_2d); print exp_2d

[[ 2.57733529  2.20167709  1.97111187  2.36834656]
 [ 1.07293755  1.63157746  1.71615344  2.04887841]
 [ 1.02542786  2.12637067  1.34825487  2.10184794]]


Now let us save the last three arrays we created into a file.

In [13]:
np.savez('three_arrays', random_2d, log_2d, exp_2d)  # An .npz extension is supplied if not already given

In [14]:
loaded_arrays = np.load('three_arrays.npz')   # for .npz files, np.load returns a dictionary-like object

In [15]:
print loaded_arrays['arr_0']  # the first stored array can be accesses via the key 'arr_0'

[[ 0.94675603  0.78921939  0.67859779  0.86219206]
 [ 0.07040027  0.48954732  0.54008541  0.71729253]
 [ 0.02510995  0.75441661  0.29881107  0.74281693]]


In [16]:
print loaded_arrays['arr_1']  # the second stored array can be accessed via the key 'arr_1' (and so on ...)

[[-0.05471384 -0.23671094 -0.38772669 -0.14827723]
 [-2.65355825 -0.71427416 -0.61602798 -0.33227153]
 [-3.6844911  -0.28181052 -1.20794379 -0.29730566]]


In [17]:
print loaded_arrays['arr_2']

[[ 2.57733529  2.20167709  1.97111187  2.36834656]
 [ 1.07293755  1.63157746  1.71615344  2.04887841]
 [ 1.02542786  2.12637067  1.34825487  2.10184794]]


If you want to save the arrays using meaningful names then use keyword arguments in `numpy.savez`.

In [18]:
np.savez('three_arrays', random=random_2d, logs=log_2d, exps=exp_2d)

In [19]:
loaded_arrays = np.load('three_arrays.npz')

In [20]:
for key in loaded_arrays.keys():
    print "loaded_arrays['" + key + "'] is the following:"
    print loaded_arrays[key]

loaded_arrays['exps'] is the following:
[[ 2.57733529  2.20167709  1.97111187  2.36834656]
 [ 1.07293755  1.63157746  1.71615344  2.04887841]
 [ 1.02542786  2.12637067  1.34825487  2.10184794]]
loaded_arrays['random'] is the following:
[[ 0.94675603  0.78921939  0.67859779  0.86219206]
 [ 0.07040027  0.48954732  0.54008541  0.71729253]
 [ 0.02510995  0.75441661  0.29881107  0.74281693]]
loaded_arrays['logs'] is the following:
[[-0.05471384 -0.23671094 -0.38772669 -0.14827723]
 [-2.65355825 -0.71427416 -0.61602798 -0.33227153]
 [-3.6844911  -0.28181052 -1.20794379 -0.29730566]]


When dealing with data destinations or sources outside of Python, the ability to save and load text files is useful. The functions `numpy.savetxt` and `numpy.loadtxt` offer basic functionality for doing that.

In [21]:
two_decimals = np.arange(0, .4, .01).reshape((5, 8)); print two_decimals 

[[ 0.    0.01  0.02  0.03  0.04  0.05  0.06  0.07]
 [ 0.08  0.09  0.1   0.11  0.12  0.13  0.14  0.15]
 [ 0.16  0.17  0.18  0.19  0.2   0.21  0.22  0.23]
 [ 0.24  0.25  0.26  0.27  0.28  0.29  0.3   0.31]
 [ 0.32  0.33  0.34  0.35  0.36  0.37  0.38  0.39]]


In [22]:
np.savetxt('two_decimals.csv', two_decimals, delimiter=',')  # default delimiter is space

In IPython (the enhanced interactive Python environment used to create this notebook), commands starting with an exclamation mark (!) are possed on to the shell for execution and the resulting output is shown. Do not use this in your python scripts meant to be run using the standard Python interpreter! Python will flag a syntax error for lines that start with an exclamation mark.

In [23]:
!cat two_decimals.csv  # we did create a file with comma separated values but the float representation is too long

0.000000000000000000e+00,1.000000000000000021e-02,2.000000000000000042e-02,2.999999999999999889e-02,4.000000000000000083e-02,5.000000000000000278e-02,5.999999999999999778e-02,7.000000000000000666e-02
8.000000000000000167e-02,8.999999999999999667e-02,1.000000000000000056e-01,1.100000000000000006e-01,1.199999999999999956e-01,1.300000000000000044e-01,1.400000000000000133e-01,1.499999999999999944e-01
1.600000000000000033e-01,1.700000000000000122e-01,1.799999999999999933e-01,1.900000000000000022e-01,2.000000000000000111e-01,2.099999999999999922e-01,2.200000000000000011e-01,2.300000000000000100e-01
2.399999999999999911e-01,2.500000000000000000e-01,2.600000000000000089e-01,2.700000000000000178e-01,2.800000000000000266e-01,2.899999999999999800e-01,2.999999999999999889e-01,3.099999999999999978e-01
3.200000000000000067e-01,3.300000000000000155e-01,3.400000000000000244e-01,3.500000000000000333e-01,3.599999999999999867e-01,3.699999999999999956e-01,3.800000000000000044e-01,3.900000000000000133e

In [24]:
np.savetxt('two_decimals.csv', two_decimals, fmt='%.2f', delimiter=',')  # we'll now use a better format string

In [25]:
!cat two_decimals.csv

0.00,0.01,0.02,0.03,0.04,0.05,0.06,0.07
0.08,0.09,0.10,0.11,0.12,0.13,0.14,0.15
0.16,0.17,0.18,0.19,0.20,0.21,0.22,0.23
0.24,0.25,0.26,0.27,0.28,0.29,0.30,0.31
0.32,0.33,0.34,0.35,0.36,0.37,0.38,0.39


Let us now delete the local variable `two_decimals` and then load the csv file.

In [26]:
del two_decimals

In [27]:
two_decimals = np.loadtxt('two_decimals.csv', delimiter=','); print two_decimals

[[ 0.    0.01  0.02  0.03  0.04  0.05  0.06  0.07]
 [ 0.08  0.09  0.1   0.11  0.12  0.13  0.14  0.15]
 [ 0.16  0.17  0.18  0.19  0.2   0.21  0.22  0.23]
 [ 0.24  0.25  0.26  0.27  0.28  0.29  0.3   0.31]
 [ 0.32  0.33  0.34  0.35  0.36  0.37  0.38  0.39]]


Let us now add a header line to two_decimals.csv

In [28]:
!echo "This is a header." > header

In [29]:
!cat header two_decimals.csv > two_decimals_header.csv

In [30]:
!cat two_decimals_header.csv

This is a header.
0.00,0.01,0.02,0.03,0.04,0.05,0.06,0.07
0.08,0.09,0.10,0.11,0.12,0.13,0.14,0.15
0.16,0.17,0.18,0.19,0.20,0.21,0.22,0.23
0.24,0.25,0.26,0.27,0.28,0.29,0.30,0.31
0.32,0.33,0.34,0.35,0.36,0.37,0.38,0.39


In [31]:
two_decimals = np.loadtxt('two_decimals_header.csv', delimiter=',', skiprows=1)  # skip the first row

In [32]:
print two_decimals

[[ 0.    0.01  0.02  0.03  0.04  0.05  0.06  0.07]
 [ 0.08  0.09  0.1   0.11  0.12  0.13  0.14  0.15]
 [ 0.16  0.17  0.18  0.19  0.2   0.21  0.22  0.23]
 [ 0.24  0.25  0.26  0.27  0.28  0.29  0.3   0.31]
 [ 0.32  0.33  0.34  0.35  0.36  0.37  0.38  0.39]]


In [33]:
!echo "# This is a comment." >> two_decimals_header.csv

In [34]:
!cat two_decimals_header.csv

This is a header.
0.00,0.01,0.02,0.03,0.04,0.05,0.06,0.07
0.08,0.09,0.10,0.11,0.12,0.13,0.14,0.15
0.16,0.17,0.18,0.19,0.20,0.21,0.22,0.23
0.24,0.25,0.26,0.27,0.28,0.29,0.30,0.31
0.32,0.33,0.34,0.35,0.36,0.37,0.38,0.39
# This is a comment.


In [35]:
two_decimals = np.loadtxt('two_decimals_header.csv', delimiter=',', skiprows=1)  # np.loadtxt will automatically skip comments

In [36]:
print two_decimals

[[ 0.    0.01  0.02  0.03  0.04  0.05  0.06  0.07]
 [ 0.08  0.09  0.1   0.11  0.12  0.13  0.14  0.15]
 [ 0.16  0.17  0.18  0.19  0.2   0.21  0.22  0.23]
 [ 0.24  0.25  0.26  0.27  0.28  0.29  0.3   0.31]
 [ 0.32  0.33  0.34  0.35  0.36  0.37  0.38  0.39]]


# Linear algebra

We have already seen one linear algebra method, namely `numpy.dot()` (and the related ndarray method `numpy.ndarray.dot()`). For vectors, it computes inner products. For 2-d arrays, it computes matrix multiplication (we have already seen an example involving 2-d rotation matrices).

In [37]:
vec1 = np.array([1, 2])

In [38]:
vec2 = np.array([3, 4])  # inner product of vec1 and vec2 is 11

In [39]:
print np.dot(vec1, vec2)  # compute it using the function np.dot

11


In [40]:
print vec1.dot(vec2)  # compute it using the method dot of ndarray objects

11


In [41]:
my_arr = np.array([[1, 1], [2, 2]]); print my_arr

[[1 1]
 [2 2]]


In [42]:
vec1.dot(my_arr)  # dot also works for 1-d and 2-d arrays

array([5, 5])

In [43]:
my_arr.dot(vec1)  # of course, the order is important

array([3, 6])

In [44]:
vec1.dot(my_arr.T)  # same as previous

array([3, 6])

`numpy.inner` is the same as `numpy.dot` for 1-d arrays.

In [45]:
np.inner(vec1, vec2)

11

We can also compute the outer product via `numpy.outer`.

In [46]:
np.outer(vec1, vec2)

array([[3, 4],
       [6, 8]])

In [47]:
my_arr = np.column_stack((vec1, vec2)); print my_arr

[[1 3]
 [2 4]]


In [48]:
np.kron(my_arr, np.ones((2, 2)))  # 2nd array scaled by entries of first (Kronecker product)

array([[ 1.,  1.,  3.,  3.],
       [ 1.,  1.,  3.,  3.],
       [ 2.,  2.,  4.,  4.],
       [ 2.,  2.,  4.,  4.]])

In [49]:
from numpy import linalg as LA  # import the linear algebra subpackage of numpy

In [50]:
right_shift = np.array([[0, 0, 1], [1, 0, 0], [0, 1, 0]]); print right_shift

[[0 0 1]
 [1 0 0]
 [0 1 0]]


In [51]:
print right_shift.dot(np.arange(3))  # [0 1 2] should becomes [2 0 1]

[2 0 1]


In [52]:
shift_cubed = LA.matrix_power(right_shift, 3)  # this should be identity

In [53]:
print shift_cubed

[[1 0 0]
 [0 1 0]
 [0 0 1]]


Let us look at functions for solving linear equations.

In [54]:
gauss_arr = np.random.randn(3, 3); print gauss_arr  # 3 x 3 random matrix

[[ 1.35774596  1.51740506 -2.13840576]
 [ 0.67557852  0.7203102  -0.47362006]
 [-0.53569361 -1.23349909 -0.00779148]]


In [55]:
gauss_arr_inv = LA.inv(gauss_arr); print gauss_arr_inv  # compute its inverse

[[-1.07436085  4.82614386  1.49662052]
 [ 0.47173012 -2.10585227 -1.46012365]
 [-0.81504805  1.56997634 -0.08584365]]


In [56]:
print gauss_arr_inv.dot(gauss_arr) == np.eye(3)  # these should return all True's, right?

[[False False False]
 [False False False]
 [False  True False]]


In [57]:
print np.allclose(gauss_arr_inv.dot(gauss_arr), np.eye(3))  # check for equality within some tolerance

True


In [58]:
my_vec = np.arange(3); print my_vec

[0 1 2]


In [59]:
A, b = gauss_arr, gauss_arr.dot(my_vec)  # set up a system of linear equations

In [60]:
print LA.solve(A, b)  # and solve it

[ -1.63539139e-15   1.00000000e+00   2.00000000e+00]


In [61]:
print np.allclose(my_vec, LA.solve(A, b))  # check if solution is correct

True


In [62]:
print np.allclose(my_vec, LA.inv(A).dot(b))  # of course we could solve via inverse

True


But computing inverse to solve a system of linear equations is not recommended: it is slower.

In [63]:
import timeit

In [64]:
timeit.timeit('numpy.linalg.solve(A, b)',
              'import numpy; A = numpy.random.randn(100, 100); b = A.dot(numpy.arange(100))', number=10**4)

1.6778178215026855

In [65]:
timeit.timeit('numpy.linalg.inv(A).dot(b)',
              'import numpy; A = numpy.random.randn(100, 100); b = A.dot(numpy.arange(100))', number=10**4)

2.9792518615722656

In [66]:
p = 3  # number of parameters
beta_star = 1.0 / (np.arange(p) + 1); print beta_star  # choose the parameters in a linear model

[ 1.          0.5         0.33333333]


In [67]:
n = 15  # no. of iid samples
X = np.random.randn(15, 3)  # covariates are iid gausssians, n = 10

In [68]:
y = X.dot(beta_star) + .1*np.random.randn(n); print y  # y = X * beta_star + noise

[-1.72404544 -0.29992334  0.50871671 -1.08435593 -1.24036359 -0.45439872
 -0.66976543 -0.53414116 -0.12959675  0.39529946  1.27934105 -1.4021718
  1.15791082  0.70540931  1.02349763]


In [69]:
beta_hat = LA.lstsq(X, y)[0]; print beta_hat  # lstsq returns a 4-tuple first of which is the solution

[ 1.06131117  0.46729599  0.39026741]


Let us now do a QR decomposition. Recall that the QR decomposition of a matrix $A$ gives two matrices $Q$ and $R$ such that:

* $A = QR$
* $Q$ is orthogonal ($Q^T Q = I$)
* $R$ is upper triangular

In [70]:
q, r = LA.qr(gauss_arr)

In [71]:
np.allclose(q.dot(r), gauss_arr)

True

In [72]:
np.allclose(q.T.dot(q), np.eye(3))  # check q^T q = I

True

In [73]:
print r  # upper triangular

[[-1.60836814 -1.99435327  2.00153522]
 [ 0.          0.60451627  0.68715772]
 [ 0.          0.         -0.56464759]]


In [74]:
evals = np.diag(np.arange(-1, 2)); print evals  # np.diag puts a 1-d array on the diagonal or extracts the diag

[[-1  0  0]
 [ 0  0  0]
 [ 0  0  1]]


In [75]:
my_arr = q.dot(evals).dot(q.T)  # my_arr = q evals q^T

In [76]:
w, v = LA.eigh(my_arr)  # eigh computes eigenvalues/vectors of symmetric (Hermitian) matrices

In [77]:
print w  # very close to [-1, 0, 1]

[ -1.00000000e+00  -1.25227943e-17   1.00000000e+00]


In [78]:
np.allclose(w, np.arange(-1, 2))

True

In [79]:
print v

[[-0.84417611  0.27489803 -0.46021492]
 [-0.42003973  0.19420059  0.88648336]
 [ 0.33306654  0.94165663 -0.04847141]]


In [80]:
print q  # has same columns as v (up to sign flips)

[[-0.84417611 -0.27489803  0.46021492]
 [-0.42003973 -0.19420059 -0.88648336]
 [ 0.33306654 -0.94165663  0.04847141]]


In [81]:
left2 = np.random.randn(10, 2)

In [82]:
right2 = np.random.randn(15, 2)

In [83]:
low_rank = np.outer(left2[:, 0], right2[:, 0]) + np.outer(left2[:, 1], right2[:, 1])

In [84]:
low_rank.shape  # we have a 10 x 15 rank-2 matrix

(10, 15)

In [85]:
low_rank_noisy = low_rank + .01*np.random.randn(10, 15)  # add some noise to each entry

In [86]:
LA.matrix_rank(low_rank_noisy)

10

In [87]:
U, s, V = LA.svd(low_rank_noisy)  # compute the SVD (Singular Value Decomposition)

Recall that the SVD of am $m \times n$ matrix $A$ is given by $A = U \Sigma V^T$ where:

* $U$ is an $m \times m$ orthogonal matrix
* $V$ is an $n \times n$ orthogonal matrix
* $\Sigma$ is a diagonal matrix with $\min\{m, n\}$ (non-negative) singular values

In [88]:
print s  # we see two large singular values and 8 smaller ones

[  8.90492256e+00   7.01349502e+00   6.05692541e-02   4.28823534e-02
   3.56422249e-02   2.78391154e-02   2.65230138e-02   1.76266258e-02
   1.62795730e-02   7.25359008e-03]


Finally, just to clean up temporary files...

In [89]:
!rm my_array.npy three_arrays.npz two_decimals.csv two_decimals_header.csv