[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/akshayrb22/playing-with-data/blob/master/data_analysis/Intro-to-Numpy/Intro-to-Numpy.ipynb)

<h1>Numpy</h1>

<h3>Introduction</h3>

The numpy package (module) is used in almost all numerical computation using Python. It is a package that provide high-performance vector, matrix and higher-dimensional data structures for Python. It is implemented in C and Fortran so when calculations are vectorized (formulated with vectors and matrices), performance is very good.

To use numpy you need to import the module, using for example:

In [11]:
from numpy import *

In the numpy package the terminology used for vectors, matrices and higher-dimensional data sets is array.

<h3>Creating Numpy Arrays</h3>

There are a number of ways to initialize new numpy arrays, for example from
<ul>
    <li>a Python list or tuples</li>
    <li>using functions that are dedicated to generating numpy arrays, such as arange, linspace, etc.</li>
    
</ul>

<h3>From lists</h3>

For example, to create new vector and matrix arrays from Python lists we can use the numpy.array function.

In [2]:
# a vector: the argument to the array function is a Python list
v = array([1, 2, 3, 4])

# a matrix: the argument to the array function is a nested Python list
M = array([[1, 2], [3, 4]])

type(v), type(M)

(numpy.ndarray, numpy.ndarray)

<h3>Using array-generating fucntions</h3>

For larger arrays it is inpractical to initialize the data manually, using explicit python lists. Instead we can use one of the many functions in numpy that generate arrays of different forms. Some of the more common are:

<h4>arange</h4>

In [5]:
# create a range

arange(0, 10, 1)  # arguments: start, stop, step

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

<h4>linspace and logspace</h4>

In [4]:
# using linspace, both end points ARE included
linspace(0, 10, 25)

array([ 0.        ,  0.41666667,  0.83333333,  1.25      ,  1.66666667,
        2.08333333,  2.5       ,  2.91666667,  3.33333333,  3.75      ,
        4.16666667,  4.58333333,  5.        ,  5.41666667,  5.83333333,
        6.25      ,  6.66666667,  7.08333333,  7.5       ,  7.91666667,
        8.33333333,  8.75      ,  9.16666667,  9.58333333, 10.        ])

In [5]:
logspace(0, 10, 10, base=e)

array([1.00000000e+00, 3.03773178e+00, 9.22781435e+00, 2.80316249e+01,
       8.51525577e+01, 2.58670631e+02, 7.85771994e+02, 2.38696456e+03,
       7.25095809e+03, 2.20264658e+04])

<h4>Random data</h4>

In [6]:
# uniform random numbers in [0,1]
M = random.rand(3, 3)
M

array([[0.01374967, 0.76297072, 0.74490932],
       [0.97704595, 0.75879003, 0.85131658],
       [0.8853672 , 0.31418022, 0.24428077]])

In [7]:
# standard normal distributed random numbers
random.randn(5, 5)

array([[ 0.49590591, -0.69997314, -0.02328732,  1.06125235, -1.53267048],
       [-1.76897536, -0.01358953,  0.33104626,  0.27781036,  1.23605737],
       [-0.15643886,  0.15304074,  0.62825784,  0.30255194, -0.29063891],
       [-0.70774784,  0.36728399, -0.49434748,  0.27477914,  0.64066817],
       [-0.0126891 , -0.69457878,  1.74197063,  2.4239445 ,  0.51155678]])

<h4>zeros and ones</h4>

In [6]:
print(zeros((3, 3)))
ones((3, 3))

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

<h3>Manipulating arrays</h3>

<h4>Indexing</h4>

We can index elements in an array using square brackets and indices:

In [9]:
# v is a vector, and has only one dimension, taking one index
v[0]

1

In [10]:
# M is a matrix, or a 2 dimensional array, taking two indices
M[1, 1]

0.7587900260161953

In [11]:
M[0]

array([0.01374967, 0.76297072, 0.74490932])

In [12]:
M[1, :]  # row 1

array([0.97704595, 0.75879003, 0.85131658])

In [13]:
M[:, 1]  # column 1

array([0.76297072, 0.75879003, 0.31418022])

<h4>Index slicing</h4>

Index slicing is the technical name for the syntax M[lower:upper:step] to extract part of an array:

In [14]:
A = array([1, 2, 3, 4, 5])
A

array([1, 2, 3, 4, 5])

In [15]:
A[1:3]

array([2, 3])


Array slices are mutable: if they are assigned a new value the original array from which the slice was extracted is modified:

In [16]:
A[1:3] = [-2, -3]

A

array([ 1, -2, -3,  4,  5])

We can omit any of the three parameters in `M[lower : upper : step]`:

In [17]:
A[::]  # lower, upper, step all take the default values

array([ 1, -2, -3,  4,  5])

In [18]:
A[::
  2]  # step is 2, lower and upper defaults to the beginning and end of the array

array([ 1, -3,  5])

In [19]:
A[:3]  # first three elements

array([ 1, -2, -3])

In [20]:
A[3:]  # elements from index 3

array([4, 5])

In [21]:
A[-1]  # the last element in the array

5

In [22]:
A[-3:]  # the last three elements

array([-3,  4,  5])

In [23]:
A = array([[n + m * 10 for n in range(5)] for m in range(5)])

A

array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])

In [24]:
# a block from the original array
A[1:4, 1:4]

array([[11, 12, 13],
       [21, 22, 23],
       [31, 32, 33]])

In [25]:
# strides
A[::2, ::2]

array([[ 0,  2,  4],
       [20, 22, 24],
       [40, 42, 44]])

<h2>Linear algebra</h2>

Vectorizing code is the key to writing efficient numerical calculation with Python/Numpy. That means that as much as possible of a program should be formulated in terms of matrix and vector operations, like matrix-matrix multiplication.

<h3>Scalar-array operations</h3>

We can use the usual arithmetic operators to multiply, add, subtract, and divide arrays with scalar numbers.

In [26]:
v1 = arange(0, 5)

v1 * 2
v1 + 2

array([2, 3, 4, 5, 6])

In [27]:
v1 * v1

array([ 0,  1,  4,  9, 16])

<h3>Matrix algebra</h3>

What about matrix mutiplication? There are two ways. We can either use the dot function, which applies a matrix-matrix, matrix-vector, or inner vector multiplication to its two arguments:

In [28]:
A = array([[n + m * 10 for n in range(5)] for m in range(5)])

dot(A, A)

array([[ 300,  310,  320,  330,  340],
       [1300, 1360, 1420, 1480, 1540],
       [2300, 2410, 2520, 2630, 2740],
       [3300, 3460, 3620, 3780, 3940],
       [4300, 4510, 4720, 4930, 5140]])

In [29]:
dot(A, v1)

array([ 30, 130, 230, 330, 430])

In [30]:
M = matrix(A)
M * M

matrix([[ 300,  310,  320,  330,  340],
        [1300, 1360, 1420, 1480, 1540],
        [2300, 2410, 2520, 2630, 2740],
        [3300, 3460, 3620, 3780, 3940],
        [4300, 4510, 4720, 4930, 5140]])

<h3>Matrix Computation</h3>

<h4>Inverse</h4>

In [31]:
C = matrix([[1, 2], [3, 4]])

In [32]:
linalg.inv(C)  # equivalent to C.I

matrix([[-2. ,  1. ],
        [ 1.5, -0.5]])

In [33]:
C.I * C

matrix([[1.00000000e+00, 0.00000000e+00],
        [2.22044605e-16, 1.00000000e+00]])

<h4>Determinant</h4>

In [34]:
linalg.det(C)

-2.0000000000000004

In [35]:
linalg.det(C.I)

-0.49999999999999967

<h3>Data Processing</h3>

Often it is useful to store datasets in Numpy arrays. Numpy provides a number of functions to calculate statistics of datasets in arrays.

In [14]:
B = np.array([[1, 2, 3, 4, 5], [2, 3, 4, 5, 6]])

In [15]:
shape(B)

(2, 5)

<h4>mean</h4>

In [38]:
mean(B)

3.0

<h4>standard deviations and variance</h4>

In [39]:
std(B), var(B)

(1.4142135623730951, 2.0)

<h4>min and max</h4>

In [40]:
min(B), max(B)

(1, 5)

<h3>Reshaping and Resizing</h3>

The shape of an Numpy array can be modified without copying the underlaying data, which makes it a fast operation even for large arrays.

In [41]:
A

array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])

In [42]:
n, m = A.shape

In [43]:
B = A.reshape((1, n * m))
B

array([[ 0,  1,  2,  3,  4, 10, 11, 12, 13, 14, 20, 21, 22, 23, 24, 30,
        31, 32, 33, 34, 40, 41, 42, 43, 44]])

In [44]:
B[0, 0:5] = 5  # modify the array

B

array([[ 5,  5,  5,  5,  5, 10, 11, 12, 13, 14, 20, 21, 22, 23, 24, 30,
        31, 32, 33, 34, 40, 41, 42, 43, 44]])

In [45]:
A  # and the original variable is also changed. B is only a different view of the same data

array([[ 5,  5,  5,  5,  5],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])

We can also use the function flatten to make a higher-dimensional array into a vector. But this function create a copy of the data.

In [46]:
B = A.flatten()

B

array([ 5,  5,  5,  5,  5, 10, 11, 12, 13, 14, 20, 21, 22, 23, 24, 30, 31,
       32, 33, 34, 40, 41, 42, 43, 44])

In [47]:
B[0:5] = 10

B

array([10, 10, 10, 10, 10, 10, 11, 12, 13, 14, 20, 21, 22, 23, 24, 30, 31,
       32, 33, 34, 40, 41, 42, 43, 44])

In [48]:
A  # now A has not changed, because B's data is a copy of A's, not refering to the same data

array([[ 5,  5,  5,  5,  5],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])