### Import Numpy

In [6]:
import numpy

### If Numpy is not installed

In [8]:
!python3 -m pip install numpy



In [9]:
import numpy as np

### Why to use Numpy?

* To run mathematical operations on all the numbers
* Making use of C Efficiencies (Python build on top of C)

In [10]:
a = [1, 2, 3, 4]
print(a * 2)

[1, 2, 3, 4, 1, 2, 3, 4]


In [11]:
b = np.array([1, 2, 3, 4])
print(b * 2)

[2 4 6 8]


### Why Python (and Python native lists) is slow?

Python is a dynamically typed language as the type of the variable depends on the value it contains whereas languages like C are called statically typed languages. It does not have to figure out the type.

Other reason that Python is not that fast is because everything in Python is an object. Take the example of a list. To read any value, I will have to do referencing and de-referencing every-time.

Numpy by-passes Python and directly performs computation in C.

### Create a NumPy array

In [13]:
a = np.arange(1, 10)
a = np.arange(1, 10, 2)

### Reshape the array

In [19]:
b = np.arange(1, 13)
b.reshape(3, 4)

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

### Create 2D Numpy array

In [21]:
d = np.array([[1, 2], [3, 4]])
d

array([[1, 2],
       [3, 4]])

### Numpy arrays have a specific type.

In [22]:
np.array([-1, 0, 1.0, 100])

array([ -1.,   0.,   1., 100.])

### Declaring the array type

In [25]:
a = np.array([-1, 0, 1.0, 100], dtype="int8")
print(a*2)
b = a.astype("float32")
print(b*2)

[ -2   0   2 -56]
[ -2.   0.   2. 200.]


### Indexing and Splicing

In [28]:
a = np.arange(12)
b = a[2:7]
b[0] = 100
print(a)
print(b)
print(a[2:7])
print(a[:7])
print(a[:])
print(a[::2])
print()

[  0   1 100   3   4   5   6   7   8   9  10  11]
[100   3   4   5   6]
[100   3   4   5   6]
[  0   1 100   3   4   5   6]
[  0   1 100   3   4   5   6   7   8   9  10  11]
[  0 100   4   6   8  10]



You cannot do insert or append to NumPy arrays, similar as in what we do in C. And splicing in Array unlike Python lists shares the same location.

### Indexing and Splicing on a 2D array

In [30]:
b = np.arange(12).reshape(3, 4)
print(b[0, 1:3])
print(b[:, ::2])

[1 2]
[[ 0  2]
 [ 4  6]
 [ 8 10]]


If we are printing just a single column, it will be printed as a vector. Nothing to worry about vertically or horizontally.

 1-D list - Vector \
 2-D list - Matrix \
 x-D list - Vector

### 3-dimensional array in a NumPy array

In [32]:
c = np.arange(24).reshape(2, 3, 4) # first dimension corresponds to be the depth
print(c[0,:, :]) # prints the entire first matrix
print(c[0]) # does the same job as above

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


### Vectorization (Broadcasting)

In [36]:
a = np.arange(1, 6)
print(a * 2) 

[ 2  4  6  8 10]


This behaviour is called vectorisation, and NumPy is called Vectorized library. Let's try putting comparison operator on NumPy array.

Some of you might have thought that this will act as a filter. This will check the condition against every element of the array and will return an array of type "bool". This means, we are doing vectorisation. This is actually equivalent to mapping that we just studied.

Another name for this is broad-casting, here the operation gets broad casted to each and every element of the list.

### Dot operation on NumPy arrays

In [40]:
a = np.arange(12) # notice that this is a 1-D array
b = a.reshape(1, 12) # notice that it now a 2-D dimensional array
c = a.reshape(12, 1)

b.dot(c)

array([[506]])

a is a vector whereas b and c are 2-D matrices.

In [47]:
a.dot(np.arange(12,24))

1298

Using dot function on two vectors is different as performing dot on two matrices.

### Shape and Transpose

In [51]:
print(a.shape)
print(b.shape)

(12,)
(1, 12)


Notice that it has just one dimension, you don't have to brackets

### Transpose of a Matrix

In [55]:
print(a.T.shape, a.shape)

(12,) (12,)


Notice that transpose of a vector is the vector itself. This won't be the case for `b` as `b` is a matrix.

In [59]:
b.shape, b.T.shape

((1, 12), (12, 1))

In [57]:
A = np.arange(12).reshape(3, 4)
print(A)
print(A.T)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[[ 0  4  8]
 [ 1  5  9]
 [ 2  6 10]
 [ 3  7 11]]


### Masking

Earlier we say that how vectorization is same as using mapping on Python lists. Now, we will see that masking in NumPy is similar to filtering.

In [63]:
a = np.arange(30)
print(a > 15) # this is similar to mapping
print(a[a > 15]) # this is filtering

[False False False False False False False False False False False False
 False False False False  True  True  True  True  True  True  True  True
  True  True  True  True  True  True]
[16 17 18 19 20 21 22 23 24 25 26 27 28 29]


Note that this can't be done on Python lists.

#### Masking on 2-D arrays.

In [74]:
b = np.arange(30).reshape(5, 6)
print(b[b > 15])

[16 17 18 19 20 21 22 23 24 25 26 27 28 29]


Notice the output of this will be vector. As the there is no placeholder to fill at the places for which the value got filtered. We might not be able to maintain the same shape.

#### Masking with multiple conditions

In [None]:
print(b[(b > 15) | (b <= 15)])

Notice that `|` and `&` will be used instead of `or` and `and` as they don't work for multiple values. Although, no value is filtered in this case, the final answer will still be a vector.

### How Numpy arrays are stored in memory?

The variable actually doesn't point to the data directly but to a header which actually points to the data. Now let's say when you  have to reshape the data and store it in a new variable b , Numpy will actually just create a new header with the new shape and doesn't copy the data to a new memory location.

But the downside of this, **if we try to update the any value, it will get reflected in other variables as well**.

In [78]:
a = np.arange(30)
print(a.flags) # check if new data is created
b = a.reshape(5, 6)
print(b.flags)
c = a[:15]
print(c.flags)

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False



We notice that both in splicing and reshape, own-data is False, which means the memory locations contraining actual data are shared. Let's try changing the value at a and notice what exactly happens.

In [80]:
a[0] = 100
print(b)
print(c)

[[100   1   2   3   4   5]
 [  6   7   8   9  10  11]
 [ 12  13  14  15  16  17]
 [ 18  19  20  21  22  23]
 [ 24  25  26  27  28  29]]
[100   1   2   3   4   5   6   7   8   9  10  11  12  13  14]


**But 'masking' doesn't do this memory sharing**. for optimisation reasons. 

In [81]:
a = a[a > 15] # duplicate data is created
print(d.flags)

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False



### Reshaping re-visited

In [83]:
b = a.reshape(15, -1)

NumPy actually tries to figure-out the remaining dimension. Of course, there can't be more than one occurrences of -1 as this will lead to ambiguity.

### Flatten any array to a vector.

In [85]:
print(a.flatten())
print(a.reshape(numpy.prod(a.shape)))
print(a.reshape(-1))

[100  16  17  18  19  20  21  22  23  24  25  26  27  28  29]
[100  16  17  18  19  20  21  22  23  24  25  26  27  28  29]
[100  16  17  18  19  20  21  22  23  24  25  26  27  28  29]


All these will do the same work.

### Special arrays

In [86]:
np.zeros(5)
np.ones(5)

array([1., 1., 1., 1., 1.])

If we have to create a multi-dimensional array, the dimension has to be passed as tuple.

In [89]:
np.zeros((3, 4))

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

Now if I can create any array containing same number of any values by simly multipying it with the number.

#### Empty array

In [93]:
np.empty((2, 2))

array([[1.49166815e-154, 1.29074076e-231],
       [1.48219694e-323, 4.17201348e-309]])

#### Identity array

In [98]:
np.eye(2)
np.identity(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

#### Array with a given range have certain number of elements in it.

In [100]:
np.linspace(1, 5, 9)

array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])

Above command will create an array of 9 elements between 1, and 5.

### Axis

In [102]:
a = np.arange(12)
print(a.sum())
b = a.reshape(3, 4)
print(b.sum())

66
66


No matter what the shape of the array, it will give the sum. Now what if I have to perform only across the specific dimension?

In [106]:
b.sum(axis=0)

array([12, 15, 18, 21])

In [107]:
b.sum(axis=1)

array([ 6, 22, 38])

If you are going through the rows, you are collapsing the rows or collapsing that dimension. This is confusing because in order to do the column-wise sum, we will have to sum across the row axis. Don't worry about the vector being printed horizontally, because we discussed that the vectors are 1-dimensional.

### Statistical Analysis

#### Calculate maximum

In [112]:
b = np.arange(12).reshape(3, 4)
print(b)
print(b.max(axis=1)) # collapses all the columns
print(np.max(b, axis=1)) # prefer this way of writing

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[ 3  7 11]
[ 3  7 11]


If two arrays of same size are given, then `np.max(a1, a2)` will give element-wise max of two arrays.

#### Calculate mean

In [114]:
print(b.mean()) # gives the overall mean
print(b.mean(axis=0)) # finds column-wise mean, counter-intuitive
print(b.mean(axis=1)) # finds row-wise mean

5.5
[4. 5. 6. 7.]
[1.5 5.5 9.5]


#### Calculate median

In [115]:
np.median(b)

5.5

`b.median()` will actually give error.

Let's take an example of a Gaussian Distribution. Let’s say the mean is 100 and standard deviation is 15. Within range [85, 115], we will have 70% of the data. Now let’s talk about 2 standard deviations around the mean. This will cover 95% of the data. The remaining 5% will be beyond 2 sigma. If the standard deviation is more, the data will more spread out. Variance is average squared distance of each value from mean.


In [116]:
np.std(b)

3.452052529534663

### Random Uniform Distributions

In [117]:
np.random.rand(5)

array([0.5410006 , 0.62158035, 0.50241116, 0.5774448 , 0.19036092])

Populates the array with random samples from a uniform distribution over [0, 1)

In [119]:
print(90 + np.random.rand(2, 3)*10)

[[91.91159386 99.57714027 92.97599012]
 [91.23257466 96.18614193 99.79015587]]


Populates the array random samples from a uniform distribution over [90, 90+10)

In [122]:
print(np.random.randint(50, 60, 5))

[53 57 57 56 53]


Random returns random integers from the “discrete uniform”

### Random Normal Distribution

In [124]:
mu = 100
sigma = 15
s = np.random.normal(mu, sigma, 100) # generates 100 values
print(np.mean(s))
print(np.std(s))

99.4479199667913
15.725100527006338


### Sorting

In [133]:
d = np.random.randint(1, 9, (3, 4))
print(d)
d.sort()
print(d)

[[2 4 7 3]
 [1 2 1 6]
 [6 2 2 2]]
[[2 3 4 7]
 [1 1 2 6]
 [2 2 2 6]]


It is interesting that `.sort()` doesn't follow the functional programming paradigm.