# Introduction to NumPy

## What is NumPy?

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

## Why NumPy?

Put simply, it's fast at performing numeric functions. This is due to is being written in C.

To speed up calculations, NumPy uses vectorisation via broadcasting. In English, it avoids using loops as that can slow down processing time, especially with large datasets.

Finally, NumPy is also the backbone for other Python scientific packages. 

## NumPy DataTypes and Attributes

!["NumPy Array Types"](../../assets/images/notes/016-numpy-array.png)

In [76]:
# --- Import NumPy amd pandas (needed for later on)
import numpy as np
import pandas as pd
from matplotlib.image import imread

In [2]:
# --- NumPy uses ndarray (n-dimensional array) for its main datatype.
# --- Create a simple one-dimensional array, also called a vector.
# --- Note: This has a shape of 1,3 (one row, three columns):
sample_array_1 = np.array([1,2,3])
sample_array_1

array([1, 2, 3])

In [3]:
# --- Create a two-dimensional array.
# --- Note 1: This has a shape of 2,3 (two rows, three columns).
# --- Note 2: As there is a float in the array, all the numbers will be converted to float:
sample_array_2 = np.array([[1, 2.0, 3.3],
                           [4, 5, 6.5]])
sample_array_2

array([[1. , 2. , 3.3],
       [4. , 5. , 6.5]])

In [4]:
# --- Create a multi-dimensional array.
# --- Note: This has a shape of 2, 3,  (two matrix's deep, three rows and three columns per matrix):
sample_array_3 = np.array([[[1, 2, 3],
                           [4, 5, 6],
                           [7, 8, 9]],
                          [[10, 11, 12],
                           [13, 14, 15],
                           [16, 17, 18]]])
sample_array_3

array([[[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9]],

       [[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]]])

In [5]:
# --- Show the shape and size of each sample array:
print(f"sample array 1. shape: {sample_array_1.shape}, size: {sample_array_1.size}")
print(f"sample array 2. shape: {sample_array_2.shape}, size: {sample_array_2.size}")
print(f"sample array 3. shape: {sample_array_3.shape}, size: {sample_array_3.size}")

sample array 1. shape: (3,), size: 3
sample array 2. shape: (2, 3), size: 6
sample array 3. shape: (2, 3, 3), size: 18


In [6]:
# --- Show the number of dimensions for each sample array:
sample_array_1.ndim, sample_array_2.ndim, sample_array_3.ndim

(1, 2, 3)

In [7]:
# --- Create a pandas dataframe from an ndarray:
sample_df_2 = pd.DataFrame(sample_array_2)
sample_df_2


Unnamed: 0,0,1,2
0,1.0,2.0,3.3
1,4.0,5.0,6.5


## Creating NumPy Arrays

In [8]:
# --- Create a 2 x 3 ndarray with values of 1 using the ones function.
# --- Note: The default datatype for each 1 is float64 so they will be 1. instead.
# --- You can change that with dtype = int:
ones = np.ones(shape=(1, 3))
ones

array([[1., 1., 1.]])

In [9]:
# --- Create a 2 x 3 ndarray with values of 1 using the zeros function.
# --- Note: The default datatype for each 0 is float64 so they will be 0. instead.
# --- You can change that with dtype = int:
zeros = np.zeros(shape=(2, 3))
zeros

array([[0., 0., 0.],
       [0., 0., 0.]])

In [10]:
# --- Create an ndarray with a range starting at 0, upto 10 and increment in 2:
range_array = np.arange(0, 10, 2)
range_array

array([0, 2, 4, 6, 8])

In [11]:
# --- Create an ndarray with 3 rows and five random integers per row:
random_array = np.random.randint(low = 0, high = 100, size = (3, 5))
print(random_array)
print(f"\nrandom_array size: {random_array.size}\nrandom_array shape: {random_array.shape}")


[[36 65 66 71 57]
 [49 79 66 27 31]
 [92 14 75 29 52]]

random_array size: 15
random_array shape: (3, 5)


Note: NumPy random numbers are Pseudo-random numbers. In short, it's random to us but not to a computer.

You can set the random number generators in NumPy to have a base starting point so that they start at the same point using the `np.random.seed()` function.

By default, the random.seed() is set to None. This will mean that each time a random number generator function is called, the seed will have a random value that will then generate a random number.

If you set a value in the seed function, each time a random number generator is run, it will generate the same numbers as before as the starting point will always be the same:

In [12]:
print(np.random.seed())

None


In [13]:
# --- Set the seed value to None and generate a random array of ints.
np.random.seed(seed=None)
random_array_seed_1 = np.random.randint(0, 10, size=(3, 5))
print(f"random_array_seed_1\n{random_array_seed_1}")

# --- Set the seed value to 1 and generate an array wind random ints:
np.random.seed(seed=1)
random_array_seed_2 = np.random.randint(0, 10, size=(3, 5))
print(f"\nrandom_array_seed_2\n{random_array_seed_2}")

# --- The result should be this each time:
# [[5 8 9 5 0]
# [0 1 7 6 9]
# [2 4 5 2 4]]

# --- Note: random.seed() only applies to the cell in Jupyter notebooks that it was run in.

random_array_seed_1
[[5 2 2 5 6]
 [4 1 4 5 0]
 [0 6 8 4 9]]

random_array_seed_2
[[5 8 9 5 0]
 [0 1 7 6 9]
 [2 4 5 2 4]]


## Viewing Arrays and Matrices

In [14]:
# --- Show the unique numbers in an ndarray:
np.unique(random_array_seed_2)

array([0, 1, 2, 4, 5, 6, 7, 8, 9])

In [15]:
# --- Show the first item in a 1-D list:
sample_array_1[0]

1

In [16]:
# --- Show the first item of the first row in a 2-D matrix:
sample_array_2[0,0]

1.0

In [17]:
# --- Show the first item of the first row in a 3-D matrix:
# --- 0, 0, 0 = z (depth) x (row) y (column)
sample_array_3[0,0,0]

1

In [18]:
# --- Using the the first two matrices (:2), from the first three rows (:3) of each of the 
# --- two matrices, get the first two numbers (:2).
# --- This is using Python splits(:):
sample_array_3[:2, :3, :2]

array([[[ 1,  2],
        [ 4,  5],
        [ 7,  8]],

       [[10, 11],
        [13, 14],
        [16, 17]]])

## Manipulating Arrays

### Arithmatic

In [19]:
# --- View the arrays we will work with
print(f"Ones Array:     {ones}")
print(f"Sample Array 1: {sample_array_1}\n")
print(f"Sample Array 2:\n{sample_array_2}\n")
print(f"Sample Array 3:\n{sample_array_3}")


Ones Array:     [[1. 1. 1.]]
Sample Array 1: [1 2 3]

Sample Array 2:
[[1.  2.  3.3]
 [4.  5.  6.5]]

Sample Array 3:
[[[ 1  2  3]
  [ 4  5  6]
  [ 7  8  9]]

 [[10 11 12]
  [13 14 15]
  [16 17 18]]]


In [20]:
# --- Perform some basic maths on with two 1-D arrays:
print(f"Addition:    {sample_array_1 + ones}")
print(f"Subtraction: {sample_array_1 - ones}")
print(f"Multiplied:  {sample_array_1 * ones}")
print(f"Squared By:  {sample_array_1 ** 2}")
print(f"Divided By:  {sample_array_1 / ones}")

Addition:    [[2. 3. 4.]]
Subtraction: [[0. 1. 2.]]
Multiplied:  [[1. 2. 3.]]
Squared By:  [1 4 9]
Divided By:  [[1. 2. 3.]]


In [21]:
# --- Perform some basic maths on with one 1-D array and a 2-D array.
# --- To mix things up, use the built-in numpy functions for arithmetic this time.
# --- What will happen is that all of the rows in sample_array_2 will be acted on by the values
# --- in ones where each column position is matched. This is called shape broadcasting:
print(f"Addition:    {np.add(sample_array_2, ones)}")
print(f"Subtraction: {np.subtract(sample_array_2, ones)}")
print(f"Multiplied:  {np.multiply(sample_array_2, ones)}")
print(f"Squared By:  {np.square(sample_array_2)}")
print(f"Divided By:  {np.divide(sample_array_2, ones)}")

Addition:    [[2.  3.  4.3]
 [5.  6.  7.5]]
Subtraction: [[0.  1.  2.3]
 [3.  4.  5.5]]
Multiplied:  [[1.  2.  3.3]
 [4.  5.  6.5]]
Squared By:  [[ 1.    4.   10.89]
 [16.   25.   42.25]]
Divided By:  [[1.  2.  3.3]
 [4.  5.  6.5]]


In [22]:
# --- Try adding a 2-D array to a 3-D array:
# --- This will not work due to their sizes being different on the z-axis.
# --- This breaks broadcasting rules. This fix is to have the shapes match.
# print(f"Addition:    {sample_array_3 + sample_array_2}")

In [23]:
# --- To get the array shapes to match, you can either recreate one to match the other
# --- or use the reshape function. For example, reshape sample_array_2 as a new array:
sample_array_4 = sample_array_2.reshape(2,1,3)

# --- Let's try adding array 4 to array 3:
print(f"Addition:\n{sample_array_3 + sample_array_4}")


Addition:
[[[ 2.   4.   6.3]
  [ 5.   7.   9.3]
  [ 8.  10.  12.3]]

 [[14.  16.  18.5]
  [17.  19.  21.5]
  [20.  22.  24.5]]]


### Aggregation

Aggregation is grouping items up and performing the same operation on each item in the group.

In [24]:
# --- Sum the values of sample_array_1 with NumPy
print(sum(sample_array_1))
print(np.sum(sample_array_1))

6
6


Whilst you can use sum() on NumPy ndarrays, it is recommended to do the following:
- For python datatypes (dicts, lists, sets and tuples), use python methods (`sum()`).
- For NumPy datatypes (ndarrays), use NumPy methods (`np.sum()`).

In [25]:
# --- Create an array with 10,000 random numbers and show the first 10 items:
sample_array_4 = np.random.randint(low = 1, high = 100, size = 10000)
sample_array_4[:10]

array([12, 29, 30, 15, 51, 69, 88, 88, 95, 97])

In [26]:
# --- Check the time (%timeit) it takes each method to run the same aggregation function.
# --- Spoilers: NumPy is WAAAAAAY faster:
%timeit sum(sample_array_4)
%timeit np.sum(sample_array_4)

612 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
7.42 µs ± 21.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [27]:
# --- Additional examples of NumPy aggregate methods:
print(f"Mean:    {np.mean(sample_array_4)}")
print(f"Median:  {np.median(sample_array_4)}")
print(f"Min:     {np.min(sample_array_4)}")
print(f"Max:     {np.max(sample_array_4)}")
print(f"Var:     {np.var(sample_array_4)}")
print(f"Std Dev: {np.std(sample_array_4)}")

Mean:    50.086
Median:  50.0
Min:     1
Max:     99
Var:     815.134604
Std Dev: 28.550562236145193


### Variance (np.var)

Variance is the measure of the average degree to which each number is different to the mean.
Higher variance = Wider range of numbers
Lower variance = Lower range of numbers

### Standard Deviation (np.std)

Standard deviation is the measure of how spread out a group of numbers are from the mean.

Another way to put it, the standard deviation is the square root of the variance.

In [28]:
# --- Just to show that the above is correct, show the std dev and the sqrt of var for
# --- sample_array_5:
print(f"Std Dev:  {np.std(sample_array_4)}")
print(f"Sqrt Var: {np.sqrt(np.var(sample_array_4))}")

Std Dev:  28.550562236145193
Sqrt Var: 28.550562236145193


### Reshape

Reshape allows you to change the shape of an existing array to whatever size you need, as long at the data will fit into it.

In [29]:
# --- Try adding a 2-D array to a 3-D array:
# --- This will not work due to their sizes being different on the z-axis.
# --- This breaks broadcasting rules. This fix is to have the shapes match.
# print(f"Addition:    {sample_array_3 + sample_array_2}")

In [30]:
# --- To get the array shapes to match, you can either recreate one to match the other
# --- or use the reshape function. For example, reshape sample_array_2 as a new array:
sample_array_5 = sample_array_2.reshape(2,1,3)
print(f"Array 5 Shape: {sample_array_5.shape}\n")
print(f"Array 5 Reshaped from Array 2:\n{sample_array_5}\n\n")

# --- Let's try adding array 4 to array 3:
print(f"Add Array 3 to Array 5:\n{sample_array_3 + sample_array_5}")

Array 5 Shape: (2, 1, 3)

Array 5 Reshaped from Array 2:
[[[1.  2.  3.3]]

 [[4.  5.  6.5]]]


Add Array 3 to Array 5:
[[[ 2.   4.   6.3]
  [ 5.   7.   9.3]
  [ 8.  10.  12.3]]

 [[14.  16.  18.5]
  [17.  19.  21.5]
  [20.  22.  24.5]]]


### Transpose

The transpose method will simply reverse the shape of an array. For example, if an array has a shape of 2, 3, running transpose on it will change the shape to 3, 2.

In [31]:
# --- Have a look at sample array 3:
print(sample_array_3)
print(sample_array_3.shape)

# --- To transpose, you can use either transpose() or simply T.
# --- The arrays shape will now be reversed:
print(sample_array_3.transpose())
print(sample_array_3.T.shape)

[[[ 1  2  3]
  [ 4  5  6]
  [ 7  8  9]]

 [[10 11 12]
  [13 14 15]
  [16 17 18]]]
(2, 3, 3)
[[[ 1 10]
  [ 4 13]
  [ 7 16]]

 [[ 2 11]
  [ 5 14]
  [ 8 17]]

 [[ 3 12]
  [ 6 15]
  [ 9 18]]]
(3, 3, 2)


## Element-Wise vs. Dot Product

!['Element-Wise vs. Dot Product'](../../assets/images/notes/014-element-vs-dot.png)

### Element Wise

Element wise is basically taking two arrays (lets say two arrays with a size of 5, 3 each) and then matching the elements in each and then multiplying them together. If the arrays are different shapes, it may fail (broadcast error).

In [32]:
# --- For example, take two arrays of 5,3 and multiply them together:
element_array_1 = np.random.randint(low = 1, high = 10, size = (5,3))
print(f"Element Array 1:\n{element_array_1}\n")

element_array_2 = np.random.randint(low = 1, high = 10, size = (5,3))
print(f"Element Array 2:\n{element_array_2}\n")

# --- Multiply the two arrays together:
print(f"Element Multiplication:\n{np.multiply(element_array_1, element_array_2)}")

Element Array 1:
[[6 5 2]
 [7 5 8]
 [4 5 7]
 [6 1 9]
 [1 7 6]]

Element Array 2:
[[2 9 5]
 [4 3 3]
 [9 1 1]
 [5 9 6]
 [9 7 5]]

Element Multiplication:
[[12 45 10]
 [28 15 24]
 [36  5  7]
 [30  9 54]
 [ 9 49 30]]


In [33]:
# Multiply all elements in one array by a scalar (single value):
print(f"Element Scalar Multiplication:\n{np.multiply(element_array_1, 5)}")

Element Scalar Multiplication:
[[30 25 10]
 [35 25 40]
 [20 25 35]
 [30  5 45]
 [ 5 35 30]]


### Dot Product

Dot product is a method to take two arrays and do a row x column multiplication and addition. It will multiply the values in a row by the values in a column, add them up and put the result in a new cell in a new matrix. 

Once it has done the first row by the first column and added the values up into the new cell, it will then do the first row by the second column and so on until all the columns are done. It will then move onto the next row and repeat.

One rule, the the inside (columns) on the first matrix, must match the outside (rows) on the second matrix. For example:

matrix a's size is 3 x (3) and matrix b's size is (3) x 2. Both the inner on a and outer on b equal 3.

The result will be a matrix with a size th

The below image will help explain it much better:

!["Dot Product Example"](../../assets/images/notes/015-dot-product.png)

In [34]:
# --- For example, take two arrays, one that is 5, 3 & another that is 3, 2 and use dot:
dot_array_1 = np.random.randint(low = 1, high = 10, size = (5,3))
print(f"Dot Array 1:\n{dot_array_1}\n")

dot_array_2 = np.random.randint(low = 1, high = 10, size = (3,2))
print(f"Dot Array 2:\n{dot_array_2}\n")

# Dot the two arrays together:
print(f"Dot Product:\n{np.dot(a = dot_array_1, b = dot_array_2)}")

Dot Array 1:
[[2 2 2]
 [8 8 5]
 [3 3 5]
 [4 8 6]
 [1 7 4]]

Dot Array 2:
[[7 6]
 [9 6]
 [8 7]]

Dot Product:
[[ 48  38]
 [168 131]
 [ 88  71]
 [148 114]
 [102  76]]


Some examples of valid sizes:
- 5,(3) x (3),4 = 5x4
- 5,(4) x (4),5 = 5x5
- 5,(5) x (5),5 = 5x5
- 7,(10) x (10),5 = 7x5

As long as the numbers in the brackets match, that's what matters. The resulting matrix size will be the numbers not in the brackets.

### Example of Using NumPy and Pandas

Use Case: Work out the line price total for sales of products for each day with the price in a different array / list.

In [35]:
np.random.seed(1)

# --- Create a random ndarray of items sold:
sales_qty = np.random.randint(30, size = (5,3))

In [36]:
# --- Create a pandas dataframe from the sales_qty array:
weekly_sales = pd.DataFrame(data = sales_qty,
                            index = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
                            columns = ["Bread", "Chicken", "Peppers"])

In [37]:
# --- Create an ndarray of prices and then make a pandas dataframe from it:
prices = np.array([2, 8, 3])
item_prices = pd.DataFrame(data = prices.reshape(1,3),
                           index = ["Price"],
                           columns = ["Bread", "Chicken", "Peppers"])

In [38]:
# --- Calculate the weekly totals:
item_totals = np.dot(a = sales_qty, 
                     b = prices)

In [39]:
# --- Insert the item_totals results to the weekly sales dataframe:
weekly_sales.insert(value = item_totals, column = "Day Total", loc = 3)
weekly_sales

Unnamed: 0,Bread,Chicken,Peppers,Day Total
Monday,5,11,12,134
Tuesday,8,9,11,121
Wednesday,5,15,0,130
Thursday,16,1,12,76
Friday,7,13,28,202


## Comparing and Sorting Arrays

### Comparison Operators

In [44]:
print(f"{sample_array_1}\n")
print(f"{sample_array_2}")

[1 2 3]

[[1.  2.  3.3]
 [4.  5.  6.5]]


In [54]:
# --- Comparison operators work the same way with ndarrays as they normally do in Python. For example:
print(f"{sample_array_1 > sample_array_2}\n")
print(f"{sample_array_1 < sample_array_2}\n")
print(f"{sample_array_1 <= sample_array_2}\n")
print(f"{sample_array_1 >= sample_array_2}\n")
print(f"{sample_array_1 is sample_array_2}\n")

# --- Compare a specific value in each ndarray (in this case 2 from sa1 and 2. from sa2):
print(f"{sample_array_1[1] == sample_array_2[0][1]}\n")

[[False False False]
 [False False False]]

[[False False  True]
 [ True  True  True]]

[[ True  True  True]
 [ True  True  True]]

[[ True  True False]
 [False False False]]

True

False



### Sorting Arrays

In [64]:
# --- Create a new array to work with:
unsorted_array_1 = np.random.randint(10, size=(3, 5))
unsorted_array_1

array([[3, 6, 5, 1, 9],
       [3, 4, 8, 1, 4],
       [0, 3, 9, 2, 0]])

In [68]:
# --- Sort the numbers so they are in order by columns:
np.sort(unsorted_array_1, axis=0)

array([[0, 3, 5, 1, 0],
       [3, 4, 8, 1, 4],
       [3, 6, 9, 2, 9]])

In [67]:
# --- Sort the numbers so they are in order by each row:
np.sort(unsorted_array_1, axis=1)

array([[1, 3, 5, 6, 9],
       [1, 3, 4, 4, 8],
       [0, 0, 2, 3, 9]])

In [69]:
# --- Sort an array by the value of the indexes (lowest to highest) and show the index numbers, not the values:
np.argsort(unsorted_array_1)

array([[3, 0, 2, 1, 4],
       [3, 0, 1, 4, 2],
       [0, 4, 3, 1, 2]])

In [75]:
# --- Get the index of the lowest and highest values by the entire matrix:
print(f"Min Value Index: {np.argmin(unsorted_array_1)}")
print(f"Max Value Index: {np.argmax(unsorted_array_1)}")

Min Value Index: 10
Max Value Index: 4


In [None]:
# --- Get the index of the lowest and highest values by columns:
print(f"Min Value Index: {np.argmin(unsorted_array_1, axis = 0)}")
print(f"Max Value Index: {np.argmax(unsorted_array_1, axis = 0)}")

Min Value Index: [2 2 0 0 2]
Max Value Index: [0 0 2 2 0]


In [74]:
# --- Get the index of the lowest and highest values by rows:
print(f"Min Value Index: {np.argmin(unsorted_array_1, axis = 1)}")
print(f"Max Value Index: {np.argmax(unsorted_array_1, axis = 1)}")

Min Value Index: [3 3 0]
Max Value Index: [4 2 2]


## Converting Images To NumPy Arrays

!["Panda"](../../assets/images/notes/017-panda.png)

What will happen when the image is converted to an ndarray? Simply put, it will look at each pixel and determine the RGB value for each and store it as a number in the ndarray.

In [83]:
# --- Turn an image into a NumPy array:
panda = imread("../../assets/images/notes/017-panda.png")

# --- Let's have a look at some details about the array:
print(f"Type: {type(panda)}\nSize: {panda.size}\nShape: {panda.shape}\nDimensions: {panda.ndim}")

Type: <class 'numpy.ndarray'>
Size: 24465000
Shape: (2330, 3500, 3)
Dimensions: 3


In [87]:
# --- Let's have a look at the first 3 entries:
panda[:3]

array([[[0.05490196, 0.10588235, 0.06666667],
        [0.05490196, 0.10588235, 0.06666667],
        [0.05490196, 0.10588235, 0.06666667],
        ...,
        [0.16470589, 0.12941177, 0.09411765],
        [0.16470589, 0.12941177, 0.09411765],
        [0.16470589, 0.12941177, 0.09411765]],

       [[0.05490196, 0.10588235, 0.06666667],
        [0.05490196, 0.10588235, 0.06666667],
        [0.05490196, 0.10588235, 0.06666667],
        ...,
        [0.16470589, 0.12941177, 0.09411765],
        [0.16470589, 0.12941177, 0.09411765],
        [0.16470589, 0.12941177, 0.09411765]],

       [[0.05490196, 0.10588235, 0.06666667],
        [0.05490196, 0.10588235, 0.06666667],
        [0.05490196, 0.10588235, 0.06666667],
        ...,
        [0.16470589, 0.12941177, 0.09411765],
        [0.16470589, 0.12941177, 0.09411765],
        [0.16470589, 0.12941177, 0.09411765]]], dtype=float32)