# Numpy

From their website:

"NumPy is the fundamental package for scientific computing with Python."

Basically, it's a portmanteau of numerical python.

[Numpy website](https://numpy.org/)

## Why?

For most of software development, we care about variables, objects, functions, etc. Sometimes we care about arrays, but they are usually pretty simple.

However, the fundamental units of data science are the array, vector, matrix, etc. Applying mathematical operations on these in python is slow, frustrating, and ugly. Languages like Matlab were designed such that arrays are first-class citizens, but they don't have the extensibility of python.

To understand why numpy is so useful, let's take an example. Let's define a 500x500x3 matrix and multiply it by 3.

In [1]:
%%timeit
# in pure python

# first, define the matrix
A = []
for z_ii in range(3):
    A_xy = []
    for x_ii in range(500):
        A_y = []
        for y_ii in range(500):
            A_y.append(z_ii+1)
        A_xy.append(A_y)
    A.append(A_xy)

# now multiply by three
A3 = []
for z_ii in range(3):
    A3_xy = []
    for x_ii in range(500):
        A3_y = []
        for y_ii in range(500):
            A3_y.append(A[z_ii][x_ii][y_ii]*3)
        A3_xy.append(A3_y)
    A3.append(A_xy)

10 loops, best of 3: 143 ms per loop


Now let's do the same in numpy

In [2]:
import numpy as np

In [3]:
%%timeit
# first create an empty array of the desired shape
A = np.empty((3,500,500))

# now assign the values in the array
A[0] = 1
A[1] = 2
A[2] = 3

# finally, multiply by 3

A3 = 3*A

The slowest run took 5.86 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 909 µs per loop


This is more than 200 times faster than the pure python approach.

Not only that, look how much more readable and intuitive the code is.

## An aside: vectorization

What does it mean to vectorize something?

In high-level languages that prioritize arrays, vectorizing means converting an action that's done repeatedly to data into an action that's only called once on the entire array (or sections of it).

Basically, if you have a for-loop operating on an array, vectorize it!

#### Why is it faster to vectorize?

The short answer is that numpy as routines written in very efficient lower-level languages. When you "vectorize" a problem, you're able to take advantage of these optimized routines without having to dig into the lower-level code.

---

## Some basic numpy syntax

### creating arrays

In [4]:
# define an array
A = np.zeros((50,50))
# creates an array of zeros with shape 50x50

In [5]:
A

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [6]:
# or literally
A = np.array([0,1,2,3,4,5])

In [7]:
# from a python array
A_py = [0,1,2,3,4,5,6,7,8,9]
A = np.asarray(A_py)

#### other useful array creation tools:

In [8]:
np.ones((3,3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [9]:
np.empty((2,3))

array([[0.0e+000, 4.9e-324, 9.9e-324],
       [1.5e-323, 2.0e-323, 2.5e-323]])

In [10]:
np.arange(4,10)

array([4, 5, 6, 7, 8, 9])

In [11]:
np.arange(4,10, dtype=np.float)

array([4., 5., 6., 7., 8., 9.])

In [12]:
np.linspace(1.1, 4.5, 10)

array([1.1       , 1.47777778, 1.85555556, 2.23333333, 2.61111111,
       2.98888889, 3.36666667, 3.74444444, 4.12222222, 4.5       ])

### Manipulating arrays

In [13]:
A = np.array([0,1,2,3,4,5,6,7,8,9])

In [14]:
A.reshape((2,5))
A

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [15]:
B = A.reshape((2,5))

In [16]:
C = B*5

In [17]:
B*C

array([[  0,   5,  20,  45,  80],
       [125, 180, 245, 320, 405]])

In [18]:
C.T

array([[ 0, 25],
       [ 5, 30],
       [10, 35],
       [15, 40],
       [20, 45]])

In [19]:
np.sqrt(C)

array([[0.        , 2.23606798, 3.16227766, 3.87298335, 4.47213595],
       [5.        , 5.47722558, 5.91607978, 6.32455532, 6.70820393]])

In [20]:
# be careful of the shapes
A + B

ValueError: operands could not be broadcast together with shapes (10,) (2,5) 

In [21]:
B + C

array([[ 0,  6, 12, 18, 24],
       [30, 36, 42, 48, 54]])

In [22]:
# roll a multi-dimensional array into one line
B.ravel()

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

### Indexing arrays

Indexing arrays is fairly intuitive with numpy. It's similar to python lists in that it is [start, end) and you can reference from the end using negative indices, but it's easier to reference the dimensions.

In [23]:
A = np.arange(27)

In [24]:
A

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26])

In [25]:
A[0]

0

In [26]:
A[5:9]

array([5, 6, 7, 8])

In [27]:
A[-10:]

array([17, 18, 19, 20, 21, 22, 23, 24, 25, 26])

In [28]:
B = A.reshape((3,3,3))

In [29]:
B

array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23],
        [24, 25, 26]]])

In [30]:
B[0]
# or B[0,:,:]
# or B[0,...]

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [31]:
B[1,2,0]

15

In [32]:
index = (B%2 == 0)

In [33]:
index

array([[[ True, False,  True],
        [False,  True, False],
        [ True, False,  True]],

       [[False,  True, False],
        [ True, False,  True],
        [False,  True, False]],

       [[ True, False,  True],
        [False,  True, False],
        [ True, False,  True]]])

In [34]:
B[index]

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26])

### Random numbers in numpy

#### Uniform floats

In [35]:
np.random.uniform(10,20,5) #(low-inclusive, high-exclusive, array shape)

array([19.83780601, 19.24649296, 19.58120912, 14.6177204 , 11.01000972])

In [36]:
np.random.uniform(1,2,(2,2))

array([[1.57692805, 1.61245116],
       [1.82888635, 1.24782498]])

#### Uniform integers

In [37]:
np.random.randint(5,25,(2,3))

array([[12,  9, 11],
       [ 9, 19, 12]])

#### Normal (Gaussian)

In [38]:
np.random.normal(5,2,(3,3)) #(mean, std, size)

array([[5.74850805, 4.9884178 , 5.76269844],
       [2.43692522, 7.6132695 , 5.72125032],
       [8.66914657, 5.57828196, 5.19450285]])

#### ASIDE: Central Limit Theorem

Basically, most of life approaches a normal distribution when you look at enough samples. If you need to approximate something, it's usually a good idea to use a normal distribution.

### Other useful numpy things

#### Checking NaNs

NaNs aren't NaNs...

In [39]:
# checking NaNs
np.nan == np.nan

False

In [40]:
np.isnan(np.nan)

True

#### Using multiple boolean indexes

In [41]:
A = np.arange(100)

In [42]:
index_1 = (A%2 == 0)

In [43]:
index_2 = (A < 75)

In [44]:
index_3 = (A > 20)

In [45]:
A[index_1]

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,
       34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66,
       68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98])

In [46]:
A[np.logical_and(index_1, index_2)]

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,
       34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66,
       68, 70, 72, 74])

In [47]:
# what about for all three?


# just multiply
A[index_1*index_2*index_3]

array([22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54,
       56, 58, 60, 62, 64, 66, 68, 70, 72, 74])

#### np.where

Gives you the indices where a condition is true

In [48]:
A = np.random.randint(0,100,40)

In [49]:
np.where(A>50)

(array([ 0,  2,  5,  6,  7, 10, 11, 12, 13, 14, 19, 22, 26, 28, 32, 34, 35,
        37, 38]),)

#### Adding and removing axes

In [None]:
A = np.random.randint(0,20,(3,3))

In [None]:
A

In [None]:
A.shape

In [None]:
B = A[np.newaxis,...]

In [None]:
B

In [None]:
B.shape

This is often useful for feeding data into machine learning algorithms. For example, if you try to make a prediction on a single (512x512x3) image in tensorflow, the image needs to be expanded to have dimensions (1,512,512,3).

In [None]:
# squeeze the singleton axis
B.squeeze()

In [None]:
B.squeeze().shape

This is useful when plotting arrays. When you pass in a bunch of singleton axes, the plotter will often not know how to handle them. "Squeeze" them to remove them from the array.

#### Be careful about (N,1) arrays

In [None]:
A = 5*np.ones((10,1))

In [None]:
B = 2*np.ones(10)

In [None]:
A

In [None]:
B

In [None]:
A*B

!!

In [None]:
# to get the answer you were expecting, you need to take the transpose
A.T*B

This has caused many headaches for ML engineers. If you are calculating accuracy or loss or any other metric and don't think about this (N,1) arrays, you will sometimes get numbers that are completely useless.

In [None]:
import datetime
now = datetime.datetime.now()
def start_end() :
    f= open("hours_worked.txt","w")
    start_time = input("Hi Nana, enter current time to clock-in using 24hr HHMM format : \n")
    print ("You clocked in at \n", now.strftime("%Y-%m-%d %H:%M:%S"))
    f = open("hours_worked.txt", "a")

    end_time = input ("Hi Nana, enter current time to clockout using 24hr HHMM format : \n")
    print ("You clocked out at \n", now.strftime("%Y-%m-%d %H:%M:%S"))
    f = open("hours_worked.txt", "a")
    return 
start_end()


wages = (float(end_time) - int(start_time)) * 5
text = ("You have earned" + "${}")
print(text.format(wages))
f = open("hours_worked.txt", "a")