### NumPy Universal Functions

Earlier in this section we discussed universal functions and vectorization.

Recall that universal functions (**ufunc** for short) are functions that operate on an array element by element - and the vectorization simply means that the loop and operation is done at the underlying C level, not the Python level.

We have already encountered a number of universal functions, like the logic functions. But we also used other universal functions, disguised as operators:

In [1]:
import numpy as np

In [2]:
arr_1 = np.array([1, 2, 3, 4, 5])
arr_2 = np.array([5, 4, 3, 2, 1])

In [3]:
arr_1 + arr_2

array([6, 6, 6, 6, 6])

In fact, this `+` operator uses the universal function `np.add`:

In [4]:
np.add(arr_1, arr_2)

array([6, 6, 6, 6, 6])

And so it is with the other arithmetic operators too:

In [5]:
np.multiply(arr_1, arr_2)

array([5, 8, 9, 8, 5])

In [6]:
np.subtract(arr_1, arr_2)

array([-4, -2,  0,  2,  4])

And even the floor division (`//`), mod (`%`) and exponent (`**`) operators:

In [7]:
arr_1 // arr_2

array([0, 0, 1, 2, 5])

In [8]:
np.floor_divide(arr_1, arr_2)

array([0, 0, 1, 2, 5])

In [9]:
arr_1 % arr_2

array([1, 2, 0, 0, 0])

In [10]:
np.mod(arr_1, arr_2)

array([1, 2, 0, 0, 0])

In [11]:
arr_1 ** arr_2

array([ 1, 16, 27, 16,  5])

In [12]:
np.power(arr_1, arr_2)

array([ 1, 16, 27, 16,  5])

These universal functions also work between an array and a scalar (such as an `int`, `float`, etc), not just two arrays:

In [13]:
arr_1

array([1, 2, 3, 4, 5])

In [14]:
arr_1 * 2

array([ 2,  4,  6,  8, 10])

In [15]:
arr_1 ** 2

array([ 1,  4,  9, 16, 25])

Of course, NumPy provides many many more universal functions than just these ones.

For example, we have the trig functions:

In [16]:
arr = np.linspace(-2 * np.pi, 2 * np.pi, 10)
arr

array([-6.28318531, -4.88692191, -3.4906585 , -2.0943951 , -0.6981317 ,
        0.6981317 ,  2.0943951 ,  3.4906585 ,  4.88692191,  6.28318531])

In [17]:
np.sin(arr)

array([ 2.44929360e-16,  9.84807753e-01,  3.42020143e-01, -8.66025404e-01,
       -6.42787610e-01,  6.42787610e-01,  8.66025404e-01, -3.42020143e-01,
       -9.84807753e-01, -2.44929360e-16])

In [18]:
np.cos(arr)

array([ 1.        ,  0.17364818, -0.93969262, -0.5       ,  0.76604444,
        0.76604444, -0.5       , -0.93969262,  0.17364818,  1.        ])

You can find more universal functions documented here:

https://numpy.org/doc/stable/reference/ufuncs.html

#### Performance Considerations

Recall that when we fist looked at ufuncs and vectorization, I said that these provide a huge speed improvement over using a Python loop and standard Python functions/operators to perform the calculations.

Let's take a look at some code and see how good the performance improvement really is.

As a first example, we're going to calculate the multiplicative inverse (`1/x`) of every element in a list (using Python) and in an array (using NumPy).

In [19]:
from time import perf_counter

In [20]:
l = list(range(1, 1_000_000))

First we'll time creating a new list containing the multiplicative inverse of each element using a standard loop/append technique:

In [21]:
start = perf_counter()
new_list = []
for el in l:
    new_list.append(1 / el)
end = perf_counter()
print('Elapsed:', end - start)

Elapsed: 0.12296572399999994


Next, we'll do the same thing, but using a list comprehension:

In [22]:
start = perf_counter()
new_list = [1 / el for el in l]
end = perf_counter()
print('Elapsed:', end - start)

Elapsed: 0.06841258400000005


So, the comprehension approach was quite a bit faster already, but we're still using Python lists and a Python loop.

Let's try a NumPy array and a vectorized ufunc:

In [23]:
np_l = np.array(l)

In [24]:
np_l.dtype

dtype('int64')

In [25]:
start = perf_counter()
new_arr = 1 / arr
end = perf_counter()
print('Elapsed:', end - start)

Elapsed: 8.306700000004774e-05


As you can see, there was quite a speed difference between the two.

Let's look at a slightly more complicated example.

Suppose we have a matrix of Open/High/Low/Close data for some equity over time.

For simplicity, and to get some practice, I'm going to create dummy data, and I'm going to leave out the time column.

I want to end up with a matrix with `10_000_000` rows and `4` columns (OHLC), and I'll use random numbers between `100` and `200`.

Python way first:

In [26]:
import random

In [27]:
num_rows = 10_000_000

random.seed(0)

start = perf_counter()
data = [
    [
        random.randint(120, 180),
        random.randint(180, 200),
        random.randint(100, 120),
        random.randint(120, 180)
    ]
    for _ in range(num_rows)
]
end = perf_counter()
print(data[:2])
print('Elapsed', end - start)

[[174, 192, 113, 122], [136, 196, 115, 145]]
Elapsed 35.039486808


Next, let's create a new list that contains the daily price variation percentage (rounded to an int) for each row.

So, for each row, we need to calculate:
```
round((high - low) / close * 100)
```

In [28]:
start = perf_counter()
var = [ 
    round((row[1] - row[2]) / row[3] * 100)
    for row in data
]
print(var[:5])
end = perf_counter()
print('Elapsed:', end - start)

[65, 56, 52, 55, 51]
Elapsed: 2.2438287670000037


Now, let's do the same thing, but using NumPy.

We could just use our existing Python list and turn it into a NumPy array:

In [29]:
start = perf_counter()
data_np = np.array(data)
end=perf_counter()
print(data_np[:2])
print('Elapsed:', end - start)

[[174 192 113 122]
 [136 196 115 145]]
Elapsed: 4.217142658999997


And then we can perform our calculations:

In [30]:
start = perf_counter()
var = np.round((data_np[:, 1] - data_np[:, 2]) / data_np[:, 3] * 100)
end = perf_counter()
print(var[:5])
print('Elapsed:', end - start)

[65. 56. 52. 55. 51.]
Elapsed: 0.17590856700000046


Quite a performance improvement!

We opted to create the NumPy array using the Python list, but we could also just created directly using NumPy.

To do that, we are going to create separate arrays for OHLC values and then stack them horizontally (`hstack`) to combine all the columns into a single 2-D array.

In [31]:
np.random.seed(0)
start = perf_counter()
data_np = np.hstack(
    [
        np.random.randint(120, 180, (num_rows, 1)),
        np.random.randint(180, 200, (num_rows, 1)),
        np.random.randint(100, 120, (num_rows, 1)),
        np.random.randint(120, 180, (num_rows, 1))
    ]
)
end = perf_counter()
print(data_np[:5])
print('Elapsed:', end - start)

[[164 181 112 125]
 [167 199 110 136]
 [173 189 106 122]
 [120 180 105 143]
 [123 199 104 122]]
Elapsed: 0.7351292549999968


Now that was quite a bit faster than generating the same random integers using Python!

Let's put all this together to fully compare the Python vs NumPy performance difference.

Python first:

In [32]:
num_rows = 10_000_000

In [33]:
random.seed(0)
start = perf_counter()
data = [
    [
        random.randint(120, 180),
        random.randint(180, 200),
        random.randint(100, 120),
        random.randint(120, 180)
    ]
    for _ in range(num_rows)
]
var = [ 
    round((row[1] - row[2]) / row[3] * 100)
    for row in data
]
end = perf_counter()
print('Python Elapsed:', end - start)

Python Elapsed: 36.517827052


And then the NumPy way:

In [34]:
np.random.seed(0)
start = perf_counter()
data_np = np.hstack(
    [
        np.random.randint(120, 180, (num_rows, 1)),
        np.random.randint(180, 200, (num_rows, 1)),
        np.random.randint(100, 120, (num_rows, 1)),
        np.random.randint(120, 180, (num_rows, 1))
    ]
)
var = np.round((data_np[:, 1] - data_np[:, 2]) / data_np[:, 3] * 100)
end = perf_counter()
print('NumPy Elapsed:', end - start)

NumPy Elapsed: 0.6275155139999953


As you can see, NumPy was quite a bit faster.

#### Broadcasting

We can also use these universal functions with arrays that do not necessarily have the same shape, through a technique NumPy refers to as **broadcasting*.

https://numpy.org/doc/stable/user/basics.broadcasting.html

We are not going to study broadcasting in this course, but some simple cases are quite easy to understand.

Suppose we have an array of numbers, representing the number of inventory sold for different widgets.

Each row represents an order, and each column represents a specific widget, and the value in the array is the number sold for that particular widget:

In [35]:
sales = np.array(
    [
        [10, 0, 5, 3],
        [0, 0, 0, 10],
        [1, 1, 0, 0],
        [3, 0, 4, 5]
    ]
)

The next array is the sale price of each widget (in the same order as the sold columns):

In [36]:
unit_price = np.array([100, 50, 20, 10])

And this array represents the cost of each widget:

In [37]:
unit_cost = np.array([80, 10, 5, 1])

Our goal is to calculate the total profit generated from those sales.

Through broadcasting we can calculate the total sale price for each widget in each order:

In [38]:
sales * unit_price

array([[1000,    0,  100,   30],
       [   0,    0,    0,  100],
       [ 100,   50,    0,    0],
       [ 300,    0,   80,   50]])

As you can see the `unit_price` array was **broadcast** as many times as the number of rows, and then the multiplication happened.

In [39]:
total_sales = sales * unit_price

We can do the same with the costs:

In [40]:
total_cost = sales * unit_cost
total_cost

array([[800,   0,  25,   3],
       [  0,   0,   0,  10],
       [ 80,  10,   0,   0],
       [240,   0,  20,   5]])

Next we can calculate the net:

In [41]:
order_net = total_sales - total_cost
order_net

array([[200,   0,  75,  27],
       [  0,   0,   0,  90],
       [ 20,  40,   0,   0],
       [ 60,   0,  60,  45]])

To calculate the sum of all those elements, we could use some loops to sum everything up, but NumPy has a better way of doing this. The `sum` function, which will sum up every element in an array (even a multi-dimensional array):

In [42]:
np.sum(order_net)

617

There's a ton more functionality to NumPy that an introductory course cannot cover - but you should have some basic ideas of how NumPy works, and be able to read the NumPy documentation to look for functionality that you may need for your specific problems.

One of the reaons why we study NumPy, is that another library, Pandas, is built on top of NumPy. That library is one that again has a ton of functionality, but focused on data sets, which offer more functionality than just plain multi dimensional arrays. We'll look at the Pandas library a littler later in this course.