In [7]:
import numpy as np

## Cumulative Average

We all know how to take the average of a list of n numbers: add them all up, and divide the sum by n.

However, what if you don't have the entire list at hand, and instead get the numbers given to you one by one?

For example, let's say we have the list of integers from 1 to 5:

In [1]:
l = [1, 2, 3, 4, 5]

We can find the average the conventional way pretty easily:

In [2]:
avg = sum(l) / len(l)
print(avg)

3.0


But what if we only start by knowing the first number in the list, 1? Obviously, the average is just 1.

Then, when we get the next number, 2, the average becomes `(1 + 2) / 2 = 3 / 2 = 1.5`.

And, when 3 is introduced, using the average formula will give us `2`.

However, notice the inefficiency in using the average formula here: every time we see a new number, we are adding up all the numbers and dividing them to find the average, even though we had already computed their sum to find the previous average.

There's also the issue of memory space: every time a new number is introduced, we add it to a growing list of values.

Isn't there any way for us to take advantage of work we have already done to compute the previous average, so that computing the new one is easier?

Of course there is: we can use the cumulative average formula.

In [3]:
# let's say we start with the average of 1,2,3:
n = 3 # 3 total numbers
avg = 2 # average is 2
new_val = 4
# now, if we introduce 4, how can we update the avg without having to recompute the sum of 1,2,3?
# well, we can simply get the previous sum by multiplying the previous average by the previous count:
previous_sum = avg * n # will give 6, which is correct: 1 + 2 + 3 = 6
# then, we can simply add the new value, 4, to get the new sum:
new_sum = previous_sum + new_val # this will be 10
# and lastly, to get the new average, we divide the new sum by the new count, which increased by 1
new_avg = new_sum / (n + 1)
print(new_avg) # this will correctly give 2.5, the average of the integers 1-4

2.5


So, in conclusion, the formula for cumulative average is:

`new_average = ((old_average * old_count) + new_value) / new_count`

Using this formula, you don't even need to keep track of all the numbers you're averaging over time. The only variables you need to keep in memory are the old average and the old count (technically the new value also, but that's the input variable).

In [4]:
# if we wrote it as a function:
def cum_avg(old_avg, old_count, new_val):
    return ((old_avg * old_count) + new_val) / (old_count + 1)

In [5]:
# let's show it in action, by using it on a list of integers to create a list of cumulative averages:
l = [1,2,3,4,5] # original integers
c = [0] * len(l) # output list of cumulative averages, we start with a list of all zeros the same size as input
c[0] = l[0] # base case: the average of the first element is just that element, no math needed
# for each element after that, use the cum_avg function to get the new average
for n in range(1, len(l)):
    # here, n is the old count, c[n-1] is the old average,
    # l[n] is the new value, and c[n] is going to be the new average
    # apply the function
    c[n] = cum_avg(c[n-1], n, l[n])

# check the results:
print(c)

[1, 1.5, 2.0, 2.5, 3.0]


## Finding the maximum value in an unsorted list

Given an unsorted list of numbers, return the maximum value in the list.

The most basic way to do this is a **greedy search algorithm**: you simply loop through each element in the list, and keep track of the highest value you've seen so far. If a new value you come across is higher than the current maximum, you replace the current max with that value. This is why it's called a greedy algorithm: you're running through the list, and greedily grabbing the biggest thing you see as you go.

In [6]:
# implementation:
def get_max(l):
    current_max = l[0] # choose the first element as our current max
    # loop through the rest of the list, starting at the second element
    for elem in l[1:]:
        # if we found an element greater than our current maximum, we take it as the new max
        if elem > current_max:
            current_max = elem
    # once we've gone through the entire list, current_max will hold the absolute max value in the list
    return current_max

In [14]:
# create a random array of integers
a = np.random.randint(0,100, size=10)
print(a)
print(get_max(a))

[34 40  9 90 89 73 19 32 16 55]
90


Let's make it a bit more challenging: what if you don't want the actual maximum value, but the *index* of the maximum value (i.e. its position in the array)?

This is not too difficult to do: we simply have to keep track of another variable, the current maximum's index, and update it whenever the current maximum updates.

In [15]:
# implementation
def get_max_index(l):
    current_max = l[0]
    max_index = 0 # make a variable to hold the index of the current_max element
    # this time, we can't just loop through the elements, we need to loop through the index values
    for i in range(1,len(l)):
        elem = l[i]
        if elem > current_max:
            current_max = elem
            max_index = i # when we update current_max, also update max_index
    # return both
    return current_max, max_index

In [16]:
# test it:
a = np.random.randint(0,100, size=10)
print(a)
print(get_max_index(a))

[22 43 72 85 46 19 94 96 97 12]
(97, 8)


Side note: just like with most common mathematical operations, numpy has built in methods for getting both the maximum value and index of the maximum value (we've seen both of these functions before):

In [19]:
a = np.random.randint(0,100, size=10)
print(a)
print(np.max(a)) # get max value
print(np.argmax(a)) # get index of max value
# np also has min and argmin for minimum values

[37 57 33 47  4 42  0 96 93 51]
96
7


However, you should definitely know how to do this manually, because for the final project you will need to use this concept on a **dictionary** instead of an array, and numpy don't do dictionaries. If you sniff around online, you will find one-liners that accomplish this task for a dictionary. If you get that kind of code to work, it's fine by me.

In [25]:
d = {i:a[i] for i in range(len(a))}
print(d)
print(np.max(d)) # Nope
print(np.argmax(d)) # Nope

{0: 37, 1: 57, 2: 33, 3: 47, 4: 4, 5: 42, 6: 0, 7: 96, 8: 93, 9: 51}
{0: 37, 1: 57, 2: 33, 3: 47, 4: 4, 5: 42, 6: 0, 7: 96, 8: 93, 9: 51}
0


## Apply Along Axis 

### How does numpy's apply_along_axis function even work?

Numpy's `apply_along_axis` function takes an input array, and carves out 1D slices from this array along whichever axis you give it. Each of these 1D slices will have the function you specified applied to it, and the result will replace the original slice.

The hardest part is remembering which value for the axis argument applies the function along the rows or columns. Is axis=0 for rows, or is it for columns?

Well, I can tell you that even after using this function for a long time, I still get confused half of the time. So, here's the easiest way to get rid of the confusion: simply try it out on a small test array.

In [26]:
test = np.arange(12).reshape((3,4))
print(test)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


In [29]:
# remember the order of arguments: function, axis, data
axis0 = np.apply_along_axis(np.sum, 0, test)
axis1 = np.apply_along_axis(np.sum, 1, test)
print(axis0)
print(axis1)

[12 15 18 21]
[ 6 22 38]


As you can see, it looks like axis=0 applied the sum function to each **column**, and axis=1 applied the function to each **row**. Will I remember this off the top of my head next time? Probably not. Will I know how to easily figure it out? Definitely.

## Normalizing Data

Real life data comes in a variety of magnitudes and units: rainfall is measured in inches and centimeters, sales and profits are measured in thousands to millions of dollars, interstellar distances are measured in hundreds to billions of light years, and so on.

**Normalization** is the process of taking data of any scale and shrinking it down to a finite range of values, usually 0-1. Generally, the easiest way to do this is by choosing a lower and upper bound, and then proportionally scaling all the data values in that range down such that the lower and upper bounds represent 0 and 1, while every other value falls somewhere in between. NOTE: for this to work, the lower bound MUST be lower than or equal to the minimum of the data, and the upper bound must be greater than or equal to the maximum.

In [30]:
# coding it is pretty simple:
def normalize(data, lower, upper):
    scale = upper - lower # the total range of the data
    # subtract each data point by the lower bound to get the 'magnitude' of the data,
    # then divide by the scale. This guarantees that the data becomes values from 0 to 1
    return (data - lower) / scale

In [31]:
# try it out:
a = np.random.randint(0,100, size=10)
n = normalize(a, 0, 100)
print(a)
print(n)

[85 81 47 87 76  8 25  0 22 14]
[0.85 0.81 0.47 0.87 0.76 0.08 0.25 0.   0.22 0.14]


In [32]:
# well, that wasn't the best example, because it's easy to normalize a range of 0 to 100
a = np.random.randint(50,150, size=10)
n = normalize(a, 50, 150)
print(a)
print(n)

[120  76 145  85 144  54  98  50 116 106]
[0.7  0.26 0.95 0.35 0.94 0.04 0.48 0.   0.66 0.56]


In [None]:
# This one makes it more impactful to see what normalizing actually means:
# The values closer to the upper bound become closer to 1 (a value exactly equal to the upper bound would become 1)
# and vice versa, closer to the lower bound becomes closer to 0 (if equal to lower bound, it becomes 0)
# You can think of it like this:
# If x is a value, a is the lower bound, and b is the upper bound,
# the normalization of x is 'how much distance has x covered from a to b?'