### Random Numbers

https://docs.python.org/3/library/random.html

Let's look at the fundamental random number generator in the `random` module:

In [1]:
import random

In [2]:
for _ in range(5):
    print(random.random())

0.5897636861996454
0.1520479293234812
0.7959830394681118
0.3406638638113163
0.13146976976784708


This generated a sequence of 5 (pseudo) random floats in `[0.0, 1.0)`.

The PRNG is deterministic, but it uses a seed value. 

If we use the **same** seed value, then the pseudo random numbers generated from repeated calls to `random()` will result in the same numbers being generated.

In [3]:
random.seed(0)

for _ in range(5):
    print(random.random())

0.8444218515250481
0.7579544029403025
0.420571580830845
0.25891675029296335
0.5112747213686085


and if we do this again:

In [4]:
random.seed(0)

for _ in range(5):
    print(random.random())

0.8444218515250481
0.7579544029403025
0.420571580830845
0.25891675029296335
0.5112747213686085


You see we get the same sequence of generate random numbers.

But if we change the seed, then the sequence will be different:

In [5]:
random.seed(1)

for _ in range(5):
    print(random.random())

0.13436424411240122
0.8474337369372327
0.763774618976614
0.2550690257394217
0.49543508709194095


But again, for the same seed, we keep the same sequence:

In [6]:
random.seed(1)

for _ in range(5):
    print(random.random())

0.13436424411240122
0.8474337369372327
0.763774618976614
0.2550690257394217
0.49543508709194095


If we do not set a `seed` the way we did, or just use `seed()` or `seed(None)`, then Python will use the current (epoch) time as a seed when `random()` is called for the first time in your program.

This means that every time your program runs, as long as you do not set a seed, the seed will be different, and hence your program will never generate the same sequence.

Setting the seed to a fixed value can be useful for testing and debugging - but once you are done with that, you should either set the seed yourself to a non-constant value, or simply let Python use an epoch time.

In this module, I am going to set the seed, so that you and I will see the same random numbers and behavior.

Let's look at another (uniform) random number generator, for floats in some interval `[a, b]`:

In [7]:
random.seed(0)

for _ in range(5):
    print(random.randrange(1, 6))

4
4
1
3
5


Becuase I reset the seed to some fixed number `0`, I can reproduce the same sequence over and over again:

In [8]:
random.seed(0)

for _ in range(5):
    print(random.randrange(1, 6))

4
4
1
3
5


We also have a function to generate random integers in some interval `[a, b]`:

In [9]:
random.seed(0)

for _ in range(5):
    print(random.randint(1, 6))

4
4
1
3
5


But remember how I said that all these random number generators use the base `random()` generator?

What this means is that I can set the seed once (before I start using any functions in `random`), and **all** the operations thereafter will be repeatable - even if the operations are mixed up:

In [10]:
random.seed(0)

for _ in range(3):
    print(random.randint(1, 6))
    print(random.randrange(1, 6))
    
for _ in range(3):
    print(random.randint(1, 3))
    
for _ in range(3):
    print(random.random())

4
4
1
3
5
4
2
2
2
0.3580493746949883
0.8916606598206824
0.2184427269152317


And if we reset the seed and start again:

In [11]:
random.seed(0)

for _ in range(3):
    print(random.randint(1, 6))
    print(random.randrange(1, 6))
    
for _ in range(3):
    print(random.randint(1, 3))
    
for _ in range(3):
    print(random.random())

4
4
1
3
5
4
2
2
2
0.3580493746949883
0.8916606598206824
0.2184427269152317


we get the exact same result.

So you only need to set the seed once for repeatable results in your own code.

In these notebooks I am going to be resetting the seed multiple times - this is only because you and I may both be trying things out, and generating random numbers - which means our operations may get "out of sync", and by the time I generate a random number, we may no longer see the same result. This in no way is saying that you constantly need to reset your seed in a real program when testing or debugging.

Although we have not seen them yet, the same holds for operations that perform sampling on sequences, or shuffle sequences. Again, as long as you start with the same seed, all those results will be repeatable.

So far, all the random number sequences have been uniformly distributed.

Let's play around with this a bit so we can visualize what that means - and at the same time start writing so more complex Python code.

Our end goal is to display a frequency chart for a sequence of random integers.

We have not covered charting yet, so we're going to hack a poor man's version, that woud look like this, given a list of values, and their relative frequencies as percentages:

In [12]:
data_1 = [
    (1, 12.3),
    (2, 30.7),
    (3, 20.5),
    (4, 36.5)
]

i.e., the value `1` occurred `12.3%` of the time, `2` occurred `30.7%` of the time, etc.

Think of it as a sequence of key/value pairs - just as a list of tuples, not a dict (you'll see why I want to avoid a dictionary for this later).

We'll generate a chart that looks something like this:

```
     1| ************
     2| *******************************
     3| *********************
     4| ****************************************
```

We can do that by rounding the relative frequency percentage, and print out as many `*` characters as that rounded value is.

The "key" values don't have to be numbers, they could also be strings, or anything else for that matter:

In [13]:
data_2 = [
    ('a', 12.3),
    ('b', 30.7),
    ('c', 20.5),
    ('d', 36.5)
]

So with that assumption let's try to chart the data:

In [14]:
for k, v in data_2:
    print(f"{k}| {'*' * round(v)}")

a| ************
b| *******************************
c| ********************
d| ************************************


That seems like it's going to work - but it's going to look ugly for data like this:

In [15]:
data_3 = [
    ('abc', 12.3),
    ('d', 30.7),
    ('ef', 20.5),
    ('ghij', 36.5)
]

for k, v in data_3:
    print(f"{k}| {'*' * round(v)}")

abc| ************
d| *******************************
ef| ********************
ghij| ************************************


What we need to do is somehow pad the strings `abc|`, `d|` with some spaces to the left, so they are all of equal length.

We can easily left-pad a string, called right-justified, using the `rjust` method of strings:

In [16]:
s = 'abc'

s.rjust(10)

'       abc'

(We can actually specify a different padding character than the default space):

In [17]:
s.rjust(10, '-')

'-------abc'

But for us, spaces will work just fine.

But the question is, what should he padding amount be?

We don't want to have insufficient padding:

In [18]:
for k, v in data_3:
    print(f"{k.rjust(3)}| {'*' * round(v)}")

abc| ************
  d| *******************************
 ef| ********************
ghij| ************************************


Or too much:

In [19]:
for k, v in data_3:
    print(f"{k.rjust(15)}| {'*' * round(v)}")

            abc| ************
              d| *******************************
             ef| ********************
           ghij| ************************************


What we really want is to pad with the largest key length (as a string, if the "key" is an integer we'll have to convert it to a string first - we're interested in the text (display) length of the key).

We can easily do this by extracting all the "keys", making sure they are all strings (just the way they get printed in our output), using the `max` function to find the longest item, and use that length as padding:

In [20]:
keys = [str(el[0]) for el in data_3]

In [21]:
keys

['abc', 'd', 'ef', 'ghij']

In [22]:
key_lengths = [len(str(el[0])) for el in data_3]
key_lengths

[3, 1, 2, 4]

In [23]:
pad = max(key_lengths)
pad

4

In [24]:
for k, v in data_3:
    print(f"{k.rjust(pad)}| {'*' * round(v)}")

 abc| ************
   d| *******************************
  ef| ********************
ghij| ************************************


OK, so let's bundle all this up in a function:

In [25]:
def chart_freq(data):
    pad = max([len(str(el[0])) for el in data])
    for k, v in data:
        print(f"{str(k).rjust(pad)}| {'*' * round(v)}")

In [26]:
chart_freq(data_1)

1| ************
2| *******************************
3| ********************
4| ************************************


In [27]:
chart_freq(data_2)

a| ************
b| *******************************
c| ********************
d| ************************************


In [28]:
chart_freq(data_3)

 abc| ************
   d| *******************************
  ef| ********************
ghij| ************************************


Ok, now that we have that, we are going to go back to our original goal - generate a sequence of random integers, and look at the distribution:

In [29]:
random.seed(0)

data = [random.randint(1, 10) for _ in range(5)]

In [30]:
data

[7, 7, 1, 5, 9]

We now need to create a dictionary that will hold the values in data as a key in a dict, and the number of times it occurs in the `data` sequence, so for the example above, we want to generate this:

In [31]:
freq = {
    1: 1,
    5: 1,
    7: 2,
    9: 1
}

A simple way to do this is to build up a dictionary as we iterate through the elements of `data`:
- if the element is **not** in the dictionary, add it with a count of 1
- if the element **is** in the dictionary, add 1 to the count (the associated value) in the dict

In [32]:
freq = {}
for el in data:
    freq[el] = freq.get(el, 0) + 1
    
print(freq)

{7: 2, 1: 1, 5: 1, 9: 1}


Let's package this up into a function, whose argument will just be an iterable (but finite!):

In [33]:
def freq_distribution(data):
    freq = {}
    for el in data:
        freq[el] = freq.get(el, 0) + 1
    return freq

In [34]:
freq = freq_distribution(data)
print(freq)

{7: 2, 1: 1, 5: 1, 9: 1}


The next thing we want is to calculate relative frequencies (as percentages).

To calculate the relative frequency we simply divide each frequency by the sum of the frequencies:

In [35]:
sum_freq = sum(freq.values())
print(sum_freq)

5


and the relative frequencies can be calculate as follows (creating a new dictionary, not mutating the existing one).

One approach is to make a copy of the dictionary, and mutate it:

In [36]:
relative_freq = freq.copy()

and then just update each value:

In [37]:
for k in relative_freq:
    relative_freq[k] = relative_freq[k] / sum_freq * 100
    
print(relative_freq)

{7: 40.0, 1: 20.0, 5: 20.0, 9: 20.0}


Another way, probably more Pythonic is to use a dictionary comprehension to create the new dictionary:

In [38]:
relative_freq = {
    k: v / sum_freq * 100 for k, v in freq.items()
}

print(relative_freq)

{7: 40.0, 1: 20.0, 5: 20.0, 9: 20.0}


Let's package this up as a function, that will receive a frequency distribution dict as an argument:

In [39]:
def relative_freq(freq_dist):
    sum_freq = sum(freq_dist.values())
    return {
        k: v / sum_freq * 100 for k, v in freq_dist.items()
    }

Now, we are almost ready to put all this together:

In [40]:
data

[7, 7, 1, 5, 9]

In [41]:
freq = freq_distribution(data)
print(freq)

{7: 2, 1: 1, 5: 1, 9: 1}


In [42]:
rel = relative_freq(freq)
print(rel)

{7: 40.0, 1: 20.0, 5: 20.0, 9: 20.0}


So we can't chart that quite yet - our charting function requires a list of tuples (`key, value`).

We could just use the `items()` property of the dictionary:

In [43]:
chart_freq(rel.items())

7| ****************************************
1| ********************
5| ********************
9| ********************


But notice how the keys are in the order they were first encountered in the original `data` - we want to sort that first.

In [44]:
sorted_items = sorted(rel.items(), key=lambda x: x[0])

In [45]:
sorted_items

[(1, 20.0), (5, 20.0), (7, 40.0), (9, 20.0)]

Ok, so now we can use that data for charting:

In [46]:
chart_freq(sorted_items)

1| ********************
5| ********************
7| ****************************************
9| ********************


Finally, let's put together a function that will generate `n` random numbers in some range `[a, b]`, and chart it:

In [47]:
def analyze_randint(n, a, b):
    data = [random.randint(a, b) for _ in range(n)]
    
    freq = freq_distribution(data)
    rel = relative_freq(freq)
    
    sorted_items = sorted(rel.items(), key=lambda x: x[0])
    chart_freq(sorted_items)

Before we run this, let's all get on the same page and reset our random generator:

In [48]:
random.seed(0)

In [49]:
analyze_randint(10, 1, 10)

1| **********
5| ********************
6| **********
7| ******************************
8| ********************
9| **********


Now let's increase the total number of random numbers, and see if we get closer to a uniform distribution:

In [50]:
analyze_randint(10_000, 1, 10)

 1| **********
 2| **********
 3| **********
 4| **********
 5| **********
 6| **********
 7| **********
 8| **********
 9| *********
10| **********


Much closer to a unioform distribution.

We have other random number generators, which will approximate other distributions.

For example, there is one for normal distributions (actually two, but unless you are using multi-treaded applications, which we definitely won't in this course!!), use the `gauss` function - it performs better.

In [51]:
random.gauss(0, 1)

-0.2035356944005242

If we calculate frequencies for a sequence of numbers generated this way, we should get a normal distribution centered at `0` with a standard deviation of `1`.

Technically, to see this, we should bucket the data (create small intervals of floats and count how may numbers fall in that interval).

I'm going to hack this - we'll generate the normally distributed random numbers, and multiply each one by `10` - and then we can use the rounded number - and count how many of these rounded numbers are found - eseentially we are creating buckets like `[-4.5, 3.5)` for the bucket `4`, etc. 

In [52]:
data = [round(10 * random.gauss(0, 1)) for _ in range(10)]
print(data)

[-9, -9, 4, 12, 5, -19, 26, -6, 9, 16]


So let's write a function to do this:

In [53]:
def analyze_normal(n, mu, sigma):
    data = [round(10 * random.gauss(mu, sigma)) for _ in range(n)]
    
    freq = freq_distribution(data)
    rel = relative_freq(freq)
    
    sorted_items = sorted(rel.items(), key=lambda x: x[0])
    chart_freq(sorted_items)

In [54]:
random.seed(0)

analyze_normal(100_000, 0, 1)

-42| 
-41| 
-38| 
-37| 
-36| 
-35| 
-34| 
-33| 
-32| 
-31| 
-30| 
-29| 
-28| 
-27| 
-26| 
-25| 
-24| 
-23| 
-22| 
-21| 
-20| 
-19| *
-18| *
-17| *
-16| *
-15| *
-14| **
-13| **
-12| **
-11| **
-10| **
 -9| ***
 -8| ***
 -7| ***
 -6| ***
 -5| ***
 -4| ****
 -3| ****
 -2| ****
 -1| ****
  0| ****
  1| ****
  2| ****
  3| ****
  4| ****
  5| ****
  6| ***
  7| ***
  8| ***
  9| ***
 10| **
 11| **
 12| **
 13| **
 14| *
 15| *
 16| *
 17| *
 18| *
 19| *
 20| *
 21| 
 22| 
 23| 
 24| 
 25| 
 26| 
 27| 
 28| 
 29| 
 30| 
 31| 
 32| 
 33| 
 34| 
 35| 
 36| 
 37| 
 38| 
 39| 
 40| 


We could clean this up a bit by removing entries that will result in zero stars:

In [55]:
def analyze_normal(n, mu, sigma):
    data = [round(10 * random.gauss(mu, sigma)) for _ in range(n)]
    
    freq = freq_distribution(data)
    rel = relative_freq(freq)
    
    filtered_items = {
        k: v
        for k, v in rel.items()
        if round(v) > 0
    }
    sorted_items = sorted(filtered_items.items(), key=lambda x: x[0])
    chart_freq(sorted_items)

In [56]:
random.seed(0)

analyze_normal(100_000, 0, 1)

-19| *
-18| *
-17| *
-16| *
-15| *
-14| **
-13| **
-12| **
-11| **
-10| **
 -9| ***
 -8| ***
 -7| ***
 -6| ***
 -5| ***
 -4| ****
 -3| ****
 -2| ****
 -1| ****
  0| ****
  1| ****
  2| ****
  3| ****
  4| ****
  5| ****
  6| ***
  7| ***
  8| ***
  9| ***
 10| **
 11| **
 12| **
 13| **
 14| *
 15| *
 16| *
 17| *
 18| *
 19| *
 20| *
