Chapter 6
# Random Numbers

We don't need true randomness in machine learning.  Instead we can use pseudorandomness i.e. a sample of numbers that look close to random, but were generated using a deterministic process.

Shuffling data and initialising coefficients with random values use pseudorandom number generators.

The numbers are generated in a sequence.  The sequence is deterministic and is seeded with an initial number.  The same seeding of the process will result in the same sequence of random numbers.

If you do not explicitly seed the pseudorandom number generator, it may use the current system time as the seed.

# Python: Seeding
Python provides a module called random offering a suite of functions for generating random numbers.  Python uses the Mersenne Twister as its pseudorandom number generator.

If no seed is specified, the current system time in milliseconds from epoch (1970) is used

It can be useful to control the randomness by setting the seed to ensure that your code produces the same result each time, such as in a production model.

For running experiments where randomisation is used to control for confounding variables, a different seed may be used for each experimental run.

In [1]:
# seed the pseudorandom number generator
from random import seed
from random import random

# seed random number generator (the argument is an integer)
seed(1)

# generate some random numbers
print(random(), random(), random())

# reset the seed
seed(1)

# generate some random numbers - the same sequence of numbers is generated
print(random(), random(), random())

0.13436424411240122 0.8474337369372327 0.763774618976614
0.13436424411240122 0.8474337369372327 0.763774618976614


# Python: Random Floating Point Values
The random() function generates values in the range [0, 1) i.e. 0 is included in the range but 1 is excluded.  Each value has an equal chance of being drawn

Floating point values could be rescaled to a desired range by:
- multipllying them by the size of the new range
- adding the min value

scaledvalue = min + (value x (max - min))

In [3]:
# generate random floating point values
from random import seed
from random import random

# seed random number generator
seed(1)

# generate random numbers between 0-1
for _ in range(10):
    value = random()
    print(value)

0.13436424411240122
0.8474337369372327
0.763774618976614
0.2550690257394217
0.49543508709194095
0.4494910647887381
0.651592972722763
0.7887233511355132
0.0938595867742349
0.02834747652200631


# Python: Random Integer Values
Random integer values can be generated using the randint() function, specifying the start and end of the range.  Integer values are generated in the range [start, end] i.e. both start and end values are included.

In [8]:
# generate random integer values
from random import seed
from random import randint

# seed random number generator
seed(1)

# generate some integers
for _ in range(10):
    value = randint(0, 10)
    print(value)

2
9
1
4
1
7
7
7
10
6


# Python: Random Gaussian Values
Random floating point values can be drawn from a Gaussian (aka normal) distribution using the gauss() function, specifying the mean and standard deviation.

In [9]:
# generate random Gaussian values
from random import seed
from random import gauss

# seed random number generator
seed(1)

# generate some Gaussian values
for _ in range(10):
    value = gauss(0, 1)
    print(value)

1.2881847531554629
1.449445608699771
0.06633580893826191
-0.7645436509716318
-1.0921732151041414
0.03133451683171687
-1.022103170010873
-1.4368294451025299
0.19931197648375384
0.13337460465860485


# Python: Randomly Choosing from a List
The choice() function can be used to randomly select an item from a list.  Selections are made with uniform likelihood

In [10]:
# choose a random element from a list
from random import seed
from random import choice

# seed random number generator
seed(1)

# prepare a sequence
sequence = [i for i in range(20)]
print(sequence)

# make choices from the sequence
for _ in range(5):
    selection = choice(sequence)
    print(selection)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
4
18
2
8
3


# Python: Random Subsample from a List
We may be interested in repeating the random selection of items from a list to create a randomly chosen subset.  Once an item is selected from the list and added to the subset, it should not be added again i.e. selection without replacement.  This is achieved using the sample() function, specifying the list and the size of the subset.

In [11]:
# select a random sample without replacement
from random import seed
from random import sample

# seed random number generator
seed(1)

# prepare a sequence
sequence = [i for i in range(20)]
print(sequence)

# select a subset without replacement
subset = sample(sequence, 5)
print(subset)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[4, 18, 2, 8, 3]


# Python: Randomly Shuffle a List
The shuffle() function can be used to shuffle a list in place (rather than returning a shuffled copy of the original list)

In [12]:
# randomly shuffle a sequence
from random import seed
from random import shuffle

# seed random number generator
seed(1)

# prepare a sequence
sequence = [i for i in range(20)]
print(sequence)

# randomly shuffle the sequence
shuffle(sequence)
print(sequence)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[11, 5, 17, 19, 9, 0, 16, 1, 15, 6, 10, 13, 14, 12, 7, 3, 8, 2, 18, 4]


# NumPy: Seeding
NumPy also uses the Mersenne Twister pseudorandom number generator.  However, note that seeding the Python pseudorandom number generator does not affect the NumPy pseudorandom number generator

In [13]:
# seed the pseudorandom number generator
from numpy.random import seed
from numpy.random import rand

# seed random number generator, specifying an integer value
seed(1)

# generate some random numbers
print(rand(3))

# reset the seed
seed(1)

# generate some random numbers - the same sequence of numbers is generated
print(rand(3))

[4.17022005e-01 7.20324493e-01 1.14374817e-04]
[4.17022005e-01 7.20324493e-01 1.14374817e-04]


# NumPy: Array of Random Floating Point Values
The rand() NumPy function is used to generate an array of random floating point values, specifying the size of the array.  If not argument is provided, a single random value is created

In [14]:
# generate random floating point values
from numpy.random import seed
from numpy.random import rand

# seed random number generator
seed(1)

# generate random numbers between 0-1
values = rand(10)
print(values)

[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01
 1.46755891e-01 9.23385948e-02 1.86260211e-01 3.45560727e-01
 3.96767474e-01 5.38816734e-01]


# NumPy: Array of Random Integer Values
The randint() NumPy function can be used to generate an array of random numbers, specifying the lower end of the range, the upper end of the range, and the number of integer values to generate.  Random integers are drawn from the range [lower, upper) i.e. including the lower value but excluding the upper value

In [15]:
# generate random integer values
from numpy.random import seed
from numpy.random import randint

# seed random number generator
seed(1)

# generate some integers
values = randint(0, 10, 20)
print(values)

[5 8 9 5 0 0 1 7 6 9 2 4 5 2 4 2 4 7 7 9]


# NumPy: Array of Random Gaussian Values
An array of random Gaussian values can be generated using the randn() NumPy function, specifying the size of the resulting array.  Values are drawn from a standard Gaussian distribution with mean 0.0 and standard deviation 1.0

Values can be scaled by:
- multiplying the value by the standard deviation
- adding the mean

from the desired scaled distribution

scaledvalue = mean + ( value x stdev )

In [16]:
# generate random Gaussian values
from numpy.random import seed
from numpy.random import randn

# seed random number generator
seed(1)

# generate some Gaussian values
values = randn(10)
print(values)

[ 1.62434536 -0.61175641 -0.52817175 -1.07296862  0.86540763 -2.3015387
  1.74481176 -0.7612069   0.3190391  -0.24937038]


# NumPy: Shuffle NumPy Array
A NumPy array can be randomly shuffled in place using the shuffle() NumPy function

In [17]:
# randomly shuffle a sequence
from numpy.random import seed
from numpy.random import shuffle

# seed random number generator
seed(1)

# prepare a sequence
sequence = [i for i in range(20)]
print(sequence)

# randomly shuffle the sequence
shuffle(sequence)
print(sequence)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[3, 16, 6, 10, 2, 14, 4, 17, 7, 1, 13, 0, 19, 18, 9, 15, 8, 12, 11, 5]


# When to Seed the Random Number Generator
There are times during a predictive modeling project when you should consider seeding the random number generator e.g.
- Data Preparation - this must be consistent so that the data is always prepared in the same way during fitting, evaluation and when making predictions with the final model
- Data Splits - splits for train/test or k-fold must be made consistently to ensure that each algorithm is trained and evaluated in the same way on the same subsamples of data
- demonstrating an algorithm in a tutorial environment

# How to Control for Randomness
A stochastic machine learning algorithm will learn slightly differently each time it is run on the same data, resulting in a model with slightly different performance each time it is trained.
We can fit the model using the same sequence of random numbers each time.  When evaluating a model this is a bad practice, as it hides the inherent uncertainty within the model.
A better approach is to evaluate the algorithm in such a way that the reported performance includes the measured uncertainty in the performance of the algorithm.  This can be done by repeating the evaluation of the algorithm multiple times with different sequences of random numbers.
There are two aspects of uncertainty to consider:
- Data Uncertainty - evaluating an algorithm on multiple splits of the data gives insight into how the algorithm's perfomance varies with changes to the train and test data
- Algorithm Uncertaintly - evaluating an algorithm multiple times on the same splits of data will give insight into how the algorithm performance varies alone

In general, both of these sources of uncertainty should be reported

In summary:
- there are times when randomness requires careful Control
- there are times when the randomness needs to be controlled for

# Extensions

In [18]:
# Confirm that seeding the Python pseudorandom number generator does not impact the NumPy pseudorandom number generator
from random import seed
from numpy.random import randint

# seed random number generator using Python
seed(1)

# generate some integers using NumPy
values = randint(0, 10, 10)
print(values)

# seed random number generator using Python
seed(1)

# generate some integers using NumPy - a different sequence will be generated
values = randint(0, 10, 10)
print(values)

[2 4 2 4 7 7 9 1 7 0]
[6 9 9 7 6 9 1 0 1 8]


In [28]:
# Practice generating random numbers between different ranges
from numpy.random import randint

print('NumPy random integer between 0 and 10')
values = randint(0, 10, 10)
print(values)
print('')

print('NumPy random integer between 10 and 20')
values = randint(10, 20, 10)
print(values)
print('')

print('Python random integer between 0 and 10')
from random import randint
for _ in range(10):
    value = randint(0, 10)
    print(value)
print('')

print('Python random integer between 10 and 20')
for _ in range(10):
    value = randint(10, 20)
    print(value)
print('')

print('Python float (to 2 dp) between 0 and 1000')
from random import random
for _ in range(10):
    value = (random() * 1000)
    print('%.2f' % value)

NumPy random integer between 0 and 10
[3 8 3 5 6 7 5 1 7 0]

NumPy random integer between 10 and 20
[12 18 12 11 14 10 14 11 17 13]

Python random integer between 0 and 10
3
9
10
6
9
3
7
1
10
6

Python random integer between 10 and 20
14
18
17
10
15
19
16
14
10
12

Python float (to 2 dp) between 0 and 1000
200.85
327.74
987.05
782.70
339.10
213.03
674.46
837.70
932.19
343.85


In [33]:
# Locate the equation for and implement a very simple pseudorandom number generator
def get_random_sequence(seed, modulus, multiplier, increment):
    values = [seed]
    for i in range(1, 10):
        values.append(((multiplier * values[i-1]) + increment) % modulus )
    return values

print(get_random_sequence(21, 88, 51, 42))
print(get_random_sequence(21, 89, 51, 42))
print(get_random_sequence(21, 88, 52, 42))
print(get_random_sequence(21, 88, 51, 43))

[21, 57, 45, 49, 77, 9, 61, 73, 69, 41]
[21, 45, 23, 58, 63, 51, 62, 0, 42, 48]
[21, 78, 50, 2, 58, 66, 42, 26, 74, 18]
[21, 58, 9, 62, 37, 82, 1, 6, 85, 66]
