# GMIT_Programming_for_data_analysis_Assignment
Contains the practical assignment for the programming for data analysis module

### Problem statement
The following assignment concerns the numpy.random package in Python [1]. You are
required to create a Jupyter [2] notebook explaining the use of the package, including
detailed explanations of at least five of the distributions provided for in the package.
There are four distinct tasks to be carried out in your Jupyter notebook.
1. Explain the overall purpose of the package.
2. Explain the use of the “Simple random data” and “Permutations” functions.
3. Explain the use and purpose of at least five “Distributions” functions.
4. Explain the use of seeds in generating pseudorandom numbers.





### 1. Explain the overall purpose of the package.

There is an interesting analogy between the numpy package itself, and its child package, numpy.random [3]. Just as numpy is so much more than a package that simply allows one to efficiently create multi-dimensional arrays, numpy.random is so much more than a package that just allows one to efficiently create random numbers (or pseudo-random numbers - pseudoransomness will be discussed below, under the 'SeedSequance' object). Just as numpy allows the user to easily and efficiently *do* things with matrices, numpy.random allows one to *do* things with random numbers, namely, generate them according to what are called * probability distributions*. In this sense, numpy.random allows the user to create both random numbers as well as numbers that are more or less probable. Numpy.random also has further use cases such as performing certain actions on arrays like randomly choosing a selection of items from the array of shuffling the contents of the array.

The 'structure' of the numpy.random package is as follows:

1. The object that the user is expected to interact with is the 'Generator' [4]. This is the object that can call the various functions which generate numbers according to a certain probability distribution (as well other functions such as shuffling).


2. This 'Generator' object itself uses another lower-level object that the user generally does not need to interact with, as this object can be created and handled in the background when the user creates the 'Generator' object. This lower-level object is the 'BitGenerator' [5], and this object is responsible for actually generating the random streams bits which the 'Generator' will then take up and apply a probablity distribution to. The 'BitGenerator' can be thought of as the conveyor belt that provides the numpy.random factory with random bits, and the 'Generator' can be thought of as the machine that 'decides' which bits to keep and which to discard according to the probability selection that the user, the machine operator, has selected on.


3. There is one further object in the numpy.random package, the 'SeedSequence' [6]. This object is used to determine the initial entropy that will be used by the 'BitGenerator' object to generate its pseudorandom bitstream. When a 'BitGenerator' object is created without first specifying the 'SeedSequence' to be used, the entropy is taken from the Operating System itself. The only case where this is not acceptable is where the psuedorandom bitstreams (and thus the pseudorandom numbers generated by the 'Generator' object) need to be repeatable for whatever reason. In this case, the user can create a 'SeedSequence' object, note the entropy of that object (SeedSequence.entropy), and then use that same initial entropy every time a 'BitGenerator' object needs to be created. This will allow the user to obtain identical sets of pseudorandom numbers (hence their *pseudorandomness* - they are actually created from a *specified* initial entropy and a *specified* algorithm).

Although one could first create a SeedSequence object using the SeedSequence constructor, then pass that to the BitGenerator constructor to create a a BitGenerator object, and finally pass that to a Generator constructor to create a Generator object to actually call the various functions that will generate the random numbers etc., the numpy documentation recommends simply creating a Generator object directly using the **default_rng** constructor, which handles the bit stream generation in the background. I will use this default constructor except where I wish to demonstrate what manually creating a SeedSequence object can achieve.

### 2. Explain the use of the “Simple random data” and “Permutations” functions.


There are four functions that can be called be a Generator object and fall under the heading, 'Simple random data':

1. **integers**(low[, high, size, dtype, endpoint]) - returns random integers from low (inclusive) to high (exclusive), or if endpoint=True, low (inclusive) to high (inclusive). [7]


2. **random**([size, dtype, out]) - returns random floats in the half-open interval [0.0, 1.0). [8]


3. **choice**(a[, size, replace, p, axis, shuffle]) - generates a random sample from a given 1-D array [9]


4. **bytes**(length) - returns random bytes. [10]


What each of these functions does is clearly explained in the docs, and in any case there is not much that needs explaining. The **integers** function generate random integers, the **random** function generates random floating points or doubles, the **choice** function randomly selected a given number of items from an inputted array, and the **bytes** function generates random bytes of a length specified by the user.






1. The **integer** function The choice function has one mandatory parameter, which is the array or integer that a 'choice' will be made from (in case of an integer, the integer is first past to numpy.arange()). There are a number of other non-mandatory options:


* int or tuple of ints, **size** - in the case of an int, this determines the length the array returned, in that of a tuple of ints,  the dimensional shape is determined
* int or array of ints, **low** & **high** - determines the max and min numbers in the array to choose from, either for each choice or for that choice whose index is equal to that of this array of ints.
* dtype, **dtype** - the dtype to be returned (the default is np.int64, although unsigned 8, 16, 32 and 64 bit integers can also be specified)

In [1]:
import numpy as np
rng = np.random.default_rng()

print(rng.integers(2, size=10)) # random
print(rng.integers(1, size=10)) # only one value to choose from!

rng.integers([1, 3, 5, 7], high=[[10], [20]], dtype=np.uint16)

[1 0 1 1 1 0 1 0 1 0]
[0 0 0 0 0 0 0 0 0 0]


array([[ 8,  8,  6,  9],
       [11, 14,  7,  9]], dtype=uint16)

2. The **random** function returns random floats from between 0.0 and 1.0, and takes the same size parameter as **integers**, with the dtype defaulting to np.float64 and with float32 specifiable. There is also an out parameter that can specify an array to contain the return floats. Note that because floats between 0.0 and 1.0 are returned, if one wants to extend the range of values generated one can multiply the result by the desired range, and to alter the minimum value one adds or subtracts a the desired minimum to/from the result. The maximum value is them automatically set through this also.

In [2]:
5 * rng.random((3, 2)) - 5

array([[-3.93846145, -3.89072437],
       [-2.68820086, -3.20615176],
       [-4.87392263, -4.75143474]])

3. The **choice** function has one mandatory parameter, which is the array or integer that a 'choice' will be made from (in case of an integer, the integer is first past to numpy.arange()). There are a number of other non-mandatory options, including size, as above:


* boolean, **replace** - whether the chosen values are put back into the original array so that they can be chosen again, the default is True
* arraylikme, **p** - an array, equal in length to the inputted array / arange, of floating points less between 0.0 and 1.0, representing the probability that the corresponding item in the inputted array will be chosen
* int, **axis** - which dimension should values be selected from (0, the default, represents rows, 1 columns)
* boolean, **shuffle** - whether the order of the selected numbers should be rearranged

In [3]:
print(rng.choice(5))
print(rng.choice([0,1,2,3,4]))
print(rng.choice(5, p=[0,0,0,0,1])) # here 4 will always be chosen
print(rng.choice(5, size=(4,4), axis=1))
print(rng.choice(10, size=10, replace=False)) # note that each value will only be chosen once
print(rng.choice(10, size=10))
print(rng.choice(10, 10, shuffle=False)) # shuffling has little effect, as the numbers are already chosen pseudorandomly

1
4
4
[[2 3 1 4]
 [4 0 0 1]
 [1 2 1 2]
 [4 0 1 1]]
[1 5 7 9 4 0 8 2 6 3]
[0 2 9 9 4 8 9 6 6 9]
[4 1 5 7 7 1 9 1 7 5]


4. The **bytes** function only takes one parameter,length, which determines the number of bytes returned. This parameter is mandatory.

In [4]:
# print(rng.bytes()) will result in a TypeError
# print(rng.bytes([1])) will result in a TypeError
print(rng.bytes(1))
print(rng.bytes(5))
print(rng.bytes(10))

b']'
b'\xe5\x15\xd7XW'
b'\x0bL\xb7\xd1\\\x96\xe0\xd6H\x9d'


#### 'Permutations'

The two Generator functions that fall under the heading, 'Permutations', differ from each other only slightly. **shuffle** shuffles the contents of an array *in-place*, i.e. changing the original array. **permutation**, on the other hand, replicates the array and then shuffles the replication, leaving the original array unchanged. Because **shuffle** does not create a new array, it also, does not return anything, unlike **permutation**, which returns the new array. Where a multi-dimensional array is inputted to either functions, an optional parameter, **axis** can be specified. This must be an int, and determines which dimension is to be shuffled, 0 for rows (the default) and 1 for columns.

In [7]:
array = np.arange(9)
rng.permutation(array)
print(array) # the array has not changed
rng.shuffle(array)
print(array) # the array has  changed

[0 1 2 3 4 5 6 7 8]
[3 5 1 2 6 8 7 4 0]


### 3. Explain the use and purpose of at least five “Distributions” functions.

### 4. Explain the use of seeds in generating pseudorandom numbers.

The supposedly random numbers generated by the Generator functions are termed pseudorandom because they are generated using a combination of an initial entropy state and an algorithm. If one knows the inital entropy state and the algorithm, then one knows what numbers will be generated, hence the not-quite-random nature of the numbers. Numpy uses the random.SeedSequance class to determine the initial entropy state, and the random.BitGenerator class to determine the algorithm and combine it with a SeedSequance object to create the random bit stream that will be used by the Generator class to generate random numbers according to a particular probability distribution. It is important to note, however, that while a bit stream can be reproducible if one known its entropy and algorithm, the actual distribution produced by the Generator functions using that bit stream will not be.

This SeedSequance object is used to determine the initial entropy that will be used by the 'BitGenerator' object to generate its pseudorandom bitstream. When a 'BitGenerator' object is created without first specifying the 'SeedSequence' to be used, the entropy is taken from the Operating System itself. The only case where this is not acceptable is where the psuedorandom bitstreams (and thus the pseudorandom numbers generated by the 'Generator' object) need to be repeatable for whatever reason, for example where the results of an algorithm need to be reproducible so that the algorithm's result can be verified by other others that are running the code implementing the alogorithm. In this case, the user can create a 'SeedSequence' object, note the entropy of that object (SeedSequence.entropy), and then use that same initial entropy every time a 'BitGenerator' object needs to be created. This will allow the user to obtain identical sets of pseudorandom numbers (hence their pseudorandomness - they are actually created from a specified initial entropy and a specified algorithm).

Although one could first create a SeedSequence object using the SeedSequence constructor, then pass that to the BitGenerator constructor to create a a BitGenerator object, and finally pass that to a Generator constructor to create a Generator object to actually call the various functions that will generate the random numbers etc., the numpy documentation recommends simply creating a Generator object directly using the default_rng constructor, which handles the bit stream generation in the background. 

PCG64, the bit stream generating algorithm that is the default used by the BitGenerator classs, makes a guarantee that a fixed seed and will always produce the same random integer stream, which is what we expect of a pseudorandom series of numbers [14].

In [18]:
sq1 = np.random.SeedSequence() # generate a SeedSequence
sq1.entropy
sq2 = np.random.SeedSequence(sq1.entropy)
np.all(sq1.generate_state(10) == sq2.generate_state(10)) # np.all checks if two numpy arrays' contents are equal

bg1 = np.random.PCG64(sq1) # PCG64 is the default BitGenerator algorithm
bg2 = np.random.PCG64(sq1)
print(bg1.random_raw()) # see that the BitGenerators' bit streams are the same
print(bg2.random_raw())

rg = np.random.Generator(bg1) # create our Generator with our BitGenerator
print(rg.integers(20)) # the Generator functions will still produce different results 
print(rg.integers(20)) # even though the underlying BitGenerators are the same


7223919406056205795
7223919406056205795
3
10


### References
[1] NumPy developers. Numpy. http://www.numpy.org/.

[2] Project Jupyter. Project jupyter home. http://jupyter.org/.

[3] https://numpy.org/doc/stable/reference/random/index.html

[4] https://numpy.org/doc/stable/reference/random/generator.html

[5] https://numpy.org/doc/stable/reference/random/bit_generators/generated/numpy.random.BitGenerator.html#numpy.random.BitGenerator

[6] https://numpy.org/doc/stable/reference/random/bit_generators/generated/numpy.random.SeedSequence.html#numpy.random.SeedSequence

[7] https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.integers.html#numpy.random.Generator.integers

[8] https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.random.html#numpy.random.Generator.random

[9] https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html#numpy.random.Generator.choice

[10] https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.bytes.html#numpy.random.Generator.bytes

[11] https://numpy.org/doc/stable/reference/random/generator.html#numpy.random.default_rng

[12] https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.shuffle.html#numpy.random.Generator.shuffle

[13] https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.permutation.html#numpy.random.Generator.permutation

[14] https://numpy.org/doc/stable/reference/random/bit_generators/pcg64.html