<img src="https://user-images.githubusercontent.com/50221806/85330638-a467b280-b489-11ea-8e64-7e7390afea32.png" style="float: left" width=500 />

# Programming for Data Analysis
***
## Assignment - numpy.random

#### Create a notebook that explains the use of the NumPy package including detailed explanations of at least five of the distributions provided for in the package.

#### There are four distinct tasks to be carried out in your Jupyter notebook
<ol>
    <li>Explain the overall purpose of the package.</li>
    <li>Explain the use of the “Simple random data” and “Permutations” functions.</li>
    <li>Explain the use and purpose of at least five “Distributions” functions.</li>
    <li>Explain the use of seeds in generating pseudorandom numbers.</li>
</ol>

## NumPy and numpy.random
***
### NumPy Background

`NumPy`, short for Numerical Python, is considered a foundational Python package for numerical computations [1]. The NumPy package was established by Travis Oliphant in 2005. It was meant as a successor to two earlier scientific Python libraries, Numeric and Numarray, with the goal of bringing a fragmented scientific computing community together around a single framework [2][3].

NumPy is an open-source external Python module that provides common mathematical and numerical routines in pre-compiled, fast functions for manipulating large arrays and matrices of numeric data [4]. Numpy does not come pre-installed with the Python standard library but can be installed using package manager programs such as `pip`. Alternatively NumPy does come pre-installed with the `Anaconda distribution` and it is recommended as the simplest way to get started for scientific computing and data science [5].

As it is an external package NumPy must be imported with an import statement in order for all of it's functions to be accesible to you. This can be done like the code shown below. The NumPy package is imported **as np** to save time when writing multiple commands, now you only need call np.x instead of numpy.x. The **numpy as np** convention is used to ensure other users reading the code understand it [4][6]. 

```python
import numpy as np

x = np.random.default_rng()
```

If you only require a function or sub-package from NumPy you can import that package directly into the current Python namespace using a from statement as below. This allows you to call the function without having to call np.package.function. You can also import the function directly as seen in the second example below [4][6].

```python
from numpy import random

x = random.default_rng()
```
or

```python
from numpy.random import default_rng

x = default_rng()
```

### Numpy's Random package
***
NumPy's random sub-module is designed to generate pseudo-random numbers and sequences from different statistical distributions [7]. The numpy.random package has functions for efficiently generating arrays of sample values.

In the below example the np.random.random function is used to create a single pseudorandom number between 0 and 1 and assign it to the variable x. It can also be used to efficiently create an array of values, as in the second example below we create an array with three columns and 3 rows of values and assign it to the variable y. This is an improvement on the Python built-in random module which only samples one value at a time [8].

In [None]:
import numpy as np

x = np.random.random(size=1)
y = np.random.random(size=(3,3))

print(x)
print(y)

#### Pseudo random number generation
***
It is extremely difficult for computers and computer programs to generate truly random numbers. Instead what programs do, including NumPy's random module, is they create what are called `pseudo random numbers`. These programs use algorithms with defined deterministic behaviour to generate a value or range of values that seem random to an observer [8][9].

NumPy's recommended Pseudo Random Number Generator (PRNG), `Generator`, uses O’Neill’s permutation congruential generator algorithm, `(PCG64)`, as the default method for generating random numbers [10]. Legacy versions of numpy.random as well as python's `stdlib random module` use the Mersenne Twister algorithm, `(MT19937)`[11]. Historically NumPy had a strict backwards compatibility policy for it's random number generation functions. This restricted upgrades and improvements to it's processes as any changes had to comply to this policy. More recent releases are not strictly compatibile with previous versions allowing changes such as the move to the PCG64 algorithm [12]. The PCG family of algorithms are considered faster, more efficient and less predictable than most other generators including the Mersenne Twister [13].

As shown below, it is still possible to access the Mersenne Twister algorithm to generate random numbers but in the latest version of numpy the default algorithm is PCG64. When default_rng and PCG64() are used with the same seed value they produce identical outputs.


In [None]:
from numpy.random import Generator, PCG64, MT19937, default_rng

sd = 1234 # create common seed value

z = default_rng(sd) # Initialise a random number generator using numpy's default RNG
x = Generator(PCG64(sd)) # Initialise Generator object using PCG64 algorithm
y = Generator(MT19937(sd)) # Initialise Generator object using Mersenne Twister algorithm

print(z.random(size=3))
print(x.random(size=3))
print(y.random(size=3))

#### Seed values for numpy.random
***
As PRNG's are deterministic by design their inputs or initial state dictates what their outputs will be. Two identical PRNG's with the same initial state will create the same output of values. While these are not true random values they are statistically similar to random [4]. This behaviour can be very useful, especially in fields where reproducability is key. PRNG's in the numpy.random package allow you to set this initial state in a paramater called `seed`[14]. 

In the example above **z and x** use the same algorithm and both have their initial seed value set to an arbitrary value of 1234 and therefore produce the same sequence of values: `0.97669977, 0.38019574, & 0.92324623`.

If no value is passed as the seed parameter of default_rng() then data will be pulled from the operating system and will generate unpredictable values [10]. In the past it had been popular to set the global state of a program using `numpy.random.seed()`. While this fuction is still available it is recommended that this is not used as setting a global state can cause other problems [12]. Instead each generator object should be initialised with a seed as a parameter. If the generator needs to be reseeded it is now best practice to recreate the a new generator as opposed to reseeding using numpy.random.seed() as shown below .
```python
rng = np.random.default_rng(1234)
# reseed later as
rng = np.random.default_rng(5678)
```

In Summary, seed values are important especially where repeatability and reproducability is needed in the outputs of the code. In this a seed should be passed to each generator when it is initialised to set it's initial state. If that state needs to be reset later then a new generator should be formed with a new seed. If it is required that the results are not easy to reproduce or predict then the seed value should be selected in a nondeterministic fashion and hidden[14].   






### Simple random data
***
Numpy's random module has a number of functions available to generate simple random data. 
The four main methods for generating simple random data are:

1. **integers**
2. **random**
3. **choice**
4. **bytes**

Previously used methods like `randint` and `random_integers` have been consolidated into the generator method `intergers` to clean up the code base and reduce duplication [7].

In the more recent versions of NumPy it is recommended to move away from the formerly used RandomState and for new code to initialise a  Generator object and call methods on the generator as in the code below [7].

In [None]:
import numpy as np
# Not recommended to call method in this fashion 
x = np.random.randint(5)
print(x)

# Instead initialise a generator like rng below 
rng = np.random.default_rng()

# Call method integers on rng to produce sample data
y = rng.integers(5)
print(y)

#### integers
***
The `integers()` method returns random integers from the `discrete uniform distribution` from within a *low* to *high* range. The method has 5 parameters 4 of which are optional for the user[16]. The dicrete uniform distrobution means that each value in the range has an equal probability of being selected [17].
```python
integers(low, high=None, size=None, dtype=np.int64, endpoint=False)
```
- **low**: The low parameter is required and sets the lowest integer value from which the distribution can be drawn. An exception to this is if `high=None` then low is set to 0 and high is set to low. low can be an int or an array of ints
- **high** (optional): By default high is set to None. If a high value is included then high is the upper bound of the distribution. The range is up to but exclusing the high value, unless endpoint=True. high can be an int or array of ints.
- **size** (optional): By default size is set to None. This means only a single value is returned. If a value is provided for size than that value governs the shape of the output. size can by an int or a tuple of ints.
- **dtype** (optional): This parameter decides the data type of the output. By default it is set to np.int64.
- **endpoint** (optional): By default endpoint is set to False. This means the range of values is exclusive of the high value. If `endpoint = True` then the range of values is inclusive of high. [16]

As we see from the below examples it is possible to customise `integers()` in order to generate the random value or range of values that we need from any range of ints we want and  with the type and shape of the output we wish to create.


In [None]:
from numpy.random import default_rng
# initialise generator object rng
rng = default_rng()

# 1. low and high value provided
print("eg.1: ", rng.integers(0,5)) # generates random int between low 0 and high 5 exclusive

# 2. low value only becomes high value as high=None, low becomes 0
print("eg.2: ", rng.integers(5, endpoint=True)) # generates random int between 0 and 5 inclusive

# 3. returns a 3 value array between 0-5 exclusive
print("eg.3: ", rng.integers(5, size=3))

# 4. returns a 2x2 array with 2 different lower and upper bounds
print("eg.4:\n ", rng.integers([1, 3], [3, 6], size=(2, 2)))

#### random
***
The `random()` method returns random floats between 0 and 1 from the `continuous uniform distribution`. This distribution ensures that each value between 0 and 1 are equally likely to occur [18][19]. The random method takes over from the legacy RandomState methods rand and random_sample when using the recommended Generator object [7].

The random method has three optional parameters which effect it's output.
```python
random(size=None, dtype=np.float64, out=None)
```
- **size**: Controls the output shape. By default = None and returns a single value. Can be an int or a tuple of ints
- **dtype**: Sets the data type of the output. float64 and float32 supported, np.float64 by default.
- **out**: An array can be used to store the result. If `size=None` then the output will match they shape out the array entered. If size is not None then the shape of the array must match the shape of size.[18]

As we can see from the below examples, we can customise the outputs of the random method by altering the input parameters. We can change the shape of the output and assign the return values to an output ndarray using the `size` and `out` parameters respectively. We can also perform calculations on the output if we require values that differ from the standard 0 to 1 output. 

In [1]:
from numpy.random import default_rng
from numpy import ndarray

rng = default_rng()

# 1. Single random float between 0 and 1
print("eg.1:")
print(rng.random())

# 2. Generate an 2x3 array of floats between 0 and 1
print("eg.2:")
print(rng.random((2,3)))

# 3. Generate an 2x2 array of floats between 5 and 10
x = rng.random((2,2)) * 5 + 5
print("eg.3:")
print(x)

# 4. Assign output to ndarray arr
arr = ndarray((2,2))
rng.random(size=(2,2), out=arr)
print("eg.4")
print(arr)

eg.1:
0.49812336356253306
eg.2:
[[0.71339548 0.02298002 0.6283706 ]
 [0.74237677 0.26685438 0.27154551]]
eg.3:
[[8.41976304 6.58396081]
 [9.50078173 6.92627107]]
eg.4
[[0.01108749 0.73467482]
 [0.13520784 0.81057304]]


#### choice
***
The `choice()` method returns a random sample from a given 1-d array. The output can be a single value or an array of values. By default choice applies a uniform distribution to the array but this is customizable using the parameter **p** [20].

The choice method has 6 parameters 5 of which are optional for the user.
```python
choice(a, size=None, replace=True, p=None, axis=0, shuffle=True)
```
- **a**: The array from which a random sample is generated. If **a** is an int, a sample array is generated using np.arange(a).
- **size** (optional): Dictates output shape. By default size=None which returns a single value. size can be an int or a tuple of ints which allows for multi dimensional array output.
- **replace** (optional): Controls wether the sample is replaced after selection. By default replace=True. If replace=False an item selected from the sample is not in the sample for future selection.
- **p** (optional): Assigns probabilities to each value in array a. By default p=None which assumes a uniform distribution of each element in a. If p is provided it must be a 1-D array with same number of elements as a and all elements of p must add up to 1.
- **axis** (optional): By defauly axis=0 which selects by row. If **a** is a 2-D axis can decide to select from rows or cols.
- **shuffle** (optional): Controls whether the output is shuffled when replace=False. shuffle=False provides a speedup to the running of the code. [20]

In the examples below we can see that the choice method is great for selecting a value or range of values from a given array. 
In the main, the parameters a, size, replace and p would be the most utilised while axis and shuffle may be needed in some less common scenarios. 



In [83]:
from numpy.random import default_rng
from numpy import ndarray

rng = default_rng()

# 1. select one random value from sample array [0 1 2 3 4]
print("eg.1")
print(rng.choice(5))

# 2. selects random 2x2 array from sample array [0 1 2 3 4]
print("eg.2")
print(rng.choice(5, size=(2,2)))

# 3. selects random value from sample array [0 1 2 3 4], 
# value cannot be selected twice as it is not replaced after selection
print("eg.3")
print(rng.choice(5, size=3, replace=False))

# 4. assign probability of outcome. from array [0 1 2 3 4] 0 has 60% chance of being selected
print("eg.4")
print(rng.choice(5, p=[.6, .1, .1, .1, .1]))

# Selecting axis
arr = [[1,2,3],[7,8,9]] # 2-D array arr
x = rng.choice(arr, size=1, axis=0) # selects by row
y = rng.choice(arr, size=1, axis=1) # selects by col
print("eg.5")
print(x, y)

# shuffle=True & shuffle=False
x = rng.choice(5, size=5, replace=False, shuffle=True) # Shuffle values, retrns in random order
y = rng.choice(5, size=5, replace=False, shuffle=False) # returns array in order [0 1 2 3 4]

print("eg.6")
print(x, y)

eg.1
2
eg.2
[[0 3]
 [3 2]]
eg.3
[4 2 3]
eg.4
0
eg.5
[[1 2 3]] [[1]
 [7]]
eg.6
[4 2 0 1 3] [0 1 2 3 4]


#### bytes
***
The `bytes()` method takes a single input `length` and returns a string of random bytes of length `length` [21].
```python
bytes(length)
```
Generating random bytes for functions such as encrypting or salting for security or authentication purposes [22].


In [90]:
from numpy.random import default_rng

rng = default_rng()

x = rng.bytes(1)
y = rng.bytes(10)

print("x: ", x)
print("y: ", y)

x:  b'\xca'
y:  b'\xa5\xd5\xca\xd5X\xc5\xbd|\xc7='


### Permutations
***
The permutaion functions in numpy.random refer to methods for the arrangement or rearrangement of elements within a set [23].

numpy.random has two such permutations functions that we will look at in more detail:

1. shuffle
2. permutation

#### shuffle
***
The `shuffle()` method modifies a sequence by randomly shuffling its elements. shuffle takes one required parameter and one optional parameter listed below [24]:
```python
shuffle(x, axis=0)
```
- **x**: The array or list to be shuffled
- **axis** (optional): Dictates the axis for x to be shuffled on. Only applicable to ndarray objects.

The return value of shuffle is **None**, this means it is not to be used to produce a copy of the original array **x** but instead modifies the array structure itself. This behaviour is seen in the example eg.1 below. Shuffle can also be used on strings provided they are in an array like structure like in eg.2 below.


In [97]:
# eg.1
from numpy.random import default_rng
from numpy import arange

# initialise generator object rng
rng = default_rng()

arr = arange(9) # create array [0,9] exclusive
print(arr) # array in order

rng.shuffle(arr) # shuffle the array
print(arr) # new array order printed

x = rng.shuffle(arr) # assign shuffle to new variable returns a None value
print(x)

[0 1 2 3 4 5 6 7 8]
[2 0 7 1 8 4 6 5 3]
None


In [109]:
# eg.2
from numpy.random import default_rng

rng = default_rng()

str = "just a few words"
y = list(str)  # y is a list of chars from string str

print(y) # print original list
rng.shuffle(y) # shuffle elements in list

print(y) # print shuffled list


['j', 'u', 's', 't', ' ', 'a', ' ', 'f', 'e', 'w', ' ', 'w', 'o', 'r', 'd', 's']
[' ', 'w', 'e', 'd', 'a', 't', ' ', 'f', 'w', 'u', 's', ' ', 'o', 'j', 's', 'r']


#### permutation
***
The `permutation()` method returns a sequence by randomly altering the arangment of an input sequence or range **x**. permutation takes one required parameter and one optional parameter listed below [25]: 
```python
permutation(x, axis=0)
```
- **x**: input int or array. If x is an int, that int is passed to np.arange to create a range of values. If x is an array, a copy is made and the elements are shuffled randomly.
- **axis** (optional): Dictates the axis for x to be shuffled on. Default is 0

Unlike the shuffle method, permutation leaves the original array intact and returns a new array which is a shuffled copy of the original.

In [112]:
from numpy.random import default_rng
from numpy import arange

rng = default_rng() # initialise generator object rng

# eg.1
a = rng.permutation(9) # int 9 passed to np.arange to create array
print(a, "values 0-9 exclusive shuffled\n")

# eg. 2
arr = arange(9) #  create array [0,9] exclusive 
print(arr, "arr in its original format")

rng.permutation(arr) # unlike shuffle this does not alter original array

x = rng.permutation(arr) # new array x created, permuted copy of arr

print(x, "x is a shuffled copy of arr")
print(arr, "arr remains unchanged")

[4 0 7 2 3 6 8 1 5] values 0-9 exclusive shuffled

[0 1 2 3 4 5 6 7 8] arr in its original format
[0 7 3 4 8 6 1 2 5] x is a shuffled copy of arr
[0 1 2 3 4 5 6 7 8] arr remains unchanged


### numpy.random Distributions
***


## References
***

1. McKinney, W., (2018). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, Second Edition. p85
2. McKinney, W., (2018). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, Second Edition. p86  
3. numpy.org, (2020). About Numpy. https://numpy.org/doc/stable/about.html  
4. M. Scott Shell, (2019). An introduction to Numpy and Scipy, p2. p19. https://sites.engineering.ucsb.edu/~shell/che210d/numpy.pdf  
5. numpy.org, Installing Numpy. https://numpy.org/install/  
6. numpy.org, (2020), NumPy: the absolute basics for beginners, https://numpy.org/devdocs/user/absolute_beginners.html  
7. numpy.org, (2020), Random sampling (numpy.random), https://numpy.org/doc/stable/reference/random/index.html?  
8. McKinney, W., (2018). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, Second Edition. p118
9. w3schools, (2020). w3schools.com. https://www.w3schools.com/python/numpy_random.asp
10. numpy.org, (2020). Random Generator. https://numpy.org/doc/stable/reference/random/generator.html
11. numpy.org, (2020). Legacy Random Generation. https://numpy.org/doc/stable/reference/random/legacy.html#numpy.random.RandomState
12. numpy.org, (2019). Random Number Generation Policy. https://numpy.org/neps/nep-0019-rng-policy.html
13. pcg-random.org, (2018). PCG, A Family of Better Random Number Generators. https://www.pcg-random.org/
14. O'Neill, Melissa (2014). PCG: A Family of Simple Fast, Space-Efficient Statistically, Good Algorithms for Random Number Generation, p7-8. https://www.pcg-random.org/pdf/hmc-cs-2014-0905.pdf
15. numpy.org, (2020). numpy.random.seed. https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html
16. numpy.org, (2020). numpy.random.integers. https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.integers.html
17. Weisstein, Eric W. Dicrete Uniform Distribution. https://mathworld.wolfram.com/DiscreteUniformDistribution.html
18. numpy.org, (2020). numpy.random.random. https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.random.html
19. Hartmann, K., Krois, J., Waske, B. (2018): E-Learning Project SOGA: Statistics and Geospatial Data Analysis. Department of Earth Sciences, Freie Universitaet Berlin. https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Continous-Random-Variables/Continuous-Uniform-Distribution/index.html
20. numpy.org, (2020). numpy.random.Generator.choice. https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html
21. numpy.org, (2020). numpy.random.Generator.bytes. https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.bytes.html
22. avelx.co.uk, (2015). random bytes nad random int functions. https://www.avelx.co.uk/random-bytes-and-random-int-functions/
23. w3schools, (2020). Random Permutations. https://www.w3schools.com/python/numpy_random_permutation.asp
24. numpy.org, (2020). numpy.random.Generator.shuffle. https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.shuffle.html
25. numpy.org, (2020). numpy.random.Generator.permutation. https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.permutation.html







    
    
   