# Programming for Data Analysis Assignment 2019

## Problem Statement
The assignment concerns the `numpy.random` package in Python.
Explain the use of the package including detailed explanations of at least five of the distributions provided for in the package.
There are four distinct task to be carried out in the Jupyter Notebook.
1. Explain the overall purpose of the package.
2. Explain the use of the “Simple random data” and “Permutations” functions. 
3. Explain the use and purpose of at least five “Distributions” functions.
4. Explain the use of seeds in generating pseudorandom numbers.


## 1. Explain the overall purpose of the package.

### some quick notes here.
- what is the stated aim of the package?
- why would you want to generate random numbers?
- pythons built-in `random` module can also generate random numbers so what is the difference.
- how are random numbers generated

## What is NumPy.random

`numpy.random` is a sub-package or module of the numpy package that generates random numbers. There are many different ways to generate random numbers which are in fact only pseudorandom numbers as computers do not actually generate random numbers.

Python has a built-in **random** module which implements pseudo-random numbers for various distributions. See [python docs/random library](https://docs.python.org/3/library/random.html). This built-in`random` module only samples one value at a time.
`numpy.random` is a sub-package or module of the NumPy package that supplements the `random` module's functions with functions that can efficiently generate arrays of sample values from various probability distributions.(rather than one value at a time). It is much much faster and more efficient.


The numpy.random package is used to generate random numbers (or pseudo random numbers) so I guess the question first is why would you want to generate random numbers!

NumPy is a data manipulation module for python with tools that operate on arrays of numbers.
NumPy is one of the most important packages for numerical computing in python and many packages that are used for data analytics are based on this package.

NumPy functions can operate on large arrays of numbers and are therefore very useful for statistics and data analytics. NumPy is used for manipulating data in datasets that can be very large.

NumPy has a `random` package which can be used to create random samples from a dataset and this is something that is used frequently in data analysis and machine learning projects. Being able to randomly select elements also has many used in applications such as gaming.

Numpy.random allows you to create random samples from a dataset. I'm guessing that the `numpy.random` package works on the multi-dimensional `ndarray` arrays that are the key data structure in the `NumPy` package and would therefore be far more efficient and faster than the raw python package.  

Given an input array of elements, random functions will allow you to select a random sample of elements from the array.

You can also create random samples with the NumPy random package which allows you to generate arrays of numbers from a particular probability distribution and this could be useful for creating a toy dataset in the absence of a real dataset or for testing or even just for learning!

Python has a built-in module called `random` which is part of the base python and this module provides a number of tools for working with random numbers. NumPy's random module is aimed more at generating random series of data rather than the scalar values which are generated by the python random module.

Pythons built-in `random` module implements pseudo-random numbers for various distributions. See [python docs/random library](https://docs.python.org/3/library/random.html). However this built-in`random` module only samples one value at a time and probably uses loops to generate sequences of random numbers.   
`numpy.random` provides functions that can efficiently generate arrays of sample values from various probability distributions rather than one value at a time. It is much much faster and more efficient.

### numpy.random functions
There are many many functions in the numpy.random package that relate to generating random numbers from different probability distributions and therefore it is not really possible to know all of these without referring to the documentation. 
Plots can illustrate the difference between different probability distributions. 
see random generation, random state, set state, get stats

### Generating random numbers
Numbers generated by either the built-in python `random` module or the `numpy.random` package are not actually random at all but *pseudorandom* numbers. This is because computers cannot actually generate random numbers but what they do is generate sequences that look like random numbers, so that someone else wouldn't be able to predict the next number generated without a key piece of information which is called the **seed**. If you know the seed then you can predict the next number to be generated in a sequence and therefore the numbers generated cannot be actually random as such.

The seed is typically the time (to the microsecond) on the computer when the code was run. The seed is decided when you import the function into jupyter or python script. Sometimes you may want to recreate the exact same sequence of random numbers, maybe when testing code or demonstrating something or teaching and you want the code to be reproduced. If instead of letting the seed default to the system time, you provide or **set** the seed, then the exact sequence can be reproduced.

Random (pseudo-random) numbers are drawn from a probability distribution. The numbers generated depend on the *seed* used and are generated according to some deterministic algorithm from that seed. 



### First a quick overview of the NumPy package

`NumPy` is a specialist Python package that works well with large arrays of numbers. Raw Python's built-in matrix operations are quite inefficient when compared to NumPy's capabilities. NumPy can work easily with multi-dimensional arrays. NumPy does operations on matrices in a far more efficient  way than using raw python matrix operations. 
Matrix operations are very commonly used when analysing data.

While NumPy provides a computational foundation for working with numbers, it does not have the modelling or scientific functionality of other computational packages in Python but these other packages do use NumPy's array objects. NumPy is typically used through other packages. 

The NumPy package has many algorithms for dealing with numerical operations on arrays. Operations can be performed on NumPy arrays very quickly. 

NumPy can also be used to simulate data. While csv data is commonly imported into NumPy for analysis or into a package such as pandas that uses NumPy, it is often handy to be able to simulate data to analyse before the real data may be collected or available.

The [numpy quickstart tutorial](https://numpy.org/devdocs/user/quickstart.html) provides a good overview of the NumPy package.

[what is NumPy](https://docs.scipy.org/doc/numpy/user/whatisnumpy.html)

- NumPy is short for Numerical Python. 
- NumPy is one of the most important foundational packages for numerical computing in Python. 
- NumPy's main data structure is an `ndarray` - an a homogenuous multidimensional array. An *ndarray* is also known by the alias *array*.
- A NumPy array is like a table of elements that are all of the same type. The elements which are usually numbers are indexed by a tuple of positive integers. Axes refer to the dimensions of a NumPy array.

- NumPy has many mathematical functions that avoid the need to use loop as they work on entire arrays of data. 
- NumPy is designed for efficiency on large arrays of data which is partly due to the way NumPy stores data in a contiguous block of memory. NumPy's algorithms can operate on this memory without any type checking. NumPy arrays use much less memory than python's own built-in sequences. 
- NumPy arrays use *vectorisation*. This is where batch operations can be performed on arrays without using loops. 
- arithmetic operations are carried out *elementwise* on an array and result in a new array being created.
- some operations on arrays actually modify an array in place and do not create a new one. `+=`, `-=`
 


#### NumPy's ndarray object. 
NumPy's *ndarray* is an N-dimensional or multi-dimensional array object. It is a fast and flexible container for larger datasets in Python. Mathematical operations can be performed on entire arrays using the same kind of syntax to similar operations on scalar values.  
An ndarray can be created using the Numpy `array` function which takes any sequence-like objects such as lists, nested lists etc. Other NumPy functions create new arrays such as `zeros`, `ones`, `empty`, `arange`, `full` and `eye` among others. All these functions create an array object. 

Some key points about the ndarray objects include:
- The data in an *ndarray* must be homogeneous, that is all of it's data elements must of the same type.
- Arrays have a *ndim* attribute for the number of axes or dimensions of the array
- Arrays have a *shape* attribute which is a tuple that indicates the size of the array in each dimension.
The length of the *shape* tupple is the number of axes that the array has. 
- Arrays have a *dtype* attribute which is an object that describes the data type of the array.
- The `size` attribute is the total number of elements in the array.


## 2. Explain the use of the “Simple random data” and “Permutations” functions.

See [Random Sampling using numpy.random](https://docs.scipy.org/doc/numpy-1.16.1/reference/routines.random.html#random-sampling-numpy-random) of the numpy documentation which for NumPy version 1.16 has 4 main sections:

- Simple random data
- Permutations
- Distributions
- Random Generator


### Simple random data

**Note that this here is intermediate work as we are not simply to rehash the documentations**
  
-`rand(d0, d1, …, dn)` returns random values in a given shape.  
-`randn(d0, d1, …, dn)` return a sample (or samples) from the “standard normal” distribution.  
-`randint(low[, high, size, dtype])` Return random integers from low (inclusive) to high (exclusive).  
-`random_integers(low[, high, size])` Random integers of type np.int between low and high, inclusive.  
-`random_sample([size])`	Return random floats in the half-open interval [0.0, 1.0).  

-`random([size])`  Return random floats in the half-open interval [0.0, 1.0).  
 
-`ranf([size])`  Return random floats in the half-open interval [0.0, 1.0).  

-`sample([size])` Return random floats in the half-open interval [0.0, 1.0).  

-`choice(a[, size, replace, p])` Generates a random sample from a given 1-D array  

-`bytes(length)` Return random bytes

In [11]:
import random
random.random()

0.19447119301697013

In [15]:
import numpy as np
import numpy.random
rand1 = np.random.rand(1,2,3)


In [20]:
rand1.dtype

dtype('float64')

# 3. Explain the use and purpose of at least five “Distributions” functions.

show how random numbers can be drawn from different probability distributions. use some plots and statistics here to show the differences between the types of random numbers that would be generated from the different probability distributions.

# 4. Explain the use of seeds in generating pseudorandom numbers.

I will show how to use a seed to generate a pseudorandom sequence of numbers that is reproducible.
First of all explain what a **seed** is in this context. 
Show that setting a seed will produce the same sequence of random numbers each time.
This means that the functions follow a very particular set of instructions to generate the so-called random numbers which is what **deterministic** means.

See the section https://docs.scipy.org/doc/numpy-1.16.0/reference/routines.random.html#random-generator



Random Generator.
- `RandomState([seed])`
- `seed([seed])`
-` get_state()`
- `set_state(state)`


`getstate` is used to capture the state of random at any time, returns a tuple representing the internal state of the generator. 
`setstate` sets the internal state of the generator from the tuple and is used if for any reason you want to manually reset the internal state of the "Mersenne Twister"

The tuple returned from `getstate` can be passed to `setstate` method to duplicate the generation at that moment.


### What exactly is a seed?

A random **seed** is a number that is used to initialise a pseudorandom number generator. This number does not need to be random. By setting the seed, the original seed is ignored and numbers will be generated in a pseudorandom manner. If you reinitialise a random number generator with the same seed then the same sequence of numbers will be produced. 

According to [statisticshowto](https://www.statisticshowto.datasciencecentral.com/random-seed-definition/), a random seed specifies the start point when a computer generates a random number. The random seed can be any number but it usually comes from seconds on the computer system's clock which counts in seconds from January 1, 1970.  (known as Unix time). This ensures that the same random sequence won't be repeated unless you actually want it to. 


Computers generate random numbers in a *deterministic* way - that is by following a set of rules. Randomness can be imitated by specifying a set of rules to follow. The algorithms behind computer number generation are based on patterns which generate numbers that follow a particular probability distribution.

https://pynative.com/python-random-seed/
>Random number or data generated by Python’s random module is not truly random, it is pseudo-random(it is PRNG), i.e. deterministic. It produces the numbers from some value. This value is nothing but a seed value. i.e. The random module uses the seed value as a base to generate a random number.

>Generally, the seed value is the previous number generated by the generator. However, When the first time you use the random generator, there is no previous value. So by-default current system time is used as a seed value.



#### pseudo random number generator.
There are various methods of generating random numbers, with replacement and without replacement. With replacement means the number drawn would be placed back in and could be selected again. 
No replacement means that once a number is chosen, it can no longer be chosen again in the same sample as it is not placed back in the pot. In this way there will be no duplicates.
it cannot be selected again in the same sample. It is no longer available and this means there will be no possibility of duplicates.

Python's random package and numpy.random are examples of pseudo random number generators. A true random number generator would involve hardware while pseudo random number generators involve software.
 
Computers cant really generate random numbers, even humans have patterns in thinking of random numbers. Computers generate numbers that looks random, so that someone else wouldn't be able to predict the next number generated.
We are really only looking at *pseudo* random numbers. If you do have a key piece of information - the **seed** - then you can predict the next random number. The seed is typically the time (to the microsecond) on the computer when the code was run. The seed is decided when you import the function into jupyter.


Example of **pi** from the lecture. Pi is the ratio of the diameter of a circle to its circumference and it has a decimal expansion that never ends and never repeats. 
Therefore you will only ever see an approximation of pi becuase it has an unending expansion. 
The digits never repeat a pattern. While you might find pairs of digits, these pairs do not appear periodically. 
If you wanted to generate a random number between 1 and 10, you could go out to a point in the decimal expansion of pi and not tell anyone where you started, the position where you started from is the seed. If no-one knows where you start from , then no-one can predict where you go next. The seed is where you start at.

For various applications you will want to set the seed, for example to test code etc if you want the same random numbers (pseudo-random) generated you need to set the seed.
This tells python not to generate a random seed at the start (from the time to the microsecond on the machine) and instead provide it with a seed to start from to get the very same output another time.

Pseudo random number generators can be seeded which makes them deterministic and the series of random values can be recreated and predicted. A seed is like a starting point to the random number generation process. The computer's system time is usually used for the seed and this is used in an algorithm to generate some (pseudo) random values. There are times however when you need to be able to generste exactly the same sequence of random numbers such as for testing or demonstrating/teaching etc. 
You can then provide a seed to the process. 



Pythons built-in `random` module implements pseudo-random numbers for various distributions. See [python docs/random library](https://docs.python.org/3/library/random.html). However this built-in`random` module only samples one value at a time and probably uses loops to generate sequences of random numbers.   
`numpy.random` provides functions that can efficiently generate arrays of sample values from various probability distributions rather than one value at a time. It is much much faster and more efficient.





### How are random numbers generated?

According to [NumPy random module](https://numpy.org/devdocs/reference/random/index.html?highlight=random#module-numpy.random) 
>Numpy’s random number routines produce pseudo random numbers using combinations of a BitGenerator to create sequences and a Generator to use those sequences to sample from different statistical distributions.

It goes on to describe how 
- **BitGenerators** are objects that generates random numbers which are typically unsigned integer words filled with sequences of either 32 or 64 random bits. 
- **Generators** objects then transform these sequences of random bits from the BitGenerator into sequences of numbers that follow a specific probability distribution within a specified interval.

See [numpy docs on random sampling](https://numpy.org/devdocs/reference/random/index.html?highlight=random#random-sampling-numpy-random) on the changes since NumPy version 1.17.0. This ties in with the differences in the documentation I noticed between versions 1.16 and 1.17.

>Since Numpy version 1.17.0 the Generator can be initialized with a number of different BitGenerators. It exposes many different probability distributions. See NEP 19 for context on the updated random Numpy number routines. The legacy RandomState random number routines are still available, but limited to a single BitGenerator.
For convenience and backward compatibility, a single RandomState instance’s methods are imported into the numpy.random namespace, see Legacy Random Generation for the complete list.


### Pseudorandom  numbers vs random numbers.
>Pseudorandom  numbers are generated by an algorithm with deterministic behaviour based on the seed of the random number generator.


Numbers generated by either the built-in random module or numpy.random package are not actually random at all but *pseudorandom* numbers. Computers cannot actually generate random numbers but they do generate sequences that look like random numbers. Computers generate numbers that looks random, so that someone else wouldn't be able to predict the next number generated without a key piece of information. This is the **seed**. If you know the seed then you can predict the next number to be generated in a sequence and therefore the numbers generated are not random as such. The seed is typically the time (to the microsecond) on the computer when the code was run. The seed is decided when you import the function into jupyter or python script.

The lectured demonstrated the example of **pi** which is the ratio of the diameter of a circle to its circumference. Pi has a decimal expansion that never ends and never repeats. Therefore you only ever see an approximation of pi becuase it has an unending expansion. 
The digits in pi never repeat a pattern and while you might find pairs of digits, these pairs do not appear periodically. 
Therefore if you wanted to generate a random number between 1 and 10, you could go out to a point in the decimal expansion and not tell anyone where you started, the position where you started from is the seed. If no-one knows where you start from, then no-one can predict where you go next. The seed is where you start at. If someone knows where in the decimal expansion you started from , then they could predict the next number in the "random" sequence. 

For various applications, you may wish to set a seed so that the exact sequence of random numbers can be generated. For example when testing code or for teaching purposes where you want the output to be reproducible you can set the seed. This tells Python not to generate a random seed at the start (from the time to the microsecond on the machine) and instead you provide it with a seed to start from to get the very same output another time.

In summary, random (pseudorandom) numbers are drawn from a probability distribution. The numbers generated depend on the *seed* used and are generated according to some deterministic algorithm from that seed. 

### References
- Python for Data Analysis - chapter 4 NumPy Basics: Arrays and Vectorised Computation by Wes McKinney
- Python Data Science Handbook by Jake VanderPlas
-[numpy quickstart tutorial](https://numpy.org/devdocs/user/quickstart.html)
- [NumPy random module](https://numpy.org/devdocs/reference/random/index.html?highlight=random#module-numpy.random), 
- [python - random library](https://docs.python.org/3/library/random.html#module-random)
- Section 4.6 of Python for Data Analysis by Wes McKinney

###  The numpy.random package documentation on docs.scipy.org

I noticed that the documentation in the latest version 1.17 does not seem to be listing `numpy.random.rand()` function in the same way as the documentation for NumPy version 1.16 (or NumPy versions 1.15 and 1.14).

[numpy-1.15.1/reference.routines.random](https://docs.scipy.org/doc/numpy-1.15.1/reference/routines.random.html) has `numpy.random.rand(d0,d1,...,dn)`
The [numpy/reference/random](https://docs.scipy.org/doc/numpy/reference/random) has a random.sampling(numpy.random) section which has a [quick start guide](https://docs.scipy.org/doc/numpy/reference/random/index.html#quick-start) and it outlines some changes since the videos were made and refers to the **generator** as a replacement for Random.State.

I'll have to go through the documentation and see what the changes are.

- https://docs.scipy.org/doc/numpy/reference/random/generated/numpy.random.Generator.random.html#numpy.random.Generator.random

Note that the documents under this url refer to `numpy.random.generator` but I think this is just the newer form.





In [2]:
import random
random.randrange(300,500)

472