# Programming for Data Analysis Assignment 2019

## Problem Statement
The assignment concerns the `numpy.random` package in Python.
Include a Jupyter notebook (this one) to explain the use of the package including detailed explanations of at least five of the distributions provided for in the package.
There are four distinct task to be carried out in the Jupyter Notebook.
1. Explain the overall purpose of the package.
2. Explain the use of the “Simple random data” and “Permutations” functions. 
3. Explain the use and purpose of at least five “Distributions” functions.
4. Explain the use of seeds in generating pseudorandom numbers.

## Documents required for the submission
A git repository containing:
-  a README
- a git ignore file
- a Jupyter Notebook

## Requirements
A good submission will be clearly organised and contain concise explanations of the particularities of the dataset. The analysis contained within the notebook will be well conceived, interesting, and well researched. Note that part of this assignment is about the use of Jupyter notebooks and so you should make use of all the functionality available in the software including images, links, code and plots. You may use any Python libraries that you wish, whether they have been discussed in class or not.



### instructions as per the video - delete later. 
This is just for me to refer to.
The documentation for the NumPy package at [numpy.random docs (1.16.1 reference/routines)](https://docs.scipy.org/doc/numpy-1.16.1/reference/routines.random.html)is broken down into four sections. 

- [simple random data](https://docs.scipy.org/doc/numpy-1.16.1/reference/routines.random.html#simple-random-data)  
- [permutations](https://docs.scipy.org/doc/numpy-1.16.1/reference/routines.random.html#permutations)  
- [distributions](https://docs.scipy.org/doc/numpy-1.16.1/reference/routines.random.html#distributions)    
- [random generator](https://docs.scipy.org/doc/numpy-1.16.1/reference/routines.random.html#random-generator)  

My understanding of this assignment from the video is that we are look at the numpy.random package and instead of just going through every function in the package (as this is already well documented), go through the numpy.random package and describe it more in your own words - in less robotic terms. No need to go through every single function but instead take a handful of the functions of random.numpy and describe them using things like plots, descriptive statistics or anything else we fancy instead of just noting the parameters, arguments and outputs of each function in the documentation. Give a more real descriptive flavour of what the package is about.




# First a quick overview of the NumPy package

Here I will start by making some notes from the lectures on the NumPy package, the package documentation and the Python for Data Analysis book chapter 4 by Wes McKinney. 

The [numpy quickstart tutorial](https://numpy.org/devdocs/user/quickstart.html)provides a good overview of the NumPy package.
I won't go into too much detail here as this assignment is just about the `numpy.random` package but just a few points to refer to!

[what is numpy](https://docs.scipy.org/doc/numpy/user/whatisnumpy.html)

`NumPy` is a Python package that is very quick at dealing with numbers in very long lists or even lists of lists - NumPy works well with arrays of numbers. Raw Python's built-in matrix operations are quite inefficient when compared to NumPy's capabilities. NumPy is a specialist package that deals with multi-dimensional arrays that resemble lists within lists. 

NumPy arrays are rectangular in shape, such as 2-dimensional array where every row and every column have the same number of columns. The number of rows and columns can be different but each row must contain the same number of elements and each column contains the same number of elements.
Multi-dimensional arrays with more than two dimensions can be created. Two-dimensional array are probably the mosy commonly used. 
Operations can be performed on NumPy arrays very quickly. NumPy has many algorithms for dealing with numerical operations on arrays.
NumPy can be used to simulate data. While csv data is commonly imported into NumPy for analysis (or more likely through the Python Pandas packages which uses NumPy in the background), it is often handy to be able to simulate data to analyse before the real data may be collected or available.

NumPy is typically used through other packages. It is a specialist packages for dealing with arrays of numbers.  While NumPy provides a computational foundation for working with numbers, it does not have the modelling or scientific functionality of other computational packages in Python but these packages use NumPy's array objects. 
 
Numpy works well with arrays of numbers and does operations on matrices in a far more efficient  way tha using raw python matrix operations. Matrix operations are very commonly used in computer applications including analysing data, resizing or compressing images or converting cd's into mpgs.


- NumPy is short for Numerical Python. 
- NumPy is one of the most important foundational packages for numerical computing in Python. 
- NumPy's main data structure is an `ndarray` - an a homogenuous multidimensional array. An *ndarray* is also known by the alias *array*.
- A NumPy array is like a table of elements that are all of the same type. The elements which are usually numbers are indexed by a tuple of positive integers. Axes refer to the dimensions of a NumPy array.

- NumPy has many mathematical functions that avoid the need to use loop as they work on entire arrays of data. 
- NumPy is designed for efficiency on large arrays of data which is partly due to the way NumPy stores data in a contiguous block of memory. NumPy's algorithms can operate on this memory without any type checking. NumPy arrays use much less memory than python's own built-in sequences. 
- NumPy arrays use *vectorisation*. This is where batch operations can be performed on arrays without using loops. 
- arithmetic operations are carried out *elementwise* on an array and result in a new array being created.
- some operations on arrays actually modify an array in place and do not create a new one. `+=`, `-=`
 
## NumPy's ndarray object. 
The *ndarray* is an N-dimensional or multi-dimensional array object. It is a fast and flexible container for larger datasets in Python. You can perform mathematical operations on entire arrays using the same kind of syntax to similar operations on scalar values. 
- The data in an *ndarray* must be homogeneous meaning that all it's data elements must of the same type.
- Arrays have a *ndim* attribute for the number of axes or dimensions of the array
- Arrays have a *shape* attribute which is a tuple that indicates the size of the array in each dimension.
The length of the *shape* tupple is the number of axes that the array has. 
- Arrays have a *dtype* attribute which is an object that describes the data type of the array.
- The `size` attribute is the total number of elements in the array.


An ndarray can be created using the Numpy `array` function which takes any sequence-like objects such as lists, nested lists etc. Other NumPy functions create new arrays such as `zeros`, `ones`, `empty`, `arange`, `full` and `eye` among others. All these functions create an array object. 



## numpy.random

- [NumPy random module](https://numpy.org/devdocs/reference/random/index.html?highlight=random#module-numpy.random), 
- [python - random library](https://docs.python.org/3/library/random.html#module-random)
- Section 4.6 of Python for Data Analysis by Wes McKinney

`numpy.random` is a sub-package or module of the numpy package that generates random numbers. There are many different ways to generate random numbers which are really pseudorandom numbers as computers can't actually generate random numbers.

Python has a built-in **random** module which implements pseudo-random numbers for various distributions. See [python docs/random library](https://docs.python.org/3/library/random.html). This built-in`random` module only samples one value at a time.
`numpy.random` is a sub-package or module of the NumPy package that supplements the `random` module's functions with functions that can efficiently generate arrays of sample values from various probability distributions.(rather than one value at a time). It is much much faster and more efficient.


### How are random numbers generated?

According to [NumPy random module](https://numpy.org/devdocs/reference/random/index.html?highlight=random#module-numpy.random) 
>Numpy’s random number routines produce pseudo random numbers using combinations of a BitGenerator to create sequences and a Generator to use those sequences to sample from different statistical distributions.

It goes on to describe how 
- **BitGenerators** are objects that generates random numbers which are typically unsigned integer words filled with sequences of either 32 or 64 random bits. 
- **Generators** objects then transform these sequences of random bits from the BitGenerator into sequences of numbers that follow a specific probability distribution within a specified interval.

See [numpy docs on random sampling](https://numpy.org/devdocs/reference/random/index.html?highlight=random#random-sampling-numpy-random) on the changes since NumPy version 1.17.0. This ties in with the differences in the documentation I noticed between versions 1.16 and 1.17.

>Since Numpy version 1.17.0 the Generator can be initialized with a number of different BitGenerators. It exposes many different probability distributions. See NEP 19 for context on the updated random Numpy number routines. The legacy RandomState random number routines are still available, but limited to a single BitGenerator.
For convenience and backward compatibility, a single RandomState instance’s methods are imported into the numpy.random namespace, see Legacy Random Generation for the complete list.


### Pseudorandom  numbers vs random numbers.
>Pseudorandom  numbers are generated by an algorithm with deterministic behaviour based on the seed of the random number generator.

Numbers generated by either the built-in random module or numpy.random package are not actually random at all but *pseudorandom* numbers. Computers cannot actually generate random numbers but they do generate sequences that look like random numbers. Computers generate numbers that looks random, so that someone else wouldn't be able to predict the next number generated without a key piece of information. This is the **seed**. If you know the seed then you can predict the next number to be generated in a sequence and therefore the numbers generated are not random as such. The seed is typically the time (to the microsecond) on the computer when the code was run. The seed is decided when you import the function into jupyter or python script.

The lectured demonstrated the example of **pi** which is the ratio of the diameter of a circle to its circumference. Pi has a decimal expansion that never ends and never repeats. Therefore you only ever see an approximation of pi becuase it has an unending expansion. 
The digits in pi never repeat a pattern and while you might find pairs of digits, these pairs do not appear periodically. 
Therefore if you wanted to generate a random number between 1 and 10, you could go out to a point in the decimal expansion and not tell anyone where you started, the position where you started from is the seed. If no-one knows where you start from, then no-one can predict where you go next. The seed is where you start at. If someone knows where in the decimal expansion you started from , then they could predict the next number in the "random" sequence. 

For various applications, you may wish to set a seed so that the exact sequence of random numbers can be generated. For example when testing code or for teaching purposes where you want the output to be reproducible you can set the seed. This tells Python not to generate a random seed at the start (from the time to the microsecond on the machine) and instead you provide it with a seed to start from to get the very same output another time.


### numpy.random functions
There are many many functions in the numpy.random package that relate to generating random numbers from different probability distributions and therefore it is not really possible to know all of these without referring to the documentation. 
Plots can illustrate the difference between different probability distributions. 
see random generation, random state, set state, get stats

### Random sampling / numpy.random
https://docs.scipy.org/doc/numpy/reference/random/index.html?highlight=numpy.random#module-numpy.random

assumptions made about the type of random numbers you want, depend on distributions.
- Permutations 
- shuffle - 
Shuffle changes array in place while permutations doesnt.


## The random generator section

Random Generator.
- `RandomState([seed])`
- `seed([seed])`
-` get_state()`
- `set_state(state)`


### What exactly is a seed?

A random seed is a number used to initialise a pseudorandom number generator. This number does not need to be random. By setting the seed, the original seed is ignored and numbers will be generated in a pseudorandom manner. If you reinitialise a random number generator with the same seed then the same sequence of numbers will be produced. 

According to [statisticshowto](https://www.statisticshowto.datasciencecentral.com/random-seed-definition/), a random seed specifies the start point when a computer generates a random number. The random seed can be any number but it usually comes from seconds on the computer system's clock which counts in seconds from January 1, 1970.  (known as Unix time). This ensures that the same random sequence won't be repeated unless you actually want it to. 


Computers generate random numbers in a *deterministic* way - that is by following a set of rules Randomness can be imitated by specifying a set of rules to follow. The algorithms behind computer number generation are based on patterns which generate numbers that follow a particular probability distribution.

https://pynative.com/python-random-seed/
>Random number or data generated by Python’s random module is not truly random, it is pseudo-random(it is PRNG), i.e. deterministic. It produces the numbers from some value. This value is nothing but a seed value. i.e. The random module uses the seed value as a base to generate a random number.

>Generally, the seed value is the previous number generated by the generator. However, When the first time you use the random generator, there is no previous value. So by-default current system time is used as a seed value.



In [None]:
###  The numpy.random package documentation on docs.scipy.org

I noticed that the documentation in the latest version 1.17 does not seem to be listing `numpy.random.rand()` function in the same way as the documentation for NumPy version 1.16 (or NumPy versions 1.15 and 1.14).

[numpy-1.15.1/reference.routines.random](https://docs.scipy.org/doc/numpy-1.15.1/reference/routines.random.html) has `numpy.random.rand(d0,d1,...,dn)`
The [numpy/reference/random](https://docs.scipy.org/doc/numpy/reference/random) has a random.sampling(numpy.random) section which has a [quick start guide](https://docs.scipy.org/doc/numpy/reference/random/index.html#quick-start) and it outlines some changes since the videos were made and refers to the **generator** as a replacement for Random.State.

I'll have to go through the documentation and see what the changes are.

- https://docs.scipy.org/doc/numpy/reference/random/generated/numpy.random.Generator.random.html#numpy.random.Generator.random

Note that the documents under this url refer to `numpy.random.generator` but I think this is just the newer form.


### numpy.random.rand()
pass the dimensions into the fucntion. if you don't pass the dimesions yuo get a scalar.
[numpy.random.rand](https://docs.scipy.org/doc/numpy/reference/random/generated/numpy.random.Generator.random.html#numpy.random.Generator.random).

You get an array so that every row has the same number of elements, every column has the same number of elements. can be different number of rows as columns.

## [Random Sampling using numpy.random](https://docs.scipy.org/doc/numpy-1.16.1/reference/routines.random.html#random-sampling-numpy-random)

This section of the documentation for NumPy version 1.16 has 4 main sections. 
- Simple random data
- Permutations
- Distributions
- Random Generator

I will start with Simple random data next. Below are the main functions. We are to go through a few and describe in our own words. Do not just repeat the documentation!!

### Simple random data


`rand(d0, d1, …, dn)` Random values in a given shape.
`randn(d0, d1, …, dn)` return a sample (or samples) from the “standard normal” distribution.
`randint(low[, high, size, dtype])` Return random integers from low (inclusive) to high (exclusive).
`random_integers(low[, high, size])` Random integers of type np.int between low and high, inclusive.
`random_sample([size])`	Return random floats in the half-open interval [0.0, 1.0).

 `random([size])`  Return random floats in the half-open interval [0.0, 1.0).
 
`ranf([size])`  Return random floats in the half-open interval [0.0, 1.0).

`sample([size])` Return random floats in the half-open interval [0.0, 1.0).

`choice(a[, size, replace, p])` Generates a random sample from a given 1-D array

`bytes(length)` Return random bytes

`numpy.random.rand(d0, d1, ..., dn)` returns random values in a given shape.

Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).

#### Parameters: d0, d1, …, dn : int, optional
The dimensions of the returned array, should all be positive. If no argument is given a single Python float is returned.
#### Returns:
out : ndarray, shape (d0, d1, ...,

In [11]:
import random
random.random()

0.19447119301697013

In [15]:
import numpy as np
import numpy.random
rand1 = np.random.rand(1,2,3)


In [20]:
rand1.dtype

dtype('float64')

### References
- Python for Data Analysis - chapter 4 NumPy Basics: Arrays and Vectorised Computation by Wes McKinney
- Python Data Science Handbook by Jake VanderPlas

In [9]:
import numpy as np
np? 


SyntaxError: invalid syntax (<ipython-input-9-638dca95d501>, line 2)

In [10]:
np?

[0;31mType:[0m        module
[0;31mString form:[0m <module 'numpy' from '/Users/angelacorkery/anaconda3/lib/python3.7/site-packages/numpy/__init__.py'>
[0;31mFile:[0m        ~/anaconda3/lib/python3.7/site-packages/numpy/__init__.py
[0;31mDocstring:[0m  
NumPy
=====

Provides
  1. An array object of arbitrary homogeneous items
  2. Fast mathematical operations over arrays
  3. Linear Algebra, Fourier Transforms, Random Number Generation

How to use the documentation
----------------------------
Documentation is available in two forms: docstrings provided
with the code, and a loose standing reference guide, available from
`the NumPy homepage <https://www.scipy.org>`_.

We recommend exploring the docstrings using
`IPython <https://ipython.org>`_, an advanced Python shell with
TAB-completion and introspection capabilities.  See below for further
instructions.

The docstring examples assume that `numpy` has been imported as `np`::

  >>> import numpy as np

Code snippets are indicat