<div align="center">
    
# Programming for Data Analysis    
## Assignment 1: Analysis of numpy.random Package
---
<div align="center">

![Image description](https://github.com/colmhiggs11/ProandS_Project/blob/master/Pictures%20for%20README/GMIT.png?raw=true) 

![Name and Id number](https://github.com/colmhiggs11/ProandS_Project/blob/master/Pictures%20for%20README/Name%20number%20box.PNG?raw=true)

<div align="left">
    

## Table of Contents
---

* [Introduction](#introduction)
* [1 - Numpy.random package](#1__numpy_random_package)
* [2 - Simple random Data & Permutations](#2__simple_random_data___permutations)
    * [2.1 - Simple random data](#2_1__sinple_random_data)
    * [2.2 - Permutatinos](##2_2__permutations) 
* [3 - Distribution Functions](#3__distribution_functions)
    * [Section 3.1](#section_3_1)
    * [Section 3.2](#section_3_2)   
    * [Section 3.3](#section_3_3)   
    * [Section 3.4](#section_3_4)   
    * [Section 3.5](#section_3_5)
* [4 - Seeds in generation of pseudorandom numbers](#4__seeds_in_generation_of_pseudorandom_numbers)
    * [Section 4.1](#section_4_1)
    * [Section 4.2](#section_4_2)
* [5 - References](#5__references)


## Introduction <a class="anchor" id="introduction"></a>
---

This Jupyter notebook consists of an analyis of the numpy.random package in Python. There are four distinct tasks carried out in this Jupyter notebook and are seperated into four seperate headings. The tasks listed below correlate to the numbered headings and can be accessed by clicking on the title of the task or the link from the Table of Contents. 

* [1. Explain the overall purpose of the package](#1__numpy_random_package)
* [2. Explain the use of the “Simple random data” and “Permutations” functions](#2__simple_random_data___permutations)
* [3. Explain the use and purpose of at least five “Distributions” functions](#3__distribution_functions)
* [4. Explain the use of seeds in generating pseudorandom numbers](#4__seeds_in_generation_of_pseudorandom_numbers)


## 1 - Numpy.random package <a class="anchor" id="1__numpy_random_package"></a>
---

* [Numpy](https://numpy.org/) is a numerical Python library that uses arrays to allow calculations to be completed on lists of data by converting them to arrays. Multiplying matrices is an example of the calculations that numpy can deal with that would otherwise be extremely difficult to write code for. Numpy is used in many forms the main ones being file compressions, analysing datasets and as mentioned performing calculations of matrix operations that would not be possible to complete on lists. 

Numpy has recently upgraded from version v1.15 to v1.19, there is a section explaining the main differences that can be found [here](https://numpy.org/doc/stable/reference/random/new-or-different.html#new-or-different). The following sections will also explain and show examples of some of the newer features in v1.19.

* [Numpy.random](https://numpy.org/doc/stable/reference/random/index.html) is a sub package of the Numpy Python library. It contains various functions that generate and show random data in different forms. It is especially useful in instances where data cannot be collected and you need to generate arrays of data to complete an analysis of said data.

Numpy.random uses Generators and Bit Generators to generate Pseudo random numbers. Generators use sequences of bits created by the Bit Generator and convert them into sequences of numbers that are sampled from probability distributions. In the new version of Numpy the Generator is compatible with numerous Bit Generators that allow more probability distributions to be sampled. Pseudo Random Number Generators are algorithms that produce an array of numbers that seem random but are actually generated from mathematical formulas. They are ideal for testing code or modelling as if you use a seed number for the initial starting point you will return the same number each time. True random numbers are required for data encryption and gambling. The table below shows the type of random numbers suitable for each application.

<div align="center">

|![Pseudo Random Numbers vs True Random numbers](https://github.com/colmhiggs11/numpy-random-assignment/blob/main/Images/Pseudo%20V%20True%20random%20number%20comparison.PNG?raw=true)|
|:--:| 
| *PNRG Vs TRNG*  |


<div align="left">


## 2 - Simple random Data & Permutations <a class="anchor" id="2__simple_random_data___permutations"></a>
---
### 2.1 - Simple random data <a class="anchor" id="2_1__sinple_random_data"></a>

The simple random data section in the numpy documentation has **four** main functions. Examples of how to use these functions will be discussed further in this section. The updates to v1.19 means it is possible to perform operations on multi-dimensional arrays for methods like choice, permutation and shuffle.

<div align="center">

|![v1.15 Integers](https://github.com/colmhiggs11/numpy-random-assignment/blob/main/Images/v1.15%20Simple%20Random%20Data.PNG?raw=true)|![v1.19 Integers](https://github.com/colmhiggs11/numpy-random-assignment/blob/main/Images/v1.19%20Simple%20Random%20Data.PNG?raw=true)|
|:--:|:--:|
| *v1.15 Functions in Simple Random Data*  | *v1.19 Functions in Simple Random Data* |


<div align="left">

To use any of the functions in the numpy.random package first numpy needs to be imported.

#### 2.1.1 Integers 

Integers is the updated version in v1.19 to generate a random integer numbers from a uniform distribution. As seen in the comparison image in the previous section it replaces rand, randn, randint and random_integers from older versions. The updated integer function allows the user to specify and endpoint. The minimum input required for this function is an integer which will be used as the highest value in the distribution. The cell below shows a the value "2" being passed through the generator. 


In [1]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
rng = np.random.default_rng()
rng.integers(2)

0

The random numbers that can be generated from this are limited to "0" and "1" as the higher value is exclusive from the range. (When there is no high value used then the output returns from "0" to "low value".
default_rng is called to get a new instance of a Generator.

There are other inputs that can shape the users output. These are shown below:

   **Inputs**
   
    (low, high=None, size=None, dtype=np.int64, endpoint=False)
    
   **Output** 
    
    int or ndarray of ints
    size-shaped array of random integers from the appropriate distribution, or a single such random int if size not provided

low and high are both integers and provide the lowest and highest values in the range of numbers to be generated.
If the size input is included then the output will become an array of integers within the limits in a tuple depending on size paramaeters. These can be multidimensional.
dtype allows the user to change data type of the output if nothing is entered then data type int64 is default type
************endpoint ************

With the integer function is is expected that it will generate all numbers with the same probability. (Uniform distribution)

In [2]:
import numpy as np
rng = np.random.default_rng()
rng.integers(1,4, size=(5,5) , endpoint=True)

array([[2, 4, 2, 3, 3],
       [2, 2, 1, 1, 4],
       [3, 4, 4, 4, 3],
       [4, 1, 2, 4, 2],
       [4, 2, 4, 3, 1]], dtype=int64)

#### 2.1.2 Random

Random is the updated version in v1.19 to generate a random float numbers from a continous uniform distribution. As seen in the comparison image in the previous section it replaces ranf,ran_sample and sample from older versions. 

Returns random floats in the half-open interval [0.0, 1.0)

   **Inputs**
          
    size=None, dtype=np.float64, out=None)

   **Output** 
    
    float or ndarray of floats
 
The size input determines the output and can have a single integer or a tuple. The cell below shows the array matrix being generated from size "2","4". There are 2 rows of 4 values generated. The random floats are generated between 0 and 1 unless you add an additional parameter to state otherwise.


In [3]:
import numpy as np

rng = np.random.default_rng()
rng.random((2,4))



array([[0.76682592, 0.011335  , 0.2272334 , 0.73258854],
       [0.49469804, 0.17287565, 0.86935786, 0.56086916]])

#### 2.1.3 Choice
The choice() method allows the user to generate random outputs from a 1-dimensional array that can consist of various different data types. An array can also be passed thrugh allowing strings etc. to be passed through. The output can be generated with/without replacement using "replace = False"(ie. Once the value is generated it cannot be generated again). There is also a function in this method to generate both uniform and non uniform outputs by assigning a probability to each of the inputs in the array. If this probability is not assigned to the inputs it is assumed that there will be a uniform distribution over all of the inputs. There is the capability to also shuffle the values when generating data without replacement.

   **Inputs**
          
    (a, size=None, replace=True, p=None, axis=0, shuffle=True)

   **Output** 
    
    samples : single item or ndarray
        The generated random samples


In [4]:
rng = np.random.default_rng()
rng.choice(10, 3, replace = False, p=[0.05, 0, 0.05, 0.6, 0,0.3,0,0,0,0], shuffle = True)

array([3, 5, 2], dtype=int64)

#### 2.1.4 Bytes

This method returns a random bytes. The input required is the length of the string wanted by the user.

   **Inputs**
          
    (length)

   **Output** 
    
    out : str
    String of length length.

In [5]:
import numpy as np
np.random.default_rng().bytes(3)


b'\x97\x16\xb2'

### 2.2 - Permutations <a class="anchor" id="2_2__permutations"></a>

<div align="center">

|![Permutations](https://github.com/colmhiggs11/numpy-random-assignment/blob/main/Images/Permutations.PNG?raw=true)|
|:--:| 
| *Functions within the Permutations function* |


<div align="left">
    
    
#### 2.2.1 Shuffle

Shuffle takes an array or list of data and shuffles it into different positions. If you however are using a multi-dimensional array then it gets a little more complicated. Using the np.arange function with the shuffle function it can be seen below that the order of sub arrays can be changed but their contents remain the same unless you change the axis input parameter. This shuffles the contents of the sub arrays and not the position. 


   **Inputs**
          
    (x, axis=0)

   **Output** 
    
    None




In [6]:
#Single array of 10 values that are shuffled
rng = np.random.default_rng()
arr = np.arange(10)
rng.shuffle(arr)
arr

array([7, 5, 3, 6, 0, 1, 8, 9, 2, 4])

In [7]:
# The sub arrays positions are shuffled but their contents remain the same
arr = np.arange(9).reshape((3, 3))
rng.shuffle(arr)
arr

array([[3, 4, 5],
       [6, 7, 8],
       [0, 1, 2]])

In [8]:
# The position of the sub arrays remains and the contents are shuffled 
arr = np.arange(9).reshape((3, 3))
rng.shuffle(arr,axis=1)
arr

array([[2, 0, 1],
       [5, 3, 4],
       [8, 6, 7]])


#### 2.2.2 permutation

This method randomly changes the sequence of data,range inputted. Unlike shuffle when a single integer is passed as an input parameter the np.arange function does not have to be called. 

    np.random.default_rng().permutation

This will automatically create an array and shuffle the contents. If an array is passed then the permutation function will make a copy and shuffle the contents. The user 

   **Inputs**
          
    (x, axis=0)

   **Output** 
    
    None

In [9]:
import numpy as np
arr = np.arange(9).reshape((3, 2))
rng.permutation(arr)

ValueError: cannot reshape array of size 9 into shape (3,2)

## 3 - Distribution Functions <a class="anchor" id="3__distribution_functions"></a>
---

Probability functions can be classed as either Discrete or Continuous Distributions depending on whether its variables fall into either category. If a value can have an infinite number of values between two values then it is a Continuous variable therefore meaning it will be in the continuous probability distribution. (Height or temperatures are types of Continuous variables. Discrete variables are the variables that cannot have values between to variables. (Coin toss outcomes or dice rolls are types of discrete variables).

The distributions analysed in this section are as follows:

**Discrete** | **Continuous**
:--------------|:--
Binomial distribution:|Normal Distribution|
Poisson distribution:|Normal Distribution|
Uniform distribution:||


https://stattrek.com/probability-distributions/discrete-continuous.aspx
https://www.datacamp.com/community/tutorials/probability-distributions-python
https://www.kaggle.com/hamelg/python-for-data-22-probability-distributions
https://statisticsbyjim.com/basics/probability-distributions/
https://www.investopedia.com/terms/u/uniform-distribution.asp


### 3.1 Binomial Distribution<a class="anchor" id="binomial_distribution"></a>
<div align="center">

**Binomial distribution** to model binary data, such as coin tosses.


|![Distributions](https://www.onlinemathlearning.com/image-files/binomial-distribution-formula.png?raw=true)|
|:--:| 
| *Binomial Distribution Formula* |


<div align="left">



As the Binomial distribution only allows for two possible outcomes, a flip of a coin is a good way to demonstrate and display the possibility of "success" or the number of times heads will occur in a certain amount of flips. The data below shows "1000" trials with "10" flips per trial. The probability is set to "0.5" as either heads or tails is equally possible. The distribution plot is almost symmetrical and shows that the probability of heads being the result is highest at "5", so its most likely that there will be 5 heads and 5 tails results.

The second graph shows the difference when you change the probability of the occurence. For exanple, if you had a coin that was weighted on one side meaning you knew there was a 75% chance it would flip to heads the distribution shows the probability of the outcomes. Clearly the probability is skewed and shows out of the 10 flips you would be expecting between 7/8 heads results.


In [None]:
P = np.random.binomial(10, 0.5, 1000)# binomial distribution
%matplotlib inline
sns.distplot(P, color = 'r', bins=20); #histogram and kde plot
plt.title('Binomial Distribution')
plt.xlabel('x variables')
plt.ylabel('p(x)')


In [None]:
P = np.random.binomial(10, 0.75, 1000)# binomial distribution
%matplotlib inline
sns.distplot(P, color = 'b', bins=20); #histogram and kde plot
plt.title('Binomial Distribution')
plt.xlabel('x variables')
plt.ylabel('p(x)')


### 3.2 Poisson Distribution<a class="anchor" id="poisson_distribution"></a>

**Poisson distribution**

Poisson distribution displays the probability an event occurring in fixed time period. Using this distribution the rate of occurance should be constant, the average rate of occurance should be known and the probability of the event occurring should be proportional to the length of the time period.

An example where it could be used is counting the amount of library book checked out per hour, the number of calls an office receives per hour or the number of goals scored in an All Ireland final.


<div align="center">

|![Distributions](https://www.onlinemathlearning.com/image-files/poisson-distribution-formula.png?raw=true)|
|:--:| 
| *Poisson Distribution Formula* |


<div align="left">
    
    
    
The code below shows the Poisson distribution of goals scored in an particular match where the average number of goals scored is "2". As shown in the distrubtion plot the probabilty of the number of goals being scored is highest at the point where the distribution goes towards the average number entered for events occuring in range of the time period. (average goals scored)

In [None]:
import numpy as np
rng = np.random.default_rng()
s = rng.poisson(2, 3600)
s
ax = sns.distplot(s, color = 'b')

### 3.3 Uniform Distribution <a class="anchor" id="uniform_distribution"></a>
**Uniform distribution** to model multiple events with the same probability, such as rolling a die.


### 3.4 Normal Distribution <a class="anchor" id="section_3_4"></a>

The most well-known continuous distribution is the **normal distribution**, which is also known as the Gaussian distribution or the “bell curve.”


### 3.5  <a class="anchor" id="section_3_5"></a>


## 4 - Seeds in generation of pseudorandom numbers <a class="anchor" id="4__seeds_in_generation_of_pseudorandom_numbers"></a>

---

In the new version of numpy 1.15 - 1.19 the way of generating seeds has changed from the Mersenne twister algorithm to PCG64. 
Permuted Congruential Generator 64-bit. How it outperforms the MT

Formulate how to seed the algorithm so you can get reproduceable results from your code - Quick start on numpy documentation is a good place to start.

https://www.datacamp.com/community/tutorials/numpy-random
https://www.pcg-random.org/other-rngs.html

### Section 4.1 <a class="anchor" id="section_4_1"></a>




## 5 - References <a class="anchor" id="5__references"></a>

---

1. https://numpy.org/
2. https://numpy.org/doc/stable/reference/random/index.html
3. https://numpy.org/doc/stable/reference/random/generator.html#simple-random-data
4. https://numpy.org/doc/stable/reference/random/generator.html#permutations
5. https://numpy.org/doc/stable/reference/random/generator.html#distributions
6. https://www.programmer-books.com/wp-content/uploads/2019/04/Python-for-Data-Analysis-2nd-Edition.pdf
7. https://stattrek.com/probability-distributions/discrete-continuous.aspx
8. https://www.datacamp.com/community/tutorials/probability-distributions-python
9. https://www.kaggle.com/hamelg/python-for-data-22-probability-distributions
10. https://numpy.org/doc/stable/reference/random/new-or-different.html#new-or-different
11. https://www.w3schools.com/python/numpy_random.asp
