
<img src="Images/NumPy_logo.png" width="400">

# Programming for Data Analysis Assignment

## numpy.random package
***

**Assignment objectives:**

1. Explain the overall purpose of the package.
2. Explain the use of the “Simple random data” and “Permutations” functions. 
3. Explain the use and purpose of at least five “Distributions” functions.
4. Explain the use of seeds in generating pseudorandom numbers.

## NumPy and random number generators

NumPy, stands for Numerical Python "is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more." [numpy.org](https://numpy.org/devdocs/user/whatisnumpy.html)

NumPy was created in 2005 by Travis Oliphant who incorporating features of the Numarray into Numeric, with extensive modifications. NumPy is open-source software and has many contributors. [Wikipedia](https://en.wikipedia.org/wiki/NumPy)

**`numpy.random`** is a module within the NumPy package which is used to generate pseudo-random values and provides a range of tools to  manipulate them. This module is similar to the Python standard library random but it works with numpy arrays.  It allows to create arrays of random numbers from various probability distributions and also randomly sample from arrays or lists. This module is frequently used to fake or simulate data which is an important tool in data analysis, scientific research, machine learning and multiple other areas. The simulated data becomes very handy as it can be used to test methods before applying them to the real data.

Python's standard library random  only samples one value at a time while numpy.random can efficiently generate arrays of sample values from various probability distributions and provides additional probability distributions to use. Numpy's random module is much faster and more efficient than the standard Python’s random library particularly when working with lots of samples, however for other simpler purposes the random module can be sufficient and even more efficient.

As computers cannot generate truly random numbers - they need a set of instructions to produce an output, that means they are predictable and reproducible which is beneficial when creating and testing code. Therefore both of the random modules can only produce what is called  pseudorandom numbers that may appear random but they are not truly so.

“Random number generators have applications in gambling, statistical sampling, computer simulation, cryptography, completely randomized design, and other areas where producing an unpredictable result is desirable.” Monte Carlo simulation uses random numbers to simulate real world problems. They are often used to assess the risk of a given trading strategy for options or stocks. A Monte Carlo simulator allow to visualize most of the potential outcomes which gives a much better idea regarding the risk of a decision. They are also used in cryptography – so long as the seed is secret. Sender and receiver can generate the same set of numbers automatically to use as keys.

In Machine learning random sampling  is often used an actual datasets for testing and evaluation analytical methods and algorithms. “Machine learning is about learning some properties of a data set and then testing those properties against another data set. A common practice in machine learning is to evaluate an algorithm by splitting a data set into two. We call one of those sets the training set, on which we learn some properties; we call the other set the testing set, on which we test the learned properties.”  The machine learning algorithms in the scikit-learn package use numpy.random in the background. 
[scikit-learn](https://scikit-learn.org/stable/tutorial/basic/tutorial.html)

“Numpy’s random number routines produce pseudo random numbers using combinations of a BitGenerator to create sequences and a Generator to use those sequences to sample from different statistical distributions” [Random sampling]( https://numpy.org/doc/stable/reference/random/index.html)

Let's create an instance of a generator using `numpy.random` package. But before we proceed, we need to do all the necessary imports first:


In [1]:
# importing NumPy module
import numpy as np

# for data analysis
import pandas as pd

# importing matplotlib and seaborn for plotting and visualisations
import matplotlib.pyplot as plt
import seaborn as sns

# We can create a new constructor using the default_rng method to get a new instance of a random number generator.
rng = np.random.default_rng()
rng

Generator(PCG64) at 0x7F88C63007C0

"PCG is a family of simple fast space-efficient statistically good algorithms for random number generation. Unlike many general-purpose RNGs, they are also hard to predict." [pcg-random.org](https://www.pcg-random.org/index.html)

## Simple random data

There are four methods in Simple random data which we going to explore below: **random, integers, choice** and **bytes.**

It is very easy to create a random number(s) using `numpy.random`. As a first example, let's create 5 random float numbers between 0 and 1 by using the instance of a random number generator which we have created above.

<span style="color:Blue; font-weight:bold;">random</span>([size, dtype, out]) returns random floats in the half-open interval [0.0, 1.0).

In [2]:
# returns an array of 5 floats (if no parameters are specified, returns just one number).
rng.random(5)

array([0.2704329 , 0.93821978, 0.20795936, 0.42395747, 0.30845896])

In [3]:
# returns an array of 2x3 floats
rng.random((2,3))

array([[0.1866022 , 0.01702412, 0.50558976],
       [0.58298432, 0.51331038, 0.41610758]])

In [4]:
# Three-by-two array of random numbers from [-3, 0):
3 * rng.random((3, 2)) - 3

array([[-1.13206963, -1.91145406],
       [-0.46144399, -0.26768358],
       [-2.60056685, -2.77426842]])

<span style="color:Blue; font-weight:bold;">integers</span>(low[, high, size, dtype, endpoint])) returns random integers from low (inclusive) to high (exclusive), or if endpoint=True, low (inclusive) to high (inclusive).

In [5]:
# returns an integer between 0-9
rng.integers(10)

1

In [6]:
# returns an array of 10 numbers between 5-8
rng.integers(5, 8, 10)

array([5, 5, 5, 7, 5, 5, 6, 7, 7, 5])

<span style="color:Blue; font-weight:bold;">bytes</span>(length) returns random bytes.

In [None]:
# returns 10 random bytes
rng.bytes(10)

<span style="color:Blue; font-weight:bold;">choice</span>(a[, size, replace, p, axis, shuffle]) gmenerates a random sample from a given array

In [None]:
# choose 5 numbers at random from the range 0-19
rng.choice(20, 5)

We can use **choice()** methods with other data types, for example strings

In [8]:
# create a list of strings
colors = ["green", "red", "blue", "orange", "white"]

# choose 2 items from the list
rng.choice(colors, size=2)

array(['green', 'blue'], dtype='<U6')

We can 'weight' the probability for each item in the list. In the example below the 5th and and the 3d items in the list are respectivle 5 times and 2 times more likely to be returned than the others.

In [None]:
rng.choice(colors, p=[0.1, 0.2, 0.1, 0.1, 0.5], size=6)