# Assignment No. 3 Part(F)

## Introduction

The scipy.stats subpackage is a compilation of 
 •	numerous random variable objects (densities, cumulative distributions, random sampling, etc.)
 •	some estimation procedures
 •	some statistical tests
 
 This module contains a large number of probability distributions as well as a growing library of statistical functions.
 
 ## Numpy Versus Scipy(Random Variable & Distribution)
 
 In Numpy, numpy.random provides functions for generating random variables


In [2]:
import numpy as np

In [3]:
np.random.beta(5, 5, size=3)

array([ 0.50653096,  0.34596042,  0.68939241])

Here’s an example of usage

In [19]:
import numpy as np
from scipy.stats import beta
import matplotlib.pyplot as plt

q = beta(5, 5)      # Beta(a, b), with a = b = 5
obs = q.rvs(2000)   # 2000 observations
grid = np.linspace(0.01, 0.99, 100)

fig, ax = plt.subplots()
ax.hist(obs, bins=40, normed=True)
ax.plot(grid, q.pdf(grid), 'k-', linewidth=2)
fig.show()


  "matplotlib is currently using a non-GUI backend, "


In this code we created a so-called rv_frozen object, via the call q = beta(5, 5)
The “frozen” part of the notation implies that q represents a particular distribution with a particular set of parameters
Once we’ve done so, we can then generate random numbers, evaluate the density, etc., all from this fixed distribution


In [11]:
q = beta(5, 5)      # Beta(a, b), with a = b = 5
q.cdf(0.4)      # Cumulative distribution function




0.26656768000000003

In [12]:
In [15]: q.pdf(0.4)      # Density function

2.0901888000000013

In [13]:
In [16]: q.ppf(0.8)      # Quantile (inverse cdf) function

0.63391348346427079

In [14]:
In [17]: q.mean()

0.5

## Distribuation Classes:
There are two general distribution classes that has been implemented for statistical analysis of Random Variables :
(a) Continous Random Variables with 80 RV's.
(b) Discrete Random Variables with 10 RV's.

All the Statistical function are contained in the scipy.stats with complete listing available at info (stats).

# Common Methods :

The main public methods for continuous RVs are:
•	rvs: Random Variates
•	pdf: Probability Density Function
•	cdf: Cumulative Distribution Function
•	sf: Survival Function (1-CDF)
•	ppf: Percent Point Function (Inverse of CDF)
•	isf: Inverse Survival Function (Inverse of SF)
•	stats: Return mean, variance, (Fisher’s) skew, or (Fisher’s) kurtosis
•	moment: non-central moments of the distribution


# Common Examples: 

Following three commonly used functions in majority of the scientific research are discussed with examples

(a) Histogram and probability density function
(b) Percentiles
(c) Statistical Tests

## Histogram and probability density function

Given observations of a random process, their histogram is an estimator of the random process’s PDF (probability
density function):


In [24]:
>>> import numpy as np
>>> a = np.random.normal(size=1000)
>>> bins = np.arange(-4, 5)
>>> bins


array([-4, -3, -2, -1,  0,  1,  2,  3,  4])

In [25]:
>>> histogram = np.histogram(a, bins=bins, normed=True)[0]
>>> bins = 0.5*(bins[1:] + bins[:-1])
>>> bins


array([-3.5, -2.5, -1.5, -0.5,  0.5,  1.5,  2.5,  3.5])

In [26]:
>>> from scipy import stats
>>> b = stats.norm.pdf(bins) # norm is a distribution
>>> plt.plot(bins, histogram)


[<matplotlib.lines.Line2D at 0x900f1aed30>]

In [27]:
>>> plt.plot(bins, b)

[<matplotlib.lines.Line2D at 0x900f1aed68>]

If we know that the random process belongs to a given family of random processes, such as normal processes,
we can do a maximum-likelihood fit of the observations to estimate the parameters of the underlying distribution.
Here we fit a normal process to the observed data:


In [28]:
>>> loc, std = stats.norm.fit(a)
>>> loc


0.03655603973611387

In [29]:
>>> std

0.99442840998252524

## Percentiles


The median is the value with half of the observations below, and half above:


In [30]:
>>> np.median(a)



0.049591346022477688


It is also called the percentile 50, because 50% of the observation are below it:


In [31]:
>>> stats.scoreatpercentile(a, 50)


0.049591346022477688

Similarly, we can calculate the percentile 90

In [32]:
>>> stats.scoreatpercentile(a, 90)

1.3217137727779449

## Statistical tests

A statistical test is a decision indicator. For instance, if we have two sets of observations, that we assume are
generated from Gaussian processes, we can use a T-test to decide whether the two sets of observations are
significantly different:


In [33]:
>>> a = np.random.normal(0, 1, size=100)
>>> b = np.random.normal(1, 1, size=10)
>>> stats.ttest_ind(a, b)


Ttest_indResult(statistic=-3.5004007860478636, pvalue=0.00067636571958768496)

The resulting output is composed of:
• The T statistic value: it is a number the sign of which is proportional to the difference between the two
random processes and the magnitude is related to the significance of this difference.
• the p value: the probability of both processes being identical. If it is close to 1, the two process are almost
certainly identical. The closer it is to zero, themore likely it is that the processes have different means.
See also:
The chapter on statistics introduces much more elaborate tools for statistical testing and statistical data loading
and visualization outside of scipy.


## Specific Points for Discrete Distributions
•	Discrete distribution have mostly the same basic methods as the continuous distributions. However pdf is replaced the probability mass function pmf, no estimation methods, such as fit, are available, and scale is not a valid keyword parameter. The location parameter, keyword loc can still be used to shift the distribution.
•	The computation of the cdf requires some extra attention. In the case of continuous distribution the cumulative distribution function is in most standard cases strictly monotonic increasing in the bounds (a,b) and has therefore a unique inverse. The cdf of a discrete distribution, however, is a step function, hence the inverse cdf, i.e., the percent point function, requires a different definition:
•	ppf(q) = min{x : cdf(x) >= q, x integer}


## Performance Issues and Cautionary Remarks

The performance of the individual methods, in terms of speed, varies widely by distribution and method. The results of a method are obtained in one of two ways: either by explicit calculation, or by a generic algorithm that is independent of the specific distribution.
Explicit calculation, on the one hand, requires that the method is directly specified for the given distribution, either through analytic formulas or through special functions in scipy.special or numpy.random for rvs. These are usually relatively fast calculations.
The generic methods, on the other hand, are used if the distribution does not specify any explicit calculation. To define a distribution, only one of pdf or cdf is necessary; all other methods can be derived using numeric integration and root finding. However, these indirect methods can be very slow. As an example, rgh = stats.gausshyper.rvs(0.5, 2, 2, 2, size=100) creates random variables in a very indirect way and takes about 19 seconds for 100 random variables on my computer, while one million random variables from the standard normal or from the t distribution take just above one second.

## Remaining Issues

The distributions in scipy.stats have recently been corrected and improved and gained a considerable test suite, however a few issues remain:
•	the distributions have been tested over some range of parameters, however in some corner ranges, a few incorrect results may remain.
•	the maximum likelihood estimation in fit does not work with default starting parameters for all distributions and the user needs to supply good starting parameters. Also, for some distribution using a maximum likelihood estimator might inherently not be the best choice.
