# Distribution
> the way in which something is shared out among a group or spread over an area

### Random Variable
> a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical sense). A random variable can take on a set of possible different values (similarly to other mathematical variables), each with an associated probability [wiki](https://en.wikipedia.org/wiki/Random_variable)

**Types**

1. Discrete Random Variables <br>
    Eg: Genders of the buyers buying shoe
2. Continuous Random Variables <br>
    Eg: Shoe Sales in a quarter
    
### Probability Distribution
> Assigns a probability to each measurable subset of the possible outcomes of a random experiment, survey, or procedure of statistical inference. [wiki](https://en.wikipedia.org/wiki/Probability_distribution)

#### Probability Mass Function
probability mass function (pmf) is a function that gives the probability that a discrete random variable is exactly equal to some value

#### Discrete probability distribution(Cumulative Mass Function)
probability distribution characterized by a probability mass function

#### Probability Density Function
function that describes the relative likelihood for this random variable to take on a given value

#### Continuous probability distribution(Cumulative Density function)
probability that the variable takes a value less than or equal to `x`

### Central Limit Theorem
Given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution. [wiki](https://en.wikipedia.org/wiki/Central_limit_theorem)

#### Normal Distribution
A bell shaped distribution. It is also called Gaussian distribution

<img style="float: left;" src="../../images/prob/normaldist.png" height="220" width="220">
<br>
<br>
<br>
<br>



**PDF**
<br>
<br>
<img style="float: left;" src="../../images/prob/normal_pdf.png" height="320" width="320">
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

**CDF**
<br>
<br>


<img style="float: left;" src="../../images/prob/normal_cdf.png" height="320" width="320">

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>


#### Skewness
Measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. [wiki](https://en.wikipedia.org/wiki/Skewness)

<img style="float: left;" src="../../images/prob/skewness.png" height="620" width="620">
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
#### Kurtosis
Measure of the "peakedness" of the probability distribution of a real-valued random variable [wiki](https://en.wikipedia.org/wiki/Kurtosis)
<br>
<br>
<img style="float: left;" src="../../images/prob/kurtosis.png" height="420" width="420">
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

#### Binomial Distribution

Binomial distribution with parameters `n` and `p` is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. A success/failure experiment is also called a Bernoulli experiment or Bernoulli trial; when n = 1, the binomial distribution is a Bernoulli distribution  [wiki](https://en.wikipedia.org/wiki/Binomial_distribution)
<br>
<br>
<img style="float: left;" src="../../images/prob/binomial_pmf.png" height="420" width="420">
<br>
<br>
<br>


#### Exponential Distribution
Probability distribution that describes the time between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate. It has the key property of being memoryless. [wiki](https://en.wikipedia.org/wiki/Exponential_distribution)
<br>
<br>
<img style="float: left;" src="../../images/prob/exponential_pdf.png" height="420" width="420">
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

#### Uniform distribution
All values have the same frequency [wiki](https://en.wikipedia.org/wiki/Uniform_distribution_(continuous))


<br> 
<br>
<img style="float: left;" src="../../images/prob/uniform.png" height="420" width="420">
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>




### 6-sigma philosophy
<img style="float: left;" src="../../images/prob/6sigma.png" height="520" width="520">

### Histograms

Most commonly used representation of a distribution.

Let's plot distribution of weed prices for 2014

In [None]:
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)
%matplotlib inline

In [None]:
#Import the data
weed_pd = pd.read_csv("../../images/prob/Weed_Price.csv", parse_dates=[-1])

In [None]:
weed_pd.head()

In [None]:
sns.distplot(weed_pd.HighQ, kde=False)

In [None]:
weed_pd["month"] = weed_pd["date"].apply(lambda x: x.month)
weed_pd["year"] = weed_pd["date"].apply(lambda x: x.year)

In [None]:
weed_jan2015_summarized = weed_pd.loc[(weed_pd.month==1) & (weed_pd.year==2015), ["State", "HighQ"]].groupby("State").mean().reset_index()

In [None]:
weed_jan2015_summarized

**Question If you'd randomly landed in USA, with equal chances of landing in any of the states, what is the probability that the price of weed is more than 340. (Bin the prices by $10)**

In [None]:
sns.distplot(weed_jan2015_summarized.HighQ, bins=range(0,500,10))

In [None]:
#Using `scipy` to use distribution

In [None]:
from scipy import stats
import scipy as sp
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
#Generate random numbers that are normally distributed
random_normal = sp.randn(100)
plt.scatter(range(100), random_normal)

In [None]:
print("mean:", random_normal.mean(), " variance:", random_normal.var())

In [None]:
#Create a normal distribution with mean 2.5 and standard deviation 1.7
n = stats.norm(loc=2.5, scale=1.7)

In [None]:
#Generate random number from that distribution
n.rvs()

In [None]:
#for the above normal distribution, what is the pdf at 0.3?
n.pdf(0.3)

In [None]:
#Binomial distribution with `p` = 0.4 and number of trials as 15

In [None]:
stats.binom.pmf(range(15), 10, 0.4)

### Standard Error

It is a measure of how far the estimate to be off, on average. More technically, it is the standard deviation of the sampling distribution of a statistic(mostly the mean). Please do not confuse it with *standard deviation*. Standard deviation is a measure of the variability of the observed quantity. Standard error, on the other hand, describes variability of the estimate. 

To illustrate this, let's do the following.

Not everyone buying weed reports it on the site. Let's assume that the actual mean price for that month was 243.7. Compute standard deviation and standard error for the mean. 

In [None]:
#Weed prices of Calinfornia for the month of Jan 2015
weed_ca_jan2015 = weed_pd[(weed_pd.State=="California") & (weed_pd.month==1) & (weed_pd.year==2015)]
weed_ca_jan2015.head()

In [None]:
#Mean and standard deviation of the price of high quality weed in California
print("Sample Mean:", weed_ca_jan2015.HighQ.mean(), "\n", "Sample Standard Deviation:", weed_ca_jan2015.HighQ.std())


In [None]:
print(weed_ca_jan2015.HighQ.max(), weed_ca_jan2015.HighQ.min())

We'll follow the same procedures we did in the `resampling.ipynb`.  We will bootstrap samples from actual observed data 10,000 times and compute difference between sample mean and actual mean. Find root mean squared error to get standard error

In [None]:
def squared_error(bootstrap_sample, actual_mean):
    return np.square(bootstrap_sample.mean() - actual_mean)

def experiment_for_computing_standard_error(observed_prices, number_of_times, actual_mean):
    bootstrap_mean = np.empty([number_of_times, 1], dtype=np.int32)
    bootstrap_sample = np.random.choice(observed_prices, size=[observed_prices.size, number_of_times], replace=True)
    bootstrap_squared_error = np.apply_along_axis(squared_error, 1, bootstrap_sample, actual_mean)
    return np.sqrt(bootstrap_squared_error.mean())

In [None]:
#Standard error of the estimate for mean
experiment_for_computing_standard_error(np.array(weed_ca_jan2015.HighQ), 10, 243.7)