# In-class notebook: 2024-01-10

In this notebook, we will get familiar with the different definitions in descriptive statistics (how we describe a PDF short of actually spelling out the PDF). We will then look at a number of common distributions that you might encounter.

This notebook is intended to support Chapter 3.2-3.3 of the textbook, and material is taken from the following scripts (from astroML):
* https://github.com/astroML/astroML-notebooks/blob/main/chapter3/astroml_chapter3_Descriptive_Statistics.ipynb
* https://github.com/astroML/astroML-notebooks/blob/main/chapter3/astroml_chapter3_Univariate_Distribution_Functions.ipynb

In the example below, we'll show distributions with different skewness (top panel) and kurtosis (bottom panel). In the top panel, we'll plot a Gaussian, a modified Gaussian, and a log-normal distribution with $\sigma = 1.2$. The modified Gaussian is a normal distribution multiplied by a Gram-Charlier series $ h(x) = N(\mu,\sigma)\sum_{k=0}^\infty a_k H_k(z)$ with $a_0 = 2$, $a_1 = 1$, and $a_2 = 0.5$. For the kurtosis panel, we'll plot a uniform, Laplace, cosine, and Gaussian distribution. 


We can find the values for skewness and kurtosis for each distribution by importing the desired distribution from`scipy.stats`. For example, we can call `from scipy.stats import uniform` and then call `uniform.stats(moments='sk')` to get the skewness and kurtosis for a uniform distribution. For the modified Gaussian however, we will hard-code $\Sigma = -0.36.$

In [None]:
from scipy.stats import uniform, norm, laplace, cosine, lognorm

uni = float(uniform.stats(moments = 'k'))
lap = int(laplace.stats(moments = 'k'))
cos = float(cosine.stats(moments = 'k'))
log = float(lognorm.stats(1.2, moments = 's'))

gauss = norm.stats(moments = 'sk')
skew_gauss = int(gauss[0])
kurt_gauss = int(gauss[1])

In [None]:
import numpy as np
from scipy import stats
from matplotlib import pyplot as plt

fig = plt.figure(figsize=(7.5, 10))
fig.subplots_adjust(right=0.95, hspace=0.05, bottom=0.07, top=0.95)

# First show distributions with different skeq
ax = fig.add_subplot(211)
x = np.linspace(-8, 8, 1000)
N = stats.norm(0, 1)

l1, = ax.plot(x, N.pdf(x), '-k',
              label='Gaussian, $\Sigma={}$'.format(skew_gauss))
l2, = ax.plot(x, 0.5 * N.pdf(x) * (2 + x + 0.5 * (x * x - 1)),
              '--k', label= 'mod. Gauss, $\Sigma=-0.36$')
l3, = ax.plot(x[499:], stats.lognorm(1.2).pdf(x[499:]), '-.k',
              label='log normal, $\Sigma={}$'.format(round(log,1)))

ax.set_xlim(-5, 5)
ax.set_ylim(0, 0.7001)
ax.set_ylabel('$p(x)$', fontsize = 12)
ax.xaxis.set_major_formatter(plt.NullFormatter())

# trick to show multiple legends
leg1 = ax.legend([l1], [l1.get_label()], loc=1, fontsize = 12)
leg2 = ax.legend([l2, l3], (l2.get_label(), l3.get_label()), loc=2, fontsize = 12)
ax.add_artist(leg1)
ax.set_title('Skew $\Sigma$ and Kurtosis $K$', fontsize = 16)

# next show distributions with different kurtosis
ax = fig.add_subplot(212)
x = np.linspace(-5, 5, 1000)
l1, = ax.plot(x, stats.laplace(0, 1).pdf(x), '--k',
              label='Laplace, K={}'.format(lap))
l2, = ax.plot(x, stats.norm(0, 1).pdf(x), '-k',
              label='Gaussian K={}'.format(kurt_gauss))
l3, = ax.plot(x, stats.cosine(0, 1).pdf(x), '-.k',
              label='Cosine, K=-0.59'.format(round(cos,2)))
l4, = ax.plot(x, stats.uniform(-2, 4).pdf(x), ':k',
              label='Uniform, K=-1.2'.format(uni))

ax.set_xlim(-5, 5)
ax.set_ylim(0, 0.55)
ax.set_xlabel('$x$', fontsize = 12)
ax.set_ylabel('$p(x)$', fontsize = 12)

# trick to show multiple legends
leg1 = ax.legend((l1, l2), (l1.get_label(), l2.get_label()), loc=2, fontsize = 12)
leg2 = ax.legend((l3, l4), (l3.get_label(), l4.get_label()), loc=1,fontsize = 12)
ax.add_artist(leg1)

### Useful NumPy and SciPy functions

The cell below computes multiple statistical functions on a one-dimensional array **x**.

In [None]:
import numpy as np
import scipy.stats
x = np.random.random(100) # 100 random numbers 

q25, q50, q75 = np.percentile(x, [25, 50, 75])
mean = np.mean(x)
median = np.median(x) 
variance = np.var(x) 
standard_deviation = np.std(x)
skew = scipy.stats.skew(x)
kurtosis = scipy.stats.kurtosis(x)
mode = scipy.stats.mode(x)

## Data-based estimates of descriptive statistics

In [None]:
uncorrected = []
corrected = []
for x in range(1,5000):
    samples = np.random.normal(loc=5.0, scale=1.0, size = 5)
    uncorrected.append(np.std(samples, ddof=0))
    corrected.append(np.std(samples, ddof=1))

fig, ax = plt.subplots(1,2)
fig.set_size_inches(15,6)   
ax[0].hist(uncorrected);
ax[1].hist(corrected);
ax[0].text(1.65, 1200, f'mean = {np.mean(uncorrected):.2}',
        bbox={'facecolor':'white', 'alpha':0.5, 'pad':1, 'boxstyle':"round"})
ax[1].text(1.85, 1200, f'mean = {np.mean(corrected):.2}',
        bbox={'facecolor':'white', 'alpha':0.5, 'pad':1, 'boxstyle':"round"})
ax[0].set_title("Uncorrected Standard Deviations");
ax[1].set_title("Corrected Standard Deviations");

### Uncertainty in our estimators $\overline{x}$ and $s$

When $N$ is large (at least ten or so), and if the variance of $h(x)$ is finite, we expect from the central limit theorem (in chapter 3.4) that $x$ and $s$ will be distributed around their values given by eqs. 1 and 2 according to Gaussian distributions with the widths (standard errors) equal to

$$ \sigma_{\overline{x}} = \frac{s}{\sqrt{N}},$$

which is called *the standard error of the mean*, and

$$ \sigma_s = \frac{s}{\sqrt{2(N-1)}} = \frac{1}{\sqrt{2}}\sqrt{\frac{N}{N-1}} \sigma_\overline{x},$$
 
which is the *error of the standard deviation.*

Note that for large $N$, the uncertainty of the location parameter is about 40% larger than the uncertainty of the scale parameter ($\sigma_\overline{x}$ $\sim$ $\sqrt{2}\sigma_s$). Note also that for small $N$, $\sigma_s$ is not much smaller than $s$ itself.

### Example of standard deviation vs. standard error

**Standard deviation**: Imagine we flip a coin 16 times; this will be one trial. As the number of trials goes to infinity, the number of heads flipped in one trial will yield a normal distribution with mean $\mu$ = 8 and std $\sigma$ = 2. No matter how many measurements we perform, the standard deviation will not reduce as its a property intrinsic to the nature of the coin.

**Standard error**: The average number of heads in a 16 flip coin toss is eight since each side has a 50/50 chance of being landed on. With enough measurements, the error of that estimate of the mean number of heads can become arbitrarily small.

Below is the distribution of a 16-flip coin toss for $N = 5000$ and $N = 15000$

In [None]:
import numpy as np
from matplotlib import pyplot as plt
fig, ax = plt.subplots(1,2) 
fig.set_size_inches(10,4)   
np.random.seed(42)
N = [5000, 15000]

for m,k in enumerate (N):
    y = np.random.binomial(n=16, p=0.5, size=k) # N trials of 16 flip toss
    values = []
    x = np.linspace(1,16, num = 16)
    for i in range(1,17):
        values.append(np.count_nonzero(y == i))
    ax[m].bar(x,values)
    print(f'Mean = {np.mean(y):.2}, Std = {np.std(y):.2}')

In [None]:
from scipy.stats import sem
j = []
for i in [3,30,100,1000]:
    sample = np.random.binomial(n=16, p=0.5, size= i)
    sample = round(sem(sample),2)
    j.append(sample)
print(j)

### Robust descriptive statistics

In [None]:
import numpy as np
from astroML import stats
np.random.seed(0)
x = np.random.normal(size=1000) # 1000 normally distributed points 
stats.sigmaG(x)

In [None]:
import numpy as np
from astroML import stats
from matplotlib import pyplot as plt
from scipy.stats import cauchy
  
normal = np.random.normal(loc=8.0, scale=1.0, size=100) # 100 samples from a Gaussian

a = np.random.normal(loc=8.0, scale=1.0, size=95) #95 samples from a Gaussian
b = cauchy.rvs(loc=8.0, scale=20, size=5) #5 samples from a Cauchy
normal_with_outliers = np.concatenate([a, b]) # combine to create Gaussian with outliers

In [None]:
labels = ['no outliers', 'with outliers']
means = [np.mean(normal),np.mean(normal_with_outliers)]
standard_deviations = [np.std(normal),np.std(normal_with_outliers)]
medians = [np.median(normal), np.median(normal_with_outliers)]
sigmaG = [stats.sigmaG(normal), stats.sigmaG(normal_with_outliers)]

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(1,2)
fig.set_size_inches(14,6)   
rects1 = ax[0].bar(x - width/2, means, width, label='mean', color = 'steelblue')
rects2 = ax[0].bar(x + width/2, standard_deviations, width, label='std', color = 'darkorange')
rects3 = ax[1].bar(x - width/2, medians, width, label='median', color = 'olivedrab')
rects4 = ax[1].bar(x + width/2, sigmaG, width, label='sigmaG', color = 'gold')

titles = ["means and std's","medians and sigmaG's"]
rected = [(1,2),(3,4)]
for i in [0,1]:
    ax[i].set_ylabel('value')
    ax[i].set_xticks(x);
    ax[i].set_xticklabels(labels)
    ax[i].legend()
    ax[i].set_title(titles[i])

combined = means + standard_deviations + medians + sigmaG
ax[1].set_ylim([0, np.max(combined)+0.5])

ax[0].bar_label(rects1, padding=3)
ax[0].bar_label(rects2, padding=3)
ax[1].bar_label(rects3, padding=3)
ax[1].bar_label(rects4, padding=3)
fig.tight_layout()


## Univariable fuctions

In the code below, we will first show the distributions themselves, and then show how you can calculate some of the descriptive statistics using the scipy in-built functions.

### Uniform

### Gaussian

### Binomial

### Poisson

### Cauchy (Lorentzian)

### Laplace (exponential)

$\chi^2$

### Student’s t

### Fisher’s F

### Beta

### G