# Statistics


## Introduction

In this chapter, you'll learn about how to do statistics with code. We already saw some statistics in the chapter on probability and random processes: here we'll focus on computing basic statistics and using statistical tests.

### Notation

Greek letters, like $\beta$, are the truth. Modified Greek letters are an estimate of the truth, for example $\hat{\beta}$. Letters from the Latin alphabet denote the values of data, for instance $x$ for a variable or vector. Modified Latin alphabet letters denote computations performed on data, for instance $\bar{x} = \frac{1}{n} \displaystyle\sum_{i} x_i$ where $n$ is number of samples.

### Imports

First we need to import the packages we'll be using

In [None]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import pandas as pd
import pingouin as pg

## Basic statistics

Let's start with computing the simplest statistics you can think of using some synthetic data. Many of the functions have lots of extra options that we won't explore here (like weights or normalisation); remember that you can see these using the `help()` method. 

We'll generate a vector with 100 entries:

In [None]:
data = np.array(range(100))
data

In [None]:
from myst_nb import glue
import sympy

dict_fns = {'mean': np.mean(data),
            'std': np.std(data),
            'mode': stats.mode([0, 1, 2, 3, 3, 3, 5])[0][0],
            'median': np.median(data)}

for name, eval_fn in dict_fns.items():
    glue(name, f'{eval_fn:.1f}')


# Set max rows displayed for readability
pd.set_option('display.max_rows', 6)
# Plot settings
plot_style = {'xtick.labelsize': 20,
                  'ytick.labelsize': 20,
                  'font.size': 22,
                  'figure.autolayout': True,
                  'figure.figsize': (10, 5.5),
                  'axes.titlesize': 22,
                  'axes.labelsize': 20,
                  'lines.linewidth': 4,
                  'lines.markersize': 6,
                  'legend.fontsize': 16,
                  'mathtext.fontset': 'stix',
                  'font.family': 'STIXGeneral',
                  'legend.frameon': False}
plt.style.use(plot_style)

Okay, let's see how some basic statistics are computed. The mean is `np.mean(data)=`{glue:}`mean`, the standard deviation is `np.std(data)=`{glue:}`std`, and the median is given by `np.median(data)=`{glue:}`median`. The mode is given by `stats.mode([0, 1, 2, 3, 3, 3, 5])[0]=`{glue:}`mode` (access the counts using `stats.mode(...)[1]`).

Less famous quantiles than the median are given by, for example for $q=0.25$,

In [None]:
np.quantile(data, 0.25)

As with **pandas**, **numpy** and **scipy** work on scalars, vectors, matrices, and tensors: you just need to specify the axis that you'd like to apply a function to:

In [None]:
data = np.fromfunction(lambda i, j: i + j, (3, 6), dtype=int)
data

In [None]:
np.mean(data, axis=0)

Remember that, for discrete data points, the $k$th (unnormalised) moment is

$$
m_k = \frac{1}{n}\displaystyle\sum_{i=1}^{n} \left(x_i - \bar{x}\right)^k
$$

To compute this use scipy's `stats.moment(a, moment=1)`. For instance for the kurtosis ($k=4$), it's

In [None]:
stats.moment(data, moment=4, axis=1)

Covariances are found using `np.cov`.

In [None]:
np.cov(np.array([[0, 1, 2], [2, 1, 0]]))

Note that, as expected, the $C_{01}$ term is -1 as the vectors are anti-correlated.

## Review

In this very short introduction to statistics with code, you should have learned how to:

- [x] 