# Announcements
- Homework 6 will be posted today, covers finite difference methods + probability
- Outlook: statistical inference (this + next week), then intro to maschine learning


This notebook is based on course material developed together with Phil Marshall and Adam Mantz.

## Goal for the next two weeks

In a nutshell, this module is about *how data are turned into conclusions*.

Most examples and problems are taken from astrophysics, but otherwise the content is extremely general.

The scientific process:
* Propose observations
* Collect and "reduce" data
* **Explore and summarize the data**
* **Hypothesize and test**
* **Interpret, conclude**, speculate
* Report

"Turning data into conclusions" broadly refers to the **bold** items.

## Some key ideas

* i) **All data we collect include some degree of randomness**
* ii) **Any conclusions we draw must therefore incorporate uncertainty**
This means we should describe both the data and conclusions in the language of mathematical probability.

Our *conclusion* will take the form: the probability that something is true in light of (given) the data we collected.
$p(\mathrm{thing}|\mathrm{data})$

By the basic laws of probability, this can be written
$p(\mathrm{thing}|\mathrm{data}) = \frac{p(\mathrm{data}|\mathrm{thing}) p(\mathrm{thing})}{p(\mathrm{data})}$

We'll unpack this much more later, but importantly it means that
* iii) **There is a correct answer**
Just like in physics, the theory tells us the solution. The challenge is in *evaluating it*.

Within this framework,

* iv) **Data are constants**
Even though they are generated randomly by the Universe, *data that we have already collected are fixed numbers*.

Much of our job boils down to building a model that predicts (probabilistically) what data we might have gotten.

* v) **Things we don't know with perfect precision can be mathematically described as "random"**
That is, we use probabilities to model things that are uncertain, even if they are not truly random.

Again,
* There is a correct answer
* Unknowns, including *potential* data and our conclusions, are (mathematically) random
* *Collected* data are constants

# Review of Probability

Goals:
* Review the bits of mathematical probability that will be most important for us later.
* Prime our brains for probabilistic reasoning.

This will be quick -- the key concepts will already be familiar from other courses.

### Some terminology
* Sample space ($\Omega$): the set of all possible answers/outcomes for a given question/experiment.
* Event ($E$): any subset of $\Omega$.

The probability of an event will be a real function satisfying certain requirements...

### Axioms of probability:
* $\forall E: 0 \leq P(E) \leq 1$
* $P(\Omega) = P\left(\bigcup_{\mathrm{all~}i} E_i\right) = 1$
* If $E_i$ are mutually exclusive, $P\left(\bigcup_i E_i\right) = \sum_i P(E_i)$

This dry definition provides a function with the right properties to describe our intuitive understanding of probability.

### A familiar example

Let $\Omega$ be the set of states available to a system of fixed energy, e.g. a box full of gas particles.

With one additional (physics) assumption, that it's equally probable for the system to occupy any state in $\Omega$, this is the microcanonical ensemble in statistical mechanics.

### Discrete vs. continuous sample spaces
Very often the type of event we're interested in lives in a continuous sample space.

Our axioms mostly translate straightforwardly; in this example $P(\Omega)=1$ becomes the normalization condition

$\int_{-\infty}^{\infty} p(h=x)dx$ = 1

We can always describe the discrete case as a continuous one where $p$ is a sum of Dirac delta functions.

### More definitions
If $X$ takes real values, then $p(X=x)$ is a **probability density** function, or PDF.
* $p(X=x)$ is *not* a probability! But integrals like $p(X=x)dx$ and $P(x_0 < X < x_1)$ are.
* We will rapidly become lazy and denote $p(X=x)$ incorrectly as $p(X)$ or $p(x)$. You have been warned.

The first bullet is highly relevant if we ever want to change variables, e.g. $x\rightarrow y(x)$
* $p(y) \neq p[x(y)]$; rather $p(y) = p(x) \left|dx/dy\right|$

The **cumulative distribution function** (CDF) is the probability that $X\leq x$.
* Usually written: $F(x) = P(X \leq x) = \int_{-\infty}^x p(X=x')dx'$.
* Conversely, the PDF is the derivative of the CDF.
* (The CDF is sometimes referred to just as the distribution function.)

### Ridiculous example: an unfair coin toss
We flip a coin which is weighted to land on heads a fraction $q$ of the time. To make things numeric, let $X=0$ for stand for an outcome of tails and $X=1$ for heads.

X | PDF $p(X)$ &nbsp; &nbsp; | CDF $F(X)$ &nbsp; &nbsp; 
:---: | :------------: | :------------:
0 |  $1-q$    |  $1-q$
1 | $q$ |  1

### Joint probability distributions
Things get more interesting when we deal with joint distributions of multiple events, $p(X=x$ and $Y=y)$, or just $p(x,y)$.

<table>
    <tr>
        <td><img src="graphics/prob_joint_correlated.png" width=75%> (Usually visualized as contours of <i>p</i>) </td>
    </tr>
</table>

The **marginal probability** of $y$, $p(y)$, means the probability of $y$ *irrespective* of what $x$ is.
* $p(y) = \int dx ~ p(x,y)$

<table>
    <tr>
        <td><img src="graphics/prob_joint_correlated.png" width=75%></td>
        <td><img src="graphics/prob_joint_marginal.png" width=75%></td>
    </tr>
</table>

The **conditional probability** of $y$ *given* a value of $x$, $p(y|x)$, is most easily understood this way
* $p(x,y) = p(y|x)\,p(x)$

i.e., $p$ of getting $x$ AND $y$ can be *factorized* into the product of
* $p$ of getting $x$ regardless of $y$, *and*
* $p$ of getting $y$ given $x$.

$p(y|x)$ is a (normalized) slice through $p(x,y)$ rather than an integral.

<table>
    <tr>
        <td><img src="graphics/prob_joint_correlated.png" width=75%></td>
        <td><img src="graphics/prob_joint_conditional.png" width=75%></td>
    </tr>
</table>

$x$ and $y$ are **independent** if $p(y|x) = p(y)$.

Equivalently, $p(x,y) = p(x)\,p(y)$.

<table>
    <tr>
        <td><img src="graphics/prob_joint_independent.png" width=75%></td>
    </tr>
</table>

## Exercise
Take the coin tossing example from earlier, where $P(\mathrm{heads})=q$ and $P(\mathrm{tails})=1-q$ for a given toss. Assume that this holds independently for each toss.

Find:

1. The conditional probability that both tosses are heads, given that the first toss is heads.
2. The conditional probability that both tosses are heads, given that at least one of the tosses is heads.

## Exercise
Say we keep on tossing this coin, still assuming independence, a total of $N$ times. Work out the probability that exactly $n$ of these turn out to be heads.

### How to count things
The answer to the previous exercise is the PDF of the binomial distribution

$P(n|q,N) = {N \choose n} q^n (1-q)^{N-n}$

To introduce some notation, we might write this as

$n \sim \mathrm{Binom}(q,N)$

Here the squiggle means "is a random variable that is distributed as" (as opposed to "has the same order of magnitude as" or "scales with", the common usages in physics).

Recall that a key assumption was that each toss (trial) was independent. If we write the mean number of heads as $\mu=qN$ and also assume that $q$ is small while $N$ is large, then a series of irritating limits and substitutions yields the __Poisson distribution__

$P(n|\mu) = \frac{\mu^n e^{-\mu}}{n!}$


This is an extremely important result, given that most astronomy and physics experiments boil down to counting events that are rare compared with the number of time intervals in which they might happen (and be recorded).
* E.g., most obviously, the number of photons from some source hitting a particular CCD pixel during an observation.

The Poisson distribution has the following (probably familiar) properties:
* Expectation value (mean) $\langle n\rangle = \mu$
* Variance $\left\langle \left(n-\langle n \rangle\right)^2 \right\rangle = \mu$
* Additivity: $n_1+n_2\sim \mathrm{Pois}(\mu_1+\mu_2)$ if $n_i\sim\mathrm{Pois}(\mu_i)$

### The central limit theorem

Another important theorem states, in its most common form:
* If $X_i$ are independent and drawn from an identical PDF, with mean $\mu$ and variance $\sigma^2$, then the sum of $n$ $X$'s tends to the normal (Gaussian) distribution with mean $n\,\mu$ and variance $n\,\sigma^2$.
* Alternatively, the average $\sum_i X_i/n$ tends to normal with mean $\mu$ and variance $\sigma^2/n$.

Among other things, this implies that a Poisson distribution with large enough $\mu$ closely resembles a Gaussian.

#### Cautions

This is a powerful result, but we need to keep some things in mind.
1. It doesn't tell us, in general, how big $n$ needs to be for things to become "Gaussian enough" for a given purpose. This would need to be determined by more careful analysis.
2. It's tempting to bin up data (e.g. Poisson counts in adjacent pixels/channels/integrations) enough to justify using the simple Gaussian distribution, but this risks throwing away key information in the data set (e.g. spatial/spectral/temporal structure).

# Generative Models

Goals:
* Introduce generative models in the context of mocking data and inference
* Introduce probabilistic graphical models as a tool for model visualization
* Practice building some simple models

## Optional further reading
(Don't buy these for this course, these are just pointers if you'd like to learn more.)
* Ivezic et al, 'Statistics, Data Mining, and Machine Learning in Astronomy', Sections 3.3 and 3.7
* Bishop, 'Pattern Recognition and Machine Learning,' Sections 8.1 and 8.2

A **generative model** formalizes our understanding of how a data set comes to exist, including
* physical processes happening out there in the Universe
* instrumental effects and the measurement process
* any computations done prior to calling the result a "data set"

In other words, it's what we need in order to generate a mock data set.

To actually generate mock data, we need to specify the **sampling distribution**,  $p(\mathrm{data}|\mathrm{model})$. This PDF is the mathemetical expression of our generative model.

* The assumed "$\mathrm{model}$" specifies the form and parameters of the sampling distribution
* A random draw from $P(\mathrm{data}|\mathrm{model})$ is a dataset, "$\mathrm{data}$"

<!--
* It shows up directly in Bayes Theorem, and ideally (as a function of model parameters) as the likelihood function in maximum-likelihood fitting.
-->

What are generative models useful for?

* Performing inference: constructing the *sampling distribution* or *likelihood function*
* Testing inference: does our analysis, run on mock data, recover the input model?
* Checking inferences: do mock data generated from a fitted model resemble the real data?

A **probabilistic graphical model** (PGM) is a very useful way of visualizing a generative model.
* They sketch out the procedure for how one would generate mock data in practice.
* They illustrate the interdependence of model parameters, and the dependence of data on parameters.
* _They also (therefore) represent a conditional factorization of the PDF for all the data and model parameters._

**Many, many** mistakes can be avoided by sketching out a PGM at the outset of a statistical analysis.

Technically, a PGM is a type of *directed acyclic graph*, where **nodes** and **edges** represent parts of the model.

Let's look at a very simple example...

Here's an image (and a zoom-in):

<table><tr width=90%>
<td><img src="graphics/tour_cluster_image.png" height=300></td>
<td><img src="graphics/tour_cluster_image_zoom.png" height=300></td>
</tr></table>

Our measurement is the number of counts in each pixel. Here is a generative model:
* There's an object emitting light, whose properties are parametrized by $\theta$.
* From $\theta$, we can determine the average flux falling on a given pixel $k$, $F_k$.
* Given the exposure time of our observation, $T$, and some conversion factors, $F_k$ determines the average number of counts expected, $\mu_k$.
* The number of counts measured, $N_k$, is a Poisson draw, given the average $\mu_k$.

Notice that the model was described in terms of conditional relationships.
* $\theta \Rightarrow F_k$
* $F_k,T \Rightarrow \mu_k$
* $N_k \sim \mathrm{Poisson}(\mu_k)$

The PGM will do the same, visually.

This is what it looks like:
<img src="graphics/pgms_pixelcounts.png">

Ingredients of a PGM:
* **Nodes** represent PDFs for parameters
* **Edges** represent conditional relationships
* **Plates** represent repeated model components whose contents are conditionally independent

Types of nodes:
* **Circles** represent a PDF. This parameter is a *stochastic* function of the parameters feeding into it.
* **Points** represent a delta-function PDF. This parameter is a *deterministic* function of the parameters feeding into it.
* **Double circles** (or shading) indicate measured data. They are stochastic in the context of generating mock data, but fixed in the context of parameter inference.

### Exercise
How are these PGMs different, and what does the difference mean?

<table><tr><td>
<img src="graphics/pgms_pixelcounts.png">
</td><td>
<img src="graphics/pgms_pixelcounts2.png">
</td></tr></table>

 
 

By mapping the conditional dependences of a model, PGMs illustrate how to factorize (and hence draw samples from) the joint PDF for all variables:

$p(\theta,T,\{F_k, \mu_k, N_k\}) = p(\theta)p(T) \prod_k P(N_k|\mu_k)p(\mu_k|F_k,T)p(F_k|\theta)$

<img src="graphics/pgms_pixelcounts.png">

In this case, some PDFs are delta functions, so we can straightforwardly marginalize over such _deterministic_ variables:

$p(\theta,T,\{\mu_k, N_k\}) = \int dF_k\; p(\theta)p(T) \prod_k P(N_k|\mu_k)p(\mu_k|F_k,T)p(F_k|\theta)$

$= \underbrace{p(\theta)} ~ \underbrace{\prod_k P\left(N_k|\mu_k(\theta,T)\right)}$
$= \mathrm{prior}(\theta) ~\times~ (\mathrm{sampling~distribution~of~}\vec{N})$

### Exercise

<table width=60%><tr>
    <td><img src="graphics/pgms_a-c-d.png"></td>
    <td><img src="graphics/pgms_c-y-d.png"></td>
</tr></table>

* On your own, write down the probability expressions illustrated by these two graphs. 
* When you're ready, raise your hand (in zoom) to report back to the class.


### Take-home messages

* Both simulation of mock data and model inference from data require a model for how the Universe (or our computer) generates data.
* PGMs are a helpful way of visualizing the conditional dependences of a model (how the probability expressions factorize).

Note: the `daft` Python package is useful for making pretty PGMs.

### Exercise: linear regression

Your data is a list of $\{x_k,y_k,\sigma_k\}$ triplets, where $\sigma_k$ is some estimate of the "error" on $y_k$. You think a linear model, $y(x)=a+bx$, might explain these data. To start exploring this idea, you decide to generate some simulated data, to compare with your real dataset.

In the absence of any better information, assume that $\vec{x}$ and $\vec{\sigma}$ are (somehow) known precisely, and that the "error" on $y_k$ is Gaussian (mean of $a+bx_k$ and standard deviation $\sigma_k$).

1. Draw the PGM, and write down the corresponding probability expressions, for this problem.

2. What (unspecified) assumptions, if any, would you have to make to actually generate data? Which assumptions do you think are unlikely to hold in practice? Choose one (or more) of these assumptions and work out how to generalize the PGM/generative model to avoid making it.