In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

In [2]:
%matplotlib notebook

# Understanding Data

## (or, lies I've been told about statistics)

#### Version 0.1

-----

By AA Miller (Northwestern/Adler Planetarium)

5 Sep 2019

According to the [DSFP curriculum](https://astrodatascience.org/curriculum), this entire week is going to be devoted to statistics (a topic which is, an entire field unto itself). 

In practice, however, statistics is too big a topic to fully tackle in a week. 

A more apt description for this week's theme is **models**, and this is a good thing, because everything in astronomy is a model (more on this in just a bit).

Broadly speaking, we build models for two purposes:

*forecasting*

and

*inference*.

Forecasting is about predicting the outcome of some future, as of yet, unobserved event. 

This can be extremely useful if you have measured the velocity dispersion of some galaxy and you would like to know the mass of the black hole at the center of that galaxy.

(*aside* –– machine learning is essentially all about forecasting)

Inference is about understanding the fundamental relationship between different parameters. Inference can be used for forecasting (the otherway around isn't always true). 

Inference is necessary for understanding physics. While the following statement can be incredibly useful, "based on 10 measurements of a force applied to this ball, I predict that a newly applied force, $F_\mathrm{new}$, will produce acceleration $F_\mathrm{new}/m_\mathrm{ball}$," it is far more useful to say $F = ma$ for any object with mass $m$.

(*aside* –– machine learning is not useful for inference)

Before we start any discussion of models, we need to back up and zoom out a bit, because models require...

## Problem 1) Data

At the core of everything we hope to accomplish with the DSFP stands a single common connection: data.

Using a vast array of telescopes across the planet and in outter space we are able to collect data that allow us to unravel the mysteries of the universe!

This naturally leads to the question:

**Problem 1a**

What is data?

*Take a few min to discuss this with your partner*

**Solution 1a**

While we just discussed several different ideas about the nature of data, the main thing I want to emphasize is the following: data are *constants*.

Hypothetical question –– what if a data reduction pipeline changes (for the better), and, as a result the endproducts (i.e. data output) have changed? 

Doesn't this mean that the data have changed?

This is a subtle point, but, in this case the data have not changed. The output from the pipeline, which we often think of as "data," is actually a *model*. Adjustments in the pipeline result in changes in the model, and any downstream analysis changes because the model has changed.

Models vary, data don't.

To clearly reiterate this point: imagine you want to determine the "turn-off" age for an open cluster. To do this, you would need to know the flux of each star in the cluster, which could then be compared to stellar evolution models.

But wait!

The stellar fluxes are themselves a model. Before LSST can report some flux plus or minus some uncertainty in a database, the project has to do the following: bias-subtract the raw CCD image (model #1), flat-field correct the bias-subtracted image (model #2), identify the pixel location of all the stars in the image (model #3), align the relative pixel-positions with a standard astrometric reference frame (model #4), determine the relative flux of every star and galaxy in the image (model #5), and convert those relative fluxes to absolute values by identifying "standard" stars with known absolute flux to calibrate the relative flux measurements (model #6).

So really, the flux of a star, which we would normally refer to as a data point for the stellar evolution model, is in fact a model of several models.

This brings us back to the earlier claim: everything in astronomy is models. (it's models all the way down...)

This should also make you a little nervous –– have the uncertainties been properly estimated for each of those models, and if they have, have they been propagated to the final numbers that are available in the database?

## Problem 2) Linear Models and Least-Squares Fitting

Execute the cell below in order to populate your namespace with some data that we will use for the subsequent analysis.

In [14]:
# from ____ import ____
x, y, y_unc = polute_namespace()

You now have some data $x$ and $y$ (ignore all the other data for now). 

**Problem 2a**

As a good data scientist, what is the first thing you should do with this data?

In [15]:
# complete

**Solution 2a**

I intentionally mislead with the previous question. 

The most important thing to do with *any* new data is understand where the data came from and what they represent. While the data are constants, they represent measurements of some kind. Thus, I would argue the most important thing to do with this data is understand where they came from (others may disagree).

In this case, we have "toy" data that were randomly generated for today's lecture. In that sense, there are no units or specific measurements that otherwise need to be understood. 

**Problem 2b**

Make a scatter plot of the data to understand the relationship between $x$ and $y$.

In [16]:
fig, ax = plt.subplots()
ax.plot(x, y, 'o')
ax.set_xlabel('x', fontsize=14)
ax.set_ylabel('y', fontsize=14)
fig.tight_layout()

<IPython.core.display.Javascript object>

There is a very good chance, though I am not specifically assuming anything, that upon making the previous plot you had a thought along the lines of "these points fall on a line" or "these data represent a linear relationship."  

**Problem 2c** 

Is the assumption of linearity valid for the above data?

Is it convenient?

*Take a few min to discuss this with your partner*

**Solution 2c**

One of the lessons from this lecture is the following: *assumptions are dangerous*! In general, a linear relationship between data should only be assumed if there is a very strong theoretical motivation for such a relationship. Otherwise, the relationship could be just about anything, and inference based on an assumption of linearity may lead to dramatically incorrect conclusions (forecasting, on the other hand, may be just fine).

That being said, assuming the data represent (are drawn) from a linear relationship is often very convenient. There are a large host of tools designed to solve this very problem.

Let us proceed with convenience and assume the data represent a linear relationship. In that case, to understand the underlying relationship between the data, we need to fit a line to the data. 

Nearly all statistical texts or courses start with the "standard" [least-squares fitting](https://en.wikipedia.org/wiki/Least_squares) procedure. In brief, least-squares minimizes the sum of the squared value of the residuals between the data and the fitting function.

Furthermore, in the case of simple polynomial models, straightforward matrix multiplication can be used to determine exactly the correct answer. That is, the model parameters that minimize the sum of the squares (we will skip the derivation for now).

Instead, we will rely on google, which informs us that the easiest way to perform a linear least-squares fit to the above data is with [`np.polyfit`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html), which performs a least-squares polynomial fit to two `numpy` arrays.

**Problem 2d**

Use `np.polyfit()` to fit a line to the data. Overplot the best-fit line on the data.

In [28]:
p = np.polyfit(x, y, 1)
p_eval = np.poly1d(p)

fig = plt.figure(figsize=(6,5))
ax = plt.subplot2grid((3,1), (0, 0), rowspan=2)
ax_res = plt.subplot2grid((3,1), (2, 0), sharex=ax)

ax.plot(x, y, 'o')
ax.plot([0,1000], p_eval([0,1000]))
ax.set_ylabel('y', fontsize=14)

ax_res.plot(x, y - p_eval(x), 'o')
ax_res.axhline(color='C1')
ax_res.set_ylabel('residuals', fontsize=14)
ax_res.set_xlabel('x', fontsize=14)
plt.setp(ax.get_xticklabels(), visible=False)

fig.tight_layout()

<IPython.core.display.Javascript object>

There is a very good chance, though, again, I am not specifically assuming anything, that for the previous plots that you plotted `x` along the abscissa and `y` along the ordinate. 

[Honestly, there's no one to blame if this is the case, this has essentially been drilled into all of us from the moment we started making plots. In fact, in `matplotlib` we cannot change the name of the abscissa label without adjusting the `xlabel`.]

This leads us to an important question, however. What if we flip `y` and `x`? Does that in any way change the results for the fit?

**Problem 2e**

Perform a linear least-squares fit to `x` vs. `y` (or if you already fit this, then reverse the axes). As above, plot the data and the best-fit model.

To test if the relation is the same between the two fits, compare the predicted `y` value for both models corresponding to `x = 50`.

In [38]:
p_yx = np.polyfit(y, x, 1)
p_yx_eval = np.poly1d(p_yx)

fig = plt.figure(figsize=(6,5))
ax = plt.subplot2grid((3,1), (0, 0), rowspan=2)
ax_res = plt.subplot2grid((3,1), (2, 0), sharex=ax)

ax.plot(y, x, 'o')
ax.plot([-150,2000], p_yx_eval([-150,2000]))
ax.set_ylabel('x', fontsize=14)

ax_res.plot(y, x - p_yx_eval(y), 'o')
ax_res.axhline(color='C1')
ax_res.set_ylabel('residuals', fontsize=14)
ax_res.set_xlabel('y', fontsize=14)
plt.setp(ax.get_xticklabels(), visible=False)

fig.tight_layout()

print("For y vs. x, we find y = {:.4f}x + {:.4f}".format(p[0], p[1]))
print("\t for x=50 we would forecast y={:.2f}".format(p_eval(50)))
print("For x vs. y, we find x = {:.4f}y + {:.4f}".format(p_yx[0], p_yx[1]))
print("\t for x=50 we would forecast y={:.2f}".format((50 - p_yx[1])/p_yx[0]))

<IPython.core.display.Javascript object>

For y vs. x, we find y = 83.8515x + 1.6729
	 for x=50 we would forecast y=167.49
For x vs. y, we find x = -12.7899y + 0.5549
	 for x=50 we would forecast y=113.16


We have now uncovered lie #1! The relationship between $x$ and $y$ is *not* the same as the relationship between $y$ and $x$ (as far as least-squares fitting is concerned). 

Wait?! This is sorta like saying $F = ma$, but $a \neq F/m$. 

How can this be?

There are a couple essential assumptions that go into standard least-squares fitting:

1. There is one dimension along which the data have negligible uncertainties
2. Along the other dimension **all** of the uncertainties can be described via Gaussians of known variance

These two conditions are *rarely* met for astronomical data. While condition 1 can be satisfied (e.g., time series data where there is essentially no uncertainty on the time of the observations), I contend that condition 2 is rarely, if ever, satisfied.

## Problem 3) Lies, Damned Lies, and Statistics

While I may have stretched the definition of lie above, there are several misconceptions associated with model fitting (some of which are a direct consequence of using least-squares) that we should address.

**Lie #2**

For a 2-dimensional data set with one dependent variable (i.e. $y$ depends on $x$) and $n$ observations, a polynomial of the $n^\mathrm{th}$ degree can "perfectly" fit the data?

In [44]:
n_obs_n_poly()

<IPython.core.display.Javascript object>

This "lie" can only be true if the data are singularly valued at every value of the dependent variable. Polynomials of the form $y = \sum_n a_n x^n$ cannot take on multiple values of $y$ at a single value for $x$.

**Lie #2**

The model that minimizes the $\chi^2$ is best.

(or a slightly more nuanced version of this statement - the model where the $\chi^2$ per degree of freedom is closest to 1 is best)

In [117]:
chi2_example()

<IPython.core.display.Javascript object>

In this case we see that increasing the order of the polynomial fit to the data we can lower the "reduced" $\chi^2$, getting a value that is much much closer to 1. 

However, a "chi-by-eye" examination of the two models presented here suggests that there is absolutely no way that the $14^\mathrm{th}$-order polynomial provides a better explanation of the relationship between $y$ and $x$. Certainly, none of us would use this model for forecasting.

This should make you very weary of any $\chi^2$-minimization techniques as a method to identify the "best-fit" model.

(You will nevertheless find this in all sorts of papers throughout the literature.)

(Full disclosure - I have been guilty of this in my past)

This raises an important and critical question: how does one decide which model is "best"?

(This is well beyond the scope of this lecture, but we will talk about this on Thursday and Friday)

**Lie #3**

You cannot fit a model with more free parameters than there are data points.

Suppose you have the following data:

In [344]:
noisy_plot()

<IPython.core.display.Javascript object>

If you also suppose that $y$ varies linearly with $x$ (perhaps you have some theory that guides you towards this intuition), but that these data are produced via some noisy process, then you could write down the probability for any individual observation $y_i$ as:

$$P(y_i) = \frac{1}{\sqrt{2\pi\sigma_i^2}} \exp{-\frac{(y_i - m x_i - b)^2}{2\sigma_i^2}}$$

where $m$ is the slope of the line, $b$ is the intercept, and $\sigma_i$ is the *unknown* uncertainty on the $i^\mathrm{th}$ observation.

(note - I am being a little sloppy in my notation)

The probability of all of the observations is then the product of the probability for each observation:

$$P = \prod_i P(y_i)$$

and the model for the data now has 20 parameters ($m$, $b$, and 1 $\sigma_i$ for each of the 18 observations) and only 18 observations.

Surely this is hopeless and we cannot possibly constrain $m$ and $b$ right? ...

**right?**

In [355]:
nuissance_model()

Fitting a 20 parameter model with 18 data points...
    Hopefully this does not destroy the universe...
        We are approaching the singularity...
            I wish my last day on Earth wasn't with this Adam bozo...


<IPython.core.display.Javascript object>

This is somewhat amazing.

We have some data, we have no idea what the uncertainties are on the individual observations, and yet(!), we are able to recover the correct slope and intercept of the line.

(Note - we will discuss the particulars of the above model in greater detail later this week. I would rarely recommend the adopted model in practice!)

I hope at this stage I've convinced you that some common tropes regarding least-squares fitting do not always apply. 

I also hope that I've raised some concerns on your end about the best methods for determining the slope and intercept of linearly related data.

Now on to another major problem for least-squares fitting...

## Problem 4) Outliers

## Helper Functions

In [9]:
def pollute_namespace():

    np.random.seed(212)

    x = np.random.uniform(0,1000,56)
    sigma = 150
    y = 1.75*x + 41 + np.random.normal(0,sigma,len(x))
    y_unc = sigma*np.ones_like(x)
    
    return x, y, y_unc

In [11]:
x, y, y_unc = polute_namespace()

In [12]:
fig, ax = plt.subplots()
ax.plot(x, y, 'o')
ax.set_xlabel('x', fontsize=14)
ax.set_ylabel('y', fontsize=14)
fig.tight_layout()

<IPython.core.display.Javascript object>

In [13]:
fig, ax = plt.subplots()
ax.errorbar(x, y, y_unc, fmt='o')
ax.set_xlabel('x', fontsize=14)
ax.set_ylabel('y', fontsize=14)
fig.tight_layout()

<IPython.core.display.Javascript object>

In [45]:
def n_obs_n_poly():
    np.random.seed(212)
    
    x = np.ones(5)*4
    y = np.random.uniform(0,100,5)
    
    fig, ax = plt.subplots()
    ax.plot(x, y, 'o')
    ax.set_xlabel('x', fontsize=14)
    ax.set_ylabel('y', fontsize=14)
    ax.set_xlim(0,10)
    fig.tight_layout()

In [46]:
n_obs_n_poly()

<IPython.core.display.Javascript object>

In [115]:
def chi2_example():
    np.random.seed(212)
    
    x = np.random.uniform(0,100,20)
    sigma = 15
    y = 2.1*x + 41 + np.random.normal(0,sigma,len(x))
    y_unc = sigma/3*np.ones_like(x)

    p1 = np.polyfit(x, y, 1)
    p1_eval = np.poly1d(p1)
    chi2_1 = np.sum((y - p1_eval(x))**2/y_unc**2)/(len(x)-2)
    npoly = 14
    p10 = np.polyfit(x, y, npoly)
    p10_eval = np.poly1d(p10)
    chi2_10 = np.sum((y - p10_eval(x))**2/y_unc**2)/(len(x)-npoly)
    
    
    fig = plt.figure(figsize=(7,8))
    ax = plt.subplot2grid((4,1), (0, 0), rowspan=2)
    ax_res1 = plt.subplot2grid((4,1), (2, 0), sharex=ax)
    ax_res10 = plt.subplot2grid((4,1), (3, 0), sharex=ax)

    ax.errorbar(x, y, y_unc, fmt='o')
    ax.plot(np.linspace(0,100,1000), p1_eval(np.linspace(0,100,1000)), 
            label=r'$1^\mathrm{st}$ order polynomial; $\chi^2_\nu = $' + '{:.4f}'.format(chi2_1))
    ax.plot(np.linspace(0,100,1000), p10_eval(np.linspace(0,100,1000)), 
            label=r'$14^\mathrm{th}$ order polynomial; $\chi^2_\nu = $' + '{:.4f}'.format(chi2_10))
    ax.set_ylim(-2700, 3200)
    ax.legend(fancybox=True)

    ax_res1.errorbar(x, y - p1_eval(x), y_unc, fmt='o')
    ax_res1.axhline(color='C1')
    ax_res1.set_ylabel('residuals', fontsize=14)

    ax_res10.errorbar(x, y - p10_eval(x), y_unc, fmt='o')
    ax_res10.axhline(color='C2')
    ax_res10.set_ylabel('residuals', fontsize=14)
    ax_res10.set_xlabel('x', fontsize=14)
    plt.setp(ax.get_xticklabels(), visible=False)
    plt.setp(ax_res1.get_xticklabels(), visible=False)
    fig.tight_layout()
    

In [116]:
chi2_example()

<IPython.core.display.Javascript object>

In [341]:
def noisy_plot():
    np.random.seed(212)

    x = np.random.uniform(0,10,18)
    sigma = 3
    y = 3*x + 23 + np.random.normal(0,sigma,len(x))

    fig, ax = plt.subplots()
    ax.plot(x, y, 'o')
    ax.set_xlabel('x', fontsize=14)
    ax.set_ylabel('y', fontsize=14)
    fig.tight_layout()

In [342]:
noisy_plot()

<IPython.core.display.Javascript object>

In [354]:
import emcee
import corner
from multiprocessing import Pool, cpu_count

def lnlike(theta, x, y):
    m = theta[0]
    b = theta[1]
    sigmas = theta[2:]
    
    lnl = np.sum(np.log(1/(2*np.pi*sigmas**2)) - (y - m*x -b)**2/(2*sigmas**2))
    return lnl

def nuissance_model():
    np.random.seed(212)
    x = np.random.uniform(0,10,18)
    sigma = 3
    y = 3*x + 23 + np.random.normal(0,sigma,len(x))

    ndim = 20
    nwalkers = 2500
    guess_0 = np.array([3,25]+[3]*18)
    nfac = [10**(-1)]*ndim

    pos = guess_0*[1 + nfac*np.random.randn(ndim) for i in range(nwalkers)]

    ncores = cpu_count()
    with Pool(ncores) as pool:


        sampler = emcee.EnsembleSampler(nwalkers, ndim, lnlike, args=[x, y],
                                        pool=pool, moves=emcee.moves.KDEMove())
        for sample in sampler.sample(pos, 
                                     iterations=100, 
                                     progress=False):
            if sampler.iteration == 10:
                print('Fitting a 20 parameter model with 18 data points...')
            elif sampler.iteration == 30:
                print('    Hopefully this does not destroy the universe...')
            elif sampler.iteration == 60:
                print('        We are approaching the singularity...')
            elif sampler.iteration == 90:
                print("            I wish my last day on Earth wasn't with this Adam bozo...")
            else:
                continue


    tau = np.mean(sampler.get_autocorr_time(tol=0)[0:2])
    samples = sampler.get_chain(discard=3*int(tau), thin=int(tau), flat=True)[:,0:2]

    _ = corner.corner(samples, 
                      labels = ['$m$', '$b$'], truths=[3, 23], 
                      show_titles=True, quantiles=[0.1, 0.5, 0.9])

In [388]:
def gen_mix_model():
    np.random.seed(212)
    
    x = np.random.uniform(0,100,25)
    y = np.empty_like(x)
    y_unc = np.empty_like(x)
    
    for i in range(25):
        rand = np.random.uniform()
        if rand > 0.2:
            sigma = np.random.uniform(5,10)
            y[i] = 0.8*x[i] + 13 + np.random.normal(0,sigma)
            y_unc[i] = sigma
        else:
            sigma = np.random.uniform(5,10)
            y[i] = 0.8*x[i] + 13 - np.random.normal(0,30)
            y_unc[i] = sigma
            
    fig, ax = plt.subplots()
    ax.errorbar(x, y, y_unc, fmt='o')
    ax.set_xlabel('x', fontsize=14)
    ax.set_ylabel('y', fontsize=14)
    fig.tight_layout()
    
    
    return

In [389]:
gen_mix_model()

<IPython.core.display.Javascript object>

In [364]:
    np.random.seed(212)
    
    x = np.random.uniform(0,100,20)
    y = np.empty_like(x)
    y_unc = np.empty_like(x)
    
    for i in range(20):
        rand = np.random.uniform()
        if rand > 0.15:
            sigma = np.random.uniform(0,7)
            y[i] = 0.8*x[i] + 13 + np.random.normal(0,sigma)
            y_unc[i] = sigma
        else:
            sigma = np.random.uniform(7,20)
            y_unc[i] = sigma
            y[i] = 35 + np.random.normal(0,sigma)

ValueError: setting an array element with a sequence.

In [365]:
sigma

0.9916346190917059