In [2]:
import numpy as np
import matplotlib.pyplot as plt

%matplotlib notebook

# Introduction to Statistics:
An Aperitif for DSFP Session 4
========

#### Version 0.1

***
By AA Miller 2017 Sep 14

An [Introduction to Statistics](https://github.com/LSSTC-DSFP/LSSTC-DSFP-Sessions/blob/master/Session1/Day2/IntroStat.pdf) was covered during Session 1 of the DSFP. Typically, this initial lecture is used as a means of providing a brief overview of the Session 1 material, but that does not make sense in the context of Session 4, as half of you have not seen this lecture.

Instead, today we will focus on a relatively simple problem, while highlighting several challenges for the standard astronomical workflow, as a way of setting up the various lectures that will happen over the course of this week. 

A lot of the lessons in this lecture are inspired by the paper [Data Analysis Recipes: Fitting a Model to Data](https://arxiv.org/abs/1008.4686) by Hogg, Bovy, & Lang. [This paper has been mentioned previously in the DSFP, though today we will only be able to scratch the surface of its content.]

In some sense - the goal right now is to make you really nervous about the work that you've previously done. 

(Though this lecture should not be met with too much consternation, as you will have a toolkit to deal with all the issues that we raise by the end of the week)

## Problem 1) Data

At the core of everything we hope to accomplish with the DSFP stands a single common connection: data.

There are many things we (may) want to do with these data: reduce them, visualize them, model them, develop predictions from them, use them to infer fundamental properties of the universe to gain a unique understanding that no one else in the history of our planet has ever had (!).

Before we dive into that really fun stuff, we should start with some basics:

**Problem 1a**

What is data?

*Take a few min to discuss this with your partner*

**Solution 1a**

While we just discussed several different ideas about the nature of data, the main thing I want to emphasize is the following: data are *constants*. 

**Need some more text on this to elaborate**

**Problem 1b**

Below, I provide some data (in the form of `numpy` arrays). As good data scientists, what is the first thing you should do with this data?

Feel free to create a new cell if necessary.

In [3]:
y = np.array([203, 58, 210, 202, 198, 158, 
              165, 201, 157, 131, 166, 160, 
              186, 125, 218, 146])
x = np.array([495, 173, 479, 504, 510, 416, 
              393, 442, 317, 311, 400, 337, 
              423, 334, 533, 344])

**Solution 1b**

I intentionally mislead with the previous question. 

The most important thing to do with *any* new data is understand where the data came from and what they represent. While the data are constants, they represent measurements of some kind. Thus, I would argue the most important thing to do with this data is understand where they came from (others may disagree). 

In the case of the arrays, the answer is that they are "toy" data that were generated for illustrative purposes in the Hogg, Bovy, & Lang paper discussed above. In that sense, there are no units or specific measurements that otherwise need to be understood. 

**Problem 1c**

[You may have already done this] Now that we understand the origin of the data, make a scatter plot showing their distribution.

In [12]:
plt.scatter(x,y)
plt.xlabel("x")
plt.ylabel("y")

<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0x10c8775f8>

## Probelm 2) Fitting a Line to Data

There is a very good chance, though I am not specifically assuming anything, that upon making the previous plot you had a thought along the lines of "these points fall on a line" or "these data represent a linear relationship."  

**Problem 2a** 

Is the assumption of linearity valid for the above data?

Is it convenient?

**Solution 2a**

One of the primary lessons from this lecture is the following: *assumptions are dangerous*! In general, a linear relationship between data should only be assumed if there is a very strong theoretical motivation for such a relationship. Otherwise, the relationship could be just about anything, and inference based on an assumption of linearity may lead to dramatically incorrect conclusions (Friday's talk by Adam will cover Model Selection).

That being said, assuming the data represent (are drawn) from a linear relationship is often very convenient. There are a large host of tools designed to solve this very problem.

Let us proceed with convenience and assume the data represent a linear relationship. In that case, in order to make predictions for future observations, we need to fit a line to the data. 

The "standard" proceedure for doing so is [least-squares fitting](https://en.wikipedia.org/wiki/Least_squares). In brief, least-squares minimizes the sum of the squared value of the residuals between the data and the fitting function.

I've often joked that all you need to be a good data scientist is [google](https://www.google.com) and [stack overflow](https://stackoverflow.com). Via those two tools, we can quickly deduce that the easiest way to perform a linear least-squares fit to the above data is with [`np.polyfit`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html), which performs a least-squares polynomial fit to two `numpy` arrays.

**Problem 2b**

Use `np.polyfit()` to fit a line to the data. Overplot the best-fit line on the data.

In [13]:
p = np.polyfit(x, y, 1)
p_eval = np.poly1d(p)

plt.scatter(x,y)
plt.plot([0,600], p_eval([0,600]))
plt.xlabel("x")
plt.ylabel("y")

<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0x10ce02518>

There is a very good chance, though I am not specifically assuming anything, that for the previous plots that you plotted `x` along the abscissa and `y` along the ordinate. 

[Honestly, there's no one to blame if this is the case, this has essentially been drilled into all of us from the moment we started making plots. In fact, in `matplotlib` we cannot change the name of the abscissa label without adjusting the `xlabel`.]

This leads us to an important question, however. What if `y` does not depend on `x` and instead `x` depends on `y`? Does that in any way change the results for the fit?

**Problem 2c**

Perform a linear least-squares fit to `x` vs. `y` (or if you already fit this, then reverse the axes). As above, plot the data and the best-fit model.

To test if the relation is the same between the two fits, compare the predicted `y` value for both models corresponding to `x = 300`.

In [19]:
p_yx = np.polyfit(y, x, 1)
p_yx_eval = np.poly1d(p_yx)

plt.scatter(y,x)
plt.plot([0,250], p_yx_eval([0,250]))
plt.xlabel("y")
plt.ylabel("x")

print("For y vs. x, then x=50 would predict y={}".format(p_eval(50)))
print("For x vs. y, then x=50 would predict y={}".format((50 - p_yx[1])/p_yx[0]))

<IPython.core.display.Javascript object>

For y vs. x, then x=50 would predict y=24.80311423398867
For x vs. y, then x=50 would predict y=9.544353223417994


So we have now uncovered one of the peculiariaties of least-squares. Fitting `y` vs. `x` is *not* the same as fitting `x` vs. `y`.

There are several assumptions that go into standard least-squares fitting:

1. There is one dimension along which the data have negligible uncertainties
2. Along the other dimension **all** of the uncertainties can be described via Gaussians of known variance

These two conditions are *rarely* met for astronomical data. While condition 1 can be satisfied (e.g., time series data where there is essentially no uncertainty on the time of the observations), I contend that condition 2 is rarely, if ever, satisfied.

Speaking of uncertainties(1), we have not utilized any thus far. [I hope this has raised some warning bells.]

We will now re-organize our data to match what is originally in Hogg, Bovy, & Lang (previously `x` and `y` were swapped).

(1) There is an amazing footnote in Hogg, Bovy, & Lang about "errors" vs. "uncertainties" - I suggest everyone read this.

In [20]:
x = np.array([203, 58, 210, 202, 198, 158, 
              165, 201, 157, 131, 166, 160, 
              186, 125, 218, 146])
y = np.array([495, 173, 479, 504, 510, 416, 
              393, 442, 317, 311, 400, 337, 
              423, 334, 533, 344])
sigma_y = np.array([21, 15, 27, 14, 30, 16, 
                    14, 25, 52, 16, 34, 31, 
                    42, 26, 16, 22])

plt.errorbar(x, y, sigma_y, fmt = "o")

<IPython.core.display.Javascript object>

<Container object of 3 artists>

We are now assuming that `x` has negligible uncertainties and that `y` has uncertainties that can be perfectly described by Gaussians of known variance.

A portion of the appeal of least-squares is that it provides a deterministic method for determining the best fit. To understand that we now need to do a little linear algebra.

We can arrange the data in the following matricies:

$$ \mathbf{Y} = \left[ {\begin{array}{c}
            y_1 \\
            y_2 \\
            \dots \\
            y_N
            \end{array}
           }
            \right] , $$

$$ \mathbf{A} = \left[ {\begin{array}{cc}
            1 & x_1 \\
            1 & x_2 \\
            \dots & \dots \\
            1 & x_N
            \end{array}
           }
           \right] ,
           $$
           
$$ \mathbf{C} = \left[ {\begin{array}{cccc}
            \sigma_{y_1}^2 & 0 & \dots & 0 \\
            0 & \sigma_{y_2}^2 & \dots & 0 \\
            \vdots & \vdots & \ddots & \vdots \\
            0 & 0 & \dots & \sigma_{y_1}^2 \\
            \end{array}
           }
           \right] ,
           $$
           
where $\mathbf{Y}$ is a vector, and $\mathbf{C}$ is the covariance matrix. 

Ultimately, we need to solve the equation

$$\mathbf{Y} = \mathbf{A}\mathbf{X}.$$

I am skipping the derivation, but the solution to this equations is:

$$ \left[ {\begin{array}{c}
            b \\
            m \\
            \end{array}
           }
            \right] = \mathbf{X} = \left[ \mathbf{A}^T \mathbf{C}^{-1} \mathbf{A}\right]^{-1} \left[ \mathbf{A}^T \mathbf{C}^{-1} \mathbf{Y}\right].$$



As noted in Hogg, Bovy, & Lang, this procedure minimizes the $\chi^2$ function, which is the total squared error, after appropriately scaling by the uncertainties:

$$ \chi^2 = \Sigma_{i = 1}^{N} \frac{[y_i - f(x_i)]^2}{\sigma_{y_i}^2} = \left[ \mathbf{Y}  - \mathbf{A}\mathbf{X}\right]^{T} \mathbf{C}^{-1} \left[ \mathbf{Y} - \mathbf{A} \mathbf{X}\right].$$

**Problem 2d** 

Using the linear algebra equations above (i.e. avoid `np.polyfit` or any other similar functions), determine the weighted least-squares best-fit values for $b$ and $m$, the intercept and slope, respectively.

Plot the results of the best-fit line. How does this compare to the above estimates?

In [45]:
Y = y.reshape(-1,1)
A = np.vstack((np.ones_like(x), x)).T
C = np.diag(sigma_y**2)

X = np.linalg.inv(A.transpose()@np.linalg.inv(C)@A) @ (A.transpose()@np.linalg.inv(C)@Y)

best_fit = np.poly1d(X[::-1,0])

plt.errorbar(x, y, sigma_y, fmt = "o")
plt.plot([0,300], best_fit([0,300]))

print("The best-fit value for the slope and intercept are: {:.4f} and {:.4f}".format(X[1][0], X[0][0]))

<IPython.core.display.Javascript object>

The best-fit value for the slope and intercept are: 2.2399 and 34.0477


**Problem 2e**

Confirm the results of this fit are the same as those from `np.polyfit`.

*Hint - be sure to include the uncertainties.*

In [49]:
p = np.polyfit(x, y, 1, w = 1/sigma_y)
print("The best-fit value for the slope and intercept are: {:.4f} and {:.4f}".format(p[0], p[1]))

The best-fit value for the slope and intercept are: 2.2399 and 34.0477


## Problem 3) Are the Uncertainties Actually Gaussian?

Previously we noted that there are two essential assumptions that are required for least-squares fitting to be correct. We are now going to examine the latter requirement, namely, that the uncertainties can be perfectly described as Gaussians with known variance.

Earlier I stated this assumption is rarely satisfied. Why might this be the case? 

In my experience (meaning this is hardly universal), if it's astro, it's got systematics. While I cannot prove this, I contend that systematic uncertainties are rarely Gaussian. If you are lucky enough to be in a regime where you can be confident that the systematics are Gaussian, I further contend that it is extremely difficult to be certain that the variance of that Gaussian is known.

Then there's another (astro-specific) challenge: in many circumstances, we aren't actually working with data, but rather with the results of other models applied to the data.

Let's take an optical astronomy (biased, but this is LSST after all) example. What are the data? In many cases inference is being based on measurements of brightness, but the true data in this case is simply a bunch of electron counts in a CCD. The brightness (or mag) is based on the application of a model (e.g., PSF, aperture, Kron) that is applied to the data. Thus, to assume that a flux (or mag) measurement has Gaussian uncertainties with known variance is to assume that whatever flux-measurement model has been applied always produces perfectly Gaussian uncertainties (and a lot of different assumptions go into flux-measurement models...)

As a demonstration associated with the challenges of these assumptions, we will examine the data set presented in Hogg, Bovy, & Lang. 

In [50]:
x = np.array([201, 201, 287, 166,  58, 157, 146, 218, 203, 186, 160,  47, 210,
       131, 202, 125, 158, 198, 165, 244])
y = np.array([592, 442, 402, 400, 173, 317, 344, 533, 495, 423, 337, 583, 479,
       311, 504, 334, 416, 510, 393, 401])
sigma_y = np.array([61, 25, 15, 34, 15, 52, 22, 16, 21, 42, 31, 38, 27, 16, 14, 26, 16,
       30, 14, 25])

**Problem 3a**

Using the least-squares methodology developed in Problem 2, determine the best-fit slope and intercept for a line fit to the data above. 

Make a scatter plot of the data, and overplot the best-fit line. What if anything, do you notice about the data and the fit?

In [64]:
Y = y.reshape(-1,1)
A = np.vstack((np.ones_like(x), x)).T
C = np.diag(sigma_y**2)

X = np.linalg.inv(A.transpose()@np.linalg.inv(C)@A) @ (A.transpose()@np.linalg.inv(C)@Y)

best_fit = np.poly1d(X[::-1,0])

plt.errorbar(x, y, sigma_y, fmt = "o")
plt.plot([0,300], best_fit([0,300]))

print("The best-fit value for the slope and intercept are: {:.4f} and {:.4f}".format(X[1][0], X[0][0]))

<IPython.core.display.Javascript object>

The best-fit value for the slope and intercept are: 1.0767 and 213.2735


Unlike the data in Problems 1 and 2, there appear to be some significant outliers (of course - this appearance of outliers is entirely dependent upon the assumption of linearity, there may actually be no outliers and a complex relation between `x` and `y`). As such, it does not appear (to me) as though the best-fit line provides a good model for the data.

**Problem 3b**

Perform a least-squares 2nd order polynomial fit to the data. Overplot the bestfit curve.

How does this compare to the linear model fit?

In [66]:
Y = y.reshape(-1,1)
A = np.vstack((np.ones_like(x), x, x**2)).T
C = np.diag(sigma_y**2)

X = np.linalg.inv(A.transpose()@np.linalg.inv(C)@A) @ (A.transpose()@np.linalg.inv(C)@Y)

best_fit = np.poly1d(X[::-1,0])

plt.errorbar(x, y, sigma_y, fmt = "o")
plt.plot(np.linspace(0,300,300), best_fit(np.linspace(0,300,300)))

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x1101b9240>]

By eye (a metric that is hardly meaningful, but nevertheless worth developing because talks never provide all of the details), the quadratic fit appears "better" than the linear fit.

But, there are still "outliers" and in the realm of polynomial fitting, it is always possible to get a better fit by adding more degrees to the polynomial. Should we keep going here, or should we stop? (Again - we will discuss model selection on Friday)

[As a reminder - in machine learning we'd call this low training error, but the generalization error is likely huge]