# Feigelson Chapter 7

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

In this pre-class assignment, we're going to examine data from the SDSS quasar catalog from SDSS DR5 [Schneider et al. 2007](http://adsabs.harvard.edu/abs/2007AJ....134..102S).  This dataset has 77,429 quasars and quite a bit of information on each one, including magnitudes in 5 bands as well as a cosmological redshift.  Unlike the last assignment, we're going to use magnitudes in various bands rather than the redshift!

In [None]:
# reads in i and z band magnitudes and errors
QSO_i_mag, QSO_i_errors, QSO_z_mag, QSO_z_errors = np.loadtxt("SDSS_QSO.dat",skiprows=1,usecols=[8,9,10,11],unpack=True)

# remove what appears to be bad data from the sample
QSO_i_reduced = QSO_i_mag[QSO_i_mag >= 16.0]
QSO_z_reduced = QSO_z_mag[QSO_i_mag >= 16.0]
QSO_i_errors_reduced = QSO_i_errors[QSO_i_mag >= 16.0]
QSO_z_errors_reduced = QSO_z_errors[QSO_i_mag >= 16.0]


First, plot all of the data (both the raw data and, separately, the reduced data) as a scatter plot to see what it looks like.

Now, let's plot some of the error bars to get a sense of the errors in the data.  There are far too many data points to plot the errors for every quasar and make sense of it (about 77,000 points in the reduced dataset).  So, let's pick a subset of the points - say 0.1% of them - and plot the errors of those using the pyplot [errorbar](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.errorbar) method.

Hint: you can do this without loops by using [numpy.random.random](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.random.html#numpy.random.random) to generate a boolean array, as follows:

```
bool_array = np.random.random(QSO_i_reduced.size) < 0.001
```

And use that to subselect data in the way that I did in the previous cell.

Why do you think you see the pattern that you see in the errors?

**put your answer here!**

Now, we're going to implement the Ordinary Least Squares method to find the best-fit line to the data (equation 7.6 in Feigelson).  Do so without using loops - recall that numpy arrays have built-in methods to get mean values, sums, etc. and you can even use that when you multiple arrays together.  So, for example, for an array ```a``` that contains a bunch of values, you can get the mean value, subtract it off of the array ```a```, and then sum the square of the resulting array in this way:

```
amean = a.mean()
((a-amean)**2).sum()
```

Which is equivalent to:

$\bar{a} = \frac{1}{N}\sum_{i=1}^{N} a_i$

$\sum_{i=1}^{N} (a_i - \bar{a})^2$

After you've implemented that method, ensure that your values for the slope and intercept give you reasonable values by plotting them!

Now, use the scipy [linear regression method](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.linregress.html) on the same data.  Do you get the same slope and intercept?

Now, we're going to do a weighted least-squares regression, as described in Section 7.4 of Feigelson.  How do the resulting values for the slope and intercept differ?  Put them both on the same plot along with your data and comment on the difference.

**IMPORTANT NOTE:** The expressions for $\bar{X}_{wt}$ and $\bar{Y}_{wt}$ are incorrect in Feigelson equation 7.37.  The expression for $\bar{X}_{wt}$ should be:

$\bar{X}_{wt} = \frac{\sum_{i=1}^{N}\frac{X_i}{\sigma^2_{Y,i}} }{ \sum_{i=1}^{N}\frac{1}{\sigma^2_{Y,i}} }$

And the expression for $\bar{Y}_{wt}$ is similarly missing a term in the denominator.

Comment on the difference here!