In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Section 1 (Chapter 5)

In this section, we're going to compare the K-band magnitudes of globular clusters in the Milky Way and Andromeda galaxies.  This data is from [Nantais et al. 2006](http://adsabs.harvard.edu/abs/2006AJ....131.1416N).  We are going to examine the data in several different ways.  Note that this is based on the exercises at the end of Chapter 5 of Feigelson (although we're going to use Python and [scipy](https://www.scipy.org/) rather than the [R language](https://www.r-project.org/).

In [None]:
# Andromeda (M31) globular cluster K-band magnitudes
M31_magnitudes = np.loadtxt("GlobClus_M31.dat",skiprows=1,usecols=[1],unpack=True)

# Milky Way globular cluster K-band magnitudes
MW_magnitudes = np.loadtxt("GlobClus_MWG.dat",skiprows=1,usecols=[1],unpack=True)

First, in the cell below use Pyplot's [hist](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.hist) method to plot both the M31 and Milky Way globular cluster distributions on the same histogram (you can do this by repeated calls to *hist()*).  Choose the bin size for both so that each cluster is represented by a reasonable number of bins.

It's clear that the magnitudes are very different; this is because the Milky Way globular clusters are stored with absolute magnitudes, whereas the Andromeda GCs are recording the apparent magnitude of each cluster.  So, the next thing to do is estimate the [distance modulus](https://en.wikipedia.org/wiki/Distance_modulus) so that we can compare the two populations together.  We will do this by (1) assuming that the two globular cluster systems have identical absolute magnitudes, and then (2) calculate the difference between their central values to get the distance modulus.  Estimate the distance modulus using the mean magnitude of each system of globular clusters, the median value of each, and the trimean (TM) of each (see equations 5.9 and 5.11 of Feigelson).  By how much do these values differ?

**Put your answer here!**

Now, create a new array for the M31 globular clusters that have been corrected using this distance modulus, and make two separate histograms of the MW and M31 clusters together - the first using the standard histogram behavior, and the second using a **normed, cumulative histogram** (look at the [documentation](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.hist) for hist to see how to do this).  Do these look like they might be the same?

Next, we're going use the [scipy Statistical functions library (scipy.stats)](https://docs.scipy.org/doc/scipy-0.15.1/reference/stats.html) to do some things.  First, use [stats.probplot](https://docs.scipy.org/doc/scipy-0.16.1/reference/generated/scipy.stats.probplot.html) to generate quantiles for a probability plot, and use pyplot to show both of the distributions on the same plot (hint: look at the examples at the bottom of that last link.)

Finally, we'll do some **hypothesis testing** (as in Feigelson Section 5.4) to see how likely it is that the distribution of magnitudes in these globular cluster systems come from the same underlying distribution.  Since we don't know what the underlying distribution actually *is*, we have to use tests that don't require this.  One way to compare two distributions is the [two-sided Kolmogorow-SMirnov (KS) test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test), which has a [scipy.stats function](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ks_2samp.html).  A second way is the [Mann-Whitney-Wilcoxan test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) (also known as the Wilcoxan rank-sum test), which also has a [scipy function](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.stats.ranksums.html).  Try both of these tests on the two datasets.  Do you get comparable answers?  (Note: be careful in interpreting the P-values - their meanings need to be considered carefully!)

# Section 2 (Chapter 6)

In this section, we're going to examine data from the SDSS quasar catalog from SDSS DR5 [Schneider et al. 2007](http://adsabs.harvard.edu/abs/2007AJ....134..102S).  This dataset has 77,429 quasars and quite a bit of information on each one, including magnitudes in 5 bands as well as a cosmological redshift.  In this particular assignment, we are ignoring everything except for the redshift of the quasar.

In [None]:
# reads in only the redshift values of the quasar

SDSS_QSO_redshifts = np.loadtxt("SDSS_QSO.dat",skiprows=1,usecols=[1],unpack=True)

First, we're going to plot several histograms of the data, varying the number of bins from a very small number (say 5) to a very large number (say 1000), choosing a small handful of intermediate values.  Do all of these on the same plot.  Approximately how many bins seems to strike a happy medium between capturing the essential features of the data and having too much noise (i.e., too few values per bin)?  Note: you should experiment with the 'normed', 'alpha', and 'histtype' options in pyplot's hist() method to get histograms that are directly comparable to each other and that are easy to read.  **Do both** a standard histogram and, separately, a cumulative histogram - is there a difference in behavior in terms of number of bins?

**In this cell,** put your observations about the effects of bin size.  Is the choice of uniform-width bins sensible here?

Next, use the scipy [stats.gaussian_kde](https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde) module to calculate a smoothed version of this histogram using a kernel density estimate (KDE) to calculate the estimated density distribution function.  Try all of the smoothing methods that are available by changing the bw_method argument in the KDE method, and make sure you look at the example code at the bottom of the page!  Note that setting bw_method to a scalar constant (rather than 'scott' or 'silverman') effectively creates a constant width in redshift space over which you smooth to get the estimated PDF.  What happens as you vary this scalar from 0.01 - 1.0?  Plot all of the various PDFs on top of each other with the same redshift values on the x-axis (created using np.linspace()) for comparison purposes.