# Day 17 project - non-linear regression

In today's in-class assignment, we're going to experiment with nonlinear models - specifically, using some models of galaxy surface brightness.

It's typical to model elliptical galaxies using a [Sérsic profile](https://en.wikipedia.org/wiki/Sersic_profile), which is an expression of the surface brightness galaxy as a function of impact parameter from its center.  The mathematical form is as follows:

$\log_{10}I(r) = \log_{10} I_e - b_n [(r/r_e)^{1/n} - 1]$,

with I$_e$ being the central intensity, r$_e$ being an effective radius (usually measured in arc seconds), and the parameter n, called the "Sérsic index," controlling the curvature of the profile (with profiles having larger values of n being more centrally concentrated).  In this expression, $b_n \simeq 0.868n -0.142$ for ranges of n that are relevant for most elliptical galaxies.  The observed surface brightness profile is calculated from the Sérsic profile as follows:

$\mu(r) = \mu_0 - 2.5 \log_{10}I(r)$ mag arcsec$^{-2}$

where $\mu_0$ is a quantity that encapsulates the distance modulus and other quantities (see [this Wikipedia page](https://en.wikipedia.org/wiki/Magnitude_(astronomy)) to remind yourself about how magnitudes work).

We're going to use a SciPy nonlinear regression tool to calculate the parameters for the Sérsic profile for three elliptical galaxies:  
[Messier 49 (NGC 4472)](https://en.wikipedia.org/wiki/Messier_49),
[Messier 86 (NGC 4406)](https://en.wikipedia.org/wiki/Messier_86), and
[NGC 4551](https://en.wikipedia.org/wiki/NGC_4551).  Data files are included for these three clusters that record the surface brightness (in units of mag arcsec$^{-2}$) as a function of radius on the sky in arc seconds.

First, make a line plot of the surface brightness profiles of the three galaxies, and make sure that they make sense to you.  Use numpy's [genfromtxt()](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.genfromtxt.html) method to read the data from each file into two separate arrays!

In [None]:
# put your code here!



Before we do any optimization, write a function that takes in an array of radii, as well as values for the four parameters above ($\mu_0$, I$_e$, $n$, and r$_e$), and returns an array of surface brightness magnitudes.  Make your best guess at the parameters of this for one of the three elliptical galaxies above, and see how good a job you can do manually fitting the data using a "chi-by-eye" fit by twiddling the parameters.  (Note that this is not an approved method of fitting models in 2018, but bear with me.)


In [None]:
# put your code here!



Next, use Scipy's [curve_fit()](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) method (which is a part of the incredibly useful [optimize](https://docs.scipy.org/doc/scipy/reference/optimize.html) library, which does optimization and root finding) to fit the Sérsic profile for all three galaxies.  Use the function you wrote above, and the default fitting method.  You may need to give the method some reasonable bounds for the various quantities as well!

In [None]:
# put your code here!



Now, calculate the coefficient of determination of this best-fit model, which is a statistic measuring the scatter in the model.  The coefficient of determination (defined by equation 7.61 in Feigelson and Babu) is:

$R^2 = 1 - \frac{\sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2}{\sum_{i=1}^{n}(Y_i - \bar{Y}_i)^2}$

where $\hat{Y}_i$ are the model values at each point and $\bar{Y}_i \equiv \sum_{i=1}^{n}Y_i/n$, i.e., the mean of the response variable.  A successful model has $R^2$ approaching 1.

Calculate this coefficient for both your best-fit model parameters and the model where you fit them by hand.  How do they compare?  Also, plot the fractional difference between the data and your best-fit and fitted-by-hand models as a function of radius (fractional difference is defined as $f \equiv \frac{\mathrm{data}-\mathrm{model}}{\mathrm{data}}$).  Does this plot generally agree with your $R^2$ values?

In [None]:
# put your code here!



**put your answer here!**

If you have time, do a cross-validation of your model by withholding a random fraction of your dataset, doing the regression with the remaining data, and then measuring the residual from the withheld sample.  Perform a bootstrap-like resampling by doing this several time, withholding 20% of the model points each time.  What is the distribution of parameters that you find?

In [None]:
# put your code here!

