# Project 2: What do galaxies that host active galactic nuclei look like?

**A**ctive **g**alactic **n**uclei (AGN) are powered by accretion of matter onto supermassive black holes at the centers of galaxies. The presence of AGN can strongly influence the evolution of galaxies, affecting star formation, morphology, and other properties.

In this project, you will explore how AGN activity correlates with host galaxy properties using data from the [MPA-JHU value-added catalog](http://wwwmpa.mpa-garching.mpg.de/SDSS/). This catalog is based on the Sloan Digital Sky Survey (SDSS) and provides emission-line fluxes, stellar masses, and other derived properties for a large sample of galaxies, including some that are thought to host AGN.

---

## Data

The Sloan Digital Sky Survey (SDSS) is one of the most ambitious and influential astronomical surveys ever conducted. Using a dedicated 2.5-meter telescope at Apache Point Observatory in New Mexico, SDSS has imaged large swaths of the sky, creating detailed maps of millions of objects (stars, galaxies, and quasars). It has also obtained spectra of these objects, allowing astronomers to study their physical properties, like chemical composition, distance, and velocity.

The MPA-JHU value-added catalog is a specialized dataset derived from SDSS observations. Researchers at the Max Planck Institute for Astrophysics (MPA) and Johns Hopkins University (JHU) used the public SDSS data (primarily the spectra) to measure additional properties not directly provided in the standard SDSS catalogs, including:

1. Emission-line fluxes: These are measurements of how bright specific emission lines (like Hα, Hβ, [O III], [N II]) appear in a galaxy’s spectrum. 
2. Stellar masses: Estimates of how massive each galaxy is in terms of the total mass of its stars.
3. Star formation rates: Indicators of how actively the galaxy is forming new stars.

Emission-line fluxes are a particularly important indicator of what's going on in a given galaxy. (There's a comprehensive review [here](https://ui.adsabs.harvard.edu/abs/2019ARA%26A..57..511K/abstract) if you want to learn more!) Emission lines are generally produced by hot gas. The gas might have been heated by young stars (indicating ongoing star formation), or it may be heated by a supermassive black hole accretion disk (which would form around an AGN). By combining color and emission-line information, astronomers can determine how the gas in a galaxy is being heated, investigate how and why galaxies transition between star-forming and quiescent states, and explore relationships between a galaxy’s properties and potential AGN activity.

The MPA-JHU catalog is described on [this website](https://wwwmpa.mpa-garching.mpg.de/SDSS/DR7/), and all the data is aggregated [on this server](https://wwwmpa.mpa-garching.mpg.de/SDSS/DR7/Data/). You will need to retrieve the following files: 

1. `gal_info_dr7_v5_2.fit.gz`: Basic information and photometric magnitudes
2. `gal_line_dr7_v5_2.fit.gz`: Emission-line fluxes
3. `gal_totsfr_dr7_v5_2.fits.gz`: Total star formation rates; **note that this filename ends with .fits.gz, not .fit.gz like the others!**
4. `totlgm_dr7_v5_2.fit.gz`: Total stellar masses

Note that these files are compressed (you can tell because of the `.gz` at the end of the filename). If you use `astropy.table.Table.read()` to load in the files, you don't need to uncompress them -- Astropy will do this automatically. 

---

## Analysis tasks

### 1. Retrieve and clean the MPA-JHU catalog

Navigate to the MPA-JHU [data server](https://wwwmpa.mpa-garching.mpg.de/SDSS/DR7/Data/) and retrieve the four files listed above. Each of these files contains a different set of properties for the galaxy sample you'll be working with. I have provided a helper function that aggregates the four files to produce a neat table that you can use for your analysis. To use this function, simply read in the four files with Astropy's `Table.read()` function and store each table in a variable. Then pass the four variables as arguments to the provided function and store the result of the function in a new variable. 

For this project, we only want to use galaxies that have reliable estimates for all of our parameters of interest: 
- Fluxes for the following emission lines:
    - `OIII_FLUX`: The [OIII] line at 5007 Angstroms
    - `NII_6584_FLUX`: The [NII] line at 6584 Angstroms
    - `H_ALPHA_FLUX`: The Hα line at 6563 Angstroms
    - `H_BETA_FLUX`: The Hβ line at 4861 Angstroms
- `MASS_MEDIAN`: The total stellar mass
- `SFR_MEDIAN`: The total star formation rate
- `PLUG_MAG`: Magnitudes in the u, g, r, i, and z SDSS filters, 
- `VDISP`: The galaxy's [velocity dispersion](https://en.wikipedia.org/wiki/Velocity_dispersion)

Examine the values in the table and remove any rows with invalid values for any of these parameters. Invalid values are often denoted as negative values, which are unphysical. Sometimes, specific values are chosen; for this dataset, the authors used -9999 for the magnitudes, -1.0 for the masses, and -99 for the star formation rates. Negative emission line fluxes and velocity dispersions (if any) should also be considered invalid.

You should also remove rows where the fluxes of the emission lines of interest have errors greater than 10%. (The columns with the flux errors are denoted with `_ERR` at the end, such as `OIII_FLUX_ERR`.) If `X_FLUX`/`X_FLUX_ERR` > 10 for a given emission line X, you should remove that row.

*Note: The magnitudes are stored in the `PLUG_MAG` column as a `numpy` array of 5 numbers. These numbers correspond to magnitudes in each of the 5 SDSS filters (in order: u, g, r, i, and z). You should make sure that you're discarding rows where any of these values are equal to -9999. The function `numpy.any()` might be useful for this task.*

### 2. Construct the BPT diagram

The BPT diagram is a famous plot used to classify galaxies as star-forming or AGN-hosting. It's named after the three astronomers that came up with it in 1981 -- Baldwin, Phillips, and Terlevich. If you'd like to read their original paper, you can check it out [here](https://ui.adsabs.harvard.edu/abs/1981PASP...93....5B/abstract). 

BPT diagrams are made by plotting **the logarithm** of two important emission line flux ratios against each other. On the x-axis is $\log{\left(\frac{\mathrm{[NII]}}{\mathrm{Hα}}\right)}$, and on the y-axis is $\log{\left(\frac{\mathrm{[OIII]}}{\mathrm{Hβ}}\right)}$. Create this plot yourself for the galaxies in your cleaned sample. Set reasonable boundaries for your x- and y-axes.

### 3. Classify the galaxies in your sample

Taking $x = \log{\left(\frac{\mathrm{[NII]}}{\mathrm{Hα}}\right)}$ and $y = \log{\left(\frac{\mathrm{[OIII]}}{\mathrm{Hβ}}\right)}$, we can define two important relations are used to classify galaxies on the BPT diagram: 

1. **Kewley line:** $y = \frac{0.61}{x - 0.47} + 1.19$
1. **Kauffman line:** $y = \frac{0.61}{x - 0.05} + 1.3$

Galaxies below the Kauffman line (from [Kauffman+2003](https://ui.adsabs.harvard.edu/abs/2003MNRAS.346.1055K/abstract)) can confidently be classified as star-forming, while galaxies above the Kewley line (from [Kewley+2001](https://ui.adsabs.harvard.edu/abs/2001ApJ...556..121K/abstract)) can confidently be classified as AGN hosts. Galaxies between the two lines are called *composite* or *transition* systems, and don't have a clear classification. 

Write two functions, called `kewley` and `kauffman`, that take in an array of x-values and return the appropriate y-values. Use these functions to plot the two lines on your BPT diagram. Make sure you choose reasonable ranges of x-values -- both of these functions will show asymptotic behavior at certain values beyond the range of our data. Also make sure to include a legend so that you can tell the lines apart! 

Finally, add a column to your data table that contains a string denoting the type of each galaxy: 'SF' for star-forming, 'C' for composite, and 'AGN' for AGN-hosting. Determine these classifications based on the location of each galaxy relative to the Kewley and Kauffman lines as described above.

### 4. Investigate correlations between galaxy properties and classification

#### 4a. Basic analysis

Plot a series of histograms that compare how host galaxy properties change between the SF, C, and AGN samples. The properties you should examine are the velocity dispersion (`VDISP`), the stellar mass (`MASS_MEDIAN`) and the star formation rate (`SFR_MEDIAN`). For each property, create a histogram that shows the three populations on the same plot (and with the same binning). Make sure to add legends to your plots, and consider setting `density=True` so that the histograms are normalized to the same scale even though each one represents a different total number of galaxies. comment on any differences or similarities you see.

#### 4b. Color-mass distribution

One of the most fundamental relationships in galaxy evolution is the [connection between a galaxy’s stellar mass and its color](https://mosdef.astro.berkeley.edu/for-the-public/public/galaxy-masses/#:~:text=For%20example%2C%20very%20massive%20galaxies,indicative%20of%20older%20stellar%20populations.). Star-forming galaxies tend to be bluer, and more massive galaxies that have stopped forming stars tend to be redder. However, galaxies that host AGN often occupy an intermediate region between these two populations, suggesting that AGN activity may play a role in the transition from star-forming to quiescent galaxies. 

Create individual scatterplots for your SF and AGN samples, with mass on the x-axis and (u - r) color on the y-axis. To calculate (u - r) color, you'll need the u and r magnitudes from the `PLUG_MAG` column. Comment on any trends that you notice, including differences between the two plots. 

Finally, use [`scipy.optimize.curve_fit()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) to fit a line of the form $y = mx + b$ to each of the scatterplots. Report the best-fit slope and intercept values and show the best-fit lines on your plots. Comment on any similarities or differences that you notice.

---

## Reflection

Write a brief (1-2 paragraphs) interpretation of the results you found above. Link it back to your original research question and key concepts from your literature review. (For this project in particular, you might consider thinking about why any correlations you discovered between galaxy properties exist.)

Then, write a brief (1-2 paragraphs) reflection on the limitations of your analysis. Are there any caveats or assumptions in your analysis? Could more data or a different method provide more robust results?

---

## Extending your analysis (optional)

Are there additional aspects of the dataset that you’d like to explore? Do you have ideas for refining the methods used in this notebook? Or maybe you’ve noticed an interesting pattern in your results that raises new questions? If you answered yes to any of these questions, I encourage you to extend your analysis! Feel free to reach out to me via email or visit office hours to discuss your ideas. If you're interested in diving deeper but aren’t sure where to start, I’m also happy to brainstorm with you. This is a great opportunity to practice developing your own research questions and exploring a dataset in a way that interests you.

---

In [None]:
def make_neat_data_table(info, lines, sfr, mass):
    '''
    Helper function to make a neat catalog of all the info you'll need for this project out of the four indivdual 
    tables you downloaded. 
    
    The four arguments are all expected to be astropy Table objects. If you use Table.read() to load in your four
    files and store them in variables, you should just be able to feed the tables directly into this function. It
    will aggregate these tables into a neat format and return a single table that you can use for this project.
    
    The arguments map to the filenames of the tables like this:
    info: gal_info_dr7_v5_2.fit.gz
    lines: gal_line_dr7_v5_2.fit.gz
    sfr: gal_totsfr_dr7_v5_2.fits.gz
    mass: totlgm_dr7_v5_2.fit.gz
    '''
    from astropy.table import hstack
    info, lines, sfr, mass = info.copy(), lines.copy(), sfr.copy(), mass.copy()
    info.keep_columns(['PLATEID', 'FIBERID', 'RA', 'DEC', 'PLUG_MAG', 'TARGETTYPE', 'SPECTROTYPE', 'SUBCLASS', 
                      'V_DISP', 'V_DISP_ERR', 'SN_MEDIAN'])
    info.rename_column('V_DISP', 'VDISP')
    info.rename_column('V_DISP_ERR', 'VDISP_ERR')
    lines.keep_columns([col for col in lines.colnames if '_FLUX' in col or '_FLUX_ERR' in col])
    sfr.keep_columns(['MEDIAN', 'P16', 'P84'])
    sfr['ERR'] = (sfr['P84'] - sfr['P16'])/2
    for col in sfr.colnames:
        sfr.rename_column(col, f'SFR_{col}')
    mass.keep_columns(['MEDIAN', 'P16', 'P84'])
    mass['ERR'] = (mass['P84'] - mass['P16'])/2
    for col in mass.colnames:
        mass.rename_column(col, f'MASS_{col}')
    return hstack([info, lines, sfr, mass])

In [None]:
#Read in each table and store it in a variable

#Then feed the four tables into the function below to get a neater table you can work with for this project
#This function might print out some warnings in pink when you run it, but those can be safely ignored!
full_table = make_neat_data_table(info, lines, sfr, mass)

#After creating the neat table, you may want to delete the individual tables so they don't take up too much space
#You can do this with del(var_name)
#BEWARE: del() is a permanent action! You'll have to create the variables again to get them back
