# Project 3: How fast is the universe expanding?

One of the most fundamental discoveries in modern astronomy is that the universe is expanding. This expansion is described by **Hubble’s Law**, which states that the velocity at which a galaxy recedes from us is proportional to its distance: 

$v = H_0*d$

where:
- $v$ is the recession velocity (in km/s), which can be estimated using redshift $z$
- $d$ is the luminosity distance (in megaparsecs, Mpc)
- $H_0$ is the Hubble constant, which describes the rate of expansion of the universe (in km/s/Mpc)

To measure the distances to far-away galaxies, astronomers rely on [standard candles]((https://astronomy.swin.edu.au/cosmos/s/Standard+Candle)): astronomical objects with known intrinsic brightnesses, like Type Ia supernovae (SN). In astronomy, we measure an object's brightness in two ways: **apparent magnitude**, which describes how bright an object appears from Earth, and **absolute magnitude**, which describes how bright the object would be at a standard distance of 10 parsecs. Since all Type Ia supernovae have nearly the same intrinsic brightness (absolute magnitude), we can compare their observed brightness (apparent magnitude) to infer their distances. By measuring the distances and redshifts of a large sample of Type Ia SN, we can investigate the expansion rate of the universe and estimate the value of the Hubble constant. 

---

## Data

For this project, we will use data from the Pantheon+SH0ES supernova catalog, which contains derived properties for over 1,500 Type Ia supernovae across a range of [redshifts](https://science.nasa.gov/mission/hubble/science/science-behind-the-discoveries/hubble-cosmological-redshift/). We will focus on a subset of nearby supernovae (z < 0.1) to measure the Hubble constant.

The specific table that we'll be using for this project is from [Brout+2022](https://ui.adsabs.harvard.edu/abs/2022ApJ...938..110B/abstract) and can be found [here](https://github.com/PantheonPlusSH0ES/DataRelease/blob/main/Pantheon%2B_Data/4_DISTANCES_AND_COVAR/Pantheon%2BSH0ES.dat) on Github. The columns in this table are described in full [here](https://github.com/PantheonPlusSH0ES/DataRelease/tree/main/Pantheon%2B_Data/4_DISTANCES_AND_COVAR), but we're primarily interested in the following parameters:

1. `CID`: A unique identifier for each Type Ia SN
2. `zHD`: The redshift of each Type Ia SN 
4. `MU_SH0ES`: The corrected and standardized [distance modulus](https://lco.global/spacebook/distance/what-is-distance-modulus/) (apparent magnitude - absolute magnitude) of each Type Ia SN
5. `zHDERR` and `MU_SHOES_ERR_DIAG`: The errors associated with each of the above parameters

These properties were painstakingly measured from photometric and spectroscopic data in a multi-decade effort by the Pantheon+SH0ES team. Though they've done a lot of the "heavy lifting" for us, there's still a lot to learn about measuring $H_0$ by playing with the data ourselves!



---



## Analysis tasks

### 1. Download and read in the Pantheon+SH0ES data

Navigate to the [Pantheon+SH0ES data](https://github.com/PantheonPlusSH0ES/DataRelease/blob/main/Pantheon%2B_Data/4_DISTANCES_AND_COVAR/Pantheon%2BSH0ES.dat) on Github and download a copy of the table to your computer. Then read the table into this notebook in a format that's easy to work with -- Astropy's `Table` is strongly recommended. 

Note that this file is delimited by spaces, meaning that each piece of information in a given row is separated by one blank space. Note also that first line of the file is a header line, with names for each of the columns. Astropy's `Table.read()` function should be able to handle this format, but you'll need to tweak the default parameters a bit. Check out the documentation [here](https://docs.astropy.org/en/latest/api/astropy.io.ascii.read.html#astropy.io.ascii.read) to identify which parameters you'll need to change. (For the table format, you can use `'ascii'`, which essentially means that the table is made up of standard text characters).

Once you've read in the data, filter the table so that it only contains SN for which `zHD` < 0.1. For help filtering tables, look back at the `intro_to_gaia_data` notebook from week 7.

### 2. Calculate luminosity distances

The `MU_SH0ES` column of Pantheon+SH0ES table contains measurements of each supernova's distance modulus ($\mu$). The distance modulus is calculated from the absolute magnitude $M$ and the apparent magnitude $m$ as $\mu = m - M$. Once you have the distance modulus, you can convert it to [luminosity distance](https://en.wikipedia.org/wiki/Luminosity_distance) $d_L$ with the following formula:

$d_L = 10^{(\mu - 25)/5} \text{ Mpc}$

Using this formula, compute the luminosity distance for each SN and add a new column to your data table that stores the results.

### 3. Plot the Hubble diagram

The Hubble diagram is a famous plot showing the correlation between velocity and distance for astronomical objects. (It dates all the way back to 1929 -- check out Hubble's original paper [here](https://ui.adsabs.harvard.edu/abs/1929PNAS...15..168H/abstract) on ADS!) Classically, this plot shows distance on the x-axis and velocity on the y-axis, but the Pantheon+SH0ES team visualize it slightly differently: redshift ($z$) is on the x-axis and distance modulus ($\mu$) is on the y-axis (see Figure 4 of [Brout+2022](https://ui.adsabs.harvard.edu/abs/2022ApJ...938..110B/abstract) for an example). For this task, you'll take a similar approach.

Create a scatterplot of redshift (x-axis) vs luminosity distance (y-axis) for the Pantheon+SH0ES SN. For visualization purposes, change both axes to a log scale. 

### 4. Measure the Hubble constant

For the nearby universe, second-order cosmological effects like the acceleration of the expansion can be ignored, and the Hubble constant can be related to the luminosity distance $d_L$ and the redshift $z$ as:

$d_L = \left(\frac{c}{H_0}\right) z$

where $c \approx 2.998 \times 10^5 \text{km/s}$ is the speed of light. Note that this is just a linear equation of the form $y = mx + b$, where $m = \left(\frac{c}{H_0}\right)$ is the slope and the intercept $b = 0$. Use [`scipy.optimize.curve_fit`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) to fit a line of this form to your data and report the best-fit value for $H_0$. Replot your Hubble diagram and show the best-fit line on your plot.

### 5. Estimate the error of your measurement

Bootstrapping is a statistical technique that allows us to estimate the uncertainty in our measurements by resampling the data multiple times. Instead of relying on a single fit to the data, we create many different datasets by randomly selecting data points from the original dataset. Each resampled dataset is then used to compute a new estimate of $H_0$. By repeating this process many times, we obtain a distribution of $H_0$ values, from which we can calculate a mean and uncertainty. 

To implement bootstrapping, write a loop that performs the following operations a large number of times (e.g. n=1000): 

1. **Resample the dataset:** Randomly select data points from the original dataset, allowing the same points to be chosen multiple times (this is called resampling with replacement). The new dataset should be the same length as the old dataset. You might find [`numpy.random.choice`](https://numpy.org/doc/2.1/reference/random/generated/numpy.random.choice.html) helpful.
2. **Fit the linear model:** Use the resampled dataset to fit linear model from step 4. Store the resulting estimate of $H_0$ in a list that can be accessed once the loop finishes. (You don't need to record what iteration the estimate came from; you can put all 1000 different estimates in the same list.)

Plot a histogram of your distribution of $H_0$ values. You should see a roughly Gaussian shape. Calculate the mean and standard deviation of the distribution and report your final $H_0$ estimate as $H_0 = \text{mean} \pm \text{standard deviation}$.

### 6. Compare your measurement to literature values for $H_0$

Finally, let's compare your measurement of $H_0$ to measurements in the literature. Each of the below measurements uses a different method to estimate $H_0$: 

1. $H_0 = 67.4 \pm 0.50$ (from [Planck Collaboration+2020](https://ui.adsabs.harvard.edu/abs/2020A%26A...641A...6P/abstract); measured from the cosmic microwave background)
2. $H_0 = 73.04 \pm 1.04$ (from [Riess+2022](https://ui.adsabs.harvard.edu/abs/2022ApJ...934L...7R/abstract); measured with Type Ia SN and [Cepheid variable stars](https://astrobites.org/2019/03/08/leavitt-variable-stars/))
3. $H_0 = 69.8 \pm 1.71$ (from [Freedman 2021](https://ui.adsabs.harvard.edu/abs/2021ApJ...919...16F/abstract); measured with [tip of the red giant branch stars](https://en.wikipedia.org/wiki/Tip_of_the_red-giant_branch))

To determine whether your measured $H_0$ significantly differs from these values, perform a **t-test**. First, calculate the t-statistic $t$ for your $H_0$ value paired with each of the above measurements:

$t = \frac{| H_{0, \text{yours}} - H_{0, \text{published}} |}{\sqrt{\sigma_{\text{yours}}^2 + \sigma_{\text{published}}^2}}$
 
Note that $\sigma$ here is just the uncertainty on your measurement. Once you have the calculated statistics, use the function `pt_test` defined below to calculate the probability that your measurement of $H_0$ agrees with each of the above measurements. (To do this, just feed in the t-statistics that you calculated one at a time.) Print out your results and comment on how your measurement from step 5 compares with values from the literature.

---

## Reflection

Write a brief (1-2 paragraphs) interpretation of the results you found above. Link it back to your original research question and key concepts from your literature review. (For this project in particular, you might consider thinking about why any correlations you discovered between galaxy properties exist.)

Then, write a brief (1-2 paragraphs) reflection on the limitations of your analysis. Are there any caveats or assumptions in your analysis? Could more data or a different method provide more robust results?

---

## Extending your analysis (optional)

Are there additional aspects of the dataset that you’d like to explore? Do you have ideas for refining the methods used in this notebook? Or maybe you’ve noticed an interesting pattern in your results that raises new questions? If you answered yes to any of these questions, I encourage you to extend your analysis! Feel free to reach out to me via email or visit office hours to discuss your ideas. If you're interested in diving deeper but aren’t sure where to start, I’m also happy to brainstorm with you. This is a great opportunity to practice developing your own research questions and exploring a dataset in a way that interests you.

---

In [None]:
def pt_test(t, df=1000): 
    '''
    Performs a t-test on the given t-statistic.
    
    t: the t-statistic that the test is being performed on
    df: the degrees of freedom of the distribution that the test will use (assumed 1000)
    
    Returns the probability that results from the t-test.
    '''
    from scipy.stats import t as tdist
    t_dist = tdist(df)
    if t > 0: 
        return 1 - t_dist.cdf(t)
    else:
        return t_dist.cdf(t)