In [None]:
import numpy as np
from astropy.table import Table
import pandas as pd
import matplotlib.pyplot as plt

Our goal is to create a new Hertsprung-Russell (HR, or color-magnitude diagram) based on data from the Gaia satellite. 

# 1. Read and examine the data

Query the Gaia archive <url>https://gea.esac.esa.int/archive/ </url> and ask for the Gaia DR3 number, the parallax and its error, and the magnitude of the star in G, and Bp-Rp (in the units of magnitude).  Only get the data for stars with high quality parallax > 20 (Gaia reports parallaxes in milliarcseconds) and high quality photometry. Here, I am defining high quality parallaxes to have an uncertainty of less than 4%, ruwe < 1.4 AND g.astrometric_excess_noise < 1.8. High quality photometry is defined to be stars with flux errors in G, Rp and Bp less than 1%, Download a csv file.

### 1.a. Read in the data. Print out the first few entries of the table.


In [None]:
# Your code here

### 1.b. Calculate the distance and absolute G magnitudes. 

You'll need to use the online documentation to figure out which columns you need: <url>https://gea.esac.esa.int/archive/documentation/GEDR3/Gaia_archive/chap_datamodel/sec_dm_main_tables/ssec_dm_gaia_source.html</url> (I found this webpage with, drumroll please! a google search for "gaia edr3 columns")

In [None]:
# Your code here

### 1.c. Plot the data

Don't forget the usual axis labels/clarity requirements.

In [None]:
# Your code here

# 2. Investigating the data

Data that comes in is never perfect. The plot we made above looks <i>kind of</i> like an HR diagram. You should be able to identify the main sequence, and a white dwarf sequence. There aren't evolved stars mostly because I only looked at the nearest stars, and evolved stars are rare (<i>Why does that make sense?</i>) But then there's this hockey stick where the fainter stars start getting bluer again--we know from other observations that this isn't correct. The reason is that some of the data has bad quality. Although it still won't be perfect, we can do a lot of good by cleaning away data that doesn't have good signal-to-noise in the brightness measurements.

### 2.a. The flux error column.

The first column we want to look at is `phot_bp_mean_flux_over_error`. What information is contained within this column? What does higher vs lower values mean?




<span style="color:blue">
    Your answer here
</span>


### 2.b. Make plots to examine this column.

Make two plots:
1) a plot of Bp-Rp color vs `phot_bp_mean_flux_over_error`
2) a plot of Bp flux vs `phot_bp_mean_flux_over_error`

It may help to use log scaling for some of your axes: test it out and see which ones convey the information best.


In [None]:
# Your code here

### 2.c. Interpretation

What did your plots show? What's a reasonable cut to use to cut out the low S/N measurements?

<span style="color:blue">
    Your answer here
</span>

# 3. Cleaning the data

## 3.a. Creating and using boolean arrays

You can select parts of an array using boolean logic, which is to say, asking whether something is True or False. For example `good = df['column'] > 15` will produce a <b>boolean</b> array filled with True (where the condition is met, i.e. the entry is greater than 15) or False (where the condition is not met). Below I've created a boolean array `bp_good` which will be True when the S/N is high on the Bp measurement--this is the data we want to keep. 

### 3.a.i. Add a similar line to create an array `rp_good` 

In [None]:
bp_good = df['phot_bp_mean_flux_over_error'] > 15

# Your code here

### 3.a.ii. Examine the arrays you've created. 

You can now select entries from the table using the boolean arrays. The code below will produce an array of `bp_rp` that contains every entry where `bp_good` is True (and exclude those where `bp_good` is False). Examine the output by plotting the data, to verify that behavior is what you expect.

In [None]:
df['bp_rp'][bp_good]

Explain why your table and/or plot demonstrates the code is working.

<span style="color:blue">
    Your answer here
</span>

## 3.b. Combining boolean arrays

We can combine two boolean arrays. Below I've created a small example where you can explore this ability. 

In [None]:
# two boolean arrays to select based upon
bool1 = np.array([True, True, False, False, True])
bool2 = np.array([False, True, True, False, False])
# some arrays of values
number = [1, 2, 3, 4, 5]
letter = ['a', 'b', 'c', 'd', 'e']

# create an astropy table containing this information
data = Table([bool1, bool2, number, letter], 
             names=['bool1','bool2','number','letter']).to_pandas()

# select only the data where bool1 is True:
data[bool1]

### 3.b.i Select data from the table based on where `bool1` is True.

In [None]:
# Your code here

### 3.b.ii. Select data from the table based on where `bool1` AND `bool2` are True. 

You can combine boolean arrays like so: `bool1 & bool2` 

In [None]:
# Your code here

If instead you wanted the case where `bool1` OR `bool2` is True, you could use: `bool1 | bool2` 

# 4. A better HR diagram

### 4.a. Make an HR diagram using only those stars for which both Bp and Rp have sufficient S/N.

In [None]:
# Your code here

### 4.b. Brainstorm ways to present the data differently.

What are some ways you could present these data differently? i.e., using different plotting tools like the ones we did together in class when looking at the star formation rate data. 


<span style="color:blue">
    Your answer here
</span>

### 4.c. Make two different versions of this plot.

Pick two of your ideas from the previous part and make them! Then, comment: What did you want to convey with the idea? Do you think it worked? 

In [None]:
# Your code here

<span style="color:blue">
    Your answer here
</span>