# ExoStat Lab 08: Probability of another Earth

**Administrative details:**

- This Lab will be turned in for credit.

- Some questions of this lab are the same as the Practice questions found on the main [YData website](http://ydata123.org/sp19/).  

- Collaborating on the ExoStat Labs is encouraged. If you get stuck for a while on a question, feel free to ask a neighbor or come to the instructor's or TF's office hours for additional help. (Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it.) Please don't just share answers, though.

This term we will be using Piazza for class discussion. Find our class page [here](https://piazza.com/yale/spring2019/sds170/home)

You can read more about course policies on our [canvas site](https://canvas.yale.edu).

**Deadline:**

This assignment is due Monday, April 1st at 11:59 P.M. Late work will not be accepted as per the course policies (see the Syllabus and Course policies on [Canvas](https://canvas.yale.edu)).

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the policies page to learn more about how to learn cooperatively.

#### Today's ExoStat Lab

1. Probability of Another Earth

**Submission:**

Submit your assignment both as a .pdf and .ipynb (Jupyter notebook) in Canvas.  

To produce the .pdf, please do the following in order to preserve the cell structure of the notebook:  
1.  Go to "File" at the top-left of your Jupyter Notebook
2.  Under "Download as", select "HTML (.html)"
3.  After the .html has downloaded, open it and then select "File" and "Print" (note you will not actually be printing)
4.  From the print window, select the option to save as a .pdf

To produce the .ipynb, please do the following:  
1.  Go to "File" at the top-left of your Jupyter Notebook
2.  Under "Download as", select "Notebook (.ipynb)"

In [None]:
# Run this cell to set up the notebook, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

from sklearn.decomposition import PCA
from sklearn import preprocessing

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

We are going to go back to using the data on confirmed exoplanets from the NASA Exoplanet Archive (https://exoplanetarchive.ipac.caltech.edu). The file is called `confirmed_planets.csv`.  Run the cell below to load the dataset.

In [None]:
exoplanets = Table.read_table("confirmed_planets.csv", skiprows = 71) 
exoplanets

Estimating the probability of another Earth is a very challenging question.  Even defining what we mean by "another Earth" is unclear.  Sometimes people mean an Earth-size planet (e.g. similar mass and radius as Earth) orbiting a Sun-like star.  Sometimes people mean more...such as orbiting in the so-called [habitable zone](https://en.wikipedia.org/wiki/Circumstellar_habitable_zone) where the planet can support liquid water on the surface.

We are going to simplify the question a little bit...we are going to consider "Earth-like" to mean the planet has a mass and a radius similar to those of Earth.

**Question 1.1.**  To begin, create a table that includes a column for planet mass (in Earth masses) and a column for planet radius (in Earth radii).  Note that you will have to convert the units.  You will likely need to do some online search in order to figure out how to convert the units from Jupiter mass and Jupiter radius to Earth mass and Earth radius.  

You will also notice that there are some missing values (i.e. `nan`'s).  We will address this in the next question.

In [None]:
...

**Question 1.2.**  A number of missing observations as `nan`.  Dealing with missing data is a very tricky statistical issue.  In this lab, we are just going to remove the rows that have mass or radius missing, but in practice you do not usually want to take this approach.  

For this question, remove the `nan`'s.  You can develop your own approach for this, but one idea is to create a function that returns `False` if an `nan` is present in any of the input values, and `True` otherwise.  Then you can use `apply` to run the function on your Table. 

In [None]:
...

**Question 1.3.** Plot the remaining masses and radii on the `log` scale by using `plt.xscale('log')` and `plt.yscale('log')`.  Be sure to add axis labels.

In [None]:
...

**Question 1.4.**  We now need to figure out how many of the planets in our sample are "Earth-like" based on their mass and radius.  Let's create a function called `is_earthlike` that takes a mass and radius as the input and determines if it "Earth-like" (`True`) or not ("False").

Let's define "Earth-like" to be exoplanets that have a mass between 0.5 and 1.5 Earth masses, and a radius that is between 0.5 and 1.5 Earth radii.  

In [None]:
def is_earthlike(mass_rad):
    """Let mass_rad be an array with two components: mass and radius"""
    ...

#Use the below to check your function...try different values for
#mass and radius to be sure it works!
is_earthlike(make_array(.75, 1.25))

**Question 1.5.**  Apply your `is_earthlike` function to your table to get the indices for the rows with planets that are "Earth-like".  Then use `.where` on the output to see the values for the rows.  What proportion of exoplanets are "Earth-like" based on our definition?

In [None]:
...

**Question 1.6.** The previous question gave us an estimate of the proportion of exoplanets in our sample that are "Earth-like" based on our definition.  Now we want to get a sense of the variability of this estimate through bootstrap sampling.  To begin, create a function that generates a sample (with replacement) of the rows from your mass-radius table (without missing values) of the same size as the input table.  Call this function `bootstrap_sample`.

In [None]:
...

#This is added as a check (you need to add your table name):
bootstrap_sample(...) 

**Question 1.7.**  Next, create a function that generates a bootstrap sample, and returns the proportion of planets that are "Earth-like" based on their mass and radius as defined above.  Call this function, `get_proportion`.

In [None]:
def get_proportion(table):
    ...
    

#Use the code below for checking your function (add table name)
get_proportion(...)

**Question 1.8.** Now bring it all together to draw 5,000 bootstrap estimates of the proportion of exoplanets that are "Earth-like".  Create a function called `bootstrap_proportion` that takes a table with a column for mass and a column for radius as the first input, and the number of bootstrap replications as the second input.  Then run it to get your 5,000 bootstrap proportions.

In [None]:
def bootstrap_proportion(original_table, replications):
    """Returns an array of bootstrapped sample proportions:
    original_sample: table containing the original sample
    replications: number of bootstrap samples
    """
    proportions = make_array()
    for i in np.arange(replications):
        ...
        
    return proportions

boot_sample = bootstrap_proportion(..., ...)

**Question 1.9.** Create a histogram of your `boot_sample`.

In [None]:
...

**Question 1.10.** Give a 95% bootstrap confidence interval for the proportion of exoplanets that are "Earth-like" based on our definition.

In [None]:
left = ...
right = ...

make_array(left, right)

**Question 1.11.**  Did we actually get a reasonable estmate for the probability of another Earth?  What might make our estimate biased?

[Add your response here]

**Submission**: Once you're finished, follow the instructions at the top of this notebook to save as a .pdf and .ipynb. Then submit the two files through Canvas.