# ExoStat Lab 1: Confirmed Exoplanet Data

**Administrative details:**

- This Lab will be turned in for credit.

- Some questions of this lab are the same as the Practice 02 questions found on the main [YData website](http://ydata123.org/sp19/).  I highly recommend that you work on the Practice exercises each week, and especially the rest of Practice 02.  

- Collaborating on the ExoStat Labs is encouraged. If you get stuck for a while on a question, feel free to ask a neighbor or the instructor, or come to the instructor's or TF's office hours for additional help. (Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it.) Please don't just share answers, though.

This term we will be using Piazza for class discussion. Find our class page [here](https://piazza.com/yale/spring2019/sds170/home)

You can read more about course policies on our [canvas site](https://canvas.yale.edu).

**Deadline:**

This assignment is due Monday, January 28th at 11:59 P.M. Late work will not be accepted as per the course policies (see the Syllabus and Course policies on [Canvas](https://canvas.yale.edu)).

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the policies page to learn more about how to learn cooperatively.

#### Today's ExoStat Lab

In today's exercises, you'll learn how to:

1. Importing code (modules)
2. Work with datasets in Python -- collections of data
3. Exploring characteristics of confirmed exoplanets

**Submission:**

Submit your assignment both as a .pdf and .ipynb (Jupyter notebook) in Canvas.  

To produce the .pdf, please do the following in order to preserve the cell structure of the notebook:  
1.  Go to "File" at the top-left of your Jupyter Notebook
2.  Under "Download as", select "HTML (.html)"
3.  After the .html has downloaded, open it and then select "File" and "Print" (note you will not actually be printing)
4.  From the print window, select the option to save as a .pdf

To produce the .ipynb, please do the following:  
1.  Go to "File" at the top-left of your Jupyter Notebook
2.  Under "Download as", select "Notebook (.ipynb)"

# 0.  Setup the environment

Run the cell below without changing it.  At this point we will just use the info in the cell below without worrying about what it means.

In [None]:
#Run this to get your environment setup
from datascience import *
import numpy as np
import matplotlib
matplotlib.use('Agg', warn=False)
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

# 1. Importing code

> What has been will be again,  
> what has been done will be done again;  
> there is nothing new under the sun.

Most programming involves work that is very similar to work that has been done before.  Since writing code is time-consuming, it's good to rely on others' published code when you can.  Rather than copy-pasting, Python allows us to **import** other code, creating a **module** that contains all of the names created by that code.

Python includes many useful modules that are just an `import` away.  We'll look at the `math` module as a first example. The `math` module is extremely useful in computing mathematical expressions in Python. 

Suppose we want to very accurately compute the area of a circle with radius 5 meters.  For that, we need the constant $\pi$, which is roughly 3.14.  Conveniently, the `math` module has `pi` defined for us:

In [None]:
import math
radius = 5
area_of_circle = radius**2 * math.pi
area_of_circle

`pi` is defined inside `math`, and the way that we access names that are inside modules is by writing the module's name, then a dot, then the name of the thing we want:

    <module name>.<name>
    
In order to use a module at all, we must first write the statement `import <module name>`.  That statement creates a module object with things like `pi` in it and then assigns the name `math` to that module.  Above we have done that for `math`.

**Question 1.1.** The module `math` also provides the name `e` for the base of the natural logarithm, which is roughly 2.71.  Compute $e^{\pi}-\pi$, giving it the name `near_twenty`.

In [None]:
near_twenty = ...
near_twenty

## 1.1. Importing functions

**Modules** can provide other named things, including **functions**.  For example, `math` provides the name `sin` for the sine function.  Having imported `math` already, we can write `math.sin(3)` to compute the sine of 3.  (Note that this sine function considers its argument to be in [radians](https://en.wikipedia.org/wiki/Radian), not degrees.  180 degrees are equivalent to $\pi$ radians.)

**Question 1.1.1.** A $\frac{\pi}{4}$-radian (45-degree) angle forms a right triangle with equal base and height, pictured below.  If the hypotenuse (the radius of the circle in the picture) is 1, then the height is $\sin(\frac{\pi}{4})$.  Compute that using `sin` and `pi` from the `math` module.  Give the result the name `sine_of_pi_over_four`.

<img src="http://mathworld.wolfram.com/images/eps-gif/TrigonometryAnglesPi4_1000.gif">
(Source: [Wolfram MathWorld](http://mathworld.wolfram.com/images/eps-gif/TrigonometryAnglesPi4_1000.gif))

In [None]:
sine_of_pi_over_four = ...
sine_of_pi_over_four

For your reference, here are some more examples of functions from the `math` module.

Note how different methods take in different number of arguments. Often, the documentation of the module will provide information on how many arguments is required for each method.

In [None]:
# Calculating factorials.  5! = 5*4*3*2*1
math.factorial(5)

In [None]:
# Calculating logarithms (the logarithm of 8 in base 2).
# The result is 3 because 2 to the power of 3 is 8.
math.log(8, 2)

In [None]:
# Calculating square roots.
math.sqrt(5)

In [None]:
# Calculating cosines.
math.cos(math.pi)

There's many variations of how we can import methods from outside sources. For example, we can import just a specific method from an outside source, we can rename a library we import, and we can import every single method from a whole library. 

In [None]:
#Importing just cos and pi from math.
#Notice that we don't have to use math. before hand for cos and pi
from math import cos, pi
print(cos(pi))
#We do have to use it infront of other methods from math, though
math.log(pi)

In [None]:
#We can nickname math as something else, if we don't want to type math
import math as m
m.log(m.pi)

In [None]:
#Lastly, we can import ever thing from math
from math import *
log(pi)

##### A function that displays a picture
People have written Python functions that do very cool and complicated things, like crawling web pages for data, transforming videos, or doing machine learning with lots of data.  Now that you can import things, when you want to do something with code, first check to see if someone else has done it for you.

Let's see an example of a function that's used for downloading and displaying pictures.

The module `IPython.display` provides a function called `Image`.  The `Image` function takes a single argument, a string that is the URL of the image on the web.  It returns an *image* value that this Jupyter notebook understands how to display.  To display an image, make it the value of the last expression in a cell, just like you'd display a number or a string.

**Question 1.1.2.** In the next cell, import the module `IPython.display` and use its `Image` function to display the image at this URL:

    https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/David_-_The_Death_of_Socrates.jpg/1024px-David_-_The_Death_of_Socrates.jpg

Give the name `art` to the output of the call to `Image`.  (It might take a few seconds to load the image.  It's a painting called *The Death of Socrates* by Jacques-Louis David, depicting events from a philosophical text by Plato.)

*Hint*: A link isn't any special type of data type in Python. You can't just write a link into Python and expect it to work; you need to type the link in as a specific data type. Which one makes the most sense?

In [None]:
# Import the module IPython.display. Watch out for capitalization.
import IPython.display
# Replace the ... with a call to the Image function
# in the IPython.display module, which should produce
# a picture.
art = ...
art

# 2. Arrays

Up to now, we haven't done much that you couldn't do yourself by hand, without going through the trouble of learning Python.  Computers are most useful when you can use a small amount of code to *do the same action* to *many different things*.

For example, in the time it takes you to calculate the 18% tip on a restaurant bill, a laptop can calculate 18% tips for every restaurant bill paid by every human on Earth that day.  (That's if you're pretty fast at doing arithmetic in your head!)

**Arrays** are how we put many values in one place so that we can operate on them as a group. For example, if `billions_of_numbers` is an array of numbers, the expression

    .18 * billions_of_numbers

gives a new array of numbers that's the result of multiplying each number in `billions_of_numbers` by .18 (18%).  Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.

Concretely, an array is a **collection of values of the same type**, like a column in an Excel spreadsheet. 

<img src="excel_array.jpg">

## 2.1. Making arrays
You can type in the data that goes in an array yourself, but that's not typically how programs work. Normally, we create arrays by loading them from an external source, like a data file.

First, though, let's learn how to do it the hard way. Execute the following cell so that all the names from the `datascience` module are available to you.

In [None]:
from datascience import *

Now, to create an array, call the function `make_array`.  Each argument you pass to `make_array` will be in the array it returns.  Run this cell to see an example:

In [None]:
make_array(0.125, 4.75, -1.3)

Each value in an array (in the above case, the numbers 0.125, 4.75, and -1.3) is called an *element* of that array.

Arrays themselves are also values, just like numbers and strings.  That means you can assign them names or use them as arguments to functions.

**Question 2.1.1.** Make an array containing the numbers 1, 2, and 3, in that order.  Name it `small_numbers`.

In [None]:
small_numbers = ...
small_numbers

**Question 2.1.2.** Make an array containing the numbers 0, 1, -1, $\pi$, and $e$, in that order.  Name it `interesting_numbers`.  *Hint:* How did you get the values $\pi$ and $e$ earlier?  You can refer to them in exactly the same way here.

In [None]:
interesting_numbers = ...
interesting_numbers

**Question 2.1.3.** Make an array containing the five strings `"Hello"`, `","`, `" "`, `"world"`, and `"!"`.  (The third one is a single space inside quotes.)  Name it `hello_world_components`.

*Note:* If you print `hello_world_components`, you'll notice some extra information in addition to its contents: `dtype='<U5'`.  That's just NumPy's extremely cryptic way of saying that the things in the array are strings.

In [None]:
hello_world_components = ...
hello_world_components

The `join` method of a string takes an array of strings as its argument and puts all of the elements together into one string. Try it:

In [None]:
'/'.join(make_array('7', '1', '1997'))

**Question 2.1.4.** Assign `separator` to a string so that the name `hello` is bound to the string `'Hello, world!'` in the cell below.

In [None]:
separator = ...
hello = separator.join(hello_world_components)
hello

### 2.1.1.  `np.arange`
Arrays are provided by a package called [NumPy](http://www.numpy.org/) (pronounced "NUM-pie" or, if you prefer to pronounce things incorrectly, "NUM-pee").  The package is called `numpy`, but it's standard to rename it `np` for brevity.  You can do that with:

    import numpy as np

Very often in data science, we want to work with many numbers that are evenly spaced within some range.  NumPy provides a special function for this called `arange`.  The line of code `np.arange(start, stop, space)` produces an array with all the numbers starting at `start` and counting up by `space`, stopping before `stop` is reached.

For example, the value of `np.arange(1, 6, 2)` is an array with elements 1, 3, and 5 -- it starts at 1 and counts up by 2, then stops before 6.  In other words, it's equivalent to `make_array(1, 3, 5)`.

`np.arange(4, 9, 1)` is an array with elements 4, 5, 6, 7, and 8.  (It doesn't contain 9 because `np.arange` stops *before* the stop value is reached.)

**Question 2.1.1.1.** Import `numpy` as `np` and then use `np.arange` to create an array with the multiples of 99 from 0 up to (**and including**) 9999.  (So its elements are 0, 99, 198, 297, etc.)

In [None]:
...
multiples_of_99 = ...
multiples_of_99

##### Temperature readings
NOAA (the US National Oceanic and Atmospheric Administration) operates weather stations that measure surface temperatures at different sites around the United States.  The hourly readings are [publicly available](http://www.ncdc.noaa.gov/qclcd/QCLCD?prior=N).

Suppose we download all the hourly data from the Oakland, California site for the month of December 2015.  To analyze the data, we want to know when each reading was taken, but we find that the data don't include the timestamps of the readings (the time at which each one was taken).

However, we know the first reading was taken at the first instant of December 2015 (midnight on December 1st) and each subsequent reading was taken exactly 1 hour after the last.

**Question 2.1.1.2.** Create an array of the *time, in seconds, since the start of the month* at which each hourly reading was taken.  Name it `collection_times`.

*Hint 1:* There were 31 days in December, which is equivalent to ($31 \times 24$) hours or ($31 \times 24 \times 60 \times 60$) seconds.  So your array should have $31 \times 24$ elements in it.

*Hint 2:* The `len` function works on arrays, too.  If your `collection_times` isn't passing the tests, check its length and make sure it has $31 \times 24$ elements.

In [None]:
collection_times = ...
collection_times

## 2.2. Working with single elements of arrays ("indexing")
Let's work with a more interesting dataset.  The next cell creates an array called `population` that includes estimated world populations in every year from **1950** to roughly the present.  (The estimates come from the [US Census Bureau website](https://www.census.gov/popclock/).)

Rather than type in the data manually, we've loaded them from a file on your computer called `world_population.csv`.  You'll learn how to do that soon.

In [None]:
# Don't worry too much about what goes on in this cell.
from datascience import *
population = Table.read_table("world_population.csv").column("Population")
population

Here's how we get the first element of `population`, which is the world population in the first year in the dataset, 1950.

In [None]:
population.item(0)

The value of that expression is the number 2557628654 (around 2.5 billion), because that's the first thing in the array `population`.

Notice that we wrote `.item(0)`, not `.item(1)`, to get the first element.  This is a weird convention in computer science.  0 is called the *index* of the first item.  It's the number of elements that appear *before* that item.  So 3 is the index of the 4th item.

Here are some more examples.  In the examples, we've given names to the things we get out of `population`.  Read and run each cell.

In [None]:
# The third element in the array is the population
# in 1952.
population_1952 = population.item(2)
population_1952

In [None]:
# The thirteenth element in the array is the population
# in 1962 (which is 1950 + 12).
population_1962 = population.item(12)
population_1962

In [None]:
# The 66th element is the population in 2015.
population_2015 = population.item(65)
population_2015

In [None]:
# The array has only 66 elements, so this doesn't work.
# (There's no element with 66 other elements before it.)
population_2016 = population.item(66)
population_2016

In [None]:
# Since make_array returns an array, we can call .item(3)
# on its output to get its 4th element, just like we
# "chained" together calls to the method "replace" earlier.
make_array(-1, -3, 4, -2).item(3)

**Question 2.2.1.** Set `population_1973` to the world population in 1973, by getting the appropriate element from `population` using `item`.

In [None]:
population_1973 = ...
population_1973

## 2.3. Doing something to every element of an array
Arrays are primarily useful for doing the same operation many times, so we don't often have to use `.item` and work with single elements.

##### Logarithms
Here is one simple question we might ask about world population:

> How big was the population in *orders of magnitude* in each year?

The logarithm function is one way of measuring how big a number is. The logarithm (base 10) of a number increases by 1 every time we multiply the number by 10. It's like a measure of how many decimal digits the number has, or how big it is in orders of magnitude.

We could try to answer our question like this, using the `log10` function from the `math` module and the `item` method you just saw:

In [None]:
import math

population_1950_magnitude = math.log10(population.item(0))
population_1951_magnitude = math.log10(population.item(1))
population_1952_magnitude = math.log10(population.item(2))
population_1953_magnitude = math.log10(population.item(3))
# Etc.

But this is tedious and doesn't really take advantage of the fact that we are using a computer.

Instead, NumPy provides its own version of `log10` that takes the logarithm of each element of an array.  It takes a single array of numbers as its argument.  It returns an array of the same length, where the first element of the result is the logarithm of the first element of the argument, and so on.

**Question 2.3.1.** Use it to compute the logarithms of the world population in every year.  Give the result (an array of 66 numbers) the name `population_magnitudes`.  Your code should be very short.

In [None]:
population_magnitudes = ...
population_magnitudes

<img src="array_logarithm.jpg">

This is called *elementwise* application of the function, since it operates separately on each element of the array it's called on.  The textbook's section on arrays has a useful list of NumPy functions that are designed to work elementwise, like `np.log10`.

##### Arithmetic
Arithmetic also works elementwise on arrays.  For example, you can divide all the population numbers by 1 billion to get numbers in billions:

In [None]:
population_in_billions = population / 1000000000
population_in_billions

You can do the same with addition, subtraction, multiplication, and exponentiation (`**`). For example, you can calculate a tip on several restaurant bills at once (in this case just 3):

In [None]:
restaurant_bills = make_array(20.12, 39.90, 31.01)
print("Restaurant bills:\t", restaurant_bills)
tips = .2 * restaurant_bills
print("Tips:\t\t\t", tips)

<img src="array_multiplication.jpg">

**Question 2.3.2.** Suppose the total charge at a restaurant is the original bill plus the tip.  That means we can multiply the original bill by 1.2 to get the total charge.  Compute the total charge for each bill in `restaurant_bills`.

In [None]:
total_charges = ...
total_charges

**Question 2.3.3.** The file `more_restaurant_bills.csv` contains 100,000 bills!  Compute the total charge for each one.  

In [None]:
more_restaurant_bills = Table.read_table("more_restaurant_bills.csv").column("Bill")
more_total_charges = ...
more_total_charges

The function `sum` takes a single array of numbers as its argument.  It returns the sum of all the numbers in that array (so it returns a single number, not an array).

**Question 2.3.4.** What was the sum of all the bills in `more_restaurant_bills`, *including tips*?

In [None]:
sum_of_bills = ...
sum_of_bills

**Question 2.3.5.** The powers of 2 ($2^0 = 1$, $2^1 = 2$, $2^2 = 4$, etc) arise frequently in computer science.  (For example, you may have noticed that storage on smartphones or USBs come in powers of 2, like 16 GB, 32 GB, or 64 GB.)  Use `np.arange` and the exponentiation operator `**` to compute the first 30 powers of 2, starting from `2^0`.

*Hint:* If your kernel breaks, you likely have the incorrect solution. Try restarting your kernel, rerunning the cells above, and trying a different solution!

In [None]:
powers_of_2 = ...
powers_of_2

# 3. Confirmed Exoplanet Data



In this section, we are going to begin looking at some real exoplanet data!  We'll be able to explore some characteristics of the exoplanets.

## 3.1.  The Data

Let's begin by reading in data on confirmed exoplanets.  The file is called `confirmed_planets.csv` and was collected from the [NASA Exoplanet Archive](https://exoplanetarchive.ipac.caltech.edu).  You can explore the link and find other data sources and options to visualize the data directly on the website!

If we try to read in the data using the usual `Table.read_table` function with the name of the file as the input, you will see an error appears:

In [None]:
exoplanets = Table.read_table("confirmed_planets.csv")
exoplanets

The problem is that the `confirmed_planets.csv` file contains a bunch of information in the first rows.  Instead, we just need to tell Python to skip those rows to get us to the table we are interested in.  We can use the `skiprows` options and set it to the number of rows to skip...in this case there were 71 rows to leave out!

In [None]:
exoplanets = Table.read_table("confirmed_planets.csv", skiprows = 71)
exoplanets

If you scroll through the table, you will see quite a few `nan`'s.  The cells that have `nan` are missing the value.  The value is likely missing because the detection method(s) use for that planet did not provide that information.  We'll be talking later in the semester about the exoplanet characteristics that can be determined by the different detection methods.

**Question 3.1.1.** How many confirmed exoplanets can be found in this Table?  Use Python code to calculate the number of rows of the `exoplanet` table.

In [None]:
exoplanets_number=...
exoplanets_number

**Question 3.1.2.**  How many columns does the `exoplanet` table have?

In [None]:
# Put answer below


The columns of `exoplanet` have names that may not be sufficiently descriptive.  Let's read in the first part of the original `confirmed_planets.csv` to see how the variables are defined.  In the `Table.read_table` function, we can specify the number of rows from the bottom to skip using `skipfooter`.

In [None]:
exoplanets_info = Table.read_table("confirmed_planets.csv", skipfooter = 3873, engine='python')
exoplanets_info.show()

If you are ever unsure about what the columns of our `exoplanet` table mean, you can come back to the `exoplanets_info` table.

Let's explore the data a bit more.  How about we try to figure out how many exoplanets were discovered for each method.  The column header `pl_discmethod` specifies the discovery method:

In [None]:
exoplanets.select("pl_discmethod")

Great!  We can see all the different discovery methods nows!  But how do we count the number for each method?  One approach is to use the Table method `.where()` and specify our criteria.  For example:

In [None]:
exoplanets.where("pl_discmethod", "Radial Velocity")

And then calculate the number of rows:

In [None]:
exoplanets.where("pl_discmethod", "Radial Velocity").num_rows

Try that for the other discovery methods.  Fortunately there is an easy was for us to find all the unique values in a list using `np.unique()`.  Run the cell below to see all the detection methods that have lead to discoveries of confirmed exoplanets.

In [None]:
#Run this cell
np.unique(exoplanets["pl_discmethod"])

**Question 3.1.3.**  Calculate the number of exoplanets that have been discovered for each method using the `.where()` method and calculating the number of rows.  Also, verify that the sum of all these counts equals the total number of exoplanets 

In [None]:
# Put answer here


Okay...so that took a lot of typing.  There is a faster way we could have done this.  Let's go back a few cells and define a name of the array with the different methods:

In [None]:
detection_methods = np.unique(exoplanets["pl_discmethod"])
detection_methods

Now let's set up something called a "for loop."  For loops are great because Python does the work of typing what we want to index over.  Let's look at a simple example of a for loop:

In [None]:
#Run this cell - but try to guess what is going to happen first
for i in np.arange(0,4,1):
    print(i**2)

What happened in words is that for each `i` in our array, which in the example above includes 0, 1, 2, 3, print out the value of `i` squared.  We could have used any value or word instead of `i`.

**Question 3.1.4.** In the code below, replace `i` with your favorite color.

In [None]:
# Replace "i" with your favorite color
for i in np.arange(0,4,1):
    print(i**2)

Now let's expand this example a bit and instead of printing the values, let's save them in a new array.  This is going to require defining a couple new functions.  First, the `make_array()` function makes an empty array, but we still need to assign it a variable name.  (Below we call the empty array `outcome`.)  

Then we have to figure out how to add values to our array.  It turns out `np.append()` does the trick!  In `np.append()`, the first input value is the name of the array we want to append a new value to (that means, we want to add new values to the end of the array).  Then the second input is the value we want added.  In the code below, we have `outcome = np.append(outcome, out)` which tells Python to append the quantity assigned to the variable `out` to the array `outcome`.  Notice we also have to assign `np.append(outcome, out)` to the original array name, `outcome`.

Try running the cell below and see what happens!

In [None]:
#Run this

#First we make an empty array
outcome = make_array()

#Then we fill the array with the following values
for i in np.arange(0,4,1):
    out = i**2
    outcome = np.append(outcome, out)

#Print the filled array
outcome

Great!  Now we want to do something similar with our `detection_methods` array.  Instead of looping through a numerical index (like `np.arange(0,4,1)`), we want to loop through the methods in `detection_methods`.

**Question 3.1.5.** Update the code below so that the for loops indexes through the `detection_methods` and calculates the number of exoplanets discovered for each method.

In [None]:
#This is our empty array
detection_counts = make_array()

#Update the '[?]' to do what the question is asking. 
for [?] in detection_methods:
    out = exoplanets.where("pl_discmethod", [?]).num_rows
    detection_counts = np.append(detection_counts, [?])

detection_counts

Now let's create a table with `detection_methods` as one column and `detection_counts` as another column.  We'll call the table `detection`.

In [None]:
#Run this cell
detection = Table().with_columns("Detection method", detection_methods,
                     "Number of detections", detection_counts)
detection

**Question 3.1.6.**  Since this is a small table, it is easy to pick out which detection methods have been the most successful.  But let's put some skills we learned last week to practice.  Sort the `detection` table so that the first row has the detection method that has discovered the *most* confirmed exoplanets.

In [None]:
# Put answer here


## 3.2.  Mass and radius

The mass and radius of exoplanets are two important characteristics in part because it helps us to define the type of planet.

Let's look at a scatterplot of the mass versus the radius of the exoplanets in our table.  Note that not all of the exoplanets have both a mass and a radius so we are not going to be able to plot all of the confirmed exoplanets.

For a quick plot, we can use the `table_name.scatter(var1, var2)` method to produce a scatterplot of `var1` vs. `var2`:

In [None]:
exoplanets.scatter("pl_radj","pl_bmassj")

This is a nice start, but let's make the plot look a bit better.  We are going to use the module imported earlier with 'import matplotlib.pyplot as plots'. The code is a little different - we use the 'plots.plot()' function.  
Note:  There are many different ways to produce visualizations and plots in Python, we're just touching upon a couple here.

We are also going to change the axis labels to be something more descriptive.  The '.' in the plotting function tells Python to use points rather than lines.  Try running the code below with and without the '.' to see what happens/

In [None]:
plots.plot(exoplanets["pl_radj"], exoplanets["pl_bmassj"],'.')
plots.title('Planet Mass vs. Radius')
plots.ylabel('Planet Mass ($M_J$)')
plots.xlabel('Planet Radius ($R_J$)')

Now let's try something a little more complicated.  We are going to produce the same plot, but color the points according to the method of detection. 

First we will create an array of colors - one color for each detection method:

In [None]:
detection_colors = make_array("brown","orange","yellow","teal","blue","purple","cyan","green","red","pink") 
Table().with_columns("Detection method", detection_methods,
                     "Color", detection_colors)

In order to ensure the the correct color goes with the assigned detection method, we are going to create a for-loop and plot the mass-radius values for each method separately.

Try to read over the cell below to figure out what each line of code is doing.

In [None]:
plots.figure(figsize=(15,10)) # Make the figure size a bit larger

# The for-loop plots Radius-Mass by detection method using the color specified above
for i in np.arange(len(detection_methods)):
    which_exoplanets = exoplanets.where("pl_discmethod", detection_methods[i])
    plots.plot(which_exoplanets["pl_radj"], which_exoplanets["pl_bmassj"],'.', color = detection_colors[i])

# Add the title and axis labels
plots.title('Planet Mass vs. Radius')
plots.ylabel('Planet Mass ($M_J$)')
plots.xlabel('Planet Radius ($R_J$)')

# The following adds a legend so we can easily see which color goes with each method
plots.legend(detection_methods)

Let's do one last tweak to the plot...change the scale of the axes.  Often in astronomy the axes are on a log-scale so we will use the 'plots.yscale('log')' and 'plots.xscale('log')' to make the change.

In [None]:
plots.figure(figsize=(15,10)) # Make the figure size a bit larger
plots.yscale('log')
plots.xscale('log')

# The for-loop plots Radius-Mass by detection method using the color specified above
for i in np.arange(len(detection_methods)):
    which_exoplanets = exoplanets.where("pl_discmethod", detection_methods[i])
    plots.plot(which_exoplanets["pl_radj"], which_exoplanets["pl_bmassj"],'.', color = detection_colors[i])

# Add the title and axis labels
plots.title('Planet Mass vs. Radius')
plots.ylabel('Planet Mass ($M_J$)')
plots.xlabel('Planet Radius ($R_J$)')

# The following adds a legend so we can easily see which color goes with each method
plots.legend(detection_methods)

**Question 3.2.1.**  What do you notice about the plot above?  Which detection method has the most points plotted?  Are there any general patterns between mass and radius on the log-scale?  How would you describe the patterns?

[Put your answer here]

## 3.3 Data summaries

In this section we are going to try out some other ways to summarize or visualize data from our exoplanets table.

Sometimes we want to know how many times something happens in our data.  For example, we may want to know how many exoplanets are in systems with other confirmed exoplanets.  To determing this, we can you the Table method `group` [Section 7.1](https://www.inferentialthinking.com/chapters/07/1/Visualizing_Categorical_Distributions).  This group operator is especially useful for categorical data.  Below is an example of it's use on our 'exoplanet' table where we want the counts over 'pl_pnum' (number of planets discovered in the system:

In [None]:
exoplanets.group("pl_pnum")

**Question 3.3.1.**  There are 7 confirmed exoplanets in a system with 7 exoplanets and 8 confirmed exoplanets in a system with 8 exoplanets.  How many different planetary systems are these accounting for?  How many distinct 6-planet systems are there?

[Put your answer here]

**Question 3.3.2.** Previously we created a for-loop to count how many planets were discovered for each method.  Instead, use the 'group' operator on the 'exoplanet' table to do this.

In [None]:
# Put your answer here


We can also use a bar chart to visualize the distribution of the number of exoplanets in each system.  First we use the 'group' operator on the 'exoplanet' table, and then use the 'barh()' operator on the grouped data. See the example below.

In [None]:
exoplanets.group('pl_pnum').barh('pl_pnum')

**Question 3.3.3.**  Create a bar chart to visualize the distribution of counts per discovery method.

In [None]:
# Put your answer here


# 4. Exploration

**4.1.** With this question, I want you to spend some time exploring the exoplanets data.  
The goal is for you to learn something that you think is interesting about the data.  This may mean you create a visualization of one of the column variables, or perhaps look into a summary of one or more of the values in the columns.  All that we require is that you write some code (it can be a couple lines or many lines) and a short description about what you found.  

In [None]:
### Put your code here

Explain what is interesting here.











**Submission:** Once you're finished, follow the instructions at the top of this notebook to save as a .pdf and .ipynb.  Then submit the two files through Canvas.