# ExoStat Lab 02: Detecting Exoplanets using the Transit Method

**Administrative details:**

- This Lab will be turned in for credit.

- Some questions of this lab are the same as the Practice 03 questions found on the main [YData website](http://ydata123.org/sp19/).  

- Collaborating on the ExoStat Labs is encouraged. If you get stuck for a while on a question, feel free to ask a neighbor or come to the instructor's or TF's office hours for additional help. (Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it.) Please don't just share answers, though.

This term we will be using Piazza for class discussion. Find our class page [here](https://piazza.com/yale/spring2019/sds170/home)

You can read more about course policies on our [canvas site](https://canvas.yale.edu).

**Deadline:**

This assignment is due Monday, February 4th at 11:59 P.M. Late work will not be accepted as per the course policies (see the Syllabus and Course policies on [Canvas](https://canvas.yale.edu)).

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the policies page to learn more about how to learn cooperatively.

#### Today's ExoStat Lab

In today's exercises, you'll learn how to:

1. More practice with [Tables](http://www.inferentialthinking.com/chapters/06/tables.html)
2. Exploring the Transit Method for detecting exoplanets

**Submission:**

Submit your assignment both as a .pdf and .ipynb (Jupyter notebook) in Canvas.  

To produce the .pdf, please do the following in order to preserve the cell structure of the notebook:  
1.  Go to "File" at the top-left of your Jupyter Notebook
2.  Under "Download as", select "HTML (.html)"
3.  After the .html has downloaded, open it and then select "File" and "Print" (note you will not actually be printing)
4.  From the print window, select the option to save as a .pdf

To produce the .ipynb, please do the following:  
1.  Go to "File" at the top-left of your Jupyter Notebook
2.  Under "Download as", select "Notebook (.ipynb)"

## 1. Introduction

For a collection of things in the world, an array is useful for describing a single attribute of each thing. For example, among the collection of US States, an array could describe the land area of each. Tables extend this idea by describing multiple attributes for each element of a collection.

In most data science applications, we have data about many entities, but we also have several kinds of data about each entity.

For example, in the cell below we have two arrays. The first one contains the world population in each year (estimated by the US Census Bureau), and the second contains the years themselves. These elements are in order, so the year and the world population for that year have the same index in their corresponding arrays.

In [None]:
#Run this to get your environment setup
from datascience import *
import numpy as np
import matplotlib
matplotlib.use('Agg', warn=False)
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

population_amounts = Table.read_table("world_population.csv").column("Population")
years = np.arange(1950, 2015+1)
print("Population column:", population_amounts)
print("Years column:", years)

Suppose we want to answer this question:

> Which year did the world's population cross 6 billion?

You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 6 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a *`Table`*, a 2-dimensional type of dataset. 

The expression below:

- creates an empty table using the expression `Table()`,
- adds two columns by calling `with_columns` with four arguments,
- assignes the result to the name `population`, and finally
- evaluates `population` so that we can see the table.

The strings `"Year"` and `"Population"` are column labels that we have chosen. Ther names `population_amounts` and `years` were assigned above to two arrays of the **same length**. The function `with_columns` (you can find the documentation [here](http://data8.org/datascience/tables.html)) takes in alternating strings (to represent column labels) and arrays (representing the data in those columns). The strings and arrays are separated by commas.

In [None]:
population = Table().with_columns(
    "Population", population_amounts,
    "Year", years
)
population

Now the data is combined into a single table! It's much easier to parse this data. If you need to know what the population was in 1959, for example, you can tell from a single glance. We'll revisit this table later.

## 2. Creating Tables

**Question 2.1.** In the cell below, we've created 2 arrays. Using the steps above, assign `top_10_movies` to a table that has two columns called "Rating" and "Name", which hold `top_10_movie_ratings` and `top_10_movie_names` respectively.

In [None]:
top_10_movie_ratings = make_array(9.2, 9.2, 9., 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8)
top_10_movie_names = make_array(
        'The Shawshank Redemption (1994)',
        'The Godfather (1972)',
        'The Godfather: Part II (1974)',
        'Pulp Fiction (1994)',
        "Schindler's List (1993)",
        'The Lord of the Rings: The Return of the King (2003)',
        '12 Angry Men (1957)',
        'The Dark Knight (2008)',
        'Il buono, il brutto, il cattivo (1966)',
        'The Lord of the Rings: The Fellowship of the Ring (2001)')

top_10_movies = ...
# We've put this next line here so your table will get printed out when you
# run this cell.
top_10_movies

#### Loading a table from a file
In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we can use our `Table` functions.

`Table.read_table` takes one argument, a path to a data file (a string) and returns a table.  There are many formats for data files, but CSV ("comma-separated values") is the most common.

**Question 2.2.** The file `imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb.  Load it as a table called `imdb`.

In [None]:
imdb = ...
imdb

Notice the part about "... (240 rows omitted)."  This table is big enough that only a few of its rows are displayed, but the others are still there.  10 are shown, so there are 250 movies total.

Where did `imdb.csv` come from? Take a look at [this labs's folder](./). You should see a file called `imdb.csv`.

Open up the `imdb.csv` file in that folder and look at the format. What do you notice? The `.csv` filename ending says that this file is in the [CSV (comma-separated value) format](http://edoceo.com/utilitas/csv-file-format).

## 3. Using lists

A *list* is another Python sequence type, similar to an array. It's different from an array because the values that it contains can all have different types. A single list can contain `int` values, `float` values, and `string` values. Elements in a list can even be other lists! A list is created by giving a name to the list of values enclosed in square brackets and separated by commas. For example, `values_with_different_types = ['data', 8, ['lab', 3]]`.

Lists can be useful when working with tables because they can describe the contents of one row in a table, which often  corresponds to a sequence of values with different types. A list of lists can be used to describe multiple rows.

Each column in a table is a collection of values with the same type (an array). If you create a table column from a list, it will automatically be converted to an array. A row, on the other hand, mixes types.

Here's a table from Chapter 5. (Run the cell below.)

In [None]:
# Run this cell to recreate the table
flowers = Table().with_columns(
    'Number of petals', make_array(8, 34, 5),
    'Name', make_array('lotus', 'sunflower', 'rose')
)
flowers

**Question 3.1.** Assign `my_flower` to a list that describes a new fourth row of this table. The details can be whatever you want, but **`my_flower` must contain two values: the number of petals (an `int` value) and the name of the flower (a `string`).** For example, your flower could be "pondweed"! (A flower with zero petals)

In [None]:
my_flower = ...
my_flower

**Question 3.2.** `my_flower` fits right in to the table from chapter 5. Complete the cell below to create a table of seven flowers that includes your flower as the fourth row followed by `other_flowers` as the last three rows. You can use `with_row` to create a new table with one extra row by passing a list of values and `with_rows` to create a table with multiple extra rows by passing a list of lists of values.

In [None]:
# Use the method .with_row(...) to create a new table that includes my_flower 

four_flowers = ...

# Use the method .with_rows(...) to create a table that 
# includes four_flowers followed by other_flowers

other_flowers = [[10, 'lavender'], [3, 'birds of paradise'], [6, 'tulip']]

seven_flowers = ...
seven_flowers

## 4. Analyzing datasets
With just a few table methods, we can answer some interesting questions about the IMDb dataset.

If we want just the ratings of the movies, we can get an array that contains the data in that column:

In [None]:
imdb.column("Rating")

The value of that expression is an array, exactly the same kind of thing you'd get if you typed in `make_array(8.4, 8.3, 8.3, etc...)`.

**Question 4.1.** Find the rating of the highest-rated movie in the dataset.

*Hint:* Think back to the functions you've learned about for working with arrays of numbers.  Ask for help if you can't remember one that's useful for this.

In [None]:
highest_rating = ...
highest_rating

That's not very useful, though.  You'd probably want to know the *name* of the movie whose rating you found!  To do that, we can sort the entire table by rating, which ensures that the ratings and titles will stay together. Note that calling sort creates a copy of the table and leaves the original table unsorted.

In [None]:
imdb.sort("Rating")

Well, that actually doesn't help much, either -- we sorted the movies from lowest -> highest ratings.  To look at the highest-rated movies, sort in reverse order:

In [None]:
imdb.sort("Rating", descending=True)

(The `descending=True` bit is called an *optional argument*. It has a default value of `False`, so when you explicitly tell the function `descending=True`, then the function will sort in descending order.)

So there are actually 2 highest-rated movies in the dataset: *The Shawshank Redemption* and *The Godfather*.

Some details about sort:

1. The first argument to `sort` is the name of a column to sort by.
2. If the column has strings in it, `sort` will sort alphabetically; if the column has numbers, it will sort numerically.
3. The value of `imdb.sort("Rating")` is a *copy of `imdb`*; the `imdb` table doesn't get modified. For example, if we called `imdb.sort("Rating")`, then running `imdb` by itself would still return the unsorted table.
4. Rows always stick together when a table is sorted.  It wouldn't make sense to sort just one column and leave the other columns alone.  For example, in this case, if we sorted just the "Rating" column, the movies would all end up with the wrong ratings.

**Question 4.2.** Create a version of `imdb` that's sorted chronologically, with the earliest movies first.  Call it `imdb_by_year`.

In [None]:
imdb_by_year = ...
imdb_by_year

**Question 4.3.** What's the title of the earliest movie in the dataset?  You could just look this up from the output of the previous cell.  Instead, write Python code to find out.

*Hint:* Starting with `imdb_by_year`, extract the Title column to get an array, then use `item` to get its first item.

In [None]:
earliest_movie_title = ...
earliest_movie_title

## 5. Finding pieces of a dataset
Suppose you're interested in movies from the 1940s.  Sorting the table by year doesn't help you, because the 1940s are in the middle of the dataset.

Instead, we use the table method `where`.

In [None]:
forties = imdb.where('Decade', are.equal_to(1940))
forties

Ignore the syntax for the moment.  Instead, try to read that line like this:

> Assign the name **`forties`** to a table whose rows are the rows in the **`imdb`** table **`where`** the **`'Decade'`**s **`are` `equal` `to` `1940`**.

**Question 5.1.** Compute the average rating of movies from the 1940s.

*Hint:* The function `np.average` computes the average of an array of numbers.

In [None]:
average_rating_in_forties = ...
average_rating_in_forties

Now let's dive into the details a bit more.  `where` takes 2 arguments:

1. The name of a column.  `where` finds rows where that column's values meet some criterion.
2. Something that describes the criterion that the column needs to meet, called a predicate.

To create our predicate, we called the function `are.equal_to` with the value we wanted, 1940.  We'll see other predicates soon.

`where` returns a table that's a copy of the original table, but with only the rows that meet the given predicate.

**Question 5.2.** Create a table called `ninety_nine` containing the movies that came out in the year 1999.  Use `where`.

In [None]:
ninety_nine = ...
ninety_nine

So far we've only been finding where a column is *exactly* equal to a certain value. However, there are many other predicates.  Here are a few:

|Predicate|Example|Result|
|-|-|-|
|`are.equal_to`|`are.equal_to(50)`|Find rows with values equal to 50|
|`are.not_equal_to`|`are.not_equal_to(50)`|Find rows with values not equal to 50|
|`are.above`|`are.above(50)`|Find rows with values above (and not equal to) 50|
|`are.above_or_equal_to`|`are.above_or_equal_to(50)`|Find rows with values above 50 or equal to 50|
|`are.below`|`are.below(50)`|Find rows with values below 50|
|`are.between`|`are.between(2, 10)`|Find rows with values above or equal to 2 and below 10|

The textbook section on selecting rows has more examples.


**Question 5.3.** Using `where` and one of the predicates from the table above, find all the movies with a rating higher than 8.5.  Assign this filtered table to the name `really_highly_rated`.

In [None]:
really_highly_rated = ...
really_highly_rated

**Question 5.4.** Find the average rating for movies released before the year 2000 and the average rating for movies released in the year 2000 or after for the movies in `imdb`.

*Hint*: Think of the steps you need to do (take the average, find the ratings, find movies released in 20th/21st centuries), and try to put them in an order that makes sense.

In [None]:
before_2000 = ...
after_or_in_2000 = ...
print("Average before 2000 rating:", before_2000)
print("Average after or in 2000 rating:", after_or_in_2000)

The property `num_rows` tells you how many rows are in a table.  (A "property" is just a method that doesn't need to be called by adding parentheses.)

In [None]:
num_movies_in_dataset = imdb.num_rows
num_movies_in_dataset

**Question 5.5.** Use `num_rows` (and arithmetic) to find the *proportion* of movies in the dataset that were released in the 1900's, and the *proportion* of movies in the dataset that were released in the 2000's.

*Hint:* The *proportion* of movies released in the 1900's is the *number* of movies released in the 1900's, divided by the *total number* of movies.

In [None]:
proportion_in_20th_century = ...
proportion_in_21st_century = ...
print("Proportion in 20th century:", proportion_in_20th_century)
print("Proportion in 21st century:", proportion_in_21st_century)

**Question 5.6.** Here's a challenge: Find the number of movies that came out in *even* years.

*Hint:* The operator `%` computes the remainder when dividing by a number.  So `5 % 2` is 1 and `6 % 2` is 0.  A number is even if the remainder is 0 when you divide by 2.

*Hint 2:* `%` can be used on arrays, operating elementwise like `+` or `*`.  So `make_array(5, 6, 7) % 2` is `array([1, 0, 1])`.

*Hint 3:* Create a column called "Year Remainder" that's the remainder when each movie's release year is divided by 2.  Make a copy of `imdb` that includes that column.  Then use `where` to find rows where that new column is equal to 0.  Then use `num_rows` to count the number of such rows.

In [None]:
num_even_year_movies = ...
num_even_year_movies

**Question 5.7.** Check out the `population` table from the introduction to this lab.  Compute the year when the world population first went above 6 billion.

In [None]:
# Run this cell to display the population table.
population

In [None]:
year_population_crossed_6_billion = ...
year_population_crossed_6_billion

## 6. Miscellanea
There are a few more table methods you'll need to fill out your toolbox.  The first 3 have to do with manipulating the columns in a table.

The table `farmers_markets.csv` contains data on farmers' markets in the United States  (data collected [by the USDA]([dataset](https://apps.ams.usda.gov/FarmersMarketsExport/ExcelExport.aspx)).  Each row represents one such market.

**Question 6.1.** Load the dataset into a table.  Call it `farmers_markets`.

In [None]:
farmers_markets = ...
farmers_markets

You'll notice that it has a large number of columns in it!

### `num_columns`

**Question 6.2.** The table property `num_columns` (example call: `tbl.num_columns`) produces the number of columns in a table.  Use it to find the number of columns in our farmers' markets dataset.

In [None]:
num_farmers_markets_columns = ...
print("The table has", num_farmers_markets_columns, "columns in it!")

### `select`

Most of the columns are about particular products -- whether the market sells tofu, pet food, etc.  If we're not interested in that stuff, it just makes the table difficult to read.  This comes up more than you might think.

In such situations, we can use the table method `select` to choose only the columns that we want in a particular table. It takes any number of arguments. Each should be the name or index of a column in the table. It returns a new table with only those columns in it.

For example, the value of `imdb.select("Year", "Decade")` is a table with only the years and decades of each movie in `imdb`.

**Question 6.3.** Use `select` to create a table with only the name, city, state, latitude ('y'), and longitude ('x') of each market.  Call that new table `farmers_markets_locations`.

In [None]:
farmers_markets_locations = ...
farmers_markets_locations

### `select` is not  the same as `column`!

The method `select` is **definitely not** the same as the method `column`.

`farmers_markets.column('y')` is an **array** of the latitudes of all the markets.  `farmers_markets.select('y')` is a **table** that happens to contain only 1 column, the latitudes of all the markets.

**Question 6.4.** Below, we tried using the function `np.average` to find the average latitude ('y') and average longitude ('x') of the farmers' markets in the table, but we messed something up.  Run the cell to see the (somewhat inscrutable) error message that results from calling `np.average` on a table.  Then, fix our code.

In [None]:
average_latitude = np.average(farmers_markets.select('y'))
average_longitude = np.average(farmers_markets.select('x'))
print("The average of US farmers' markets' coordinates is located at (", average_latitude, ",", average_longitude, ")")

### `drop`

`drop` serves the same purpose as `select`, but it takes away the columns that you provide rather than the ones that you don't provide.

**Question 6.5.** Suppose you just didn't want the "FMID" or "updateTime" columns in `farmers_markets`.  Create a table that's a copy of `farmers_markets` but doesn't include those columns.  Call that table `farmers_markets_without_fmid`.

In [None]:
farmers_markets_without_fmid = ...
farmers_markets_without_fmid

### `take`
Let's find the 5 northernmost farmers' markets in the US.  You already know how to sort by latitude ('y'), but we haven't seen how to get the first 5 rows of a table.  That's what `take` is for.

The table method `take` takes as its argument an array of numbers.  Each number should be the index of a row in the table.  It returns a new **table** with only those rows.

Most often you'll want to use `take` in conjunction with `np.arange` to take the first few rows of a table.

**Question 6.6.** Make a table of the 5 northernmost farmers' markets in `farmers_markets_locations`.  Call it `northern_markets`.  (It should include the same columns as `farmers_markets_locations`.

In [None]:
northern_markets = ...
northern_markets

**Question 6.7.** Make a table of the farmers' markets in New Haven, Connecticut.  (It should include the same columns as `farmers_markets_locations`.)

In [None]:
newhaven_markets = ...
newhaven_markets

Recognize any of them?

## 7. Summary

For your reference, here's a table of all the functions and methods we saw in this lab so far.

|Name|Example|Purpose|
|-|-|-|
|`Table`|`Table()`|Create an empty table, usually to extend with data|
|`Table.read_table`|`Table.read_table("my_data.csv")`|Create a table from a data file|
|`with_columns`|`tbl = Table().with_columns("N", np.arange(5), "2*N", np.arange(0, 10, 2))`|Create a copy of a table with more columns|
|`column`|`tbl.column("N")`|Create an array containing the elements of a column|
|`sort`|`tbl.sort("N")`|Create a copy of a table sorted by the values in a column|
|`where`|`tbl.where("N", are.above(2))`|Create a copy of a table with only the rows that match some *predicate*|
|`num_rows`|`tbl.num_rows`|Compute the number of rows in a table|
|`num_columns`|`tbl.num_columns`|Compute the number of columns in a table|
|`select`|`tbl.select("N")`|Create a copy of a table with only some of the columns|
|`drop`|`tbl.drop("2*N")`|Create a copy of a table without some of the columns|
|`take`|`tbl.take(np.arange(0, 6, 2))`|Create a copy of the table with only the rows whose indices are in the given array|

<br/>

## 8. Transit Method:  getting the data

In this section, we are going to learn about transit data.  There is a useful Python module (remember a module is a collection of functions) for loading and analyzing exoplanet transit data called "lightkurve."  We are going to learn about this module by working through a [userguide](http://docs.lightkurve.org/tutorials/quickstart.html).  Some of the content of this section is taken directly from the guide, and the rest is information and questions designed especially for you!

Let's begin by importing the module.  This module has not been previously added to our computing cluster so we have to add an extra line of code `!pip install lightkurve` before importing the module.

In [None]:
!pip install lightkurve

Now that `lightkurve` is available to us, we can do the usual import.  Notice that the import statement imports all the functions from `lightkurve` and without needing to include the `lightkurve.` before using the functions. 

In [None]:
from lightkurve import *

### 8.1 Pixel-level data

Recall from the lecture that to collect data for detecting exoplanets using the transit method, the brightness of stars in the field of view of the photometer.  For example, this image displays Kepler's field of view:

<img src="kepler_field.jpg">

The `lightkurve` module has a function that allows us to access the pixel-level data from Kepler.  The function is `KeplerTargetPixelFile` and the input is a link to the dataset.  The Kepler data is archived [here](https://archive.stsci.edu/pub/kepler/target_pixel_files/).  (We'll note later how you can also directly search the archived data using `lightkurve` functions.  For now let's just explore this selected dataset.)

In [None]:
tpf = KeplerTargetPixelFile("https://archive.stsci.edu/pub/kepler/target_pixel_files/0069/006922244/kplr006922244-2010078095331_lpd-targ.fits.gz")

The `tpf` has type `lightkurve.targetpixelfile.KeplerTargetPixelFile` and contains useful information.  Run the cells below and see the sorts of properties you can grab from `tpf`.

In [None]:
# type
type(tpf)

In [None]:
# Mission
tpf.mission

In [None]:
# Quarter
tpf.quarter

Let's look at the flux.  `tpf.flux` constains the pixel data and we can check it's shape by running the following cell.

In [None]:
# Shape of the flux data
tpf.flux.shape

We can understand this shape as a sequence of 4116 pixel images.  We can look at one of the images by running the following cell.  You can change the `frame_num` to look at the flux image at a different time point.

In [None]:
frame_num = 0
tpf.plot(frame=frame_num)

The flux values can be displayed as well:

In [None]:
frame_num = 0
tpf.flux[frame_num]

So then what are the 4116 different values?  These are the different times the measurements were taken.  The units are Kepler-specific Barycentric Kepler Julian Day format (BKJD)

In [None]:
tpf.time

The time can also be converted into [AstroPy Time objects](http://docs.astropy.org/en/stable/time/), which make it easier to change into other time units.

In [None]:
tpf.astropy_time

This then allows for conversion into time as is more readily recognizable to us:

In turn, this gives you access to human-readable ISO timestamps using the astropy_time.iso property.  Notice that we first needed to use the `astropy_time` then grab the `iso` property from that.

In [None]:
# hese timestamps are in the Solar System Barycentric frame (TDB) 
# ...that is, they do not include corrections for light travel time or leap seconds
tpf.astropy_time.iso

Now that we have the basics of the pixel data down, let's get back to how to read-in or find the data.  The above dataset used `KeplerTargetPixelFile` with a url that linked to the data of interest.  There is another way to find data using the `search_targetpixelfile` function. 

This is directed to the [MAST data archive](https://archive.stsci.edu/kepler/), where the Kepler and K2 data are being stored.  With this function, you can specify the Kepler ID along with the quarter.  The cell below searches for Kepler ID 6922244 for Quarter 4, as we used above.  

In [None]:
tpf = search_targetpixelfile(6922244, quarter=4).download()

In [None]:
tpf.flux.shape

Though outdated, list of some of the Kepler ids can be found [here](https://archive.stsci.edu/kepler/planet_candidates.html).  Notice that Kepler ID 6922244 is for exoplanet [Kepler-8b](https://en.wikipedia.org/wiki/Kepler-8b), which was one of the first five planets confirmed by the Kepler mission.  It has a mass of about 0.603 M$_J$, a radius of about 1.419 R$_J$, and a semimajor axis of 0.0483 AU (close in orbit!).


Next we want to turn the pixel-data into a light curve.  Recall that the light curve is going to have a brightness measure for the star across time.  Looking at one of the pixel images, we see that several of the pixels are collecting some light from the star so we somehow need to sum across the pixels to get the total brightness.  (The "point-spread function" or PSF of the telescope is what leads to this spreading out of the light from the star.)  However, when we sum across the pixels we want to be sure we only use the pixels that are thought to be related to the target star.  So we sum over all the pixels in an "aperature."  The aperature is a mask that tells us which are the good pixels.  The aperature mask is provided through the Kepler pipeline, and we can view the aperature mask by running the following:

In [None]:
# aperature mask
tpf.pipeline_mask

The pixels labeled `False` are ones we do not want to use and the pixels labeled `True` we do want to use.

**Question 8.1.1.**  Why do we not want to just sum across all the pixels in the image?  In what way could we get incorrect or misleading results?  *Hint: think about other things that might appear in neighboring pixels that are not associated with the target star.*

**[Put your answer here]**

We can plot the aperature mask as well:

In [None]:
tpf.plot(aperture_mask=tpf.pipeline_mask)

More information about the data can be revealed about `tpf` using `.header`:

In [None]:
tpf.header[:30]

Let's move on to calculating the light curves.  We could go through the different time stamp pixel data and add up the pixel fluxes that are kept with the aperature mask.  This would simply be a matter of grabbing the good pixels by multiplying the mask by the pixel image:

In [None]:
frame_num = 0
tpf_mask = tpf.pipeline_mask*tpf.flux[frame_num]
tpf_mask

Notice that there is an `nan` and zeros where the aperature mask specifies bad pixels.  Now we just need to sum over the good pixels to get our total brightness.  We can get rid of the `nan` by determining its index using the `np.isnan` function in the NumPy module.  Then, we want to the good pixels to be `True` so we can use `np.logical_not` to convert the False to True and True to False.  And finally we only keep the non-`nan` pixels.  Note that the "bad" pixels as specified by the aperature were set to 0.

In [None]:
# Find nan
tpf_nan = np.isnan(tpf_mask)
tpf_nan

In [None]:
# Change the True to False, and False to True
tpf_good_pixels = np.logical_not(tpf_nan)
tpf_good_pixels

In [None]:
# Remove the nan
tpf_mask = tpf_mask[tpf_good_pixels]
tpf_mask

In [None]:
#  We could do the previous steps together in one line of code:
tpf_mask = tpf_mask[np.logical_not(np.isnan(tpf_mask))]
tpf_mask

Now we want to find the total flux for this image.  All we need to do is sum over all the designated pixels now:

In [None]:
sum(tpf_mask)

Fortunately we don't have to do all those steps for each time stamp.  Instead there is a function in the `lightkurve` module that does it for us!  The cell below turns our `tpf` into a light curve and uses the specified aperature mask:

In [None]:
lc = tpf.to_lightcurve(aperture_mask=tpf.pipeline_mask)

Let's compare `lc` to what we calculated above:

In [None]:
print(lc.flux[0])
print(sum(tpf_mask))

The `lc.time` and `lc.flux` are going to be the important variables for us.  We can make a plot of the light curve using `plot` method for `lc`.

In [None]:
lc.plot()

Notice that the plot above is not flat, but has a decreasing trend...we would like to remove this.  There are different options in the `lightkurve` module for processing the data, but here we will use the `flatten()` function to remove this trend.  The `window_length` variable specifies the length of the filter window. Note that must be a positive odd integer.

In [None]:
flat_lc = lc.flatten(window_length=401)
flat_lc.plot()

This shows us the time series, but we will want to have a good picture of the transit.  To get a clear transit, we can "fold" the light curve at the exoplanets orbital period so that the transits get stacked.  Note that this requires us to know the period in advance.

In [None]:
folded_lc = flat_lc.fold(period=3.5225)
folded_lc.plot()

The folded lightcurve above is a bit noisy so we can bin our data to reduce some the noise.

In [None]:
binned_lc = folded_lc.bin(binsize=10)
binned_lc.plot();

Arbitary binning is not always a good idea.  Next we can plot several different binned folded light curves to compare the results.  Notice that binning over more data points can back the transit significantly shallower...which can effect the estimate of exoplanet parameters so we want to be careful and only carryout minimum binning as needed.

In [None]:
binned_lc = folded_lc.bin(binsize=10)
binned_lc50 = folded_lc.bin(binsize=50)
binned_lc100 = folded_lc.bin(binsize=100)

plots.plot(binned_lc.time, binned_lc.flux)
plots.plot(binned_lc50.time, binned_lc50.flux)
plots.plot(binned_lc100.time, binned_lc100.flux)

**Question 8.1.2.**  We just went through a lot of background on the `lightkurve` module so now let's put what we learned to practice!  For this question, pick out a Kepler exoplanet..you may need to do some online searching to find it's Kepler ID; you will also want to search for the orbital period for the light curve folding.  

In the cells below, include code for (i) importing the data, (ii) plotting the raw light curve, (iii) plotting the folded light curve, and (iv) plot the binned folded light curve with at least 3 different bin sizes (all on the same plot).

In [None]:
# Import data here

In [None]:
# Plot the light curve here

In [None]:
# Plot the folded light curve here (look at the table linked in the question to get the period)

In [None]:
# Plot the binned folded light curve here (try out at least 3 bin sizes and add them all to the plot)

**Question 8.1.3.**  Using your pixel data above, calculate the first brightness measurement of your light curve.  We did this previously before we learned that `tpf.to_lightcurve` could transform the target pixel file (tpf) into a light curve for us.  

Compare your calculation to the matching brightness value of your light curve.

In [None]:
# Put code here

## 9.  Transit method:  analyzing the transits

Next we are going to look at another set of data from NASA's [Kepler Mission](https://www.nasa.gov/mission_pages/kepler/main/index.html) and study the shape of the transit.  A nice reference for analyzing transits can be found in the [Transit Light Curve tutorial](https://www.cfa.harvard.edu/~avanderb/tutorial/tutorial.html) developed by Andrew Vanderburg.

Let's begin by loading in data we want to work with.  For this part of the lab, we will consider Kepler data from the planet [HAT-P-7b](https://en.wikipedia.org/wiki/HAT-P-7b) (also known as Kepler-2b).  The Kepler ID is 10666592.

**Question 9.1.** Read in the Target Pixel File for Kepler ID is 10666592 for `quarter = 4`.

In [None]:
tpf2 = ...

**Question 9.2.** Plot `frame = 0` of the target pixel file.

In [None]:
...

**Question 9.3.** Convert the target pixel file into a light curve.  Be sure to include the pipeline mask!  Also plot the light curve.

In [None]:
lc2 = ...

**Question 9.4.** Detrend/flatten the light curve.  You can use a `window_length = 401`.

In [None]:
flat_lc2 =...

**Question 9.5.** Next we want to create the folded light curve.  You can look up the orbital period [here](https://en.wikipedia.org/wiki/HAT-P-7b).

In [None]:
folded_lc2 = ...

**Question 9.6.**  Smooth out the folded light curve by binning.  You can choose the `binsize`.

In [None]:
binned_lc2 = ...

**Question 9.7.**  Estimate the radius of the planet.  You will have to look up the radius of the star and also figure out the depth of the transit by looking at the previous plot. (You may want to consider using `binned_lc2.flux` to determine the depth.) Convert the units to $R_J$ (Jupiter radius).

In [None]:
...

**Submission:** Once you're finished, follow the instructions at the top of this notebook to save as a .pdf and .ipynb.  Then submit the two files through Canvas.