# Using Deep Learning to Find Hot-Jupiters

## 1 Find Training Set
### Uncleaned Dataset (Given by DSECOP Tutorials)

In [2]:
import pandas as pd
exoplanets = pd.read_csv('Data/NASAExoplanetsData.csv')
exoplanets.head()

Unnamed: 0.1,Unnamed: 0,loc_rowid,kepid,kepoi_name,kepler_name,koi_disposition,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,...,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag,label
0,1,2,10666592,K00002.01,Kepler-2 b,CONFIRMED,CANDIDATE,,0,1,...,4.021,0.011,-0.011,1.991,0.018,-0.018,292.24728,47.969521,10.463,1.0
1,3,4,3861595,K00004.01,Kepler-1658 b,CONFIRMED,CANDIDATE,,0,1,...,3.657,0.205,-0.107,2.992,0.469,-0.743,294.35654,38.94738,11.432,1.0
2,5,6,3248033,K00006.01,,FALSE POSITIVE,FALSE POSITIVE,,0,0,...,4.106,0.175,-0.152,1.58,0.415,-0.34,294.59955,38.366772,12.161,1.0
3,6,7,11853905,K00007.01,Kepler-4 b,CONFIRMED,CANDIDATE,,0,0,...,4.105,0.01,-0.01,1.533,0.04,-0.04,285.61533,50.13575,12.211,1.0
4,7,8,5903312,K00008.01,,FALSE POSITIVE,FALSE POSITIVE,,0,0,...,4.433,0.062,-0.156,0.985,0.187,-0.079,298.66101,41.13789,12.45,1.0


## 2 Find and Isolate our needed parameters

### Cleaned Dataset
I've isolated the values we will be utilizing as inputs for our training.

We do this because the given dataset includes unneeded to parameters. To mediate this, I created a duplicate file of my data in excel, and I found the parameters needed to determine whether the Exoplanet is a Hot Jupiter or not. The needed variables are as follows:

- Orbital Period (koi_period)
    - Hot Jupiters have an average orbital period of 2 to 10 Earth days. This variable helps us find Hot Jupiters by understanding that Hot Jupiters will typically have a short Orbital period, indicating that they are close to their host star.<sup>1</sup>
    - In mathematical terms, Orbital period, τ, defined by Kepler's Third Law, is $τ^2=\frac{4\pi^2μ}{k}a^3$, where μ is the mass of the planet, given there is only one star and one planet within this system. k is the Gaussian constant, which helps to relate the orbital period to the orbit's semi-major axis and the mass of the planet. a is the size of the semi-major axis of the planet's orbit.<sup>8</sup>
    - Kepler's Third Law tells us that τ<sup>2</sup> is proportional to a<sup>3</sup>. Otherwise stated, this means the planet's orbital period squared is proportional to the semimajor axis (the planet's furthest distance during orbit from its host star). We also know, based on Kepler's Second Law, that the speed of the orbiting planet increases as it nears its host star versus when it recedes from its host star. This can help us to understand the planet's orbital period.
- Transit Duration (koi_duration)
    - The transit duration of an exoplanet of an exoplanet can tell us the size of the exoplanet relative to the host star.<sup>2</sup><sup>12</sup> I believe this variable is likely used to either support or detract the validity of the orbital period and planetary radius variables included.
- Planetary Radius (koi_prad)
    - Our own Jupiter is more than ten times the diameter of Earth.<sup>4</sup> We can assume this variable allows us to identify whether or not the exoplanet is a gas giant or not.
- Equilibrium Temperature (koi_teq)
    - Given Hot Jupiters have equilibrium temperatures of around 1500 K.<sup>5</sup> We can compare this variable to the average equilibrium temperatures of those exoplanets which were marked as Hot Jupiters.
- Effective Stellar Temperature (koi_steff)
    - This variable is the temperature of a Black Body, an object that absorbs all electromagnetic radiation falling onto it<sup>6</sup>, that would radiate the same amount of electromagnetic energy as emitted by the stellar body.<sup>7</sup> This variable could possibly help us understand how much of the exoplanets heat is from solar energy. Jupiter is heated by both internal processes and by solar heat, so this may be to check if this star is hot enough, given the distance from the Jupiter, to greatly heat it in addition to its internal processes. This will also go hand-in-hand with Stellar Radius, as star radius and temperature are proportional.
- Stellar Surface Gravity (koi_slogg)
    - The average surface gravity for a star is 2.992 log10 cm/s<sup>2</sup>.<sup>9</sup> The surface gravity of a host star can tell us about the star's mass and its diameter<sup>14</sup>, which likely can be used to support or detract from the other variables.
    - Mathematically, $W = mg$ where W is the weighted force, m is the mass, and g is the gravitational field strength.<sup>13</sup>
- Stellar Radius (koi_srad)
    - Massive stars tend to host larger planets, so we can assume that if we have a larger host star, we are more likely to have a Hot Jupiter on our hands.<sup>10</sup> Stars are on average 0.5 to 8 solar masses.<sup>11</sup>
- Type of Exoplanet (label)
    - This variable is straightforward, and tells us whether we have a Hot Jupiter (1) or not (0).

### Sources

1 Wang, J., Fischer, D. A., Horch, E. P., & Huang, X. (2014). On the Occurrence Rate of Hot Jupiters in Different Stellar Environments. ArXiv. https://doi.org/10.1088/0004-637X/799/2/229

2 (2019). What is a Transit? NASA Science. https://spaceplace.nasa.gov/transits/en/

4 Berry, Dana. (2022). Hot Jupiter. NASA. https://exoplanets.nasa.gov/resources/1040/hot-jupiter/ 

5 Baxter, Claire. Désert, Jean-Michel. Parmentier, Vivien. Line, Mike. Fortney, Jonathan. Arcangeli, Jacob. Bean, Jacob L. Todorov, Kamen O. Mansfield, Megan. (2020). A transition between the hot and ultra-hot Jupiter atmospheres. https://www.aanda.org/articles/aa/full_html/2020/07/aa37394-19/aa37394-19.html#:~:text=Hot%20Jupiters%20have%20equilibrium%20temperatures%20around%201500%20K.

6 Katsir, Dina. (2021). An In Depth Guide to Understanding Black Bodies. Acktar. https://acktar.com/an-in-depth-guide-to-understanding-black-bodies/#:~:text=A%20black%20body%2C%20also%20written,turn%2C%20no%20light%20is%20reflected.

7 Rouan, Daniel. (2011). Effective Temperature. Encyclopedia of Astrobiology. pp 479-480. https://link.springer.com/referenceworkentry/10.1007/978-3-642-11274-4_487#:~:text=Definition,as%20emitted%20by%20the%20star.

8 Thornton, Stephen T. Marion, Jerry B. (2004) Classical Dynamics of Particles and Systems. Thomson Learning.

9 Smalley, B. The Determination of *T*<sub>eff</sub> and log*g* for B to G stars. Keele University. https://www.astro.keele.ac.uk/bs/publs/review_text.html

10 Lozosky, M. Helled, R. Pascucci, I. Dorn, C. Venturini, J. Feldmann, R. (2021). Why do massive stars host larger planets? Astronomy & Astrophysics. https://www.aanda.org/articles/aa/full_html/2021/08/aa40563-21/aa40563-21.html

11 Nettles, Coralie. Chamberlain, Katie. (2021). Average Star Lifespan and Size. Study.com. https://study.com/academy/lesson/average-star-definition-life-cycle-quiz.html#:~:text=Stars%20with%20masses%20between%200.5%2D8%20solar%20masses%20are%20called,%2C%20red%20giant%2C%20white%20dwarf.

12 Vanderburg, Andrew. (2021). Transit Light Curve Tutorial. AVanderburg. https://avanderburg.github.io/tutorial/tutorial.html

13 Lesson Explainer: The Relationship between Mass and Weight. Nagwa. https://www.nagwa.com/en/explainers/919127989456/#:~:text=The%20weight%20force%20on%20the%20object%20is%20related%20to%20the,is%20the%20gravitational%20field%20strength

14 Croswell, Ken. (2013). Star's flicker reveals its surface gravity. PhysicsWorld. https://physicsworld.com/a/stars-flicker-reveals-its-surface-gravity/#:~:text=Surface%20gravity%20is%20important%20because,seen%20to%20be%20orbiting%20it.

In [4]:
exoplanets = pd.read_csv('Data/NASAExoplanetsDataCleaned-full.csv')
exoplanets.head()

Unnamed: 0.1,Unnamed: 0,koi_period,koi_duration,koi_prad,koi_teq,koi_steff,koi_slogg,koi_srad,label
0,1,2.204735,3.88216,16.39,2025,6350,4.021,1.991,1
1,3,3.849372,2.6605,13.1,2035,6244,3.657,2.992,1
2,5,1.334104,3.0142,50.73,2166,6178,4.106,1.58,1
3,6,3.213669,3.99355,4.14,1507,5781,4.105,1.533,1
4,7,1.160153,1.4127,2.0,1752,5842,4.433,0.985,1


### Split between a training set and a testing set
I split the data, each with half Hot Jupiters and half not Hot Jupiters, into two csv files: one for training and one for testing.

We need to split the dataset so that we can utilize this same dataset for both training and testing. If we use the entire set for both training and testing, that will tell us very little about the effectiveness of our program, and it will not eliminate the possibility of the program training specifically to that set, and therefore being unable to work accurately on a different dataset.

In [6]:
exoplanets_train = pd.read_csv('Data/LearnData.csv')
exoplanets_train.head()

Unnamed: 0,koi_period,koi_duration,koi_prad,koi_teq,koi_steff,koi_slogg,koi_srad,label
0,2.204735,3.88216,16.39,2025,6350,4.021,1.991,1
1,3.849372,2.6605,13.1,2035,6244,3.657,2.992,1
2,1.334104,3.0142,50.73,2166,6178,4.106,1.58,1
3,3.213669,3.99355,4.14,1507,5781,4.105,1.533,1
4,1.160153,1.4127,2.0,1752,5842,4.433,0.985,1


In [7]:
exoplanets_test = pd.read_csv('Data/TestData.csv')
exoplanets_test.head()

Unnamed: 0,koi_period,koi_duration,koi_prad,koi_teq,koi_steff,koi_slogg,koi_srad,label
0,1.636689,1.353,11.55,2560,5234,3.436,3.739,1
1,0.616388,0.8228,2.9,3451,5667,3.625,3.049,1
2,0.895725,0.839,1.53,1786,5897,4.56,0.834,1
3,2.20922,2.73,0.88,1507,5991,4.375,1.092,1
4,0.519439,2.1631,40.61,2403,6177,4.462,1.011,1


## 3 Set Hyperparameters

| Value | Description | Value |
| ----- | ----------- | ----- |
| Learning Rate | Defines the adjustment value of our weight | $α = 0.03$  |
| Activation Function (between input layer and until after hidden layer 2) | Decides whether a node should be fired or not | $g(z) = tanh(z)$ (non-linear) |
| Activation Function (Between final layer and output) | Decides whether a node should be fired or not | $g(z) = σ(z)$(linear) |
| Hidden Layers | These are where our inner nodes are stored which take in the weighted inputs and produce an output | 3 |
| Nodes in Hidden Layers | The neurons within the Hidden layers | [4, 3, 1] |
| Iterations | How many times the gradient descent program is run | 5000 |

**Note**: This means we will not stop based on the value of our cost function, we will stop after the given number of iterations.

![simplified graphic of three layers in a neural network](Resources/Flowchart.png)

## 4 Define the Loss and Cost Function

Our Loss function is the Log-Likelihood Loss function, defined as the following:

$L(a,y^i)=-y^ilog(a)-(1-y^i)log(1-a)$.

We define our Cost function as the following:

$J(ω,b)=\frac{1}{m}Σ^m_{i=1}L(a,y^i)$.

Using the Log-Likelihood Loss function, we define our Cost function as the following:

$J(ω,b)=\frac{1}{m}Σ^m_{i=1}[-y^ilog(a)-(1-y^i)log(1-a)]$

We will define our Gradient Descent method as the following.


## 5 Generalize Gradient Descent method, Apply and Test

In [5]:
import numpy as np
from sklearn import linear_model
import sys

def read_in_dataset(file_loc):
    num = np.genfromtxt(file_loc, dtype=float, delimiter=",", skip_header=True)
    return num

def main():
    n_iteration = 100_000
    learning_rate = 0.0008

    dataset_location = "data/LearnData.csv"
    num_planets = read_in_dataset(dataset_location)
    Y = (num_planets.T)[7:].flatten() # use flatten to make it one dimensional after using the "label" column
    num_planets = num_planets[:,:-1]

    SGDClf = linear_model.SGDClassifier(loss="log_loss", alpha=learning_rate, max_iter=n_iteration)
    SGDClf.fit(num_planets, Y)

    test_data = read_in_dataset("data/TestData.csv")
    testY = (test_data.T)[7:].flatten()
    test_data = test_data[:,:-1]
    # let's predict and compare...how well do our predicted values perform?

    predictedY = SGDClf.predict(test_data)
    realY = testY

    percentMatch = (predictedY == realY).sum()/float(predictedY.size) * 100
    print(percentMatch)
    
if __name__ == "__main__":
    main()

99.74704890387859


**Note**: As seen above, for the interest of time, we are at the time not hard-coding the gradient descent program, and are instead using a Scikit Learn method. The differences between what was planned in part 4 versus here is that we are liking not using the hyperbolic tangent as an activation function, and instead may solely be using the Sigmoid function as an activation function. 

## 6 Judgement of Classification without the use of Machine Learning - Example

Let's suppose we want to classify exoplanets as Hot Jupiters or not without the use of Machine Learning. This will help us understand the process by which we make this decision, and how machine learning accelerates this process.

Suppose we have the following exoplanet (from our set of exoplanets):

In [8]:
import csv
with open('Data/TestData.csv', newline='') as f:
    csv_reader = csv.reader(f)
    header = next(csv_reader)
    row = next(csv_reader)
    print(header)
    print(row)

['\ufeffkoi_period', 'koi_duration', 'koi_prad', 'koi_teq', 'koi_steff', 'koi_slogg', 'koi_srad', 'label']
['1.636689474', '1.353', '11.55', '2560', '5234', '3.436', '3.739', '1']


So we are given:

| Exoplanet variable | Value | Unit of Measure |
| ------------------ | ----- | ---------- |
| Orbital Period | 1.636689474 | Days |
| Transit Duration | 1.353 | Hours |
| Planetary Radius | 11.55 | Earth radii |
| Equilibrium Temperature | 2560 | K |
| Effective Stellar Temperature | 5234 | K |
| Stellar Surface Gravity | 3.436 | log10(cm/s**2) |
| Stellar Radius | 3.739 | Solar radii |
| Classifcation | 1 | Binary classification |

Let's break this down, one variable at a time.
First, our Orbital Period is a small number, which is in line with what we know about Hot Jupiters. I would assume this is an important variable, as a much longer Orbital Period would tell me this exoplanet is not very close to its host star, and therefore we can be sure it isn't a **hot** Jupiter.

Next, we have the Transit Duration of the exoplanet. This tells us that, for one, it likely has a short orbital period, and is likely quite close to its star. This supports what we know from above.

Next, we have the Planetary Radius. This exoplanet is 11.55 times the size of Earth's radius. The range of size for Hot Jupiters is large, but we can safely say this is a large exoplanet, therefore, possibly, a Gas Giant. This variable is also of great importance, as a relatively small exoplanet likely will not be a Hot Jupiter.

Based on the average Equilibrium temperature of Hot Jupiters, we can tell this exoplanet is definitely quite hot, which is a good sign for it being a **hot** Jupiter.

The Effective Stellar Temperature of this exoplanets host star is quite hot, so we can theorize that the exoplanet is partially heated by solar energy. Our Jupiter is heated by both internal processes and by solar heat, so this may be to check if this star is hot enough, given the distance from the Jupiter, to greatly heat it in addition to its internal processes.

Next, we have Stellar Surface Gravity. This host star's surface gravity is large compared to the average for stellar surface gravity. This may support what we know about the star's radius, and perhaps clue us in to its mass. Given it's large surface gravity and, as we will see shortly, it's average radius, we can assume it has a decently sized mass.

Lastly, we consider the host star's stellar radius. This stellar radius is within the average solar radii for stars. However, I expect that this variable is not of the most importance, however, as a large planet does not always equate to a large star, given our own Jupiter is quite large, and our own Sun is smaller than the listed host star.

Based on our analysis, we can safely hypothesize that this exoplanet is a Hot Jupiter, and we know this is correct as seen by its label of 1.

## 7 Conclusion

We can use `SGDClf.predict()`, where `SGDClf` is the Stochastic Gradient Descent Classifier, to predict the classification Hot Jupiters from a dataset of exoplanets. This can be used to automate and accelerate processes of classifying these exoplanets with a high rate of accuracy.

## 8 Goals for the Future

Future goals for this project would including working on the gradient descent program, and finalizing this. I would also like to further study and delve into the subject of astronomy and coding for data analysis in astronomy, as this is my first deep-dive into these topics.