# Using Deep Learning to Find Hot-Jupiters

## 1 Find Training Set
### Uncleaned Dataset (Given by DSECOP Tutorials)

In [2]:
import pandas as pd
exoplanets = pd.read_csv('Data/NASAExoplanetsData.csv')
exoplanets.head()

Unnamed: 0.1,Unnamed: 0,loc_rowid,kepid,kepoi_name,kepler_name,koi_disposition,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,...,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag,label
0,1,2,10666592,K00002.01,Kepler-2 b,CONFIRMED,CANDIDATE,,0,1,...,4.021,0.011,-0.011,1.991,0.018,-0.018,292.24728,47.969521,10.463,1.0
1,3,4,3861595,K00004.01,Kepler-1658 b,CONFIRMED,CANDIDATE,,0,1,...,3.657,0.205,-0.107,2.992,0.469,-0.743,294.35654,38.94738,11.432,1.0
2,5,6,3248033,K00006.01,,FALSE POSITIVE,FALSE POSITIVE,,0,0,...,4.106,0.175,-0.152,1.58,0.415,-0.34,294.59955,38.366772,12.161,1.0
3,6,7,11853905,K00007.01,Kepler-4 b,CONFIRMED,CANDIDATE,,0,0,...,4.105,0.01,-0.01,1.533,0.04,-0.04,285.61533,50.13575,12.211,1.0
4,7,8,5903312,K00008.01,,FALSE POSITIVE,FALSE POSITIVE,,0,0,...,4.433,0.062,-0.156,0.985,0.187,-0.079,298.66101,41.13789,12.45,1.0


## 2 Find and Isolate our needed parameters

### Cleaned Dataset
I've isolated the values we will be utilizing as inputs for our training.

We do this because the given dataset includes unneeded to parameters. To mediate this, I created a duplicate file of my data in excel, and I found the parameters needed to determine whether the Exoplanet is a Hot Jupiter or not. The needed variables are as follows:

- Orbital Period (koi_period)
    - Hot Jupiters have an average orbital period of 2 to 10 Earth days. This variable helps us find Hot Jupiters by understanding that Hot Jupiters will typically have a short Orbital period, indicating that they are close to their host star.<sup>1</sup>
- Transit Duration (koi_duration)
    - The transit duration of an exoplanet tells us, if it is inconsistent, whether the exoplanet exists within a multi-planet system.<sup>2</sup> **Not sure what this tells us yet. Best guess is whether or not these other planets would have an effect on the other variables, therefore skewing them to appear like an Hot Jupiter when it really isn't. Will read more on this**. Another study states that the stellar intensity reduction by planetary transit is proportional to the size of the exoplanet.<sup>3</sup> Thus, this variable may help us indicate the size, and therefore understand what sizes are typical of a Hot Jupiter.
- Planetary Radius (koi_prad)
    - Our own Jupiter is more than ten times the diameter of Earth.<sup>4</sup> We can assume this variable allows us to identify whether or not the exoplanet is a gas giant or not.
- Equilibrium Temperature (koi_teq)
    - Given Hot Jupiters have equilibrium temperatures of around 1500 K.<sup>5</sup> We can compare this variable to the average equilibrium temperatures of those exoplanets which were marked as Hot Jupiters.
- Effective Stellar Temperature (koi_steff)
    - This variable is the temperature of a Black Body, an object that absorbs all electromagnetic radiation falling onto it<sup>6</sup>, that would radiate the same amount of electromagnetic energy as emitted by the stellar body.<sup>7</sup> **Also not completely sure about this one.** This variable could possibly help us understand how much of the exoplanets heat is from solar energy.
- Stellar Surface Gravity (koi_slogg)
    - **Need to find out**
- Stellar Radius (koi_srad)
    - **Need to find out**
- Type of Exoplanet (label)
    - This variable is straightforward, and tells us whether we have a Hot Jupiter (1) or not (0).

### Sources

1 Wang, J., Fischer, D. A., Horch, E. P., & Huang, X. (2014). On the Occurrence Rate of Hot Jupiters in Different Stellar Environments. ArXiv. https://doi.org/10.1088/0004-637X/799/2/229

2 Dunbar, Brian. (2017). About Transits. NASA. https://www.nasa.gov/kepler/overview/abouttransits 

3 (2016). Chapter 3 Transits of Planets Mean Densities. ETHZ. https://ethz.ch/content/dam/ethz/special-interest/phys/particle-physics/quanz-group-dam/documents-old-s-and-p/Courses/ExtrasolarPlanetsFS2016/exop2016_chapter3_part2.pdf 

4 Berry, Dana. (2022). Hot Jupiter. NASA. https://exoplanets.nasa.gov/resources/1040/hot-jupiter/

5 Baxter, Claire. Désert, Jean-Michel. Parmentier, Vivien. Line, Mike. Fortney, Jonathan. Arcangeli, Jacob. Bean, Jacob L. Todorov, Kamen O. Mansfield, Megan. (2020). A transition between the hot and ultra-hot Jupiter atmospheres. https://www.aanda.org/articles/aa/full_html/2020/07/aa37394-19/aa37394-19.html#:~:text=Hot%20Jupiters%20have%20equilibrium%20temperatures%20around%201500%20K.

6 Katsir, Dina. (2021). An In Depth Guide to Understanding Black Bodies. Acktar. https://acktar.com/an-in-depth-guide-to-understanding-black-bodies/#:~:text=A%20black%20body%2C%20also%20written,turn%2C%20no%20light%20is%20reflected.

7 Rouan, Daniel. (2011). Effective Temperature. Encyclopedia of Astrobiology. pp 479-480. https://link.springer.com/referenceworkentry/10.1007/978-3-642-11274-4_487#:~:text=Definition,as%20emitted%20by%20the%20star.


In [4]:
exoplanets = pd.read_csv('Data/NASAExoplanetsDataCleaned-full.csv')
exoplanets.head()

Unnamed: 0.1,Unnamed: 0,koi_period,koi_duration,koi_prad,koi_teq,koi_steff,koi_slogg,koi_srad,label
0,1,2.204735,3.88216,16.39,2025,6350,4.021,1.991,1
1,3,3.849372,2.6605,13.1,2035,6244,3.657,2.992,1
2,5,1.334104,3.0142,50.73,2166,6178,4.106,1.58,1
3,6,3.213669,3.99355,4.14,1507,5781,4.105,1.533,1
4,7,1.160153,1.4127,2.0,1752,5842,4.433,0.985,1


### Split between a training set and a testing set
I split the data, each with half Hot Jupiters and half not Hot Jupiters, into two csv files: one for training and one for testing.

We need to split the dataset so that we can utilize this same dataset for both training and testing. If we use the entire set for both training and testing, that will tell us very little about the effectiveness of our program, and it will not eliminate the possibility of the program training specifically to that set, and therefore being unable to work accurately on a different dataset.

In [6]:
exoplanets_train = pd.read_csv('Data/LearnData.csv')
exoplanets_train.head()

Unnamed: 0,koi_period,koi_duration,koi_prad,koi_teq,koi_steff,koi_slogg,koi_srad,label
0,2.204735,3.88216,16.39,2025,6350,4.021,1.991,1
1,3.849372,2.6605,13.1,2035,6244,3.657,2.992,1
2,1.334104,3.0142,50.73,2166,6178,4.106,1.58,1
3,3.213669,3.99355,4.14,1507,5781,4.105,1.533,1
4,1.160153,1.4127,2.0,1752,5842,4.433,0.985,1


In [7]:
exoplanets_test = pd.read_csv('Data/TestData.csv')
exoplanets_test.head()

Unnamed: 0,koi_period,koi_duration,koi_prad,koi_teq,koi_steff,koi_slogg,koi_srad,label
0,1.636689,1.353,11.55,2560,5234,3.436,3.739,1
1,0.616388,0.8228,2.9,3451,5667,3.625,3.049,1
2,0.895725,0.839,1.53,1786,5897,4.56,0.834,1
3,2.20922,2.73,0.88,1507,5991,4.375,1.092,1
4,0.519439,2.1631,40.61,2403,6177,4.462,1.011,1


## 3 Set Hyperparameters

| Value | Description | Value |
| ----- | ----------- | ----- |
| Learning Rate | Defines the adjustment value of our weight | $α = 0.03$  |
| Activation Function (between input layer and until after hidden layer 2) | Decides whether a node should be fired or not | $g(z) = tanh(z)$ (non-linear) |
| Activation Function (Between final layer and output) | Decides whether a node should be fired or not | $g(z) = σ(z)$(linear) |
| Hidden Layers | These are where our inner nodes are stored which take in the weighted inputs and produce an output | 3 |
| Nodes in Hidden Layers | The neurons within the Hidden layers | [4, 3, 1] |
| Iterations | How many times the gradient descent program is run | 5000 |

**Note**: This means we will not stop based on the value of our cost function, we will stop after the given number of iterations.

The following is a diagram depicting the information in the table.

![simplified graphic of my neural network](Resources/Flowchart.png)

## 4 Define the Loss and Cost Function

Our Loss function is the Log-Likelihood Loss function, defined as the following:

$L(a,y^i)=-y^ilog(a)-(1-y^i)log(1-a)$.

We define our Cost function as the following:

$J(ω,b)=\frac{1}{m}Σ^m_{i=1}L(a,y^i)$.

Using the Log-Likelihood Loss function, we define our Cost function as the following:

$J(ω,b)=\frac{1}{m}Σ^m_{i=1}[-y^ilog(a)-(1-y^i)log(1-a)]$

We will define our Gradient Descent method as the following.


## 5 Generalize Gradient Descent method, Apply and Test

In [23]:
import numpy as np
from sklearn import linear_model
import sys

def read_in_dataset(file_loc):
    num = np.genfromtxt(file_loc, dtype=float, delimiter=",", skip_header=True)
    return num

def main():
    n_iteration = 100_000
    learning_rate = 0.0008

    dataset_location = "data/LearnData.csv"
    num_planets = read_in_dataset(dataset_location)
    Y = (num_planets.T)[7:].flatten() # use flatten to make it one dimensional after using the "label" column
    num_planets = num_planets[:,:-1]

    SGDClf = linear_model.SGDClassifier(loss="log_loss", alpha=learning_rate, max_iter=n_iteration)
    SGDClf.fit(num_planets, Y)

    test_data = read_in_dataset("data/TestData.csv")
    testY = (test_data.T)[7:].flatten()
    test_data = test_data[:,:-1]

    # let's predict and compare...how well do our predicted values perform?

    predictedY = SGDClf.predict(test_data)
    realY = testY

    percentMatch = (predictedY == realY).sum()/float(predictedY.size) * 100
    print(percentMatch)
    
if __name__ == "__main__":
    main()

99.74704890387859


**Note**: As seen above, for the interest of time, we are at the time not hard-coding the gradient descent program, and are instead using a Scikit Learn method. The differences between what was planned in part 4 versus here is that we are liking not using the hyperbolic tangent as an activation function, and instead may solely be using the Sigmoid function as an activation function. 

## 6 Conclusion

We can use `SGDClf.predict()`, where `SGDClf` is the Stochastic Gradient Descent Classifier, to predict the classification Hot Jupiters from a dataset of exoplanets. This can be used to automate and accelerate processes of classifying these exoplanets with a high rate of accuracy.