# Deep Learning and its usage in Categorizing Exoplanets

## Lecture 1: Introduction to Neural Networks

### Neural Networks Vocabulary
**Neural Networks** is a computer system used in deep learning. They utilize an input-output process where the inputs are subject to functions and weights to determine *hidden units* and the output reflects as such. The concept of Neural Networks is inspired by the processes within the human brain and neurons firing to indicate some input-output.

**Hidden Units** are units which we find in each layer of the Neural Network. All inputs of our Neural Network are connected to these hidden values, though with differing dependencies which are denoted as *weights*.

**Weights** in a Neural Network help us to denote the dependency of our hidden values to our inputs. For example, say one of our Hidden Units is a host star's type. Inputs such as the star's temperature and radius will have higher weights in determining the host star's type. The orbital distance of an orbiting exoplanet, however, will likely have a much lower weight, as the orbital distance of this exoplanet will not explicitly help us to identify its host star's type.

![simplified graphic of three layers in a neural network](Resources/weights-image.webp)

**Layers** in a Neural Network consist of a couple parts each: the first layer is some singular or a vector of inputs, whether they be the initial inputs or outputs from the last layer. Next, the inputs are weighed, as discussed above. Afterwards, the data is transformed by some functions which will be discussed further later on. Lastly, the layer has some output which either goes to an *activation function* at the end of the Neural Network, or the inputs are the final output. 

![simplified graphic of three layers in a neural network](Resources/neural_network_w_matrices.png)

An **Activation Function** is the final portion of the Neural Network which decides whether the final output of the "neuron" fires; for a binary classification, for example, this portion makes the decision of whether it is a 1 or a 0, depending on the criteria decided by the inputs and hidden units.

---
### Homework: Perceptron
In basic terms, a **Perceptron** is a single layer Neural Network, meaning it contains all parts of a singular layer listed above. 

![simplified graphic of a perceptron](Resources/perceptron-image.webp)

Perceptrons are usually used for Binary classification, or classifying the data into two parts.

---
### References
*Concepts*. ML Glossary. https://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html

Sharma, Shagar. (2017, September 9). *What the Hell is Perceptron?: The Fundamentals of Neural Networks.* Towards Data Science. https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53

## Lecture 2: Logistic Regression

### Logistic Regression Vocabulary and Understanding

**Logistic Regression** is used in binary classification and can be utilized to create a model for predicting these binary outcomes based on similar precedented variables. To be able to make accurate predictions, these models requires *training sets*.

**Training Sets** are small matrix sets of data with known outputs. By using these in our models, we can create progressively more accurate predictions for outputs of our data given some precedented or similar inputs. 

A **Linear Regression Model** is a regression model used to predict trends in data given some input, but which assumes the relationship between input and output is linear. In this relationship, the weighted value of the input is the slope and has some constant b as the y-intercept. However, given a model with binary classification, we will want our y-values to be between or equal to 0 and 1. To do this, we will require a *Sigmoid function*.

A **Sigmoid Function** allows us to make the output a probability as opposed to the literal target variable; this allows us to make binary classifications of our data; for example, rather than having an output pertaining to the literal size of an exoplanet, the Sigmoid Function would transform the data to simply tell us whether or not the size of the planet is *yes*, big enough to be a Hot Jupiter, or *No*, not big enough to be a Hot Jupiter (of course other factors go into making this decision, but let's keep things simple).

To apply the Sigmoid Function, we simply make our linear function the input of the Sigmoid Function, and thus it becomes our *Activation function*.

---
### Homework: 

Consider *n* inputs and *m* data in our training set. Find the dimension of *X*, *X<sup>i</sup>*, *w*, *b*, *z*, and *a* where *z* is the input of the Sigmoid function and *a* is the Activation function.

*X* = [(x<sup>1</sup>)<sub>1</sub>, (x<sup>2</sup>)<sub>2</sub>,...,(x<sup>n</sup>)<sub>n</sub>]

*X<sup>i</sup>* = [*X*]

*w* = weighted value

*b* = y-intercept

*z* = *wX<sup>i</sup>* + *b*

*a* = 1/(1 + e<sup>-(*wX<sup>i</sup>* + *b*<sup>))


## Lecture 3: Loss and Cost Functions

### Loss and Cost Functions Vocabulary and Understanding

A **Loss/Error Function** is used to measure how well a model can make predictions. We say we want to "minimize" our Loss Function, meaning we want to find the parameters necessary for our model to make the best possible predictions.

There are multiple common Loss Functions we can utilize, but for now we will focus on the **Log-Likelihood Loss Function** which states 
*L(a,y<sup>i</sup>) = -y<sup>i</sup>log(a) - (1 - y<sup>i</sup>)log(1 - a)* where *a* is the predicted output and *y<sup>i</sup>* is the known output. To minimize the loss function would mean that the overall difference between the actual and predicted outputs is as small as possible. 

A **Cost Function** gives an idea of the average loss from the Loss Function for the entire training set involved. We define the Cost Function as
*J(w,b) = 1/m * (The sum from 1 to m of L(a, y<sup>i</sup>))* where *w* is the weighted value, *b* is bias, and *m* is the number of input-output sets we have in our training set.

### Homework:

A necessary and sufficient condition for a function *f(x)* to be convex on a interval is that the second derivative *d<sup>2</sup>f/dx<sup>2</sup>* >= 0 for all x in the interval. We can show a local minimum of a convex function is also a global minimum. In a neural network we can try to minimize the cost function. Does the cost function have to be convex?

The cost function does not necessarily have to be convex, however if it is convex it will allow us to minimize our cost function to the best possible ability. This is because convex functions have a global minimum, allowing us to have a best overall possible minimum to our cost function, which would be the goal. However, given a non-convex function, we will have multiple local minima but no global minimum which would mean we would be unable to optimize the minimization of our cost function.

---
### References
Mack, Conor. (2017, November 27). *Machine Learning Fundamentals (I): Cost Functions and Gradient Descent*. Towards Data Science. https://towardsdatascience.com/machine-learning-fundamentals-via-linear-regression-41a5d11f5220#:~:text=Put%20simply%2C%20a%20cost%20function,to%20as%20loss%20or%20error.)