# Deep Learning and its usage in Categorizing Exoplanets

## Lecture 1: Introduction to Neural Networks

### Neural Networks Vocabulary
**Neural Networks** is a computer system used in deep learning. They utilize an input-output process where the inputs are subject to functions and weights to determine *hidden units* and the output reflects as such. The concept of Neural Networks is inspired by the processes within the human brain and neurons firing to indicate some input-output.

**Hidden Units** are units which we find in each layer of the Neural Network. All inputs of our Neural Network are connected to these hidden values, though with differing dependencies which are denoted as *weights*.

**Weights** in a Neural Network help us to denote the dependency of our hidden values to our inputs. For example, say one of our Hidden Units is a host star's type. Inputs such as the star's temperature and radius will have higher weights in determining the host star's type. The orbital distance of an orbiting exoplanet, however, will likely have a much lower weight, as the orbital distance of this exoplanet will not explicitly help us to identify its host star's type.

![simplified graphic of three layers in a neural network](Resources/weights-image.webp)

**Layers** in a Neural Network consist of a couple parts each: the first layer is some singular or a vector of inputs, whether they be the initial inputs or outputs from the last layer. Next, the inputs are weighed, as discussed above. Afterwards, the data is transformed by some functions which will be discussed further later on. Lastly, the layer has some output which either goes to an *activation function* at the end of the Neural Network, or the inputs are the final output. 

![simplified graphic of three layers in a neural network](Resources/neural_network_w_matrices.png)

An **Activation Function** is the final portion of the Neural Network which decides whether the final output of the "neuron" fires; for a binary classification, for example, this portion makes the decision of whether it is a 1 or a 0, depending on the criteria decided by the inputs and hidden units.

---
### Homework: Perceptron
In basic terms, a **Perceptron** is a single layer Neural Network, meaning it contains all parts of a singular layer listed above. 

![simplified graphic of a perceptron](Resources/perceptron-image.webp)

Perceptrons are usually used for Binary classification, or classifying the data into two parts.

---
### References
*Concepts*. ML Glossary. https://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html

Sharma, Shagar. (2017, September 9). *What the Hell is Perceptron?: The Fundamentals of Neural Networks.* Towards Data Science. https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53

## Lecture 2: Logistic Regression

### Logistic Regression Vocabulary and Understanding

**Logistic Regression** is used in binary classification and can be utilized to create a model for predicting these binary outcomes based on similar precedented variables. To be able to make accurate predictions, these models requires *training sets*.

**Training Sets** are small matrix sets of data with known outputs. By using these in our models, we can create progressively more accurate predictions for outputs of our data given some precedented or similar inputs. 

A **Linear Regression Model** is a regression model used to predict trends in data given some input, but which assumes the relationship between input and output is linear. In this relationship, the weighted value of the input is the slope and has some constant b as the y-intercept. However, given a model with binary classification, we will want our y-values to be between or equal to 0 and 1. To do this, we will require a *Sigmoid function*.

A **Sigmoid Function** allows us to make the output a probability as opposed to the literal target variable; this allows us to make binary classifications of our data; for example, rather than having an output pertaining to the literal size of an exoplanet, the Sigmoid Function would transform the data to simply tell us whether or not the size of the planet is *yes*, big enough to be a Hot Jupiter, or *No*, not big enough to be a Hot Jupiter (of course other factors go into making this decision, but let's keep things simple).

To apply the Sigmoid Function, we simply make our linear function the input of the Sigmoid Function, and thus it becomes our *Activation function*.

---
### Homework: 

Consider *n* inputs and *m* data in our training set. Find the dimension of *X*, *X<sup>i</sup>*, *w*, *b*, *z*, and *a* where *z* is the input of the Sigmoid function and *a* is the Activation function.

*X* = [(x<sup>1</sup>)<sub>1</sub>, (x<sup>2</sup>)<sub>2</sub>,...,(x<sup>n</sup>)<sub>n</sub>]

*X<sup>i</sup>* = [*X*]

*w* = weighted value

*b* = y-intercept

*z* = *wX<sup>i</sup>* + *b*

*a* = 1/(1 + e<sup>-(*wX<sup>i</sup>* + *b*<sup>))


## Lecture 3: Loss and Cost Functions

### Loss and Cost Functions Vocabulary and Understanding

A **Loss/Error Function** is used to measure how well a model can make predictions. We say we want to "minimize" our Loss Function, meaning we want to find the parameters necessary for our model to make the best possible predictions.

There are multiple common Loss Functions we can utilize, but for now we will focus on the **Log-Likelihood Loss Function** which states 
*L(a,y<sup>i</sup>) = -y<sup>i</sup>log(a) - (1 - y<sup>i</sup>)log(1 - a)* where *a* is the predicted output and *y<sup>i</sup>* is the known output. To minimize the loss function would mean that the overall difference between the actual and predicted outputs is as small as possible. 

A **Cost Function** gives an idea of the average loss from the Loss Function for the entire training set involved. We define the Cost Function as
*J(w,b) = 1/m * (The sum from 1 to m of L(a, y<sup>i</sup>))* where *w* is the weighted value, *b* is bias, and *m* is the number of input-output sets we have in our training set.

### Homework:

A necessary and sufficient condition for a function *f(x)* to be convex on a interval is that the second derivative *d<sup>2</sup>f/dx<sup>2</sup>* >= 0 for all x in the interval. We can show a local minimum of a convex function is also a global minimum. In a neural network we can try to minimize the cost function. Does the cost function have to be convex?

The cost function does not necessarily have to be convex, however if it is convex it will allow us to minimize our cost function to the best possible ability. This is because convex functions have a global minimum, allowing us to have a best overall possible minimum to our cost function, which would be the goal. However, given a non-convex function, we will have multiple local minima but no global minimum which would mean we would be unable to optimize the minimization of our cost function.

---
### References
Mack, Conor. (2017, November 27). *Machine Learning Fundamentals (I): Cost Functions and Gradient Descent*. Towards Data Science. https://towardsdatascience.com/machine-learning-fundamentals-via-linear-regression-41a5d11f5220#:~:text=Put%20simply%2C%20a%20cost%20function,to%20as%20loss%20or%20error.)

## Lecture 4: Gradient Descent
### Gradient Descent Vocabulary
**Gradient Descent** is a minimization algorithm oftem utilized in deep learning. It is used for optimization and for training neural networks. Using Gradient Descent, we are able to train the data to have the correct weight and bias values and minimize the cost function.

Gradient Descent requires a **Learning Scale** to function, or the step size. The Learning Scale is the size of the steps taken to reach that minimum value. The Learning Scale should be reasonable, not too large or too small. If it is too large, our function runs the risk of overshooting the minimum. On the other hand, if it is too small, it may be more precise but it will also be less efficient. A small (not excessively small) value is best for the Learning Scale.

The steps taken to determine the weight and bias using gradient descent are as follows:
1. Weight and Bias are initially defined, randomly or by some guess,
2. We find the slope of the cost function,
3. We adjust the Weight and Bias values by checking along the nearest steepest descent in reference to the slope and changing to the next set of weight and bias values,
4. Repeat until the Weight and Bias values do not have any significant change, therefore indicating we have reached the minimum of the cost function.

This process can be mathematically defined as:
*w* := *w* - *α(dJ(w, b)/dw)* and *b* := *b* - *α(dJ(w, b)/db)* 

where *α* is the learning scale.

The **Backpropogation Method** is the way by which we find the derivative (slope of the tangent line) of the loss function with respect to the weights of our given inputs and the bias. This method finds the derivative with respect to multiple instances: first with respect to *z*, where *z*=*wX<sup>i</sup>* + *b*, then to *y*, where *y* is the output for a given input, then top *w<sub>1</sub>*, *w<sub>2</sub>*, and *b*. After deriving these final values, we are able to modify the values as mathematically defined.

After modifying our values, we repeat this process as described in step 4 above.

### Homework:
In programming, the term *dL/dq* is also denoted by *dq* where *q* is a parameter, such as *z* or *w<sub>i</sub>*. Calculate *da*, *dz*, *dw<sub>1</sub>*, *dw<sub>2</sub>*, and *db* using the mathematical definition of a loss function and considering sigmoid function.

*da* = -*y<sup>i</sup>*log(*a*)-(1-*y<sup>i</sup>*)log(1-*a*)*dL/da* = -(*y<sup>i</sup>*/*a*ln(10))-((1-*y<sup>i</sup>)*/(ln(10)-*a*ln(10)))

*dz* = *y<sup>i</sup>*log(1/(1+*e<sup>-z</sup>*))-(1-*y<sup>i</sup>*)log(1-(1/(1+*e<sup>-z</sup>*)))*dL/dz* = ((-*y<sup>i</sup>e<sup>-z</sup>*)/(*e<sup>-z</sup>*ln(10)+ln(10)))+(((1-*y<sup>i</sup>*)*e<sup>-z</sup>*)/(*e<sup>-z</sup>*ln(10)+**e<sup>-2z</sup>*ln(10)))

*dw<sub>1</sub>* = *y<sup>i</sup>*log(1/(1+*e*<sup>-*w<sub>1</sub><sup>i</sup>x<sub>1</sub><sup>i</sup>*+*w<sub>2</sub><sup>i</sup>x<sub>2</sub><sup>i</sup>*+*b*))</sup>))-(1-*y<sup>i</sup>*)log(1-(1/(1+*e*<sup>-*w<sub>1</sub><sup>i</sup>x<sub>1</sub><sup>i</sup>))*+*w<sub>2</sub><sup>i</sup>x<sub>2</sub><sup>i</sup>*+*b*))</sup>*dL/dw<sub>1</sub>* = -(*e*<sup>-*w<sub>1</sub><sup>i</sup>x<sub>1</sub><sup>i</sup>*+*w<sub>2</sub><sup>i</sup>x<sub>2</sub><sup>i</sup>*+*b*</sup>)/(ln(10)+ln(10)*e*<sup>-*w<sub>1</sub><sup>i</sup>x<sub>1</sub><sup>i</sup>*+*w<sub>2</sub><sup>i</sup>x<sub>2</sub><sup>i</sup>*+*b*</sup>)

*dw<sub>2</sub>* = *y<sup>i</sup>*log(1/(1+*e*<sup>-*w<sub>1</sub><sup>i</sup>x<sub>1</sub><sup>i</sup>*+*w<sub>2</sub><sup>i</sup>x<sub>2</sub><sup>i</sup>*+*b*))</sup>))-(1-*y<sup>i</sup>*)log(1-(1/(1+*e*<sup>-*w<sub>1</sub><sup>i</sup>x<sub>1</sub><sup>i</sup>))*+*w<sub>2</sub><sup>i</sup>x<sub>2</sub><sup>i</sup>*+*b*))</sup>*dL/dw<sub>2</sub>* = -(*e*<sup>-*w<sub>1</sub><sup>i</sup>x<sub>1</sub><sup>i</sup>*+*w<sub>2</sub><sup>i</sup>x<sub>2</sub><sup>i</sup>*+*b*</sup>)/(ln(10)+ln(10)*e*<sup>-*w<sub>1</sub><sup>i</sup>x<sub>1</sub><sup>i</sup>*+*w<sub>2</sub><sup>i</sup>x<sub>2</sub><sup>i</sup>*+*b*</sup>)

*db* = *y<sup>i</sup>*log(1/(1+*e*<sup>-*w<sub>1</sub><sup>i</sup>x<sub>1</sub><sup>i</sup>*+*w<sub>2</sub><sup>i</sup>x<sub>2</sub><sup>i</sup>*+*b*))</sup>))-(1-*y<sup>i</sup>*)log(1-(1/(1+*e*<sup>-*w<sub>1</sub><sup>i</sup>x<sub>1</sub><sup>i</sup>))*+*w<sub>2</sub><sup>i</sup>x<sub>2</sub><sup>i</sup>*+*b*))</sup>*dL/db* = -(*e*<sup>-*w<sub>1</sub><sup>i</sup>x<sub>1</sub><sup>i</sup>*+*w<sub>2</sub><sup>i</sup>x<sub>2</sub><sup>i</sup>*+*b*</sup>)/(ln(10)+ln(10)*e*<sup>-*w<sub>1</sub><sup>i</sup>x<sub>1</sub><sup>i</sup>*+*w<sub>2</sub><sup>i</sup>x<sub>2</sub><sup>i</sup>*+*b*</sup>)

### References
*What is Gradiant Descent?*. IBM. https://www.ibm.com/topics/gradient-descent

*Weights and Biases*. AI Wiki. https://machine-learning.paperspace.com/wiki/weights-and-biases

Gosavi, Bhushan. (2019 November 4). *Mathematics Behind the Artificial Neural Networks: Part 1*. Medium. https://medium.com/analytics-vidhya/mathematics-behind-artificial-neural-networks-part-1-2214dab225c2

## Lecture 5: Gradient Descent over All Elements in Training Set
### Notes
As discussed in Lecture 4, the following is the equation used to modify the weight for Gradient Descent:

Ω/Ω*w*<sub>1</sub> J(*w*,*b*) = (1/*m*)Σ<sub>i=1</sub><sup>m</sup>Ω/Ω*w*<sup>i</sup><sub>1</sub>*L*(a<sup>i</sup>,y<sup>i</sup>)

*Remember!* in Neural Networks we need to avoid **for-loops**. Programs like this are called **vectorized code**. This is necessary for the speed of our program; when we are training our neural network using Gradient Descent, multiple for loops will only slow down our algorithm, especially since we are already spending a decent bit of time on this algorithm.

### Practice

In [1]:
import numpy as np
import math


def sigmoid(x):
  
    return(1 / (1 + math.exp(-x))) 

def sigmoid_prim(x):
    a = (np.exp(-z))/(1 + np.exp(-z))**2
    return a


def gradient_decent_one_node_one_layer(X_t, Y, n_iteration, learning_rate):
    
    alpha = learning_rate  # learning scale
    
    for k in range(n_iteration):
        
        z = 0; db = 0 
        d_omega = np.zeros((1, n)) 
        
                                         
        n = X_t.shape[0]   # number of dimension of the input X_t^(1) = [x_1^(1), x_2^(1)], {X_t.shape = = (n, m)}                        #X = [x_1, x_2]
        
        omega = np.random.rand((1, n)) * np.sqrt(1/n)   # initial values for [omega_1, omega_2]
        b = np.random.rand()

        m = len(Y)  # number of data in the tarining set, or m = X_t.shape[1] {length of [X_t^(1), ..., X_t^(m)]} or [y_1, ..., y_m]

        #original code
        #for i in range(1, m, +1):
        #    z = np.dot(omega , X_t[i]) + b    # z = \omega^T X + b
        #    a = sigmoid(z)

        #    J += - Y[i] * np.log(a) - (1 - Y[i]) * np.log(1 - a)
        #    dz = a - Y[i]

        #    for j in range(n):
        #        d_omega[j] += X_t[i][j] * dz

        #    db += dz 

        # My code
        def dbSet (z, a, J, dz, j, db):
            z = np.dot(omega, X_t[i]) + b
            a = sigmoid(z)
            
            J += -Y[i] * np.log(a) - (1 - Y[i]) * np.log(1 - a)
            dz = a - Y[i]

            #original code
            #for j in range(n):
            #    d_omega[j] += X_t[i][j] * dz

            #my code
            d_omega.map(X_t[i][j] * dz, d_omega[j], range(n))

            db += dz

        Y.map(dbSet, range(m))

        J = J / m
        # original code
        #for j in range(n):
        #    d_omega[j] = d_omega[j] / m
        #    omega[j] = omega[j] - alpha * d_omega[j]

        def editDOmega():
            d_omega[j] = d_omega[j] / m
            omega[j] = omega[j] - alpha * d_omega[j]

        d_omega.map(editDOmega)

        db =  db / m
        b = b - alpha * db

    return(omega, b)    

### References
Bakhvalov, Denis. (10 November 2017). *Vectorization Part 7. Tips for Writing Vectorizable Code*. Easyperf. https://easyperf.net/blog/2017/11/10/Tips_for_writing_vectorizable_code

*Built-In Functions*. Python Documentation. https://docs.python.org/3/library/functions.html#map

Hsu, Jonathan. (2019 December 14). *How to Replace your Python For Loops with Map, Filter, and Reduce*. Better Programming. https://betterprogramming.pub/how-to-replace-your-python-for-loops-with-map-filter-and-reduce-c1b5fa96f43a

Pozo Ramos, Leodanis. *Python's map(): Processing Iterables Without a Loop*. Real Python. https://realpython.com/python-map-function/

Shah, Ahmar. (2022 Jul 22). *Understand Vectorization for Deep Learning*. Medium. https://towardsdatascience.com/understand-vectorization-for-deep-learning-d712d260ab0f

Tarleton, Shane. (2022 February 23). *Use map() instead of for() loops*. Medium. https://blog.bitsrc.io/please-use-map-instead-of-for-loops-5a2f54f088c8

## Lecture 6: Deep Neural Network
### Deep Neural Network Vocabulary
There are two parts to involved in hidden layers:
1. We calculate *z*=*w<sup>T</sup>X*+*B*, or the output depending on our input and its weights and bias.
2. We calculate our activation function *a*=*g(z)*

![simplified graphic of a neural network including equations](Resources/neural%20network%20diagram.jpeg)

We know we can have multiple layers within our neural network. These layers can differ in multiple ways:
- They can each have differing numbers of inputs
- The amount of **nodes** in each layer can vary: **Nodes** are the units that have one of more inputs, an activation function, and an output.

A **Deep Neural Network** is made by incorporating more nodes and layers. The two parts to computing nodes in each layers has the general form:

1. *z<sub>j</sub><sup>[l]</sup>* = (*w<sup>T</sup>*)*<sub>j</sub><sup>[l]</sup>a<sup>[l-1]</sup>*+*b<sub>j</sub><sup>[l]</sup>*

2. *a<sub>j</sub><sup>[l]</sup>* = *g<sup>[l]</sup>*(*z<sub>j</sub><sup>[l]</sup>*)

where *a<sup>[0]</sup>* represents the input layer, *l* represents the number of layers, *j* is the node's index in the given layer, and *a<sup>l</sup>* is the output of each layer.

We can define the output layers as:

*Z<sup>[l]</sup>* = [ *z<sup>[l]</sup><sub>1</sub>*, *z<sup>[l]</sup><sub>2</sub>*, ... , *z<sup>[l]</sup><sub>n<sub>1</sub></sub>* ]

*A<sup>[l]</sup>* = [ *a<sup>[l]</sup><sub>1</sub>*, *a<sup>[l]</sup><sub>2</sub>*, ... , *a<sup>[l]</sup><sub>n<sub>1</sub></sub>* ]

where *n<sub>1</sub>* is the number of nodes in the layer *l*.

### Homework
Consider a deep neural network with L layers and n inputs. Eahc layer has *n<sub>1</sub>* nodes. Find the rank of the matrices *w<sup>[1]</sup><sub>j</sub>*, *b<sup>[1]</sup><sub>j</sub>Z<sup>[1]</sup>*, *A<sup>[1]</sup>*, *w<sup>[1]</sup>*, *b<sup>[1]</sup>*, *Z<sup>[1]</sup>*, *A<sup>[1]</sup>*.

