# Artificial Neural Network
Book Sections:Chapter 10

## In this tutorial we will use the dataset with the following attributes as an example:
**2 Attributes:**

1. Words (Specific vocabularies that are feminine, masculine or neutral)
2. Category

Datasets comes from: https://link.springer.com/article/10.3758/BF03195592

This tutorial aims at using SVM to classify whether a word is gender-biased. To deal with vocabulary in computer, background of Natural Language Processing is discussed here for you to read if you are interested in.

## Section 1 Before Algorithms: Background

In this section, we will first introduce some basic concepts and background to help you understand what is artificial neural network(ANN).

**Neural Network**

![](./fig/ANN/fig1.png)

Under the background of biology, in the nervous system, our brain contains billions of neurons. In the system, the neurons send the information down to the axon until the axon terminals (electric signals), and then the signal is transformed to chemical signals and is sent to the next neuron. 

The basic idea of artificial neural network (ANN) is based on the concept of the biological neural network. In ANN, there are several layers that contain some number of neurons, where each layer has different roles. Similar to the biological one, ANN receives the input information (x variables), sends the information based on the importance of each information to the next neuron, and finally outputs the signal (y variable)

## Section 2 Before Application: Basic Concepts

**ANN Terminology**

![](./fig/ANN/fig2.png)

*Layers*

The input layer is the layer that inputs the signals (data) into the network (The number of neurons inside it is the number of attributes that we are interested in). The output layer is the layer that outputs the signal (y variables). Between the input layer and the output layer, there is/are some hidden layer(s) that contains some number of neurons. Like the biological neuron, each dendrite’s signal is weighted according to its importance.

*Activation Functions*

In each hidden layer, there is an activation function to transform the input information (a technique to increase the nonlinearity of the model). For example, if the activation function in hidden layer L1 is $f$, then:

<div align = 'center'><font size = '6'>${A_k}^{(1)}=f(w_{k0}+\sum_{j=1}^p w_{kj}X_j)$</font></div>

Here, $w_{kj}$ is the weight between the previous neuron $j$ and the current neuron $k$, which reveals the importance of the data being sent (and needs to be estimated by the model). $w_{k0}$ is the bias weight. When the information is propagating, each neuron will sum all the information that it receives (i.e., all the information $X_j$ as well as its importance $w_{kj}$), then it will do a nonlinear transformation by using activation function. 

The form of the activation function is flexible. Generally, some common activation functions are sigmoid, ReLU, $tanh(x)$, etc. Unfortunately, like SVM, there is not a following rule to choose the activation function (thus we need to define them by ourselves in the model)

Activation Function Readings and Cheetsheet:

**https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6**



**Neural Network Topology**

The ability of a neural network to learn is rooted in its topology. The characteristics of a network are the number of layers, whether the information is allowed to be sent backward and the number of neurons within each layers.

Similar to the models we learned before, ANN has a tradeoff: the more complex the neural network, the more accurate we will predict on our training data, while we may face the problem of overfitting. 

*Study the weights*

Like the linear regression, we have a criterion function that the model must minimize while it is being trained (like the least square function). In machine learning, this function is called the cost function. 
The cost function of the weights:

<div align = 'center'><font size = '6'>$J(\vec{w})=\frac{1}{2n}\sum_{i=1}^n ((\sum_{j=1}^p w_jx_j)-y_i)^2$</font></div>

OR:

<div align = 'center'><font size = '6'>$\frac{\partial J(\vec{w})}{\partial w_k}=\frac{1}{n}\sum_{i=1}^n (\widehat{y}_i-y_i)x_{ik}$</font></div>

To minimize the cost function, we can adjust the weights to decrease the value of the cost function. Remember the definition of the gradient, we can find out the steepest function that the cost function would decrease. Then we can adjust the weights by a small amount which is determined by the learning rate. We keep adjusting the weights until it reaches to the minimal. (Imagine a person is at the top of a mountain that wants to go downhill. Each time the person moves a little bit along the steepest direction down the hill) This process is called gradient descent.

***Problems in this process?***
- The amount to be adjusted is determined by the learning rate. If the learning rate is too small, it spends a large amount of time to reach the optimal; if it is too large, it may cause the weights to change too drastically oscillating between values, and so again the learning algorithm will either take too long to converge or oscillate continuously.
- When the cost function is non-convex, we may finally reach out a local minimum. (Thus we need to use stochastic gradient descent, with stochastic initialization)


## Section 3: Example Codes

In this section we will use R language. R is a very powerful tool for data visualization and statistics. It is also a good tool to do machine learning and deep learning