#Introduction to Neural Networks

Neural Networks is the forefront of today's Machine Learning research. In this workshop, I will be going over how Neural Networks work and I will run over a hands-on example of using a Neural Network for Reinforcement Learning.

## Reinforcement Learning

Today we will be covering Q-learning, a specific type of learning that Google utilized back in 1900's.

### What is it used for?
* Playing Games (AI).
* Time series analysis (financial strategies).
* Robots (Manufacturing)
* Learning partially observable environments. Most problems you can apply POMDP to.



### How does it work?

Q-learning is a specific type of Reinforcement learning with the following setup.


---


Input: Observations from the environment

Output: Action to perform on the environment (optimal action-selection policy)

What it learns: Learns the expected reward of a performing an action on a given state. We call these expected values "policies".



---

This may seem confusing, let's work through our example first.


In this case, at every position that the cart is in, we feed the x and y coordinates to the Neural Network as input.

Then, Neural Network produces a value, which is the expected action to take on the environment given this x and y.

## Installing Dependencies

First off, let's import our libraries.

In [0]:
# Import libraries
import matplotlib.pyplot as plt
import random
import pandas as pd
import tensorflow as tf
import numpy as np
import random

### Installing OpenAI's gym
Today, we will be using OpenAI's gym environment. To install on your computer, run 'pip install gym' on your command line (highly recommended!). You should also ensure you have OpenGL installed on your computer.

In [0]:
# Run this if you do not have opengl
!apt-get install python-opengl
# Run this to install OpenAI gym
!pip install gym
import gym

## Exploring the Environment

Great, now that we have installed gym, let's try to explore what it does. 

First, let's load the environment called 'MountainCarContinuous-v0' and call reset() to initialize our setup.

In [0]:
env = gym.make('CartPole-v1')
env.reset()
random.seed()

For every action on the environment, the environment provides us a list of fields, called observations that results from the action we took.

Let's take 5 actions and observe what we get back from the environment.

In [0]:
for step in range(5):
    action = env.action_space.sample()   # Generate a random action
    
    # Apply the predicted step onto the environment.
    # Receive back a observation, reward, done flag, and extra information
    obs1,reward,done,_ = env.step(action)
    
    print("STEP {}: We took an action {} and got back:".format(step,action))
    print("Observation {}".format(obs1))
    print("Reward: {} \n".format(reward))

**Note**: *The input is an array of one element*

What does this mean? Well, for every action we've taken, we got back the parameters for the new state of the environment. We call these "Observations". Along with the observations, we are given a "Reward".


---


### Question: 
Why is Observation a matrix of dimension 4? What does each element in this matrix represent?


---


We will use the Observations as inputs to our Neural Network and we will use the Rewards for the loss function!

We can use the following commands to find the maximum Action values and maximum Observation values.

In [0]:
print(env.action_space)

print(env.observation_space.high)
print(env.observation_space.low)

Great, now we know the bound of our action: {0, 1} and the expected values for our observation!

## Creating our Neural Network

Let's start to create our Neural Network. First, let's get some variables from the environment.

We can use this to determine how many input nodes we need and how many output nodes we need.

In [0]:
print(env.observation_space.shape)
INPUTSIZE = 4

print(env.action_space.shape)
OUTPUTSIZE = 2

# Number of hidden layers
HIDDENSIZE=15

Tensorflow helps us build a graph of our Neural Network before we train the network. To start, let's reset the tensorflow graph.

In [0]:
tf.reset_default_graph()

### Input Layer:

Next, let's use the function `tf.placeholder(dtype, shape)` to create our input layer. Notice that the shape of the input layer is: 1 x 2 (or 1 x `INPUTSIZE`).

In [0]:
inputs = tf.placeholder(tf.float32, shape=(1,INPUTSIZE))

### Hidden Layers:

Now let's create the weights and bias to connect the input and the first hiddlen layer. Let's first declare our variables.

In [0]:
W1 = tf.get_variable("W1", shape=[INPUTSIZE, HIDDENSIZE], initializer=tf.contrib.layers.xavier_initializer())   # A 2 x 10 layer
b1 = tf.Variable(tf.zeros(HIDDENSIZE))   # Initialize bias with 0

Let's add both `W1` and `b1` together in a layer and call it `layer1`. 

`tf.matmul` multiplies inputs and W1 together. `tf.nn.relu` applies the Rectified Linear function to the equation.

In [0]:
layer1 = tf.nn.relu(tf.matmul(inputs, W1)+b1)   # Use ReLU activation function

Now let's do the same thing with the second layer. This time, the size of the layer is 10x10. In otherwords, `INPUTSIZE` x `INPUTSIZE`

In [0]:
W2 = tf.get_variable("W2", shape=[HIDDENSIZE, HIDDENSIZE], initializer=tf.contrib.layers.xavier_initializer())
b2 = tf.Variable(tf.zeros(HIDDENSIZE))
layer2 = tf.nn.relu(tf.matmul(layer1, W2) + b2)

### Output Layer:

Now it's time to create our output layer of size 10 x 1.

In [0]:
W3 = tf.get_variable("W3", shape=[HIDDENSIZE, OUTPUTSIZE], initializer=tf.contrib.layers.xavier_initializer())
layer_out= tf.matmul(layer2, W3)

The last layer, `layer_out` will produce a value, however, not neccessarily an action. We apply argmax, to get the index of action of the largest expected reward.

In [0]:
# Output: returns a value either 0 or 1
predict = tf.argmax(layer_out,1)

This prediction will be what we feed into the environment to get the next Observation!


---


After each action, we receive the reward from the environment. With the reward, we need to define a loss function to evaluate our Neural Network's performance.
Here we will use least squares for our loss function. There are other loss functions that we can use.

In [0]:
rewardQ = tf.placeholder(shape=[1,OUTPUTSIZE],dtype=tf.float32)
loss = tf.losses.mean_squared_error(rewardQ, layer_out)

Now, let's are declare our optimizer. A good choice in optimizer and learning rate can affect the time it takes to converge. Here we will use the Adam Optimizer, which is an optimizer extended from gradient descent. In practice, AdamOptimizer generally performs better than GradientDescentOptimizer. We also declare updateModel so we can have tensorflow minimize the loss function w.r.t. the optimizer.

In [0]:
trainer = tf.train.AdamOptimizer(learning_rate=0.01)
updateModel = trainer.minimize(loss)

Finally, let's declare a initializer. `init` will activate the initializers in all the variables.

In [0]:
# Initialize the weights and biases
init = tf.global_variables_initializer()

## Training the Network

Now, we can train our Neural Network.

First, let's set a few parameters for our Neural Network.

In [0]:
# Set learning parameters
gamma = 0.99
epsilon = 0.05
num_episodes = 400
maxsteps=500

# For graphing rewards
rList=[]

We create a new Tensorflow session. These session instances keep track of things like Variables specific to the session.

In [0]:
sess = tf.Session()

As an example, let's use sess to call `init` and initialize our weights.

In [0]:
sess.run(init)

Great! Now all the weights are randomly assigned. Now, we can train our network.

In [0]:
obs=env.reset()
rAll, steps = 0, 0
done = False
while steps < maxsteps:
    # Run the network with obs, and received the values for predict and layer_out
    action, actionQs = sess.run([predict, layer_out], feed_dict={inputs:obs.reshape(1,INPUTSIZE)})

    # Epsilon greedy exploration
    if np.random.rand(1) < epsilon:
        action[0] = env.action_space.sample()

    # Apply the predicted step onto the environment.
    # Receive back a observation, reward, done flag, and information
    next_obs,reward,done,_ = env.step(action[0])

    # Evaluate the neural network on next_obs: the observation after the action
    evalQ = sess.run(layer_out, feed_dict={inputs:next_obs.reshape(1,INPUTSIZE)})
    
    #Take the largest value of all the next Q(s',a') values
    max_evalQ = np.max(evalQ)
    # Grab the old Q function
    targetQ = actionQs
    
    # Update the (old) Q(s,a) value
    if done:
        targetQ[0,action[0]] = reward
    else:
        targetQ[0,action[0]] = reward + gamma*max_evalQ

    # This will train our network with the update Q(s,a) value
    # and update the weights once every episode.
    sess.run(updateModel, feed_dict={inputs:obs.reshape(1,INPUTSIZE),rewardQ:targetQ})
    # Change the state s=s'
    obs=next_obs

    # Add to the steps and record the reward for graphs
    steps += reward
    if done:
        rAll += steps
        rList.append(rAll)
        break;

Cool! Let's see how it did.

In [0]:
print("Episode: 1 Steps: "+str(steps))

More than likely, the neural network did not perform very well! Let's try to hit 500 steps by training it for 500 more iteraions.

In [0]:
for i in range(num_episodes):
    obs=env.reset()
    rAll, steps = 0, 0
    done = False
    while steps < maxsteps:
        #if i % 50 == 0:
            #env.render()
            
        action, actionQs = sess.run([predict, layer_out], feed_dict={inputs:obs.reshape(1,INPUTSIZE)})

        if np.random.rand(1) < epsilon:
            action[0] = env.action_space.sample()
            
        next_obs,reward,done,_ = env.step(action[0])

        evalQ = sess.run(layer_out, feed_dict={inputs:next_obs.reshape(1,INPUTSIZE)})
        max_evalQ = np.max(evalQ)
        targetQ = actionQs

        if done:
            targetQ[0,action[0]] = reward
        else:
            targetQ[0,action[0]] = reward + gamma*max_evalQ

        sess.run(updateModel, feed_dict={inputs:obs.reshape(1,INPUTSIZE),rewardQ:targetQ})
        obs=next_obs

        steps += reward
        if done:
            rAll += steps
            rList.append(rAll)
            break;
    print "Episode: {}/{} Steps: {}".format(i,num_episodes,rAll)

Time to create a graph of how we did:

In [0]:
print "Average steps per episode: " + str(sum(rList)/num_episodes) + " steps"
plt.plot(rList)
plt.show()

Did it learn? Perhaps did the weights diverge? What happened?

Let's test it out on the environment

In [0]:
obs=env.reset()
steps = 0
done = False
while steps < maxsteps:
    #env.render()

    actions = sess.run(predict, feed_dict={inputs:obs.reshape(1,INPUTSIZE)})
    obs,reward,done,_ = env.step(actions[0])
    if done:
        print("Performed "+str(steps))
        break
    steps += reward

We are now done with the environment. Let's close the session.

In [0]:
sess.close()

Now you've created your (maybe first) neural network!

Where can we go from here?<br>
Keras for DQN: https://keon.io/deep-q-learning/ <br>
Convolutional Neural Networks: http://cs231n.github.io/convolutional-networks/ <br>
Long-Short Term Memory (LSTMs): http://colah.github.io/posts/2015-08-Understanding-LSTMs/ <br>
Deep Learning Specialization: https://www.coursera.org/specializations/deep-learning<br>
Loss functions in NNs: https://isaacchanghau.github.io/2017/06/07/Loss-Functions-in-Artificial-Neural-Networks/