# Simple Reinforcement Learning in Pytorch: 
## The Multi-armed bandit, basic policy optimization

In [1]:
import torch
import torch.nn.functional as F
import numpy as np

### The Bandit
We are using a four-armed bandit. The pullBandit function generates a random number between zero and 1. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the arm that will give that positive reward.

In [2]:
#List out our bandit arms, where the arm value is the likelihood of a payoff when that arm is pulled.
#Currently arm 3 (index #2) is set to most often provide a positive reward.

bandit_arms = [0.1,0.0,0.3,0.15]
num_arms = len(bandit_arms)
def pullBandit(bandit):
    #Get a random number.
    result = np.random.rand(1)
    if result < bandit:
        #return a positive reward.
        return 1
    else:
        #return a negative reward.
        return -1

### The Model
The code below established a simple neural model. We use a policy gradient method to update the agent by moving the value for the selected action toward the recieved reward.


In [3]:
class Model(torch.nn.Module):
    
#Simple layer of weights, one for each bandit arm
    def __init__(self):
        super(Model, self).__init__()
        self.weights = torch.ones([num_arms], requires_grad=True)
        
#Place holder for reward to line up with output position
        self.targets = torch.zeros([num_arms], requires_grad=False,dtype=torch.long)
        
#Use softmax to shape output of desired choices
    def forward(self):
           return F.softmax(self.weights,dim=0)
        
#pick a random action with likelihood based on the current policy
    def action(self,output):
            o = output.data.numpy()
            a = np.random.choice(o,p=o)
            return np.argmax(output == a)

#Get our reward from pulling the selected bandit arms.
    def reward(self,action):
            return pullBandit(bandit_arms[action])
        
#compute loss on the arm we picked

    def loss(self,output,action,reward):
        self.targets[action]=reward
        loss = -output[action]*reward
        self.targets[action]=0.
        return loss

### Training the Model

In [4]:

net = Model()
optimizer = torch.optim.SGD([net.weights], lr=0.001, momentum=0.9)
trials = 100
for i in range(trials):
    output = net.forward()
    action = net.action(output)
    reward = net.reward(action)
    loss = net.loss(output,action,reward)
    loss.backward()
    optimizer.step()
    
print("The learned likelihood of getting the best reward from each of the 4 arms after trial", trials)
print(F.softmax(net.weights,dim=0).data)

The learned likelihood of getting the best reward from each of the 4 arms after trial 100
tensor([0.1469, 0.0568, 0.6046, 0.1917])
