# Experiments on Reinforcement Learning

## The game

Let's imagine a game in which the player has to choose one out of ten boxes. Every game one of the boxes lights up and the player gets one point if he or she chooses that box and zero points otherwise. The player can make 10 choices for each game.

In practice, the environment $\mathcal{E}$ is given by a random number generator that extracts a random number from 1 to 10 according to some probability distribution. The agent choose one box for 10 times, getting the corresponding reward at each attempt.

In [1]:
import numpy as np
import sys
sys.path.insert(0, '../src')
from rl_tools import *
from pprint import pprint
from tqdm import tqdm_notebook as tqdm
import random
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go

init_notebook_mode(connected=True)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
from importlib import reload
import rl_tools
reload(rl_tools)
from rl_tools import *

## Random agent

Initialize environment, a random agent and the game.

In [3]:
env = Environment()
random_agent = Agent(0.2,0)
game = Game()

Test: execute `n_actions` sequences of one extraction of a random state and one random action from the agent.

In [4]:
n_actions = 10

for i in range(n_actions):
    print(f'Extraction {i+1} of {n_actions}')
    state = np.random.randint(1,11)
    print(f'State: {state}')
    action = random_agent.choose_action(state, 0.0)
    print(f'Action: {action}')

Extraction 1 of 10
State: 10
Action: 7
Extraction 2 of 10
State: 6
Action: 4
Extraction 3 of 10
State: 6
Action: 5
Extraction 4 of 10
State: 7
Action: 7
Extraction 5 of 10
State: 1
Action: 1
Extraction 6 of 10
State: 5
Action: 6
Extraction 7 of 10
State: 9
Action: 8
Extraction 8 of 10
State: 7
Action: 3
Extraction 9 of 10
State: 5
Action: 9
Extraction 10 of 10
State: 4
Action: 4


Test: execute `n_actions` actions with the Game object.

In [5]:
for i in range(n_actions):
    print(game.play_one_action(random_agent, env, 0.0))

(6, 9, 0, 9)
(9, 9, 1, 7)
(7, 9, 0, 3)
(3, 1, 0, 7)
(7, 2, 0, 9)
(9, 1, 0, 4)
(4, 1, 0, 1)
(1, 8, 0, 1)
(1, 4, 0, 1)
(1, 3, 0, 4)


Test: execute `n_episodes` episodes with the Game object.

In [6]:
game.n_actions

10

In [7]:
game.action_count

10

In [8]:
n_episodes = 10

for i in range(n_episodes):
    game.play_one_episode(random_agent, env, 0.0)

Play 500 rounds with 10 attempts each and plot the results.

In [10]:
n_rounds = 500
scores = []

for _ in tqdm(range(n_rounds)):
    scores.append(game.play_one_episode(random_agent, env, 1.0))
    
scores = np.array(scores)




In [11]:
trace = go.Scatter(
    x = np.arange(1, len(scores)+1),
    y = scores,
    mode='markers'
)

layout = go.Layout(
    xaxis = dict(
        title='Episode number'
    ),
    yaxis = dict(
        title='Score'
    )
)

data = [trace]

fig = go.Figure(data=data, layout=layout)

iplot(fig)

In [12]:
print('Average score:')
print(scores.mean())
print('Standard deviation of the mean:')
print(scores.std())

Average score:
1.006
Standard deviation of the mean:
0.9807976345811606


## Neural network agent

In [13]:
from keras.models import Sequential
from keras.layers import Dense
from keras.losses import mean_squared_error

In [14]:
def custom_loss(Y_target, Y_pred):
    return mean_squared_error(Y_target, Y_pred)

In [15]:
model = Sequential()
model.add(Dense(32, input_shape=(1,), activation='relu'))
model.add(Dense(10, activation='relu'))

model.compile(
    optimizer='adam',
    loss=custom_loss
)

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 32)                64        
_________________________________________________________________
dense_2 (Dense)              (None, 10)                330       
Total params: 394
Trainable params: 394
Non-trainable params: 0
_________________________________________________________________


In [16]:
env = Environment()
nn_agent = Agent(0.2, model, random_only=False)
game = Game()

In [17]:
n_rounds = 500
scores = []

for _ in tqdm(range(n_rounds)):
    scores.append(game.play_one_episode(nn_agent, env, 0.0))
    
scores = np.array(scores)




In [18]:
trace = go.Scatter(
    x = np.arange(1, len(scores)+1),
    y = scores,
    mode='markers'
)

layout = go.Layout(
    xaxis = dict(
        title='Episode number'
    ),
    yaxis = dict(
        title='Score'
    )
)

data = [trace]

fig = go.Figure(data=data, layout=layout)

iplot(fig)

In [19]:
print('Average score:')
print(scores[350:].mean())
print('Standard deviation:')
print(scores.std())

Average score:
0.96
Standard deviation:
0.9178235124467012


### Optimization possible on:
- Memory size
- Batch size
- Attempts/episode
- Number of episodes
- Choice of optimizer
- $\epsilon$-greedy strategy parameter ($\epsilon$)

# Building the training routine

Trining: building a training batch and computing the target variables.

In [None]:
training_batch = np.array(
    random.sample(agent.memory, int(len(agent.memory)/10))
)
training_batch

In [None]:
training_batch.shape

In [None]:
training_batch[:,3].shape

In [None]:
training_batch[:,3]

Compute the Q-values (each of which is a 10-component array) for each final state in the transitions.

In [None]:
np.argmax(nn_agent.compute_q(training_batch[:,3]), axis=1)

Compute the target variable ($y_i$) for each of the transitions in the training batch.

In [None]:
np.amax(nn_agent.compute_q(training_batch[:,3]), axis=1)+1

In [None]:
Y_target = (training_batch[:,2]
    + nn_agent.gamma
    * np.amax(nn_agent.compute_q(training_batch[:,3]), axis=1))

In [None]:
Y_target.shape

Compute the "prediction" from the model, based on on the initial state and action taken. This implies computing the Q-value (array) associated to the initial state and then selecting the component of each array according to which action the agent performed in each trainsition.

In [None]:
Y_pred = np.take(nn_agent.compute_q(training_batch[:,0]), training_batch[:,1])

In [None]:
Y_pred.shape

Optimization is done on the MSE between Y_pred and Y_target.

In [None]:
nn_agent.model.fit(Y_target, Y_pred, epochs=1)