In this practical assignment you will learn how to work with reinforcement learning. You will be provided with an environment which generates rewards. It is your goal to implement a Tabular q-learning class which is able to accumulate reward after training.

Your code will be evaluated on clarity, conciseness and efficiency. Document your code. The resulting notebook including the required outputs should be converted to pdf and uploaded to Blackboard. Comments can be added in the notebook using markdown cells. Note that for code development you may want to use the PyCharm IDE, which facilitates debugging.

### Assignment 1

Below we provide a class Foo, which is an environment with which an agent can interact. Use a markdown cell to explain how the environment generates observations and responds to actions.

In [170]:
import numpy as np

class Foo(object):
    """
    Very simple environment for testing fully observed models. The actor gets a reward when it correctly decides
    on the ground truth. Ground truth 0/1 determines probabilistically the number of 0s or 1s on the output
    """

    def __init__(self, n, p = 0.8):
        """

        Args:
            n: number of inputs
            p: probability of emitting the right sensation at the input
        """

        super(Foo, self).__init__()

        self.ninput = n
        self.p = p
        self.noutput = 2

        self.reset()

    def reset(self):

        self.state = np.random.randint(0, 2)

        p = np.array([1 - self.p, self.p])

        if self.state == 0:
            p = 1 - p

        obs = np.random.choice(2, [1, self.ninput], True, p)

        return obs.astype(np.float32)

    def step(self, action):

        # reward is +1 or -1
        reward = 2 * int(action == self.state) - 1

        obs = self.reset()
        done = True

        return obs, reward, done

    def get_ground_truth(self):
        """
        Returns: ground truth state of the environment
        """

        return self.state

    def set_ground_truth(self, ground_truth):
        """
        :param: ground_truth : sets ground truth state of the environment
        """

        self.state = ground_truth

### Assignment 2

Below we provide a default implementation of an agent which takes random actions and an experimental run which shows how an agent can interact with an environment. Run the random agent on the environment and plot the cumulative reward gained throughout the experiment.

In [176]:
###
# Random agent

class RandomAgent(object):
    """
    Agent which takes random actions
    """

    def __init__(self, ninput, noutput, **kwargs):

        # Number of input variables
        self.ninput = ninput

        # Number of actions
        self.noutput = noutput

    def act(self, obs):
        """"
        Perform random action.
        """

        return np.random.randint(self.noutput)
    
    def learn(self,obs,obs2,action,reward,done):
        """
        Nothing to do
        """
        pass

In [179]:
# number of iterations
niter = 10**3

###########
# Environment specification

env = Foo(2,0.8)

###########
# Agent specification

agent = RandomAgent(env.ninput, env.noutput)

###########
# train phase

rewards = np.zeros([niter, 1])

obs = env.reset()
reward = done = None

for i in xrange(niter):

    # Choose an action
    action = agent.act(obs)

    # Perform action and receive new observations and reward
    obs2, reward, done = env.step(action)

    # Store reward
    rewards[i] = reward
    
    # Tabular q learning
    agent.learn(obs,obs2,action,reward,done)
    
    obs = obs2

### Assignment 3

Now implement a TabularQAgent which changes the policy using tabular q-learning. Reuse the above experimental run to show that the cumulative reward increases.

### Assignment 4

Show how playing around with the observation probability <code>p</code> affects convergence.