## Deep Reinforcement Learning homework 1
### Daniel Kuknyo
Consider a 2-armed bandit situation. You are in a hurry driving a car and want to go from Buda to Pest. You have to get there early since you have an important meeting. Unfortunately, there are only two bridges, and you have to take one to cross the river.

Sometimes there is heavy traffic on the bridges, and it is possible that you jam on them, and miss your meeting. If the traffic is regular, then the crossing time is shorter. For both bridges, there is a different p probability of the jam. The reward (here, it will be a negative number) of the jam and the regular pass is different. The reward means the negative value of the time you need to cross the bridge. Note that you could also talk about costs that should be minimized. It would be more natural to use costs here, but in RL, most works talk about rewards.

Optimize the exploration-exploitation strategy to minimize your time (maximize reward) during 1000 steps (episodes). Exploration: the number of random choices between the two bridges. Exploitation: the sudden or gradual switch to the bridge that you consider better.

You’ll be given the jamming probability of the bridges and also the rewards. You can trivially compute the solution. Compute it. This way, you know what the best you could achieve is. However, your agent knows nothing and learns from experiences... Run a sufficient number of random tests to show the improving (average) performance throughout the 1000 learning steps.

Create a graph about the rewards during the 1000 steps. In what circumstances (jamming probabilities) could this problem be hard or easy? Why?

You have to use Python during the task and have to upload the solution in Jupyter notebook (.ipynb) format. If you are not familiar with it, here is a quick and easy tutorial: https://www.dataquest.io/blog/jupyter-notebook-tutorial/ (Links to an external site.)

You'll find two sets of model values in the Homework_1_input_data.csv under your Neptun code. Solve the task for both set. If you have any questions feel free to ask.

In [13]:
# Import modules 
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd 
%matplotlib inline

In [14]:
# Add hyperparameters
APjam = 0.18
ARewardJam = 49
ARewardNormal = 9

BPjam = 0.29
BRewardJam = 46
BRewardNormal = 12

In [1]:
class eps_bandit:
    '''
    k-bandit problem
    
    Inputs
    =====================================================
    k: number of arms (int)
    
    eps: probability of random action 0 < eps < 1 (float)
    
    iters: number of steps (int)
    
    mu: set the average rewards for each of the k-arms.
        Set to "random" for the rewards to be selected from
        a normal distribution with mean = 0. 
        Set to "sequence" for the means to be ordered from 
        0 to k-1.
        Pass a list or array of length = k for user-defined
        values.
    '''
    
    def __init__(self, k, eps, iters, mu='random'):
        # Number of arms
        self.k = k
        # Search probability
        self.eps = eps
        # Number of iterations
        self.iters = iters
        # Step count
        self.n = 0
        # Step count for each arm
        self.k_n = np.zeros(k)
        # Total mean reward
        self.mean_reward = 0
        self.reward = np.zeros(iters)
        # Mean reward for each arm
        self.k_reward = np.zeros(k)
        
        if type(mu) == list or type(mu).__module__ == np.__name__:
            # User-defined averages            
            self.mu = np.array(mu)
        elif mu == 'random':
            # Draw means from probability distribution
            self.mu = np.random.normal(0, 1, k)
        elif mu == 'sequence':
            # Increase the mean for each arm by one
            self.mu = np.linspace(0, k-1, k)
        
    def pull(self):
        # Generate random number
        p = np.random.rand()
        if self.eps == 0 and self.n == 0:
            a = np.random.choice(self.k)
        elif p < self.eps:
            # Randomly select an action
            a = np.random.choice(self.k)
        else:
            # Take greedy action
            a = np.argmax(self.k_reward)
            
        reward = np.random.normal(self.mu[a], 1)
        
        # Update counts
        self.n += 1
        self.k_n[a] += 1
        
        # Update total
        self.mean_reward = self.mean_reward + (reward - self.mean_reward) / self.n
        
        # Update results for a_k
        self.k_reward[a] = self.k_reward[a] + (reward - self.k_reward[a]) / self.k_n[a]
        
    def run(self):
        for i in range(self.iters):
            self.pull()
            self.reward[i] = self.mean_reward
            
    def reset(self):
        # Resets results while keeping settings
        self.n = 0
        self.k_n = np.zeros(k)
        self.mean_reward = 0
        self.reward = np.zeros(iters)
        self.k_reward = np.zeros(k)

In [20]:
APjam = 0.18
ARewardJam = 49
ARewardNormal = 9

k = 2
iters = 1000

eps_0_rewards = np.

episodes = 1000

# Run experiments
for i in range(episodes):
    # Initialize bandits
    eps_0 = eps_bandit(k, APjam, iters)
    
    # Run experiments
    eps_0.run()
    
    # Update long-term averages
    eps_0_rewards = eps_0_rewards + (eps_0.reward - eps_0_rewards) / (i + 1)
    
plt.figure(figsize=(12,8))
plt.plot(eps_0_rewards, label="$\epsilon=0$ (greedy)")
plt.legend(bbox_to_anchor=(1.3, 0.5))
plt.xlabel("Iterations")
plt.ylabel("Average Reward")
plt.title("Average $\epsilon-greedy$ Rewards after " + str(episodes) + " Episodes")
plt.show()

TypeError: 'float' object cannot be interpreted as an integer