### 0. Introduction

There are 4 sections in this project report. The 1st section implements a basic driving agent, which randomly choose action at each time step. The behavior of the basic agent is then observed. In section 2, we will discuss the choice of the agent's state, which will be used in the Q-learning algorithm. Section 3 implements the Q-learning algorithm. The code of Q-learning algorithm are illustrated in details including the data structure I use.  The choices of learning rate, discount factor and random action selection method I use in section 3 are the optimal choices which will be discussed in section 4. Section 4 firstly defines a performance evaluation metrics for the agent. Then the tunning parameters (learning rate, discount factor and random action selection method) are adjusted to find the best performance of the agent. After tunning the parameters systematically, the choice of the optimal tunning parameters and the best performance of the agent are reported. Based on the optimal tunning parameters, the generated Q-table will be explored in details. The conclusion will be made in the end of the report.

### 1. Implement a basic driving agent

The basic driving agent will be implemented in this section. The work in this section is simple. We produce some random action from the actions list **[None, "forward", "left", "right"]** and assign it to the variable **action**. This work can be done by adding one-line code to the **update()** method in class **LearningAgent**. The line of code is:  

In [None]:
# class LearningAgent(Agent):
    
def update(self, t):
    #some other code
    
    #the code produces a random action-
    #    -from (None,"left","right","forward")
    action = random.choice([None, "left", "right", "forward"])
    #this line should be above the code-
    #    -reward = self.env.act(self, action)
    
    #some other given code
    reward = self.env.act(self,action)

After implementing the random action, I set the **enforce_deadline** (within **run()** function) to be **True** and then run this agent. The performance of the agent is observed. 

**Question: ** *In your report, mention what you see in the agent’s behavior. Does it eventually make it to the target location?*  

**Answer: **  
At each step, the agent firstly choose an random **action**. Then the traffic light at the agent's current intersection is checked. If the traffic light is okay to allow the agent making the action, then the agent will act accordingly and the location of the agent will be updated. If the traffic light is not okay to allow the agent making the action, which means that making the action violates traffic rules, then the agent will take no action (**None**).  

Since the **action** is randomly chosen at each time step, the probability of the **action** being equal to **self.next_waypoint** is only 0.25. So, the agent usually does not move towards the target at each step due to the randomness of the action. According to my observation, the Manhattan distance between the agent and the target is only 5 at the beginning. But it takes 157 steps for the agent to finally get to the target.  

### 2. Identify and update state

The state I think is appropriate for modeling the driving agent is the combination of current traffic light and **self.next_waypoint**. The current traffic light can be obtained from the dictionary **inputs** (**inputs["light"]**). The variable **self.next_waypoint** has already been obtained from the route planner in the starter code. The state will be denoted as a tuple of **input["light"]** and **self.next_waypoint** in the **update()** method. Besides, the **deadline** will not be included in the **self.state** variable. The reason will be illustrated in the below Q&A part. The code is:

**Question: ** *Justify why you picked these set of states, and how they model the agent and its environment.*

**Answer: **  
The **self.state** I defined is a list that consists of the current traffic light and **self.next_waypoint**. The current traffic light is obtained by **inputs["light"]**. The **self.next_waypoint** is obtained by calling the **next_waypoint()** method of **RoutePlanner** class.  

The **inputs** is a python dictionary obtained by calling the **sens()** method of **Environment** class. The keys of **inputs** are: **"light"**, **"oncoming"**, **"left"** and **"right"**. We only use **inputs["light"]** as one of the states. The reason is as follows. First of all, I noticed that the **reward** is obtained by calling the **act()** method of **Environment** class. When I checked the **act()** method I found that the other dummy agents have nothing to do with the value of the **reward**, because the variable **move_okay** only depends on the traffic light no matter if there are other dummy agents at the same location. So, the **inputs["light"]** is an important state that we need to consider during the learning. Besides, the **inputs["oncoming"]**, **inputs["left"]**, **inputs["right"]** can be ignored because they represent the states of dummy agents.   

The other state I want to add to my **self.state** is the **self.next_waypoint**. The **self.next_waypoint** is obtained by calling the **RoutePlanner** class' method **next_waypoint()**, which returns a proposed action according to the relative location between the location of the agent and the target. The reason of choosing **self.next_waypoint** as one of the states is as follows. When checking the **act()** method of **Environment** class, it can be noticed that before the agent arrives at the target, the highest value of **reward** (2) is assigned when the **action** is equal to the **self.next_waypoint**. So, **self.next_way** is also an important state for the Q-learning.  

Another import thing I need to mention is that **deadline** should not be included in **self.state** according to my opinion. The reason is that no matter what the value of **deadline** is, we always want to choose the **action** to be the correct one. So, the **deadline** is redundant to be included in the **self.state** for selecting the correct **action**. Besides, adding the redundant **deadline** to **self.state** increases the dimension of the feature variable so that more trails will be needed to train a feasible policy. Based on reasons above, the **deadline** should not be added into **self.state**.

The ideal **action** we want to have is as follows: if the traffic light is okay to move, the ideal **action** should be **self.next_waypoint**; else, the ideal **action** should be **None**. So, our goal of this project is to build a table of Q-values by Q-Learning Algorithm such that the **action** chosen based on Q-values is the ideal **action**. And the implementation of Q-learning will be illustrated in the next section. 

### 3. Implement Q-Learning

The Q-learning algorithm can be denoted as the following formula:
$$\hat{Q}(s, a) \xleftarrow{\alpha} reward + \gamma\max_{a'}\hat{Q}(s',a')$$
So, the Q-learning algorithm can implemented by initializing and updating a table of Q-values at each time step. 

#### 3.1 Initialization  

The data structure of the table of Q-values (**self.q_table**) I use is python dictionary. The keys of the **self.q_table** are the states of the agent. For a specific key **self.state**, the **self.q_table[self.state]** is also a dictionary. The keys of the **self.q_table[self.state]** are the actions **(None, "left", "right", "forward")**. For a specific **action**, **self.q_table[self.state][action]** denotes the Q-value of the specific **(self.state, action)**. 

Besides the **self.q_table**, some parameters in the Q-learning algorithm such as the learning rate $\alpha$ (**self.alpha**) and the discount factor $\gamma$ (**self.gamma**) also needed to be initialized. The variable **self.alpha** is initialized as 1 and will decay over time. The variable **self.gamma** is initialized as some float number between 0 and 1 and remain fixed during the whole process of learning. Also, a variable **self.t** which denotes the time step is initialized to be 0. The variable **self.t** will be used for computing decayed **self.alpha** at each time step.

In order to avoid "local minimum", the $\epsilon$-greedy method will be used. So, the variable $\epsilon$ (**self.epsilon**) is initialized as 1 and will also decay over time steps.

The last variable that needs to be initialized is the **self.n_success**. It counts the number of successfully arriving at the destination within deadline. And it will be used for the performance evaluation in section 4.

Based on the idea described above, the initialization can be done in the **\__init__** method of **LearningAgent** class as follows:

In [None]:
#class LearningAgent(Agent):
    
def __init__(self, env):
    #some other code
    
    #my code for initializing some neccessary variables
    self.q_table = dict()  # table of Q-values
    self.alpha = 1.0       # learning rate
    self.gamma = 0.4       # discount factor
    self.epsilon = 1.0     # epsilon for epsilon-greedy method
    self.t = 0             # time step
    self.n_success = 0     # number of successfully arriving- 
                           #    -destination within deadline
    

#### 3.2 Select Action and Update Q-values  

In this subsection, the **update()** method of **LearningAgent** class will be implemented. The implementation will be illustrated in the following 5 steps. 

**Step 1: **Update **self.t**, **self.alpha** and **self.epsilon**  
The learning rate will decay over time step as $\alpha_{t} = \frac{1}{t}$, which is a classic choice for $\alpha$. The decaying choice for epsilon I made is that the epsilon decays at constant speed until it gets to 0: $\epsilon_{t} = \epsilon_{t-1} - 0.00046$. These works can be done by the code:

In [None]:
# class LearningAgent(Agent):
    
def update(self, t):
    #some other code
        
    #update self.t, self.epsilon, self.alpha
    self.t += 1
    self.alpha = 1.0/self.t
    self.epsilon -= 0.00046
        
    #some other code

**Step 2: **Update state (**self.state**)   
It can be noticed that the **self.state** has been implemented in section 2. Another thing we need to do is to check if the current **self.state** is in the **self.q_table**. If **self.state** is not in the **self.q_table**, the **self.q_table[self.state]** should be initialized such that the Q-values of each action at current state are all 0. These work can be done by following code:

In [None]:
#class LearningAgent(Agent):
    
def update(self, t):
    #some other code
        
    #update state
    self.state = (inputs["light"], self.next_waypoint)  
    #done in section 2
    
    if self.state not in self.q_table:
        #initialize q_table[self.state]
        self.q_table[self.state] = \
                    {None:0, "left":0, "right":0, "forward":0}
            
    #some other code

**Step 3: **Select action according to my policy  
Given current state **self.state**, we want to choose the action that has the highest Q-value (**max_q**). If there are several actions that has the same highest Q-value, then we will randomly choose one from these candidate actions. My code generates a list called **actions** which stores cadidate actions that have equal highest Q-value. The length of action may be 1 or more.   

If we select the action based on the Q-values every time step from the beginning, we may be trapped into the "local minimum", which can lead us to choose the optimal action "None" every time in this project. In order to avoid this "local minimum" problem, the $\epsilon$-greedy method is applied. The basic idea of $\epsilon$-greedy method is that at each time step, we randomly choose the action from all possible actions with probability $\epsilon$, and we choose the action according to the highest Q-value with probability $1-\epsilon$. In this project, the $\epsilon$ (**self.epsilon**) is initialized as 1, and then decayed over time. So, during the beginning stage of learning, it has higher probability to choose an action randomly. And the probability of random choosing action becomes smaller and smaller as time goes on.   

The uniform distribution will be used in the implementation of the $\epsilon$-greedy method. First of all, we generate a random number **e** from the uniform distribution ranging from 0 to 1. Then if the random number is less than **self.epsilon**, the randomly chosen action is used, otherwise the action is selected according to the highest Q-value.

The above illustrated work can be coded as follows:

In [None]:
#class LearningAgent(Agent):
    
def update(self, t):
    #some other code
        
    #create candidate actions that has highest Q-value
    max_q = max(self.q_table[self.state].values())
    actions = []
    for key in self.q_table[self.state].keys():
        if self.q_table[self.state][key] == max_q:
            actions.append(key)
                
    #epsilon-greedy method
    e = random.uniform(0,1)
    if e < self.epsilon:
        action = random.choice([None, "left", "right", "forward"])
    else:
        action = random.choice(actions)
            
    #some other code

**Step 4: **Update **self.n_success**  
We notice that after we choose our **action**, the **reward** can be obtained from given code. In this step, we will check the value of the **reward**. We can see from **act()** method in **Environment** class that if the agent arrives at the destination within deadline, then the **reward** will have an extra bonus 10. It means that the **reward** must be greater than 2 if the agent successfully arrives at the destination before deadline. So, our code is to check if the **reward** is greater than 2. If yes, the number of successes (**self.n_success**) should be incremented by 1. This variable **self.n_success** will be used for performance evaluation in section 4. The code is written as:

In [None]:
#class LearningAgent(Agent):
    
def update(self, t):
    #some other code
        
    #execute action and get reward
    reward = self.env.act(self, action) 
    #this line is given in the starter code
        
    #update self.n_success, my code
    if reward > 2:
        self.n_success += 1

**Step 5: ** Update table of Q-values based on state, action and reward  
After we get the **reward** from step 4, we will update the table of Q-values by using this formula:
$$\hat{Q}(s, a) \xleftarrow{\alpha} reward + \gamma\max_{a'}\hat{Q}(s',a')$$
Firstly, we initialize **new_state** as $s'$ in the formula. Secondly, we obtain the $reward + \gamma\max_{a'}\hat{Q}(s',a')$ as **q_hat** in the code. Then we update the **self.q_table** using the above formula. The code is shown as below:

In [None]:
#class LearningAgent(Agent):
    
def update(self, t):
    #some other code
        
    #Learn and update Q-value based on state, action and reward
    new_state = (self.env.sense(self)["light"],\
                 self.planner.next_waypoint())
        
    if new_state not in self.q_table:
        q_hat = reward + self.gamma * 0
    else:
        q_hat = reward + \
                self.gamma * max(self.q_table[new_state].values())
            
    self.q_table[self.state][action] = \
        self.alpha*q_hat + (1-self.alpha)*self.q_table[self.state][action]

**Question: ** *What changes do you notice in the agent’s behavior*?

**Answer: **  
The agent behaves differently. The agent uses much less steps to arrive at the target. According to my observation, the Manhattan distance between the agent and the target was 8 at the begining. The agent took 53 steps to arrive at the destination. Compare to the case in section 1, this agent used much less steps. This different behavior is due to the change of the action selection method. We can see that the agent's behavior improved according to the Q-learning.  

### 4. Enhance the driving agent

In the Q-Learning, the performance of the agent depends on the choice of the tunning parameters such as learning rate (**self.alpha**), discount factor (**self.gamma**) and action selection method (**self.epsilon**). The goal is to get the agent to a point so that within 100 trials, the agent is able to learn a feasible policy. According to this goal, the performance evaluation metrics is defined first, then we will discuss different choices of these parameter and how they improve the agent's performance.

#### 4.1 Performance Evaluation Metrics

The assigned task is to apply the Q-Learning method to the agent so that the agent is able to learn a feasible policy within 100 trials. So, the performance evaluation metrics I use in this project is the number of successfully arriving at destination within deadline for the last 10 of the 100 trials. The first 90 trials will be used only to learn and update the Q-table for the agent. After 90 trials, the number of successfully arriving at destination within deadline will be counted until the number of trials gets to 100. For example, if 7 of the last 10 trials have the agent successfully arriving at destination before deadline, then 7 will be used as the value of the agent's performance.  

We have already implemented the variable **self.n_success** for the performance evaluation in section 3. But in order to count the number of successes after the 90th trial, we also need to revise the **run()** method of **Simulator** class in the **simulator.py** by adding two lines of code, which resets the variable **n_success** of the agents to be 0 right after the 90th trial. The added code is shown as below:  

In [None]:
#class LearningAgent(Agent):
    
def update(self, t):
    #some other given code
    self.next_waypoint = self.planner.next_waypoint()
    inputs = self.env.sense(self)
    deadline = self.env.get_deadline(self)
    
    #update the state, my code
    self.state = (inputs["light"], self.next_waypoint)
    
    #some other code

After the implementation, how the reported state changes through the run is observed. 