<b>Code</b>

The code is all in agent.py and consists of modifications in the update function, as well as three new functions: get_actions, get_max and init_qt.

<b>Implement a Basic Driving Agent</b>

<b>QUESTION:</b> Observe what you see with the agent's behavior as it takes random actions. Does the smartcab eventually make it to the destination? Are there any other interesting observations to note?

When taking random actions, the agent goes all over the place and also sometimes racks up significant negative penalties. It rarely makes it to the destination. Out of 10 runs, it only once made it to the destination. Another interesting observation is that the performance of the smartcab does not improve over time.

<b>Inform the Driving Agent</b>

<b>QUESTION:</b> What states have you identified that are appropriate for modeling the smartcab and environment? Why do you believe each of these states to be appropriate for this problem?

The elements of the state identified include the waypoint / GPS direction, the traffic light, and the directions / existence of traffic coming from each of the three directions (oncoming, left, right). Those are five elements total. These elements are appropriate because they are all relevant to rewards at the next state -- either through going in the right direction (waypoint), or obeying traffic laws (lights and traffic).

EDIT: We exclude the deadline from the elements of the state because it is not relevant to finding the fastest route -- the cab should attempt to find the fastest route whether or not it is under an explicit deadline. Additionally, the deadline has many possible values -- which means that including it will make the Q-table much larger and mean we need many more iterations to learn the right moves.

Additionally, we exclude traffic from the right because this traffic is not relevant given the traffic rules -- it does not impact our ability to go straight, take a left turn, or take a right turn.

<b>OPTIONAL:</b> How many states in total exist for the smartcab in this environment? Does this number seem reasonable given that the goal of Q-Learning is to learn and make informed decisions about each state? Why or why not?

The total number of states is 384 -- all combinations of 3 (waypoints) x 2 (lights) x (4 (traffic states) x 3 (traffic directions)). This, in turn, is multiplied by four, which is the number of actions possible at each state, meaning that there are 1,537 Q-values to learn. This is quite a large number of states and we'll have to do a large number of iterations to get meaningful values for all of these.

EDIT: By excluding traffic from the right, we have 96 states, all combinations of 3 (waypoints) x 2 (lights) x (4 (traffic states) x 2 (traffic directions)). For all actions, we have a total of 384 state, action pairs. This is more manageable versus including traffic from the right.

<b>Implement a Q-Learning Driving Agent</b>

<b>QUESTION:</b> What changes do you notice in the agent's behavior when compared to the basic driving agent when random actions were always taken? Why is this behavior occurring?

The agent tends to follow traffic laws and tends to follow the waypoint direction as long as the path is clear. The reason is that it is naively tring to maximize the next step's reward.

<b>Improve the Q-Learning Driving Agent</b>

<b>QUESTION:</b> Report the different values for the parameters tuned in your basic implementation of Q-Learning. For which set of parameters does the agent perform best? How well does the final driving agent perform?

I have set the number of trials to 1000. When alpha = .1; gamma = 0.8; epsilon = .2, the last 10 trials always succeed. When alpha = .5, however, the last 10 trails all fail. When learning too fast, we lose too much of our previous information and don't succeed in coming up with a good policy. 

<b>QUESTION:</b> Does your agent get close to finding an optimal policy, i.e. reach the destination in the minimum possible time, and not incur any penalties? How would you describe an optimal policy for this problem?

With the right parameters, it does appear that we come up with a good policy. Most generally, the best policy first obeys the traffic rules (avoids any penalties) and then takes direction from the waypoint (minimizing time).

EDIT: After 100 iterations, I exported the Q-table to see what observations it would give. First, I checked what was the average reward across all state, actions -- it was 0.38. Next, I checked what reward was recorded when action and waypoint were the same AND the light was green. This average was 0.87 -- as expected, much higher than the average reward. Next, I checked what was the reward when light is red and action is not None. This is 0.34 -- slightly less than the average.

It seems like the agent is learning reasonably well, but there are certainly places where more iterations are needed.

EDIT 2: To see whether the agent has learned the optimal policy, we need to check whether the driving agent's Q-table is making optimal decisions. I have added a tracker that tracks the rewards and the penalties assigned to each run. The results are below. It seems that we are learning quite slowly, given that the rewards are only slightly increasing and the penalties are also only slightly decreasing.



In [1]:
import agent
from environment import Agent, Environment
from planner import RoutePlanner
from simulator import Simulator

In [2]:
    """Run the agent for a finite number of trials."""

    # Set up environment and agent
    e = Environment()  # create environment (also adds some dummy traffic)
    a = e.create_agent(agent.LearningAgent)  # create agent
    e.set_primary_agent(a, enforce_deadline=True)  # specify agent to track
    # NOTE: You can set enforce_deadline=False while debugging to allow longer trials

    # Now simulate it
    sim = Simulator(e, update_delay=0, display=False)  # create simulator (uses pygame when display=True, if available)
    # NOTE: To speed up simulation, reduce update_delay and/or set display=False

    sim.run(n_trials=100)  # run for a specified number of trials
    # NOTE: To quit midway, press Esc or close pygame window, or hit Ctrl+C on the command-line
    # for i in a.qt.items(): print(str(i).replace('(','').replace(')','').replace('\'','').replace(' ',''))

Simulator.run(): Trial 0
Environment.reset(): Trial set up with start = (7, 6), destination = (2, 6), deadline = 25
RoutePlanner.route_to(): destination = (2, 6)
Environment.step(): Primary agent ran out of time! Trial aborted.
Simulator.run(): Trial 1
Environment.reset(): Trial set up with start = (8, 1), destination = (3, 2), deadline = 30
RoutePlanner.route_to(): destination = (3, 2)
Environment.act(): Primary agent has reached destination!
Simulator.run(): Trial 2
Environment.reset(): Trial set up with start = (1, 6), destination = (5, 3), deadline = 35
RoutePlanner.route_to(): destination = (5, 3)
Environment.act(): Primary agent has reached destination!
Simulator.run(): Trial 3
Environment.reset(): Trial set up with start = (2, 5), destination = (8, 4), deadline = 35
RoutePlanner.route_to(): destination = (8, 4)
Environment.act(): Primary agent has reached destination!
Simulator.run(): Trial 4
Environment.reset(): Trial set up with start = (7, 6), destination = (8, 1), deadline =

In [3]:
import numpy as np
for i in range(0,100,10):
    print ('reward ' + str(np.mean(a.reward_tracker[i:i+10])))
    print ('penalty ' + str(np.mean(a.penalty_tracker[i:i+10])))

reward 21.0
penalty 2.8
reward 22.95
penalty 1.0
reward 26.1
penalty 1.8
reward 21.2
penalty 1.7
reward 21.7
penalty 1.2
reward 22.35
penalty 3.0
reward 24.7
penalty 1.3
reward 24.95
penalty 2.0
reward 19.6
penalty 0.5
reward 22.15
penalty 1.4
