# Machine Learning Engineer Nanodegree
## Reinforcement Learning
## Project: Train a Smartcab to Drive

Welcome to the fourth project of the Machine Learning Engineer Nanodegree! In this notebook, template code has already been provided for you to aid in your analysis of the *Smartcab* and your implemented learning algorithm. You will not need to modify the included code beyond what is requested. There will be questions that you must answer which relate to the project and the visualizations provided in the notebook. Each section where you will answer a question is preceded by a **'Question X'** header. Carefully read each question and provide thorough answers in the following text boxes that begin with **'Answer:'**. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide in `agent.py`.  

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

-----

## Getting Started
In this project, you will work towards constructing an optimized Q-Learning driving agent that will navigate a *Smartcab* through its environment towards a goal. Since the *Smartcab* is expected to drive passengers from one location to another, the driving agent will be evaluated on two very important metrics: **Safety** and **Reliability**. A driving agent that gets the *Smartcab* to its destination while running red lights or narrowly avoiding accidents would be considered **unsafe**. Similarly, a driving agent that frequently fails to reach the destination in time would be considered **unreliable**. Maximizing the driving agent's **safety** and **reliability** would ensure that *Smartcabs* have a permanent place in the transportation industry.

**Safety** and **Reliability** are measured using a letter-grade system as follows:

| Grade 	| Safety 	| Reliability 	|
|:-----:	|:------:	|:-----------:	|
|   A+  	|  Agent commits no traffic violations,<br/>and always chooses the correct action. | Agent reaches the destination in time<br />for 100% of trips. |
|   A   	|  Agent commits few minor traffic violations,<br/>such as failing to move on a green light. | Agent reaches the destination on time<br />for at least 90% of trips. |
|   B   	| Agent commits frequent minor traffic violations,<br/>such as failing to move on a green light. | Agent reaches the destination on time<br />for at least 80% of trips. |
|   C   	|  Agent commits at least one major traffic violation,<br/> such as driving through a red light. | Agent reaches the destination on time<br />for at least 70% of trips. |
|   D   	| Agent causes at least one minor accident,<br/> such as turning left on green with oncoming traffic.       	| Agent reaches the destination on time<br />for at least 60% of trips. |
|   F   	|  Agent causes at least one major accident,<br />such as driving through a red light with cross-traffic.      	| Agent fails to reach the destination on time<br />for at least 60% of trips. |

To assist evaluating these important metrics, you will need to load visualization code that will be used later on in the project. Run the code cell below to import this code which is required for your analysis.

In [2]:
# Import the visualization code
import visuals as vs

# Pretty display for notebooks
%matplotlib inline

### Understand the World
Before starting to work on implementing your driving agent, it's necessary to first understand the world (environment) which the *Smartcab* and driving agent work in. One of the major components to building a self-learning agent is understanding the characteristics about the agent, which includes how the agent operates. To begin, simply run the `agent.py` agent code exactly how it is -- no need to make any additions whatsoever. Let the resulting simulation run for some time to see the various working components. Note that in the visual simulation (if enabled), the **white vehicle** is the *Smartcab*.

### Question 1
In a few sentences, describe what you observe during the simulation when running the default `agent.py` agent code. Some things you could consider:
- *Does the Smartcab move at all during the simulation?*
- *What kind of rewards is the driving agent receiving?*
- *How does the light changing color affect the rewards?*  

**Hint:** From the `/smartcab/` top-level directory (where this notebook is located), run the command 
```bash
'python smartcab/agent.py'
```

**Answer:**
The smartcab does not move at all, meaning that it does relatively well with safety, but not as well with reliability. Safety is roughly defined as avoiding major traffic violations. The car avoids major traffic violations such as an accident because it never moves at all. At the same time, however, the car commits a minor traffic violation every time there is a green light, because it never moves at all.
Regarding rewards, the agent receives a small reward it sits at a stop light (~.5), and a large penalty each time it stops on a green light (~(-4, -5)). Learning is turned off right now, but this would suggest to me that when learning is enabled, the agent will start to go on green lights much more often.


### Understand the Code
In addition to understanding the world, it is also necessary to understand the code itself that governs how the world, simulation, and so on operate. Attempting to create a driving agent would be difficult without having at least explored the *"hidden"* devices that make everything work. In the `/smartcab/` top-level directory, there are two folders: `/logs/` (which will be used later) and `/smartcab/`. Open the `/smartcab/` folder and explore each Python file included, then answer the following question.

### Question 2
- *In the *`agent.py`* Python file, choose three flags that can be set and explain how they change the simulation.*
- *In the *`environment.py`* Python file, what Environment class function is called when an agent performs an action?*
- *In the *`simulator.py`* Python file, what is the difference between the *`'render_text()'`* function and the *`'render()'`* function?*
- *In the *`planner.py`* Python file, will the *`'next_waypoint()`* function consider the North-South or East-West direction first?*

**Answer:**Three important flags include "learning", "enforce_deadline" and "log_metrics". The first, "learning", means that the agent will actually adapt its actions to the environment based on rewards - i.e., it will learn (the default is False, meaning that the agent always does the same thing. The second, "enforce_deadline", indicates that the actions will be penalized based on the amount of time remaining in the trial. This penalty is then taken into account when determining the reward the agent is assigned based on performing a particular action. Because the penalty is inversely proportional to the amount of time remaining, the agent is incentivized to achieve the goal as soon as possible. Finally, we have the "log_metrics" flag. This falg simply determines whether the results of the experiment will be logged to a file, thereby allowing us to investigate the results after the experiment is over. The function "act" is called to perform an action. It also checks whether an action is valid before performing it. The "render" function is used for the GUI, and the "render_text" function is not. The "next_waypoint()" function checks the East/West direction first.

-----
## Implement a Basic Driving Agent

The first step to creating an optimized Q-Learning driving agent is getting the agent to actually take valid actions. In this case, a valid action is one of `None`, (do nothing) `'left'` (turn left), `right'` (turn right), or `'forward'` (go forward). For your first implementation, navigate to the `'choose_action()'` agent function and make the driving agent randomly choose one of these actions. Note that you have access to several class variables that will help you write this functionality, such as `'self.learning'` and `'self.valid_actions'`. Once implemented, run the agent file and simulation briefly to confirm that your driving agent is taking a random action each time step.

### Basic Agent Simulation Results
To obtain results from the initial simulation, you will need to adjust following flags:
- `'enforce_deadline'` - Set this to `True` to force the driving agent to capture whether it reaches the destination in time.
- `'update_delay'` - Set this to a small value (such as `0.01`) to reduce the time between steps in each trial.
- `'log_metrics'` - Set this to `True` to log the simluation results as a `.csv` file in `/logs/`.
- `'n_test'` - Set this to `'10'` to perform 10 testing trials.

Optionally, you may disable to the visual simulation (which can make the trials go faster) by setting the `'display'` flag to `False`. Flags that have been set here should be returned to their default setting when debugging. It is important that you understand what each flag does and how it affects the simulation!

Once you have successfully completed the initial simulation (there should have been 20 training trials and 10 testing trials), run the code cell below to visualize the results. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded!
Run the agent.py file after setting the flags from projects/smartcab folder instead of projects/smartcab/smartcab.


In [3]:
# Load the 'sim_no-learning' log file from the initial simulation results
vs.plot_trials('sim_no-learning.csv')

Not enough data collected to create a visualization.
At least 20 trials are required.


### Question 3
Using the visualization above that was produced from your initial simulation, provide an analysis and make several observations about the driving agent. Be sure that you are making at least one observation about each panel present in the visualization. Some things you could consider:
- *How frequently is the driving agent making bad decisions? How many of those bad decisions cause accidents?*
- *Given that the agent is driving randomly, does the rate of reliability make sense?*
- *What kind of rewards is the agent receiving for its actions? Do the rewards suggest it has been penalized heavily?*
- *As the number of trials increases, does the outcome of results change significantly?*
- *Would this Smartcab be considered safe and/or reliable for its passengers? Why or why not?*

**Answer:** The following all pertain to the "no_learning" experiment.
The agent makes bad decisions ~40-44% of the time. 25% of the time, accidents are caused (i.e., ~56% of bad decisions lead to accident).
The driver received an "F" for reliability. It seems reasonable because the driver is choosing directions at random AND we are enforcing a time deadline. If the driver were allowed to continue indefinitely, we would expect them to eventually get to the destination, but capping the amount of time means that the driver is not likely to get to their destination simply by choosing random actions.
The 10 trial rolling average rarely goes above -5, and is always negative and below -4. Therefore, it seems that the driving agent is being penalized heavily.
After running 10,15,20,25 and 30 trials, respectively, the relative frequency of bad actions tends towards 35%. Minor violations and major violations trend downwards as well, but by less than 5%. Major accidents, however, seem to be relatively steady. The reliability moves up strongly, from near 0% at 10 trials to near 40% for 30 trials
This smartcab could be under no circumstances considered safe based on the decision mechanism of taking a random action. In short, this is because it seems likely that, no mater how many trials we perform, this car will get into far too many accidents, although the number of moving violations overall may improve with a greater number of trials. The car will perform a bad action betwen 3 and 4 out of every ten times. Considering how many "actions" a car takes on a given trip (number of turns, etc) this would equate to tens or even hundreds of moving violations every time the car takes a trip! However, the number of violations does seem to go down as more trials are performed. In reality, we could performs thousands upon thousands of trials, and perhaps improve the number of bad actions to a reasonable level. However, >5% of the time, a major accident occurs. Considering that most trips will take more than 20 actions, this means that for almost any trip, the car is expected to get in a MAJOR accident, which should not be the case, obviously. Contrary tobad actions overall, however, the number of major accidents does not seem to significantly move as more trials are performed. Therefore, we seem to need a better model for choosing actions, and not simply take more time to perform more iterations of the experiment, meaning that this car could not be considered safe. Reliability, however, may get better as the number of trials increases. We saw a 40% increase simply by increasing the number of trials from 10 to 30. It may be feasible to increase the number of trials to a point where we achieve an acceptable rate of reliability. However, it is not likely that this is the most effective means of arriving at a destination because a random route is being chosen. So in some sense, perhaps we conuld consider a car that chooses actions randomly "reliable", but it is likely a very weak sense.

-----
## Inform the Driving Agent
The second step to creating an optimized Q-learning driving agent is defining a set of states that the agent can occupy in the environment. Depending on the input, sensory data, and additional variables available to the driving agent, a set of states can be defined for the agent so that it can eventually *learn* what action it should take when occupying a state. The condition of `'if state then action'` for each state is called a **policy**, and is ultimately what the driving agent is expected to learn. Without defining states, the driving agent would never understand which action is most optimal -- or even what environmental variables and conditions it cares about!

### Identify States
Inspecting the `'build_state()'` agent function shows that the driving agent is given the following data from the environment:
- `'waypoint'`, which is the direction the *Smartcab* should drive leading to the destination, relative to the *Smartcab*'s heading.
- `'inputs'`, which is the sensor data from the *Smartcab*. It includes 
  - `'light'`, the color of the light.
  - `'left'`, the intended direction of travel for a vehicle to the *Smartcab*'s left. Returns `None` if no vehicle is present.
  - `'right'`, the intended direction of travel for a vehicle to the *Smartcab*'s right. Returns `None` if no vehicle is present.
  - `'oncoming'`, the intended direction of travel for a vehicle across the intersection from the *Smartcab*. Returns `None` if no vehicle is present.
- `'deadline'`, which is the number of actions remaining for the *Smartcab* to reach the destination before running out of time.

### Question 4
*Which features available to the agent are most relevant for learning both **safety** and **efficiency**? Why are these features appropriate for modeling the *Smartcab* in the environment? If you did not choose some features, why are those features* not *appropriate? Please note that whatever features you eventually choose for your agent's state, must be argued for here. That is: your code in agent.py should reflect the features chosen in this answer.
*

NOTE: You are not allowed to engineer new features for the smartcab. 

**Answer:** All of the features may be relevant, but the reward scheme may mean that we could ignore "deadline". Looking at the reward scheme, the reward is discounted depending on how much time remains until the deadline. This is the same function that the "deadline" feature would have to provide. It may turn out that the reward scheme does not do so as well as the deadline feature would, but if it does satsfactorily replace the deadline feature, we could see a drastic improvement in performance because "deadline" could have a relatively high number of states if the number of trials is high. Even in the simple experiments we have carried out so far, we have had 10 trials, which would mean there are at least ten states for deadline. Considering the other features have fewer than 5 states, this is an "expensive" feature, and we could do well to remove it. 'Inputs' are primarily relevant for safety because they determine the permissible moves for the smartcab. The rules of the road are wholly determined by the light, position of other cars at the intersection and the agent's desired direction of travel. The first two are provided by 'inputs'. However, the agent's desired direction of travel is provided by the 'waypoint' feature, meaning that waypoint also bears some relevance to safety. Also note that 'waypoint' is the optimal direction of travel for the agent, meaning that it is relevant to both safety and efficiency.

### Define a State Space
When defining a set of states that the agent can occupy, it is necessary to consider the *size* of the state space. That is to say, if you expect the driving agent to learn a **policy** for each state, you would need to have an optimal action for *every* state the agent can occupy. If the number of all possible states is very large, it might be the case that the driving agent never learns what to do in some states, which can lead to uninformed decisions. For example, consider a case where the following features are used to define the state of the *Smartcab*:

`('is_raining', 'is_foggy', 'is_red_light', 'turn_left', 'no_traffic', 'previous_turn_left', 'time_of_day')`.

How frequently would the agent occupy a state like `(False, True, True, True, False, False, '3AM')`? Without a near-infinite amount of time for training, it's doubtful the agent would ever learn the proper action!

### Question 5
*If a state is defined using the features you've selected from **Question 4**, what would be the size of the state space? Given what you know about the environment and how it is simulated, do you think the driving agent could learn a policy for each possible state within a reasonable number of training trials?*  
**Hint:** Consider the *combinations* of features to calculate the total number of states!

**Answer:** First we must consider how many states are possible for each feature. There are 3 possible directions for 'waypoint' (left, right, straight). 'Light' has 2 possible states. 'Left' has 4 possible states (L, R, Straight, None if there is no car) 'Right' has 4 possible states. 'Oncoming' has 4 possible states. So there are 3 * 2 * 4 * 4 * 4 = 384 possible states. Given that we can perform Q-learning in polynomial time, this is likely computationally feasible, but not necessarily computationally easy.

### Update the Driving Agent State
For your second implementation, navigate to the `'build_state()'` agent function. With the justification you've provided in **Question 4**, you will now set the `'state'` variable to a tuple of all the features necessary for Q-Learning. Confirm your driving agent is updating its state by running the agent file and simulation briefly and note whether the state is displaying. If the visual simulation is used, confirm that the updated state corresponds with what is seen in the simulation.

**Note:** Remember to reset simulation flags to their default setting when making this observation!

-----
## Implement a Q-Learning Driving Agent
The third step to creating an optimized Q-Learning agent is to begin implementing the functionality of Q-Learning itself. The concept of Q-Learning is fairly straightforward: For every state the agent visits, create an entry in the Q-table for all state-action pairs available. Then, when the agent encounters a state and performs an action, update the Q-value associated with that state-action pair based on the reward received and the iterative update rule implemented. Of course, additional benefits come from Q-Learning, such that we can have the agent choose the *best* action for each state based on the Q-values of each state-action pair possible. For this project, you will be implementing a *decaying,* $\epsilon$*-greedy* Q-learning algorithm with *no* discount factor. Follow the implementation instructions under each **TODO** in the agent functions.

Note that the agent attribute `self.Q` is a dictionary: This is how the Q-table will be formed. Each state will be a key of the `self.Q` dictionary, and each value will then be another dictionary that holds the *action* and *Q-value*. Here is an example:

```
{ 'state-1': { 
    'action-1' : Qvalue-1,
    'action-2' : Qvalue-2,
     ...
   },
  'state-2': {
    'action-1' : Qvalue-1,
     ...
   },
   ...
}
```

Furthermore, note that you are expected to use a *decaying* $\epsilon$ *(exploration) factor*. Hence, as the number of trials increases, $\epsilon$ should decrease towards 0. This is because the agent is expected to learn from its behavior and begin acting on its learned behavior. Additionally, The agent will be tested on what it has learned after $\epsilon$ has passed a certain threshold (the default threshold is 0.05). For the initial Q-Learning implementation, you will be implementing a linear decaying function for $\epsilon$.

### Q-Learning Simulation Results
To obtain results from the initial Q-Learning implementation, you will need to adjust the following flags and setup:
- `'enforce_deadline'` - Set this to `True` to force the driving agent to capture whether it reaches the destination in time.
- `'update_delay'` - Set this to a small value (such as `0.01`) to reduce the time between steps in each trial.
- `'log_metrics'` - Set this to `True` to log the simluation results as a `.csv` file and the Q-table as a `.txt` file in `/logs/`.
- `'n_test'` - Set this to `'10'` to perform 10 testing trials.
- `'learning'` - Set this to `'True'` to tell the driving agent to use your Q-Learning implementation.

In addition, use the following decay function for $\epsilon$:

$$ \epsilon_{t+1} = \epsilon_{t} - 0.05, \hspace{10px}\textrm{for trial number } t$$

If you have difficulty getting your implementation to work, try setting the `'verbose'` flag to `True` to help debug. Flags that have been set here should be returned to their default setting when debugging. It is important that you understand what each flag does and how it affects the simulation! 

Once you have successfully completed the initial Q-Learning simulation, run the code cell below to visualize the results. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded!

In [4]:
# Load the 'sim_default-learning' file from the default Q-Learning simulation
vs.plot_trials('sim_default-learning.csv')

IOError: File logs/sim_default-learning.csv does not exist

### Question 6
Using the visualization above that was produced from your default Q-Learning simulation, provide an analysis and make observations about the driving agent like in **Question 3**. Note that the simulation should have also produced the Q-table in a text file which can help you make observations about the agent's learning. Some additional things you could consider:  
- *Are there any observations that are similar between the basic driving agent and the default Q-Learning agent?*
- *Approximately how many training trials did the driving agent require before testing? Does that number make sense given the epsilon-tolerance?*
- *Is the decaying function you implemented for $\epsilon$ (the exploration factor) accurately represented in the parameters panel?*
- *As the number of training trials increased, did the number of bad actions decrease? Did the average reward increase?*
- *How does the safety and reliability rating compare to the initial driving agent?*

**Answer:** The first thing I noticed was that the number of bad actions and major violations began at a substantially lower level, and trended down more sharply than in the default agent. Major violations behaved similarly. This shows that we were improving regarding the total number of bad actions taken and major violations. However, the number of minor accidents actually increased, while there was little difference in the number of major accidents. The increase in minor accidents makes some sense because a minor accident may occur when a driver takes a left into oncoming traffic, and the default learner never turned left. But the lack of improvements in major accidents shows that we are still performing equally badly as the default model when it comes to situations like running red lights into crossing traffic. The agent required 20 trials before testing, which makes perfect sense given our starting epsilon of 1, and the decay rate of .05 / trial. Also, the average reward increased steadily, but still remained negative. This means that we were learning some (b/c the average reward increased), but the fact that the reward remained solidly negative (barely crossing -2) indicates that our model still has a lot of learning to do. This makes sense given that we got safety/reliability ratings of F and F, respectively, meaning that our model has a long way to go. The safety and reliability ratings are the same as for the agent which chose actions randomly, but I would still call this model an improvement because of the strong decrease in total bad actions and major violations and increase in average reward. Moreover, the upward trend in reward suggests that maybe we could imporve the model with more trials, and also carefully managing the decay function to ensure that a proper number of the trials are used for exploration vs. exploitation.

-----
## Improve the Q-Learning Driving Agent
The third step to creating an optimized Q-Learning agent is to perform the optimization! Now that the Q-Learning algorithm is implemented and the driving agent is successfully learning, it's necessary to tune settings and adjust learning paramaters so the driving agent learns both **safety** and **efficiency**. Typically this step will require a lot of trial and error, as some settings will invariably make the learning worse. One thing to keep in mind is the act of learning itself and the time that this takes: In theory, we could allow the agent to learn for an incredibly long amount of time; however, another goal of Q-Learning is to *transition from experimenting with unlearned behavior to acting on learned behavior*. For example, always allowing the agent to perform a random action during training (if $\epsilon = 1$ and never decays) will certainly make it *learn*, but never let it *act*. When improving on your Q-Learning implementation, consider the implications it creates and whether it is logistically sensible to make a particular adjustment.

### Improved Q-Learning Simulation Results
To obtain results from the initial Q-Learning implementation, you will need to adjust the following flags and setup:
- `'enforce_deadline'` - Set this to `True` to force the driving agent to capture whether it reaches the destination in time.
- `'update_delay'` - Set this to a small value (such as `0.01`) to reduce the time between steps in each trial.
- `'log_metrics'` - Set this to `True` to log the simluation results as a `.csv` file and the Q-table as a `.txt` file in `/logs/`.
- `'learning'` - Set this to `'True'` to tell the driving agent to use your Q-Learning implementation.
- `'optimized'` - Set this to `'True'` to tell the driving agent you are performing an optimized version of the Q-Learning implementation.

Additional flags that can be adjusted as part of optimizing the Q-Learning agent:
- `'n_test'` - Set this to some positive number (previously 10) to perform that many testing trials.
- `'alpha'` - Set this to a real number between 0 - 1 to adjust the learning rate of the Q-Learning algorithm.
- `'epsilon'` - Set this to a real number between 0 - 1 to adjust the starting exploration factor of the Q-Learning algorithm.
- `'tolerance'` - set this to some small value larger than 0 (default was 0.05) to set the epsilon threshold for testing.

Furthermore, use a decaying function of your choice for $\epsilon$ (the exploration factor). Note that whichever function you use, it **must decay to **`'tolerance'`** at a reasonable rate**. The Q-Learning agent will not begin testing until this occurs. Some example decaying functions (for $t$, the number of trials):

$$ \epsilon = a^t, \textrm{for } 0 < a < 1 \hspace{50px}\epsilon = \frac{1}{t^2}\hspace{50px}\epsilon = e^{-at}, \textrm{for } 0 < a < 1 \hspace{50px} \epsilon = \cos(at), \textrm{for } 0 < a < 1$$
You may also use a decaying function for $\alpha$ (the learning rate) if you so choose, however this is typically less common. If you do so, be sure that it adheres to the inequality $0 \leq \alpha \leq 1$.

If you have difficulty getting your implementation to work, try setting the `'verbose'` flag to `True` to help debug. Flags that have been set here should be returned to their default setting when debugging. It is important that you understand what each flag does and how it affects the simulation! 

Once you have successfully completed the improved Q-Learning simulation, run the code cell below to visualize the results. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded!

In [1]:
# Load the 'sim_improved-learning' file from the improved Q-Learning simulation
vs.plot_trials('sim_improved-learning.csv')

NameError: name 'vs' is not defined

### Question 7
Using the visualization above that was produced from your improved Q-Learning simulation, provide a final analysis and make observations about the improved driving agent like in **Question 6**. Questions you should answer:  
- *What decaying function was used for epsilon (the exploration factor)?*
- *Approximately how many training trials were needed for your agent before begining testing?*
- *What epsilon-tolerance and alpha (learning rate) did you use? Why did you use them?*
- *How much improvement was made with this Q-Learner when compared to the default Q-Learner from the previous section?*
- *Would you say that the Q-Learner results show that your driving agent successfully learned an appropriate policy?*
- *Are you satisfied with the safety and reliability ratings of the *Smartcab*?*

**Answer:** The function ultimately used for epsilon was abs(cos(.001t)). The first thing I looked for in a function was that near 0 trials, the output was high (near 1). I made this decision because in early trials (i.e., close to 0 on the x axis) the agent would not have much prior experience to draw upon (exploit), so I thought it would be beneficial for it to explore more. Epsilon expresses the probability of trying a new action, therefore, I wanted a high epsilon value for low trial values. Cosine seemed like a natural fit because cos(0) = 1. However, cos also takes negative values, so I took the absolute value. I tried the other suggested functions, but cosine was also the only one of the suggested functions that received a non-F score in both reliability and safety on the first try, so I decided to keep experimenting with it. I also noticed that smaller a values led to an uptick in performance, so I kept decreasing the size of a until I got a satisfactory score. However, the smaller values of a lengthened the period of cos(at), meaning that more trials were necessary. I tried many different alpha values, and noticed that I was only getting values above a 'D' for reliability when I used a high alpha value. This could be thought of as combatting the low bias in our model for reliability by increasing the importance of each and every data point. This Q-learner received the maximum possible marks for both reliability and safety, trouncing the default model (which simply chose a random action at all times). Because it received the maximum possible marks, I am satisifed with the smartcab, and believe that it has learned an appropriate policy. However, I will also note that training the model did take a decent (40 min- 1 hr) of time.

### Define an Optimal Policy

Sometimes, the answer to the important question *"what am I trying to get my agent to learn?"* only has a theoretical answer and cannot be concretely described. Here, however, you can concretely define what it is the agent is trying to learn, and that is the U.S. right-of-way traffic laws. Since these laws are known information, you can further define, for each state the *Smartcab* is occupying, the optimal action for the driving agent based on these laws. In that case, we call the set of optimal state-action pairs an **optimal policy**. Hence, unlike some theoretical answers, it is clear whether the agent is acting "incorrectly" not only by the reward (penalty) it receives, but also by pure observation. If the agent drives through a red light, we both see it receive a negative reward but also know that it is not the correct behavior. This can be used to your advantage for verifying whether the **policy** your driving agent has learned is the correct one, or if it is a **suboptimal policy**.

### Question 8

1. Please summarize what the optimal policy is for the smartcab in the given environment. What would be the best set of instructions possible given what we know about the environment? 
   _You can explain with words or a table, but you should thoroughly discuss the optimal policy._

2. Next, investigate the `'sim_improved-learning.txt'` text file to see the results of your improved Q-Learning algorithm. _For each state that has been recorded from the simulation, is the **policy** (the action with the highest value) correct for the given state? Are there any states where the policy is different than what would be expected from an optimal policy?_ 

3. Provide a few examples from your recorded Q-table which demonstrate that your smartcab learned the optimal policy. Explain why these entries demonstrate the optimal policy.

4. Try to find at least one entry where the smartcab did _not_ learn the optimal policy.  Discuss why your cab may have not learned the correct policy for the given state.

Be sure to document your `state` dictionary below, it should be easy for the reader to understand what each state represents.

**Answer:** My 'state' dictionary is defined as follows: { smartcab_desired_direction, light, left, right, oncoming } An optimal policy would entail that whenever the smartcab has the right of way and a green light, it goes in the optimal direction. Otherwise, it waits until it does have the right of way. Theoretically, it could be possible that the car could benefit by taking a detour if, e.g, the cab wanted to turn left but there were multiple cars traveling straight in the other direction. However, because we are not considering future states (i.e.,those along the detour) this is not relevant. Therefore, whenever the car has a green light and desires to go straight or right, it should go straight. Whenever the car has a green light and desires to left, it should go left unless the other driver is going straight. And on a red light, the car should idle unless it desires to turn right AND there is no car crossing the intersection in the direction perpendicular to the smartcab. Even though the smartcab scored A+ for both safety and reliability, there are still some states for which it does NOT learn the optimal policy. For example ('left', 'green', 'left', None, 'forward') should idle because the oncoming car wants to go forward and has the right of way, but "left" has the highest reward (Rewards: left:1.57, right:.3, forward:0, None:0). Another example is ('left', 'red', None, 'right', 'left') which has rewards (forward : -10.43, None : 0.00, right : 1.64, left : -9.50). Right has the highest q-value, but the car's desired direction of travel is left, meaning that this state does not follow the desired waypoint. Note that this is another "type" of failure than the first example, because the rules of the road are observed (there is no car on the left, so there is no car in the intersection blocking the right on red). Contrarily, in the first example, the smartcab drive across the intersection with a car oncoming, which violates the rules of the road. So the model does not only fail the rules of the road OR the determined waypoints, but instead intermittently violates both. It is unsurprising that the model struggles when the desired durection of travel is left, because this is when right-of-way issues may prevail. The fact that the smartcab scored well but did not learn the optimal policy in each situation tells me that the states with incorrect policies were not visited often. Take, for instance, ('left', 'green', 'left', None, 'forward'). This state could cause an accident because the car would be turning into oncoming traffic. If the agent got in a lot of such accidents, it would not score well. But it did score well, so it must not have come up against this state very often on a relative basis. Perhaps this speaks to the fact that Q-learning will approach the proper Q value as the number of trials grows arbitrarily large - meaning that there is no guarantee that we achieve the correct result for any "small" number of trials. On the other hand, the cab obtained the best possible scores, so it must have determined the best actions for some states. One such example is ('left', 'green', 'forward', None, None). In this situation, there is no oncoming car so the agent is allowed to make the left. Given that the Q-values are (forward : -0.29, None : 0.00, right : 0.00 ,left : 9.39), left has the highest value and will be chosen.

-----
### Optional: Future Rewards - Discount Factor, `'gamma'`
Curiously, as part of the Q-Learning algorithm, you were asked to **not** use the discount factor, `'gamma'` in the implementation. Including future rewards in the algorithm is used to aid in propagating positive rewards backwards from a future state to the current state. Essentially, if the driving agent is given the option to make several actions to arrive at different states, including future rewards will bias the agent towards states that could provide even more rewards. An example of this would be the driving agent moving towards a goal: With all actions and rewards equal, moving towards the goal would theoretically yield better rewards if there is an additional reward for reaching the goal. However, even though in this project, the driving agent is trying to reach a destination in the allotted time, including future rewards will not benefit the agent. In fact, if the agent were given many trials to learn, it could negatively affect Q-values!

### Optional Question 9
*There are two characteristics about the project that invalidate the use of future rewards in the Q-Learning algorithm. One characteristic has to do with the *Smartcab* itself, and the other has to do with the environment. Can you figure out what they are and why future rewards won't work for this project?*

**Answer:** The smartcab itself cannot sense anything other than the cars at its current intersection, meaning that it will not be able to accurately judge the rewards for intermediate steps. Essentially, the car has no way to predict which states will occur between the current state and the final state. For instance, suppose the car is in a state 'A' and is "tempted" by a future state C. The car has no idea what will occur in the intermediate state B, and thereforce could accidentally be penalized if it ends up in a poor state, canceling out the future gain! The problem with the environment is that the state contains no information about which intersection the agent is currently at. Because of this, the agent cannot determine whether some future state is attainable, and cannot predict the intermediate rewards either. For instance, suppose the agent in state S0 is "tempted" by some state S1 such that S1's deadline feature is less than S0 (because it will take some number of moves to get from S0 --> S1). Moreover, suppose that S0:deadline - S1:deadline = r (S1 is r moves away from S0). There is more than just one intersection which is r moves away from S0. For instance, it could go straight r times, or if r is even, go backwards once for every time it goes forward (never turning) and end up at the same intersection! If the latter did happen, the agent could be in S1 (depending on the state of other cars, which we are ignoring for the moment), and yet get a lesser reward than expected because there are fewer moves remaining and the agent is no closer to its target! Then adding this lesser reward would lower the Q-value of S1, demonstrating how the Q-value could fall with more trials.

> **Note**: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to  
**File -> Download as -> HTML (.html)**. Include the finished document along with this notebook as your submission.