# Training a self-driving car agent
## Reinforcement Learning Project
This documentation outlines the thought process behind training a smartcab to navigate on its own. We're applying reinforcement learning techniques for a self-driving agent in a simplified world to aid it in effectively reaching its destinations in the allotted time. 

First we'll investigate the given environment in which the agent operates in and instruct it, to execute very basic driving commands. Afterwards we'll exhibit the different states a smartcab can be in, analyzing its limited world and derive a Q-Learning algorithm, that will guide the agent towards its destination. At last we'll be going through several iterations in order to find the best configuration for our algorithm and the environment its operating in, to improve the results.

## Description
In the not-so-distant future, taxicab companies across the United States no longer employ human drivers to operate their fleet of vehicles. Instead, the taxicabs are operated by self-driving agents — known as smartcabs — to transport people from one location to another within the cities those companies operate. In major metropolitan areas, such as Chicago, New York City, and San Francisco, an increasing number of people have come to rely on smartcabs to get to where they need to go as safely and efficiently as possible. Although smartcabs have become the transport of choice, concerns have arose that a self-driving agent might not be as safe or efficient as human drivers, particularly when considering city traffic lights and other vehicles. To alleviate these concerns, your task as an employee for a national taxicab company is to use reinforcement learning techniques to construct a demonstration of a smartcab operating in real-time to prove that both safety and efficiency can be achieved.

## Software Requirements
This project uses the following software and Python libraries:

* [Python 2.7](https://www.python.org/download/releases/2.7/)
* [NumPy](http://www.numpy.org/)
* [PyGame](http://pygame.org/)
    * **Helpful links for installing PyGame:**
    * [Getting Started](https://www.pygame.org/wiki/GettingStarted)
    * [PyGame Information](http://www.pygame.org/wiki/info)
    * [Google Group](https://groups.google.com/forum/#!forum/pygame-mirror-on-google-groups)
    * [PyGame subreddit](https://www.reddit.com/r/pygame/)
    
## Definitions

### Environment
The smartcab operates in an ideal, simplified, grid-like city (see image below). Roads are established on two major axis (North-South & East-West). There are other vehicles on the road but the city is abstracted to a point where there are no other obstacles such as traffic jams, construction sights or other heterogenous agents such as pedestrians. However there are certain rules the city runs on, which are alike to our street system. At each intersection there is a traffic light that either allows traffic in the North-South direction or the East-West direction.

![Smart Cab Environment](smartcab_screenshot.png)

**Following rules guide the traffic:**
* On a green light, a left turn is permitted if there is no oncoming traffic making a right turn or coming straight through the intersection.
* On a red light, a right turn is permitted if no oncoming traffic is approaching from your left through the intersection.

### Inputs and Outputs
Very much alike to modern hail-a-cab applications the smartcab will have a route assigned based on t he passengers' starting location and the destination. The route is split at each intersection into waypoints, and for simplicity purposes the the smartcab is at some intersection at any instant in the world. 

Therefore, the next waypoint to the destination, assuming the destination has not already been reached, is one intersection away in one direction (North, South, East, or West). 

### States
The smartcab has only an egocentric view of the intersection it is at and can therefore use following information: 

* The state of the traffic light for its direction of movement: `['green', 'red']`
* Whether there is a vehicle at the intersection for each of the oncoming directions: `['left', 'right', 'oncoming']`

### Actions
For each action, the smartcab has one of the following options:

* Idle at the intersection
* Drive to the next intersection to either of the directions, which offers a set of following actions: `[None, 'left', 'right', 'forward']`

### Deadline
The smartcab has to get to its final destination in a given time. With each action taken, this time decreases. If the allotted time becomes zero before reaching the destination, the trip has failed.

### Rewards, Penalties and Goal
Smartcabs receive rewards for each successfully completed trip and smaller rewards for each action they execute successfully that obeys traffic rules. For any incorrect action a small penalty will be given and violating traffic rules or causing traffic accidents result in a high penalty. Based on the rewards and penalties the smartcab receives, the self-driving agent implementation will learn an optimal policy for driving on the city roads while obeying traffic rules, avoiding accidents, and reaching passengers' destinations in the allotted time.

## Implementing a Basic Driving Agent
In order to get going, we'll implement a basic driving agent that chooses from all possible actions one at random and drives through the city streets. 
In order to do so, we set `enforce_deadline` to `False` on line `47` and implement the following code in `agent.py` on line `30`:

In [14]:
# Pick a random action from set of possible actions. 
import random
action = random.choice([None, 'forward', 'left', 'right'])

In [13]:
# Quick code to demonstrate effect of actions allocation. 
n = 5
action_dict = []
for _ in range(n):
    action_dict.append(random.choice([None, 'forward', 'left', 'right']))
print "Smartcab performs following", n, "actions:", action_dict

Smartcab performs following 5 actions: ['right', 'forward', None, None, None]


### Observations
There are a couple interesting observations when you run the smartcab with this source of randomness. 

While performing random actions there are a couple observations one can make:
* When crashing, cars will restart at a different location and start to perform actions again.
* There are 3 other agents or cars in play.
* With n $\rightarrow \infty$ (through `enforce_deadline = False`) the agent reaches the goal eventually if it doesn't crash into another vehicle.
* There is a reward system in place that shows the immediate gratification of an action.

### Inform the Driving Agent
Now that the driving agent is capable of moving around in the environment, the next task is to identify a set of states that are appropriate for modeling the smartcab and environment. 

The main source of state variables are the current inputs at the intersection, but given our set of rules not all may require representation. The goal is, to process the inputs and update the agent's current state at each waypoint using the `self.state` variable. In order to check our performance we'll continue with the simulation deadline enforcement `enforce_deadline` being set to `False`.

#### All information available
Let's take a look at all information we've got available in our environment and talk through the possibilities that come with each of these inputs.

##### Traffic Lights
As mentioned above, we have an indicator that shows what color the traffic light at any given position of the agent has. This can take the values `green` and `red`. We find this information in the inputs dictionary and can call it with `inputs['light']`. Given our set of rules, this is a highly useful indicator for our smartcab to decide whether it should perform a certain action or not. This information will be included in our set of state variables.

* `inputs['light'] = ['red', 'green']`

##### Traffic
Our agent can detect, whether there is a vehicle at the intersection for each of the oncoming directions and what action this other agent is performing. Given the set of rules that apply on US streets we have following information that we need to especially emphasize:

* If the traffic light is green:
    * Is there oncoming traffic going in my direction? ($\rightarrow$ can't perform a left turn)
* If the traffic light is red:
    * Is there traffic coming from the left side? ($\rightarrow$ can't make a right turn)

However since we want to let our agent learn these rules through reinforcement and gratification we are going to add all the information about traffic in our set of states. This adds following set to the list: 

* `inputs['oncoming'] = [None, 'left', 'right', 'forward']`
* `inputs['left'] = [None, 'left', 'right', 'forward']`
* `inputs['right'] = [None, 'left', 'right', 'forward']`

##### Waypoints
The decision, which step to perform next is essential to learning if some action is good or bad, given the state. Especially in reinforcement learning. If you compare it to a simple but hurtful early life analogy, this is where our agent will understand if the cooktop is hot or not.

* `self.next_waypoint = ['forward', 'left', 'right']`

##### Deadline
One could argue that the time that is available for the smartcab to perform its delivery describes the state its in. We can think about classic situations where cab drivers are pushed to their limits and into breaking the law in order to get a passenger faster to the destination. Since we're building this system though, to perform optimal results without breaking the law we should discard this input though. 
Another valid concern is the number of possible combinations. Given that the deadline reduces by `1` on each action we perform an average of 40 to 50 actions each run, this would add a multiplier of `len(deadline)` to the state combinations that we already have. With only 100 training runs including around 40 to 50 actions in each run this might be a state input that rather distracts the learner from learning what's essential for the task.

##### The final states for our Q-Learner
To summarize, we're planning on including following inputs to the list of possible state combinations:
* `inputs['light'] = ['red', 'green']` $\rightarrow 2$
* `inputs['oncoming'] = [None, 'left', 'right', 'forward']` $\rightarrow 4$
* `inputs['left'] = [None, 'left', 'right', 'forward']` $\rightarrow 4$
* `inputs['right'] = [None, 'left', 'right', 'forward']` $\rightarrow 4$
* `self.next_waypoint = ['forward', 'left', 'right']` $\rightarrow 3$

The total number of states therefore is the combination of all states:
$$ 2 * 4 * 4 * 3 = 384 $$

**Does this number seem reasonable given that the goal of Q-Learning is to learn and make informed decisions about each state?**
Given that we are dealing with around `5000` waypoints in our training runs (`100` runs * `~50` actions), the amount of `384` states seems to be a bit high. We could consider dropping `inputs['right']`, since given our rules if the traffic light is red and we want to make a right turn (which is allowed by US rules) this would be a correct action that wouldn't end up in an accident but all other actions would be incorrect and therefore should be learned as being incorrect. This would bring down the number of states to `96` but we would lose some flexibility, if we for example would change the rules slightly and appply it to a European traffic setting our Reinforcement Learner would learn incorrect rules. Since we'll be adding each state as a combination when it occures to a dictionary and calculate its given value we might not even end up evaluating and adding all `384` states. Therefore it should be alright to add all appearances out of `384` possible combinations.



## Implement a Q-Learning Driving Agent
With your driving agent being capable of interpreting the input information and having a mapping of environmental states, your next task is to implement the Q-Learning algorithm for your driving agent to choose the best action at each time step, based on the Q-values for the current state and action. Each action taken by the smartcab will produce a reward which depends on the state of the environment. The Q-Learning driving agent will need to consider these rewards when updating the Q-values. Once implemented, set the simulation deadline enforcement enforce_deadline to True. Run the simulation and observe how the smartcab moves about the environment in each trial.

The formulas for updating Q-values can be found in this video.

**What changes do you notice in the agent's behavior when compared to the basic driving agent when random actions were always taken?** 

**Why is this behavior occurring?**