# Introduction

In this notebook, we will learn how to create a reinforcmenet learning agent for self driving Mario Kart.
Then, we will build a leaderboard to compare the algrithms we build.

## Goals

The main goals of implementing self driving Mario Kart are:
- Understand the basics of Reinforcement Learning
- Get familiar with Open AI Gym and Universe
- Create an agent that is capable of learning by itself

## Steps
- In section 1, we will go through a short introdcution of the fundamentals of Reinforcement Learning.
- In section 2, we will try to cover the capabilities of openAI Gym and how we can make use of it to solve a collection of test problems (environments)
- In section 3, we will define our problem and create our basic Q-Learning algorithm
- In section 4, we will build our benchmarking tool and validate our approach

# 1. Reinforcement Learning

## 1.1. Introduction
Those interested in the world of machine learning are aware of the capabilities of reinforcement-learning-based AI. The past few years have seen many breakthroughs using reinforcement learning (RL). The company DeepMind combined deep learning with reinforcement learning to achieve above-human results on a multitude of Atari games and, in March 2016, defeated Go champion Le Sedol four games to one. Though RL is currently excelling in many game environments, it is a novel way to solve problems that require optimal decisions and efficiency, and will surely play a part in machine intelligence to come.

Reinforcement learning, explained simply, is a computational approach where an agent interacts with an environment by taking actions in which it tries to maximize an accumulated reward. 
Here is a simple graph (the agent-environement loop), from the book [Reinforcement Learning: An Introduction 2nd Edition](http://incompleteideas.net/sutton/book/the-book-2nd.html) :

![RL](https://d3ansictanv2wj.cloudfront.net/image3-5f8cbb1fb6fb9132fef76b13b8687bfc.png)

An agent in a current state (St) takes an action (At) to which the environment reacts and responds, returning a new state(St+1) and reward (Rt+1) to the agent. Given the updated state and reward, the agent chooses the next action, and the loop repeats until an environment is solved or terminated.

## 1.2. Elements of Reinforcement Learning

TODO: Define the elements of Reinforcement Learning

## 1.2. Q-Learning and Exploration
[Link](https://studywolf.wordpress.com/2012/11/25/reinforcement-learning-q-learning-and-exploration/)

## 1.3. DeepMind RL Algorithms

TODO: An overview of DeepMind algorithms

# 2. OpenAI

OpenAI was founded in late 2015 as a non-profit with a mission to “build safe artificial general intelligence (AGI) and ensure AGI's benefits are as widely and evenly distributed as possible.” In addition to exploring many issues regarding AGI, one major contribution that OpenAI made to the machine learning world was developing both the Gym and Universe software platforms.

## 2.1. OpenAI Gym
OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It makes no assumptions about the structure of your agent, and is compatible with any numerical computation library, such as TensorFlow or Theano. You can use it from Python code, and soon from other languages.

OpenAI Gym consists of two parts:
- The gym open-source library: a collection of test problems — environments — that you can use to work out your reinforcement learning algorithms. These environments have a shared interface, allowing you to write general algorithms.
- The OpenAI Gym service: a site and API allowing people to meaningfully compare performance of their trained agents.

### Running our first agent
Here's a bare minimum example of getting something running. This will run an instance of the CartPole-v0 environment for 1000 timesteps, rendering the environment at each step. You should see a window pop up rendering the classic cart-pole problem:

In [2]:
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample()) # take a random action

[2017-09-26 15:50:23,384] Making new env: CartPole-v0
[2017-09-26 15:50:23,511] You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.


It should look something like this:

In [3]:
%%HTML
<video width="320" height="240" controls>
  <source src="./Videos/cartpole-no-reset.mp4" type="video/mp4">
</video>

Normally, we'll end the simulation before the cart-pole is allowed to go off-screen. More on that later.

If you'd like to see some other environments in action, try replacing CartPole-v0 above with something like MountainCar-v0, MsPacman-v0 (requires the Atari dependency), or Hopper-v1 (requires the MuJoCo dependencies). Environments all descend from the Env base class.

Note that if you're missing any dependencies, you should get a helpful error message telling you what you're missing. (Let us know if a dependency gives you trouble without a clear instruction to fix it.) Installing a missing dependency is generally pretty simple. You'll also need a MuJoCo license for Hopper-v1.


### Observtations

If we ever want to do better than take random actions at each step, it'd probably be good to actually know what our actions are doing to the environment.

The environment's step function returns exactly what we need. In fact, step returns four values. These are:

- observation (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
- reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
- done (boolean): whether it's time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
- info (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment's last state change). However, official evaluations of your agent are not allowed to use this for learning.

This is just an implementation of the classic "agent-environment loop". Each timestep, the agent chooses an action, and the environment returns an observation and a reward.

![agent-environment loop](https://gym.openai.com/assets/docs/aeloop-138c89d44114492fd02822303e6b4b07213010bb14ca5856d2d49d6b62d88e53.svg)

The process gets started by calling reset, which returns an initial observation. So a more proper way of writing the previous code would be to respect the done flag:

In [4]:
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break

[2017-09-26 15:50:58,282] Making new env: CartPole-v0


[ 0.04958342  0.02695095 -0.02865629 -0.01010996]
[ 0.05012244  0.2224719  -0.02885849 -0.31169477]
[ 0.05457188  0.41799285 -0.03509239 -0.6133373 ]
[ 0.06293174  0.22337841 -0.04735914 -0.33191042]
[ 0.06739931  0.02896144 -0.05399734 -0.05453032]
[ 0.06797854 -0.16534636 -0.05508795  0.22063906]
[ 0.06467161  0.03051798 -0.05067517 -0.08889943]
[ 0.06528197 -0.16384236 -0.05245316  0.1873747 ]
[ 0.06200512 -0.35817615 -0.04870566  0.46306031]
[ 0.0548416  -0.55257716 -0.03944446  0.74000221]
[ 0.04379005 -0.35693343 -0.02464441  0.43517123]
[ 0.03665139 -0.55169799 -0.01594099  0.71998462]
[ 0.02561743 -0.7465958  -0.0015413   1.00760774]
[ 0.01068551 -0.94169714  0.01861086  1.29980627]
[-0.00814843 -0.7468163   0.04460698  1.0130069 ]
[-0.02308476 -0.94250401  0.06486712  1.3193566 ]
[-0.04193484 -0.74825938  0.09125425  1.04766   ]
[-0.05690003 -0.55445908  0.11220745  0.78496061]
[-0.06798921 -0.75092922  0.12790667  1.11073234]
[-0.08300779 -0.55769822  0.15012131  0.86075567]


# 3. Let's play: Our problem

One of the problems we are trying to tackle is to know what is the best Self Driving Algorithm that we should pick.
To know this, we should define our key features:
- Time to solve the environment
- How well the agent performs in a multi-agent environment

For simplicity (and fun) reasons, we will use the MarioKart N64 game as an environement to "play" with.
