<a href="https://colab.research.google.com/github/drewnlia/Complete-Python-3-Bootcamp/blob/master/Module_8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div>
<img src="https://babl.ai/wp-content/uploads/2022/02/logo_two.png" alt="Drawing" width="400"/>
</div>

# Introduction
This notebook is for constructing your solutions to quiz questions in Module 8 of "[Algorithms, AI, & Machine Learning: AI for AI Ethics.](https://courses.babl.ai/p/algorithms-ai-machine-learning)". Make a copy in your own drive, complete any instructions below, then share it with the instructor (sheabrown@bablai.com). 


 # Reinforcement Learning

Reinforcement learning involves an "agent" that learns to take action in an environment based on the expected future reward. As we learned about in the first week when we discussed agents, figuring out the task environment is an important first step in deciding how to construct a way to make decitions. Here are a few definitions: 

$\mathcal{S}$ is the set of all possible states. \\
$\mathcal{A}$ is the set of all possible actions. \\
$P(s_j | s_i, a)$ are the probabilities of transitioning from state $s_i \in \mathcal{S}$ to state $s_j \in \mathcal{S}$ given that you took action $a\in \mathcal{A}$. \\
$R(s,a)$ is a reward function that the agent gets when taking an action $a$ in state $s$. 

It's important to note that what counts as a state $s$ is a modeling choice, as most of the time the true state of the world (or game world) is not fully observable or known. 

## The Agent
$\pi(a | s)$ is the policy of the agent, a function that returns the action given the current state. 
$V^{\pi}(s) = \mathbb{E}[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ...]$ is the value function, where $r_t \sim R(s_t,a_t)$ is the reward at timestep $t$ and $\gamma \in [0,1]$ is a discount factor. The value funtion is expected discounted sum of rewards you get for starting in state $s$ and taking actions based on the policy $\pi$ from then on. The optimal way to act in this environment is given by the optimal policy $\pi^{*}$:

\begin{equation} \pi^{*}(s) = \underset{\pi}{\operatorname{argmax}}(V^{\pi}(s)). \end{equation}

This is just a fancy way of saying "find the policy that maximizes the expected future rewards from that state". 

## Policy Iteration

One method of model-based learning is called policy iteration. For this, we need to define the state-action value functionof a policy $Q^{\pi}(s,a)$, which differs from the value function in that you ask what the value (discounted sum of future rewards) of being in a certain state is, but if you took action $a$ which may not have been the action of the policy in that state $\pi(s)$. 

\begin{equation} Q^{\pi_i}(s,a) = R(s,a) + \gamma \sum_{s'\in \mathcal{S}} P(s'|s,a)V^{\pi_i} 
\end{equation}

The reason we use the subscript $i$ is because we can search for a better policy that takes an action $a \ne \pi_i(s)$ that would give you more reward.

\begin{equation}
 \pi_{i+1}(s) = \underset{a}{\operatorname{argmax}}(Q^{\pi_i}(s,a))
\end{equation}

By iteratively looping through all the states in your environment and doing this policy improvement step, you can eventually converge on the optimal policy. 




# Exercise 3: Play around with OpenAI Gym

[OpenAI Gym](https://gym.openai.com/) aims to provide an easy-to-setup general-intelligence benchmark with a wide variety of different environments. The goal is to standardize how environments are defined in AI research publications so that published research becomes more easily reproducible. The project claims to provide the user with a simple interface. 
OpenAI gym is pip-installed onto your local machine.  There are a few significant limitations to be aware of:

* OpenAI Gym can not directly render animated games in Google Colab.

Because OpenAI Gym requires a graphics display, the only way to display Gym in Google Colab is an embedded video.  The presentation of OpenAI Gym game animations in Google Colab is discussed later in this module.


### Looking at Gym Environments

The centerpiece of Gym is the environment, which defines the "game" in which your reinforcement algorithm will compete.  An environment does not need to be a game; however, it describes the following game-like features:
* **action space**: What actions can we take on the environment, at each step/episode, to alter the environment.
* **observation space**: What is the current state of the portion of the environment that we can observe. Usually, we can see the entire environment.

Before we begin to look at Gym, it is essential to understand some of the terminology used by this library.

* **Agent** - The machine learning program or model that controls the actions.
Step - One round of issuing actions that affect the observation space.
* **Episode** - A collection of steps that terminates when the agent fails to meet the environment's objective, or the episode reaches the maximum number of allowed steps.
* **Render** - Gym can render one frame for display after each episode.
* **Reward** - A positive reinforcement that can occur at the end of each episode, after the agent acts.
* **Nondeterministic** - For some environments, randomness is a factor in deciding what effects actions have on reward and changes to the observation space.

It is important to note that many of the gym environments specify that they are not nondeterministic even though they make use of random numbers to process actions. It is generally agreed upon (based on the gym GitHub issue tracker) that nondeterministic property means that a deterministic environment will still behave randomly even when given consistent seed value. The seed method of an environment can be used by the program to seed the random number generator for the environment.


In [9]:
 #@title

# Install lots of things needed to display the environments in Colab
!wget http://www.atarimania.com/roms/Roms.rar 
!unrar x -o+ /content/Roms.rar >/dev/nul
!python -m atari_py.import_roms /content/ROMS >/dev/nul
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1


--2022-12-03 00:46:51--  http://www.atarimania.com/roms/Roms.rar
Resolving www.atarimania.com (www.atarimania.com)... 195.154.81.199
Connecting to www.atarimania.com (www.atarimania.com)|195.154.81.199|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19583716 (19M) [application/x-rar-compressed]
Saving to: ‘Roms.rar.3’


2022-12-03 00:47:22 (616 KB/s) - ‘Roms.rar.3’ saved [19583716/19583716]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


The Gym library allows us to query some of these attributes from environments.  Below is a function to query gym environments.

In [11]:
 import gym

def query_environment(name):
  env = gym.make(name)
  spec = gym.spec(name)
  print(f"Action Space: {env.action_space}")
  print(f"Observation Space: {env.observation_space}")
  print(f"Max Episode Steps: {spec.max_episode_steps}")
  print(f"Nondeterministic: {spec.nondeterministic}")
  print(f"Reward Range: {env.reward_range}")
  print(f"Reward Threshold: {spec.reward_threshold}")

## Visualize the game

Below is a function that will allow you to visualize a game session in Colab as a video. 

In [15]:
# from gym.wrappers import Monitor
from gym.wrappers.monitoring.video_recorder import VideoRecorder
import glob
import io
import base64
from IPython.display import HTML
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay

display = Display(visible=0, size=(1400, 900))
display.start()

"""
Utility functions to enable video recording of gym environment 
and displaying it.
To enable video, just do "env = wrap_env(env)""
"""

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = VideoRecorder(env, './video', force=True)
  return env

# Watch a random agent play a game

The code below starts a game and plays it using random actions. For this exersise, I simply want you to play around with doing something differnt that taking a random action. Look [here](https://gym.openai.com/docs/) to get some idea of how the gym environmen works. I want you to try to do three things: 



1.   Take any non-random action. This could be as simple as always moving to the right, or as complex as using an 'if-then' statements. to decide on an action.  
2.   Create a memory variable of some type, and store old observations in this memory variable. Be as simple or creative as you want. 
3.  Try at least one other game (see the OpenAI website for a list).

When you're done, share the document with sheabrown@bablai.com. 



In [14]:
# Edit this cell to make your changes
# ====================================
env = wrap_env(gym.make("MsPacman-v0"))
observation = env.reset()

while True:
  
    env.render()
    
    #your agent goes here
    action = env.action_space.sample() 
         
    observation, reward, done, info = env.step(action) 
   
        
    if done: 
      break;
            
env.close()
show_video()

  logger.warn(
  deprecation(
  deprecation(


NameError: ignored