## DXC AI Starter: Reinforcement Learning

<a target="_blank" href="https://github.com/dxc-technology/DXC-Industrialized-AI-Starter"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>


## Set up the development environment

This code installs all the packages you will need. Run it first. It should take 30 seconds or so to complete. If you get missing module errors later, it may be because you have not run this code. Restart the runtime/session after executing the below code.

In [None]:
! pip install git+https://github.com/dxc-technology/DXC-Industrialized-AI-Starter.git -qq

In [None]:
from dxc import rl
from gym import spaces
import numpy as np
import gym

# Reinforcement Learning Basics

Reinforcement learning is machine learning using **rewards** given to an **agent** acting in an **environment**. Instead of learning from historical data, the agent learns how to maneuver through the environment by receiving positive or negative rewards depending on the actions that it takes. 


### Train an OpenAI Gym environment

Gym is a toolkit used for reinforcement learning created by OpenAI that includes several premade environments to test your models on. [View OpenAI Gym environments](https://gym.openai.com/envs/)

### Models

There are several models that can be used to train an agent in an environment. The model that needs to be used can be generally determined by the type of actions an agent can take. 

If a discrete set of actions are given, **DQN** or **SARSA** can be used. A discrete set of actions are actions that you can count. Examples of this include 3 actions for Rock, Paper, Scissors or 2 actions in TicTacToe for X and O. If the action space were between 1 to 5, the actions would be 1, 2, 3, 4, and 5.


If a continous set of actions are given, **DDPG** can be used. A continuous set of actions are actions that you can measure. Examples of this include how fast a car should run (mph) or how likely a card will appear in a board game (%). If the action space were between 1 to 5, the actions would be all numbers between 1 to 5, including all decimal values.

### Calling the Reinforcement Learning Helper

`rl_helper()` accepts 2 parameters to run. The first parameter is the environment. The second parameter is the type of model the environment will train in. 

There are extra parameters that can be defined, but aren't necessary. These are used to further refine your model. Listing them down and defining the default values in parenthesis after the parameter name, the parameters that can be set for both discrete and continuous environments are:

- steps (50000): number of episodes that the model will run for 
- saved_model_name (model): name of the saved model files
- visualize (False): boolean (True or False) if there is visualization for the model, set True to display it

The other parameters are only used for discrete environments.

- test_steps (5): number of episodes the model will test
- critic_hidden_layers (3): number of critic hidden layers
- hidden_layers (3): number of hidden layers

To call a gym environment, simply use the `gym.make()` function and pass the name of the gym environment. You can get the name of the gym environment [here](https://gym.openai.com/envs/). 

### Continuous Example

In the following example code, the [Pendulum environment](https://gym.openai.com/envs/Pendulum-v0/) is used. This is an environment where the goal of the agent is to balance a pendulum upright. Unlike the cartpole example, the actions set for this environment are continuous, so DDPG will be used to train it.


In [None]:
env = gym.make("Pendulum-v0")
rl.rl_helper(env=env, 
             model_name="DDPG", 
             saved_model_name="pendulum_model", 
             steps=1000)


### Discrete Example

In the following example code, the [Cart Pole environment](https://gym.openai.com/envs/CartPole-v0/) is used. This is an environment where the goal of the agent (which is a cart) is to balance a pole. The model used is DQN (discrete actions) since the actions that the cart can take is move left or right. SARSA may also be used since it is a discrete environment.

In [None]:
import gym

gym_env = gym.make("CartPole-v0")
rl.rl_helper(env=gym_env, 
             model_name="DQN", #SARSA
             steps=25000, 
             test_steps=5,
             visualize=False,
             saved_model_name="cartpole_model",
             critic_hidden_layers=2,
             hidden_layers=2)

# Making your own environment

Now that you've learned how to use the `rl_helper()`, you can explore how to create your own environment to train your own agent.

Each environment has 3 main parts: 

`__init__()` is where the variables describing the environment is initialized. This includes variables that define what an agent can do and observe. This also includes the variables that define how the environment looks as the agent is going through it. It is important to define what the agent can observe since this is what the agent will base his choices in his action on. For example, in an environment where a car is running, if the road is slippery, the agent might want to choose a slower speed. 

In `__init__()`, it is important to define the following variables in the environment:

*   `action_space` initializes the number of actions that are defined. 
*   `observation_space` initializes the number of variables that is observed by the agent
*   `total_reward` initializes and later stores the rewards that the agent has
*   `isDone` initializes and stores whether or not the agent is done going through the environment

The rest of the variables defined in init are variables that just describe what the environment looks like to the agent.

`reset()` is where the variables of the environment are reset when the environment is ran again. Generally, it will just look like `__init__()` since we are just setting the variables to its original values again. This is done because the agent will train in the environment multiple times. To make sure each run independent of each other, the variables need to be reset to their original values. 

It is important to note that `reset()` should return what the agent can observe.

`step()` is where the steps that the agent takes through the environment is defined. 

## Making a **Rock, Paper, Scissors** environment

The goal of the game is to choose an item that beats the other item. 

*   Rock beats scissors. Rock loses to paper.
*   Paper beats rock. Paper loses to scissors.
*   Scissors beats paper. Scissors loses to rock.

In the following environment, each item is defined by an integer:

*   0 - Rock
*   1 - Paper
*   2 - Scissors

Since there are 3 discrete actions that the agent can take, `action_space` is set to `space.Discrete(3)` to define 3 discrete actions. The agent would only observe what the opponent chooses so `observation_space` is set to `space.Discrete(1)` to define 1 observable variable. The action of the opponent is one of the items chosen at random so `opponent_action` is defined as `np.random.randint(3)`.

In `step()`, we define the steps that the agent takes. In this case, we pass the action that the agent takes. Using this action, we can compare it to the action that the opponent takes and determine the reward that the agent gets. This is defined in `determine_reward()`. It is a simple function that uses if else statements to compare the actions that the agent and opponent takes. If their action is the same, the reward that the agent gets is 0. If the agent beats the opponent, the agent gets 1. If the agent loses to the opponent, the agent gets 0. 

Once the reward for the current step is determined, it is added to the total rewards of the whole run. We then determine if the run is over by checking what round the agent is on. If `round_number` is not equal to the `max_round`, then `round_number` is iterated. Otherwise, isDone is set to True causing the run of the environment to be over. 

In [None]:
class basic_env(object):
  def __init__(self):
    self.action_space = spaces.Discrete(3)
    self.observation_space = spaces.Discrete(1)
    self.total_reward = 0
    self.isDone = False

    self.opponent_action = np.random.randint(3)
    self.round_number = 1
    self.max_rounds = 10
    
  def reset(self):
    self.action_space = spaces.Discrete(3)
    self.observation_space = spaces.Discrete(1)
    self.total_reward = 0
    self.isDone = False

    self.opponent_action = np.random.randint(3)
    self.round_number = 1
    self.max_rounds = 10

    return np.array(self.opponent_action)

  def step(self, action):
    reward = basic_env.determine_reward(self, action, self.opponent_action)
    self.total_reward = self.total_reward + reward

    if self.round_number != self.max_rounds:
      self.round_number += 1
    else:
      self.isDone = True
      
    basic_env.determine_opponent_action(self)
    return np.array(self.opponent_action), reward, self.isDone, {}

  ####################
  ##helper functions##
  ####################
  def determine_opponent_action(self):
    self.opponent_action = np.random.randint(3)

  def determine_reward(self, agent_choice, opponent_choice):
    reward = 0
    if agent_choice == opponent_choice: #the same choice
      reward = 0
    elif agent_choice == 0: #rock
      if opponent_choice == 1: #against paper
        reward = 0
      else: #against scissors
        reward = 1
    elif agent_choice == 1: #paper
      if opponent_choice == 2: #against scissors
        reward = 0
      else: #against rock
        reward = 1
    elif agent_choice == 2: #scissors
      if opponent_choice == 0: #against rock
        reward = 0
      else: #against paper
        reward = 1

    return reward

## Calling your custom environment

Simply pass your environment to the same `rl_helper()` function.

`rl_helper()` is set to train the model in 50,000 episodes. Once this training is done, the saved model will be tested 5 times. In the Rock, Paper, Scissors environment, the ideal total reward that the agent can get is 10. The agent can get a total reward of 10 if it wins each of the 10 rounds of Rock, Paper, Scissors.

In [None]:
env = basic_env()
rl.rl_helper(env, "DQN")

## HVAC env

To save on energy, several companies such as Google use Reinforcement Learning to curb their energy use. The following is an example of a simplified HVAC environment.

The goal of the agent in this example is to keep the environment temperature as close to the desired room temperature for as long as possible. 

To do this, we establish the desired temperature, the time intervals, the external temperature that affects the internal temperature, how fast the internal temperature assumes the external temperature, and how fast the HVAC temperature affects the internal temperature.



In [None]:
import random
class hvac_env(object):
	def __init__(self):
		self.action_space = spaces.Discrete(3)
		self.observation_space = spaces.Discrete(1)
		self.isDone = False

		self.current_hour = 0
		self.hour = 24

		self.target = 60
		self.target_condition_range = 5 # acceptable range deviation from target

		self.temp_increment = 5

		self.target_reward = 1
		self.out_of_target_reward = -2
		self.efficiency_reward = 1

		self.range = random.choice([(40, 50), (70, 80)])
		self.current_condition = np.random.randint(self.range[0],self.range[1])

	def reset(self):
		self.temp_increment = 5
		self.isDone = False
		self.range = random.choice([(40, 50), (70, 80)])
		self.current_condition = np.random.randint(self.range[0],self.range[1])
		self.current_hour = 0

		return np.array(self.current_condition)

	def step(self, action):
		episode_reward = 0

		if action == 0: #heat
			self.current_condition += self.temp_increment 
		elif action == 1: #cool
			self.current_condition -= self.temp_increment 
		elif action == 2: #do nothing
			self.current_condition = self.current_condition
			episode_reward += self.efficiency_reward

		if self.current_condition >= self.target - self.target_condition_range and self.current_condition <= self.target + self.target_condition_range:
			episode_reward += self.target_reward #reward for being within the target range
		else:
			episode_reward += self.out_of_target_reward #penalty for not being within the range

		server_heat = np.random.randint(0, 3)
		#print("Episode reward: {} | Total reward: {}".format(episode_reward, self.total_reward))
		#print("Action: {} | Current Condition: {} | Heated Condition: {} | Episode Reward: {}".format(action, self.current_condition, (self.current_condition+server_heat), episode_reward))
		self.current_condition += server_heat

		if self.current_hour != self.hour:
			self.current_hour += 1
		else:
			self.isDone = True

		return np.array(self.current_condition), episode_reward, self.isDone, {}


In [None]:
hvac_env = hvac_env()

rl.rl_helper(env=gym_env, 
             model_name="DQN", 
             steps=50000, 
             test_steps=5,
             visualize=False,
             saved_model_name="hvac_model",
             critic_hidden_layers=2,
             hidden_layers=2)