<a href="https://colab.research.google.com/github/Xiaocong233/ReinforcementLearning_ML/blob/master/epsilon_greedy_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Reinforcement Learning Tutorial in Python**
###### Created by **Xiaocong Yan** for [StartOnAI](https://startonai.com/)
---


## 1. Introduction to RL

![alt text](https://lilianweng.github.io/lil-log/assets/images/RL_illustration.png)

- What is Reinforcement Learning?
  - RL is a subfield in machine learning, it particularly focuses on training AI agents to behave in a certain way by learning directly from its surrounding environment
  - Essentially, we are training the agent to choose the optimal action (a) given a state (s) from the environment that will maxmimizes an engineered reward (r)

- RL Applications
  - gameplaying AI
    - AlphaGo

    <img src="https://cdn.geekwire.com/wp-content/uploads/2016/03/160312-go-630x353.jpg" alt="alt text" width="500" height="300">

    - AlphaStar

    <img src="https://www.version2.dk/sites/v2/files/topillustration/2019/01/alphastarscreenshot.png" alt="alt text" width="600" height="337">
  
  - agent in simulation learning to walk
  
  <img src="https://nav74neet.github.io/media/blog/walking.png" alt="alt text"  width='740' height='300'>


## 2. Explore-exploit dilemma and Multi-Armed Bandit Problem

In [None]:
import matplotlib.pyplot as plt
import numpy as np

class Bandit:
  def __init__(self, p):
    self.p = p # the winning rate
    self.p_estimate = 0. # estimation of the winning rate, intialized to 0
    self.N = 0. # number of samples collected

  def pull(self):
    # draw a random probability p and check if won according to the winning rate
    return np.random.random() < self.p

  def update(self, x):
    # increment numbers of samples collected
    self.N += 1.
    # calculate the new p hat from the previous p hat and the newly obtained value
    self.p_estimate = ((self.N - 1) * self.p_estimate + x) / self.N

def run_experiment(bandits_probs_list, epsilon, N):
  # create a list of bandit objects according to their probabilities of win rate
  bandits = [Bandit(p) for p in bandits_probs_list]
  
  # initialize variables
  rewards = np.zeros(N)
  num_times_explored = 0
  num_times_exploited = 0
  num_optimal = 0

  # print out the true optimal bandit index
  optimal_j = np.argmax([b.p for b in bandits])
  print('optimal j:', optimal_j)

  for i in range(N):
    # use epsilon_greedy to select the next bandit
    if np.random.random() < epsilon:
      num_times_explored += 1
      j = np.random.randint(len(bandits))
    else:
      num_times_exploited += 1
      j = np.argmax([b.p_estimate for b in bandits])
    
    if j == optimal_j:
      num_optimal += 1

    # pull the arm for the bandit selected
    x = bandits[j].pull()

    # update rewards log
    rewards[i] = x
    bandits[j].update(x)     
  
  # print mean estimates for each bandit
  for i, b in enumerate(bandits):
    print(f'bandit{i + 1} estimate win-rate: {round(b.p_estimate, 3)} | true win_rate: {b.p}')

  # print total reward
  print()
  print('total reward:', rewards.sum())
  print('overall win-rate:', rewards.sum() / N)
  print('explore count:', num_times_explored)
  print('exploit count:', num_times_exploited)
  print('optimal selection count:', num_optimal)

  # plot the results
  cumulative_rewards = np.cumsum(rewards)
  win_rates = cumulative_rewards / (np.arange(N) + 1)
  plt.plot(win_rates)
  plt.plot(np.ones(N) * np.max(bandits_probs_list))
  plt.title('cumulative win-rate over time')
  plt.xlabel('number of trials')
  plt.ylabel('win-rate')
  plt.show()

if __name__ == '__main__':
  # simulate a multi-armed bandit problem with 5 machines with win-rates 0, 0.25, 0.5, 0.75 and 1
  # default random selection to happen 10% of the time, thus selecting epsilon of 0.1
  # default to 10000 trials
  run_experiment([0, 0.25, 0.5, 0.75, 1], 0.1, 10000)

## Sources: