class called OrderQuantity, which contains an instance variable Q indicating the order quantity to be used when that instance is “pulled.” A “pull” consists of a simulation of one period of the newsvendor problem, which just means generating a random demand and calculating the resulting cost. The class should contain (at least) 3 methods:

* an __init__() method that sets up the class and sets Q and any other necessary instance variables

* pull(), which performs a random “pull” and returns the reward (negative of the cost)

* update(), which updates the estimate of the mean reward


$S^∗=μ+z_ασ$

$z_α=Φ^{−1}$
 is the $α-quantile$ of the standard normal distribution and $α=\dfrac{p}{p+h}$


Implement the following explore–exploit methods, and write code to test the ability of each one to find the best order quantity (from among a set of OrderQuantity instances) for a given instance of the newsvendor problem, i.e., for a given demand probability distribution and given values of the holding and stockout costs:

* epsilon-greedy with ε = 0.1
* decaying epsilon
* optimistic initial values
* UCB1
* Bayesian sampling

Experiment: Test each of your explore–exploit methods on a newsvendor instance in which the holding and stockout costs are 1 and 20, respectively, and the demand comes from a normal distribution with a mean of 20 and a standard deviation of 4. Use the following order quantities: 15, 20, 25, 30, 35. 

For each method, plot the running count of each order quantity. That is, plot the number of times that each order quantity is “pulled” versus the training period. Also plot a comparison of the running average reward versus the training period for all 5 methods (on the same graph).

Note: For the newsvendor instance described above, here are the expected costs per period for the five order quantities listed above:

* Q = 15: 104.2493
* Q = 20: 33.5112
* Q = 25: 9.2493
* Q = 30: 10.1683   
* Q = 35: 15.0018

(These costs were calculated analytically, from the theory, not from experimentation.)


## News vendor Problem with reinforecment learning

In [20]:
from __future__ import print_function, division #https://stackoverflow.com/questions/7075082/what-is-future-in-python-used-for-and-how-when-to-use-it-and-how-it-works
from builtins import range
import numpy as np
import matplotlib.pyplot as plt

#bandit changes to OrderQuantity
class OrderQuantity: #object oriented
  def __init__(self, Q): #m changed to Q m is the true mean in bandit---> here we can write it as Q for the quantity
    self.Q = Q  #true qty
    self.mean = 0 #estimate of the qty
    self.N = 0

  def pull(self): #simulate pulling the bandit arm ---for Q simulating a new vendor. getting random demand  
    estimate= np.random.normal(mu, sigma, 1)
    
    return stockout_cost* (estimate-self.Q)
    if self.Q < estimate  else holding_cost* (self.Q-estimate)
        
  def update(self, x): #latest sample receive from the bandit
    self.N += 1
    self.mean = (1 - 1.0/self.N)*self.mean + 1.0/self.N*x

def run_experiment(m1, m2, m3, eps, N): #m_n is the differnt bandits, N is the number of times we play
  bandits = [Bandit(m1), Bandit(m2), Bandit(m3)]

  data = np.empty(N)
  
  for i in range(N):
    # epsilon greedy
    p = np.random.random()
    if p < eps:
      j = np.random.choice(3)
    else:
      j = np.argmax([b.mean for b in bandits])
    x = bandits[j].pull()
    bandits[j].update(x)

    # for the plot
    data[i] = x
  cumulative_average = np.cumsum(data) / (np.arange(N) + 1)

  # plot moving average ctr
  plt.plot(cumulative_average)
  plt.plot(np.ones(N)*m1)
  plt.plot(np.ones(N)*m2)
  plt.plot(np.ones(N)*m3)
  plt.xscale('log')
  plt.show()

  for b in bandits:
    print(b.mean)

  return cumulative_average

    """
        
if __name__ == '__main__':
  c_1 = run_experiment(1.0, 2.0, 3.0, 0.1, 100000)
  c_05 = run_experiment(1.0, 2.0, 3.0, 0.05, 100000)
  c_01 = run_experiment(1.0, 2.0, 3.0, 0.01, 100000)

  # log scale plot
  plt.plot(c_1, label='eps = 0.1')
  plt.plot(c_05, label='eps = 0.05')
  plt.plot(c_01, label='eps = 0.01')
  plt.legend()
  plt.xscale('log')
  plt.show()


  # linear plot
  plt.plot(c_1, label='eps = 0.1')
  plt.plot(c_05, label='eps = 0.05')
  plt.plot(c_01, label='eps = 0.01')
  plt.legend()
  plt.show()

 
 """


SyntaxError: invalid syntax (<ipython-input-20-f20361b00509>, line 17)