<a href="https://colab.research.google.com/github/chandrusuresh/ReinforcementLearning/blob/master/Policy_Iteration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np
from scipy.special import factorial

##Policy Iteration
Policy iteration is an algorithm for evaluating the value function and improving it in successive iterations. Each iteration involves 2 steps - first step evaluates the policy and the second improves it. The first step is the same as the iterative policy evaluation algorithm and second step is the iteration policy improvement algorithm. Refer to the Grid World example for more details on these two algorithms.

## Jack's Rental Car Problem

Jack manages two locations for a nationwide car
rental company. Each day, some number of customers arrive at each location to rent cars.

Cost of rental = \$10

Cost of moving cars = -$2

Max Cars = 20

Max cars moved = 5

Number of cars rented/returned is a Poisson distribution given by,
$$P(rented/returned = n) = \frac{\lambda^n}{n!} e^{-\lambda}$$ where $\lambda$ is the expected number.

$\lambda_{rent}^1 = 3$, $\lambda_{rent}^2 = 4$ and

$\lambda_{return}^1 = 3$,$\lambda_{return}^2 = 2$

$gamma = 0.9$
Time steps = days

State variable = Number of cars at each location at the end of the day

Action variable = Number of cars moved

In [0]:
max_cars = 20
max_move = 5
rent = [3,4]
retn = [3,2]
rent_cost = 10
move_cost = -2

def poisson(lamda,n):
  return lamda**n/factorial(n)*np.exp(-lamda)

def prob_rent(init_cars,lambda_rent):
  sum_prob = 0
  prob_rent = dict()
  for i in range(init_cars):
    prob_rent[i] = poisson(lambda_rent,i)
    sum_prob = sum_prob + prob_rent[i]
  prob_rent[init_cars] = 1 - sum_prob
  return prob_rent

def prob_retn(init,prob_rent,lambda_retn):
  prob_cars = dict()
  for k in prob_rent.keys():
    c = init-k
    sum_prob = 0
    for i in range(max_cars-c):
      prob = poisson(lambda_retn,i)
      sum_prob = sum_prob + prob
      if c+i in prob_cars.keys():
        prob_cars[c+i] = prob_cars[c+i] + prob_rent[k]*prob
      else:
        prob_cars[c+i] = prob_rent[k]*prob
    if max_cars in prob_cars.keys():
      prob_cars[max_cars] = prob_cars[max_cars] + prob_rent[k]*(1-sum_prob)
    else:
      prob_cars[max_cars] = prob_rent[k]*(1-sum_prob)
  return prob_cars

def transition(lambda_rent,lambda_retn):
  p_trans = dict()
  for init_cars in range(max_cars+1):
    p_rent = prob_rent(init_cars,lambda_rent)
    p_trans[init_cars] = prob_retn(init_cars,p_rent,lambda_retn)
  return p_trans

p_trans_1 = transition(rent[0],retn[0])
p_trans_2 = transition(rent[1],retn[1])

## Test

In [48]:
cars_init = [7,7]
p_rent = prob_rent(cars_init[1],rent[1])
p_retn = prob_retn(cars_init[1],p_rent,retn[1])
print("Probability of Rent")
print(p_rent)
print(np.sum(list(p_rent.values())))
print("Probability of Return")

total_prob = 0
for i in range(max_cars+1):
  print(i,p_retn[i]-p_trans_2[cars_init[1]][i])
  total_prob = total_prob + p_retn[i]
print(total_prob)
print(np.sum(list(p_retn.values())))

Probability of Rent
{0: 0.01831563888873418, 1: 0.07326255555493671, 2: 0.14652511110987343, 3: 0.19536681481316456, 4: 0.19536681481316456, 5: 0.15629345185053165, 6: 0.1041956345670211, 7: 0.11067397840257387}
1.0
Probability of Return
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0
9 0.0
10 0.0
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
1.0000000000000002
0.9999999999999999
