Pistes : 

* Probabilité de faire attendre une requête élevée puis diminue à chaque attente : réinitialisée quand assignation plutôt que attente choisie; coef entre 0 et 1 de multiplication de cette probabilité à la puissance nombre d'assignations
* Découpage en epochs de période constante (=batching) : expression de la reward temporelle facile ==> problème = déclenchement d'une epoch pour une nouvelle requête donc pas nécessairement pile en fin de période
* Hot/cold areas : si conflit entre 2 voitures, assigner celle de la cold area, si conflit entre 2 requêtes, assigner celle qui mène dans une hot area
* Policy evaluation ?
* Métriques : temps moyen d'un trajet pour satisfaire une requête, taux de requêtes satisfaites
* Collective greedy policy ?
* Améliorer l'heuristique
* Grasp metaheuristic ==> heuristic ?
* Transformers pour gérer l'interchangeabilité des voitures pour le 1er réseau (attention)
* Gérer les conflits de relocalisation : infos globales sur les autres véhicules
* Stockage des localisations des requêtes pour filtrage dans une période donnée proche de l'epoch afin de réaliser clustering des hot areas
* Réseau de relocalisation qui prend en entrée le temps pour atteindre chaque lot pour chaque voiture, un classement des lots les plus intéressants et des infos sur les autres voitures, et pq pas des infos sur les requêtes précédentes
* Entrée du réseau d'assignation de requêtes : coords voiture, temps de trajet jusqu'à requête pour chaque voiture, zone voiture, coords requête, zone requête, time, nb voitures dispos, nb requêtes en attente, nb de requêtes dans les 15 dernières minutes
* Si pas de requête au bout d'une minute : repositionnement
* Cerebellar embedding ==> Hierarchical hexagon grid system ?
* Regulariser descente du gradient avec terme de pénalité (constante Lipschitzienne du réseau) à chaque itération ?
* Vitesse normalisée incohérente avec tirage de l'environnement
* Autoencoder pour contourner attention



In [43]:
# Import 

#!pip install pyhailing
#!pip install --upgrade Pillow # Restart runtime

import torch.nn as nn
import torch
import numpy
import numpy as np #To improve
import pyhailing
from pyhailing import RidehailEnv
import tqdm
import random
from collections import OrderedDict
import time as t
import torch.nn.functional as F

#seed
device = torch.device('cuda')

## Distance functions

In [2]:
env = RidehailEnv()
speeds_data = env.speeds_data.reset_index()

In [3]:
def dist_manhattan(list_coord_depart, list_coord_arrivee):
  """ Retourne la distance de Manhattan entre deux points d'un plan 
  Prend en entrée les coordonnées des points de départ et d'arrivée :
  ([x_depart,y_depart],[x_arrivee,y_arrivee]) """
  
  depart = numpy.array(list_coord_depart)
  arrivee = numpy.array(list_coord_arrivee)
  return numpy.linalg.norm((depart - arrivee), ord=1)

In [4]:
def vitesse_normalisee(vitesse_moyenne, sigma):
  """ Retourne un tirage aléatoire d'une loi normale N(vitesse_moyenne,variance_vitesse) """
  
  #return numpy.random.randn(1)*sigma + vitesse_moyenne
  loi_normale = numpy.random.randn(1000)
  loi_normale = [loi_normale[i]*sigma + vitesse_moyenne for i in range(len(loi_normale))]
  limite = numpy.quantile(loi_normale, .10)
  normale_tronquee = []
  for i in range(len(loi_normale)):
    loi_normale[i] = numpy.array(loi_normale[i])[0]
    if loi_normale[i] >= limite:
      normale_tronquee.append(loi_normale[i])
  return random.choices(normale_tronquee)[0]

In [5]:
def duree_deplacement(list_coord_depart, list_coord_arrivee, time_):
  """ Retourne la duree d'un trajet en secondes
  Prend en entrée les coordonnées des points de départ et d'arrivée ainsi que l'heure :
  ([x_depart,y_depart],[x_arrivee,y_arrivee],time) """

  distance = dist_manhattan(list_coord_depart, list_coord_arrivee)
  depart = numpy.array(list_coord_depart)
  arrivee = numpy.array(list_coord_arrivee)
  zone_depart = env.xy_to_zone(depart)
  zone_arrivee = env.xy_to_zone(arrivee)
  tranche_horaire = int(time_/15/60)*15
  vitesse_moyenne = speeds_data[(speeds_data['puzone']==zone_depart) & (speeds_data['dozone']==zone_arrivee) & (speeds_data['min']==tranche_horaire)]['speed_mean']
  sigma = speeds_data[(speeds_data['puzone']==zone_depart) & (speeds_data['dozone']==zone_arrivee) & (speeds_data['min']==tranche_horaire)]['speed_stddev']
  #vitesse_associee = vitesse_normalisee(vitesse_moyenne, sigma) #WAY TO MUCH TIME TO PROCESS
  temps = distance/vitesse_moyenne
  return numpy.array(temps)

In [6]:
def distance_to_request(car_coord, req_coord, car_job, time, first_job_coord, second_job_coord, third_job_coord):
  """
  Return the distance to a request taking into account the jobs of the cars.
  """
  car_job = str(car_job[0]) + str(car_job[1]) + str(car_job[2])
  if car_job in ['044','444','104']: #Mistake 104
    return duree_deplacement(car_coord,req_coord,time)
  if car_job in ['344']:
    return duree_deplacement(car_coord,first_job_coord[0],time) + duree_deplacement(first_job_coord[0],first_job_coord[1],time) + duree_deplacement(first_job_coord[1],req_coord,time)
  if car_job == '234':
    return duree_deplacement(car_coord,first_job_coord[0],time) + duree_deplacement(first_job_coord[0],first_job_coord[1],time) + duree_deplacement(first_job_coord[1],second_job_coord[0],time) + duree_deplacement(second_job_coord[0],second_job_coord[1],time) + duree_deplacement(second_job_coord[1],req_coord,time)
  return duree_deplacement(car_coord,first_job_coord[0],time) + duree_deplacement(first_job_coord[0],first_job_coord[1],time) + duree_deplacement(first_job_coord[1],second_job_coord[0],time) + duree_deplacement(second_job_coord[0],second_job_coord[1],time) + duree_deplacement(second_job_coord[1],third_job_coord[0],time) + duree_deplacement(third_job_coord[0],third_job_coord[1],time) + duree_deplacement(third_job_coord[0],req_coord,time)

## Heuristic functions

In [7]:
def triplets_jobs(state):
  """ Prend en entrée l'état de l'environnement et retourne les listes des indices des véhicules pour chaque
  triplet de jobs sous forme d'une liste de listes ayant chaucune le triplet en premier élément 
  """
  jobs = state['v_jobs']
  dic = {'044':[],'104':[],'234':[],'323':[],'344':[],'444':[]}
  for i in range(len(jobs)):
    triplet = str(jobs[i][0]) + str(jobs[i][1]) + str(jobs[i][2])
    dic[triplet] += [i]
  return dic

In [8]:
def plus_proche_lot(coords_voiture, time):
  """ Retourne l'indice du lot le plus proche, prend en entrée la liste des coordonnées du véhicule d'intérêt
  [x,y] et l'heure 
  """
  lots = numpy.array(env.lots)
  durees = []
  for i in range(len(lots)):
    durees.append(duree_deplacement(coords_voiture, lots[i], time)[0])
  return np.argmin(durees)

In [9]:
def heuristic(state):
  """
  Returns the reposition according to the heuristic used.
  """
  triplets = triplets_jobs(state)
  reposition = [env.num_lots]*env.num_vehicles 
  if len(triplets['444']) > 0:
    for i in range(len(triplets['444'])):
      lot = plus_proche_lot(state['v_locs'][triplets['444'][i]], state['time'])
      reposition[triplets['444'][i]] = lot
  return np.array(reposition) #Créer l'array direct : improve

## Q learning algorithm

In [54]:
# Memory

class ReplayBuffer():
    def __init__(self, max_size, device):
        self.max_size = max_size
        self.mem_cntr = 0

        self.state_x_memory = [] #It stores the 'x' input
        self.state_y_memory = [] #It stores the 'y' input
        self.new_state_x_memory = [] 
        self.new_state_y_memory = []
        self.action_memory = [] #Consider action going from 0 to nb_car // Size : self.mem_size*nb_actions_to_make_in_state(it varies)

        self.terminal_memory = [] #We could use arrays here
        self.reward_memory = [] #We could use arrays here
        self.device = device

    def push(self, state_x, state_y, action, reward, new_state_x, new_state_y, done):
        """
        Add a new sample and replace oldest one if full
        """
        self.state_x_memory += [state_x]
        self.state_y_memory += [state_y]
        self.new_state_x_memory += [new_state_x]
        self.new_state_y_memory += [new_state_y]
        self.action_memory += [action]
        self.reward_memory += [reward]
        self.terminal_memory += [done]
        self.mem_cntr += 1

        # Supress 1st element if too many of them.
        if self.mem_cntr>self.max_size:
          self.state_x_memory.pop(0)
          self.state_y_memory.pop(0)
          self.new_state_x_memory.pop(0)
          self.new_state_y_memory.pop(0)
          self.action_memory.pop(0)
          self.reward_memory.pop(0)
          self.terminal_memory.pop(0)

    def sample(self, batch_size):
        """
        Sample from the memory
        return : list of size 'batch_size' containing different observations.
        """
        max_mem = min(self.mem_cntr, self.max_size)
        batch = np.random.choice(max_mem, batch_size, replace=False) #It's probably not gonna work that way.

        states_x = []
        states_y = []
        actions = []
        rewards = []
        new_states_x = []
        new_states_y = []
        terminal = []
        for ele in batch:
          states_x += [self.state_x_memory[ele]]
          states_y += [self.state_y_memory[ele]]
          actions += [self.action_memory[ele]]
          rewards += [self.reward_memory[ele]]
          new_states_x += [self.new_state_x_memory[ele]]
          new_states_y += [self.new_state_y_memory[ele]]
          terminal += [self.terminal_memory[ele]]

        return states_x, states_y, actions, self.to_torch(rewards), new_states_x, new_states_y, self.to_torch(terminal)

    def to_torch(self, x):
        return torch.tensor(x).to(self.device)

    def to_numpy(self, x):
        return x.detach().cpu().numpy()

    def __len__(self):
        return min(self.mem_cntr, self.max_size) 

In [11]:
# Network

class ReqN(nn.Module):
    def __init__(self, x_input_size, y_input_size, nb_car, hidden_size_1=10,hidden_size_2=500):
        super().__init__()
        self.x_input = x_input_size
        self.y_input = y_input_size
        self.nb_car = nb_car
        self.linear_1 = nn.Linear(x_input_size, hidden_size_1)
        self.linear_2 = nn.Linear(hidden_size_1, hidden_size_1)
        self.linear_3 = nn.Linear(hidden_size_1*nb_car+y_input_size, hidden_size_2)
        self.linear_4 = nn.Linear(hidden_size_2, hidden_size_2)
        self.linear_5 = nn.Linear(hidden_size_2, nb_car+1)
        self.relu = nn.ReLU()

    def forward(self, x, y): #x : batch_size*(Nb_car)*x_input  ; y : batch_size*(nb_caracteristics_global : y_input)
        x = self.linear_1(x)
        x = self.relu(x)
        x = self.linear_2(x)
        x = self.relu(x)
        x = torch.flatten(x,start_dim=1)
        x = torch.cat((x,y),1)

        x = self.linear_3(x)
        x = self.relu(x)
        x = self.linear_4(x)
        x = self.relu(x)

        output = self.linear_5(x)
        if output.shape[0]==1:
          x = x.squeeze(0)
        return output

In [99]:
# Agent

class Agent(object):

    def __init__(self, 
                 n_actions,
                 memory, 
                 eps, eps_decay,
                 discount_rate, 
                 update_delay, 
                 device
                 ):
        
        self.action_space = [i for i in range(n_actions)] # n_actions = nb of cars + 1
        self.memory = memory 
        self.eps, self.eps_decay = eps, eps_decay
        self.discount_rate = discount_rate
        self.update_delay = update_delay
        self.counter = 0
        self.device = device

    def init_nets(self, Q_net, target_net, optimizer, batch_size):
        """
        initialize online and targets
        """
        self.Q_net = Q_net
        self.target_net = target_net
        self.optimizer = optimizer
        self.batch_size = batch_size

        self.x_input = self.Q_net.x_input
        self.y_input = self.Q_net.y_input
        self.nb_car = self.Q_net.nb_car

        self.copy_weights()
        self.counter += 1
    
    def to_torch(self, x):
        return torch.tensor(x).to(self.device).float()

    def to_numpy(self, x):
        return x.detach().cpu().numpy()

    def create_mask(self,jobs,state,state_tensor_x,time_max=180): 
      """
      Create a mask for the car that we cannot assign to a certain request and that for every request of a state.
      return: a list of 0 and 1 where 1 means that we mask the outcome, size: nb_request*(nb_car+1)
      """
      # Deal with cars, for which the distance is too high.
      nb_request = state_tensor_x.shape[0]
      time = state['time']
      mask = torch.zeros(nb_request,self.nb_car+1).to(self.device)
      for i in range(nb_request):
        request_time = state['request_times'][i]
        for j in range(self.nb_car):
          mask[i][j] = (time_max - state_tensor_x[i][j][2] + request_time - time) < 0 #if not accesible then 1.

      # Deal with cars which already have too many jobs
      for car in (jobs['323']+jobs['234']):
        for i in range(nb_request):
          mask[i,car] = 1 
      return mask

    def forward(self, states_x, states_y, targets, mask, list_association):
        """
        Train online net for 1 step
        """
        self.optimizer.zero_grad()
        # Forward pass
        self.Q_net.train()
        Q_values = self.Q_net(states_x,states_y).to(device)

        # Masking
        Q_values = (Q_values*mask).sum(-1)

        # Aggregate
        Q_values = self.aggregation(Q_values,list_association)

        # Computing loss
        loss = (targets.detach() - Q_values).pow(2).mean()
        loss.backward()

        # Apply gradients
        self.optimizer.step()

        self.Q_net.eval()

    def copy_weights(self):
        """
        Copy weights from online to target net
        """
        self.target_net.load_state_dict(self.Q_net.state_dict())

    def select_action(self, states_x, states_y, mask): # Optimize it by taking the action : no assignement (in the beggining) very frequently.
        """
        Select an action with eps greedy as well as dealing with the overlapping issue
        """
        # Epsilon greedy

        list_action = []
        for i,(state_x,state_y) in enumerate(zip(states_x,states_y)):
          rand = np.random.random()
          if rand < self.eps: #If we choose randomly
            action = np.random.choice(self.action_space)
            while (action in list_action or mask[i][action]==1) and action != self.nb_car: #Continue until we find an action we can realize. #To improve
              action = np.random.choice(self.action_space)
            list_action += [action]
          else: #If we take the max Q value
            tensor_action = self.Q_net(state_x.unsqueeze(0),state_y.unsqueeze(0)).to(self.device)
            tensor_action += tensor_action*mask[i]*(-100000) #We mask the actions we cannot take/ To improve
            action = self.to_numpy(torch.max(tensor_action,1)[1])[0]
            # Deal with overlapping actions : 1st arrived 1st served.
            k = 2
            while (action.item() in list_action or mask[i,action]==1) and action.item() != self.nb_car: #Continue until we find an action we can realize.
              action = self.to_numpy(torch.topk(tensor_action.squeeze(0),k)[1][k-1])
              k += 1
            list_action += [action]
                                          
        return np.array(list_action) #To improve: we don't need to create a list

    def remember(self, *args):
        """
        Update memory
        args: state_x, state_y, action, reward, new_state_x, new_state_y, done
        """

        self.memory.push(*args)

    def regroup_tensor(self, states_x, states_y):
      """
      Regroup tensors into a unique tensor
      return : a unique tensor composed of every input tensors with a list which associates the input with the output
      """
      # Create an association list between the elements of this batch and compute the total amount of request in this batch.
      n_element = 0
      list_association = []

      for i,(state_x,state_y) in enumerate(zip(states_x,states_y)):
        n_element += len(state_x)
        for j in range(len(state_x)):
          list_association += [[i,j]]
      
      # Fill the tensors we're going to use for the batch
      states_x_ = torch.zeros(n_element,self.nb_car,self.x_input).to(device)
      states_y_ = torch.zeros(n_element,self.y_input).to(device)

      count = 0
      for i,(state_x,state_y) in enumerate(zip(states_x,states_y)):
        for j in range(len(state_x)):
          states_x_[count] = state_x[j]
          states_y_[count] = state_y[j]
          count += 1

      return states_x_,states_y_,list_association

    def dict_to_network(self,state):
      """
      Takes (in input) the state dict and transforms it into a tensor while selecting the right features.
      """
      # Build tensors x and y.
      n = len(state['request_times'])
      x_input = torch.zeros(n,self.nb_car,self.x_input).to(self.device)
      y_input = torch.zeros(n,self.y_input).to(self.device)
      for i in range(n):
        for car in range(self.nb_car):
          x_input[i][car][0] = state['v_locs'][car][0]
          x_input[i][car][1] = state['v_locs'][car][1]
          # Calculate the distance to the request.
          first_job_coord = state['v_job_locs'][car][0] #We retrieve it even if it's not necessary
          second_job_coord = state['v_job_locs'][car][1] 
          third_job_coord = state['v_job_locs'][car][2]
          x_input[i][car][2] = self.to_torch(distance_to_request(state['v_locs'][car], state['request_locs'][i][0], state['v_jobs'][car], state['time'], first_job_coord, second_job_coord, third_job_coord))
        y_input[i][0] = state['dow']
        y_input[i][1] = state['time']
        y_input[i][2] = state['request_times'][i]

      return x_input,y_input

    def action_to_tensor(self, actions):
      """
      Transforms the list of actions of a batch into a single tensor.
      """
      list_res = []
      for list_action in actions:
        for action in list_action:
          list_res += [action]
      return self.to_torch(np.array(list_res))
    
    def aggregation(self, Q_values, list_association): #Verified!
      """
      Aggregate the Q_values for the request sharing the same timestep using the mean.
      """
      # Filling tensors doing a double iteration
      Agglomerate_Q_value = torch.zeros(self.batch_size,requires_grad=True).to(device)
      current_batch = 0 #goes from 0 to batch_size
      current_values = []
      for i in range(len(Q_values)):
        if list_association[i][0] == current_batch:
          current_values += [Q_values[i].item()]
        else:
          Agglomerate_Q_value[current_batch] = np.mean(current_values)
          current_batch += 1
          current_values = [Q_values[i].item()]

      return Agglomerate_Q_value


    def policy(self):
        """
        Apply deep Q-learning algorithm step
        """
        if len(self.memory) >= self.batch_size:

            # Sample from memory
            states_x, states_y, actions, reward, new_states_x, new_states_y, done = self.memory.sample(self.batch_size) #Gotta have a reward per second probably

            # Regroup the states and compute the association lists
            states_x, states_y, list_association = self.regroup_tensor(states_x, states_y)
            new_states_x, new_states_y, new_list_association = self.regroup_tensor(new_states_x, new_states_y)

            # Compute target Q value
            target_Q_value = self.target_net(new_states_x,new_states_y)

            # Get action with highest Q_value
            highest_Q_value = torch.max(target_Q_value, dim=-1)[0] #We should also here deal with the overlapping issue but it's not that important!!

            # Agglomerate the different Q_value
            Agglomerate_Q_value = self.aggregation(highest_Q_value,new_list_association)

            # Compute target
            target_Q_value = reward + self.discount_rate*Agglomerate_Q_value*done #The reward is a (mean reward (t)) #Discount rate(t('times')) #Verify that it's correct

            # Change actions into usable tensors
            actions = self.action_to_tensor(actions)

            # Compute mask for non optimal actions
            mask = F.one_hot(actions.long())

            # Train network         
            self.forward(states_x, states_y, target_Q_value, mask, list_association)

            # Copy weights
            if self.counter % self.update_delay == 0:
                self.copy_weights()

            # Update epsilon
            self.eps *= self.eps_decay

            self.counter += 1
        else:
            return
        
    

In [101]:
# Training Loop

env = RidehailEnv()

MEMORY_SIZE = 4000
X_INPUT_SIZE = 3
Y_INPUT_SIZE = 3
N_CAR = env.num_vehicles

LR = 0.001
BATCH_SIZE = 16

N_SIMULATION = 10
EPS = 0.3
EPS_DECAY = 0.9995
DISCOUNT_RATE = 0.99 # should depend of t
UPDATE_DELAY = 50 # delay between target_net parameters updates
DEVICE = "cuda" # "cuda" or "cpu"

# Model and target model 
Q_net = ReqN(X_INPUT_SIZE, Y_INPUT_SIZE, N_CAR).to(DEVICE)
target_net = ReqN(X_INPUT_SIZE, Y_INPUT_SIZE, N_CAR).to(DEVICE)

# Optimizer (only on Q_net)
optimizer = torch.optim.Adam(Q_net.parameters(), lr=LR)

# Memory
memory = ReplayBuffer(MEMORY_SIZE, DEVICE)

# Agent and initialization
agent = Agent(n_actions=N_CAR+1, 
              memory=memory, 
              eps=EPS, 
              eps_decay=EPS_DECAY,
              discount_rate=DISCOUNT_RATE,
              update_delay=UPDATE_DELAY, 
              device=DEVICE
              )

agent.init_nets(Q_net, target_net, optimizer, BATCH_SIZE)

all_scores = []
# Progress bar
with tqdm.tqdm(total=N_SIMULATION, position=0, leave=True) as pbar:
    for i in range(N_SIMULATION):
        done = False
        score = 0
        # Reset env
        state = env.reset()

        # Make sure that the first state is a state with request
        while len(state['request_times']) == 0:
          action_rep = heuristic(new_state)
          action = {'reposition': action_rep, 'req_assgts': np.array([]), 'req_rejections': np.array([])}
          state, reward, _, _ = env.step(action)
          score += reward

        #Store the states in a list for the analysis.
        list_state = []
        list_state += [state]
        state_tensor_x,state_tensor_y = agent.dict_to_network(state)
        while not done:
            print('Running...')
            # Retrieve lists of triplets from state
            jobs = triplets_jobs(state)

            # Create the mask for every request (taking into account the distance)
            mask = agent.create_mask(jobs,state,state_tensor_x)

            # Apply Heuristic
            action_rep = heuristic(state)

            # Select action
            action_req = agent.select_action(state_tensor_x,state_tensor_y, mask)

            # Construct action # A not so probable error to correct : if request assign at the same time that a reposition is requested : gotta change the status!!!
            action = OrderedDict({'reposition': action_rep, 'req_assgts': action_req, 'req_rejections': np.zeros_like(action_req)}) #Need to deal with the rejections probably #May need to be an ordered dict

            print('Action:', action)
            day = state['dow']
            time = state['time']
            print(f'day : {day}/ time : {time}')
            # Execute action
            new_state, reward, done, _ = env.step(action) #Should compute the mean reward.
              
            # While no request for new_state : Apply heuristic and create a new action / There may be a problem with done.
            while len(new_state['request_times']) == 0:
              action_rep = heuristic(new_state)
              action = {'reposition': action_rep, 'req_assgts': np.array([]), 'req_rejections': np.array([])}
              new_state, reward_add, done, _ = env.step(action)
              reward += reward_add
              
            score += reward

            # Transforms state into a usable tensor for the : request network
            new_state_tensor_x,new_state_tensor_y = agent.dict_to_network(new_state)

            # Update memory
            agent.remember(state_tensor_x,state_tensor_y, action_req, reward, new_state_tensor_x, new_state_tensor_y, 1-int(done))

            # Apply algorithm
            agent.policy() #We should maybe iterate do apply the policy more often.

            # Update state
            state = new_state
            list_state += [state]
            state_tensor_x,state_tensor_y = new_state_tensor_x,new_state_tensor_y
        
        all_scores.append(score)

        pbar.set_description('score=' + str(score))
        pbar.update()

plt.plot(all_scores)
plt.show()

  0%|          | 0/10 [00:00<?, ?it/s]

Running...
Action: OrderedDict([('reposition', array([302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302,
       302, 302, 302, 302, 302, 302, 302])), ('req_assgts', array([15])), ('req_rejections', array([0]))])
day : 4/ time : 224.29111309871996
Running...
Action: OrderedDict([('reposition', array([302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302,
       302, 302, 302, 302, 302, 302, 302])), ('req_assgts', array([7])), ('req_rejections', array([0]))])
day : 4/ time : 484.9911660731291
Running...
Action: OrderedDict([('reposition', array([302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302,
       302, 302, 302, 302, 302, 302, 302])), ('req_assgts', array([6])), ('req_rejections', array([0]))])
day : 4/ time : 488.7343229001261
Running...
Action: OrderedDict([('reposition', array([302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302,
       302, 302, 302, 302, 302, 302, 302])), ('req_assgts', array([20])), ('req_rejections', array([0]

Int64Index([3], dtype='int64').


Running...
Action: OrderedDict([('reposition', array([302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302,
       302, 302, 302, 302, 302, 302, 302])), ('req_assgts', array([3])), ('req_rejections', array([0]))])
day : 4/ time : 12414.237043307852
Running...
Action: OrderedDict([('reposition', array([302, 302, 302, 302, 302, 302, 302, 302, 302, 302,  74, 302, 302,
       302, 302, 302, 302, 302, 302, 302])), ('req_assgts', array([2])), ('req_rejections', array([0]))])
day : 4/ time : 12507.565554896144
Running...
Action: OrderedDict([('reposition', array([302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302,
       302, 302, 302, 302, 302, 302, 302])), ('req_assgts', array([3])), ('req_rejections', array([0]))])
day : 4/ time : 12761.008963609343
Running...
Action: OrderedDict([('reposition', array([302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302,
       302, 302, 302, 302, 302, 302, 302])), ('req_assgts', array([20])), ('req_rejections', array([0

Int64Index([10], dtype='int64').


Running...
Action: OrderedDict([('reposition', array([302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302,
       302, 302, 302, 302, 302, 302, 302])), ('req_assgts', array([10])), ('req_rejections', array([0]))])
day : 4/ time : 16575.673424412063
Running...
Action: OrderedDict([('reposition', array([302,  18, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302,
       302, 302, 302, 302, 302, 302, 302])), ('req_assgts', array([20])), ('req_rejections', array([0]))])
day : 4/ time : 16689.56087295142


  0%|          | 0/10 [01:08<?, ?it/s]

Running...
Action: OrderedDict([('reposition', array([302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302,
       302, 302, 302, 302, 302, 302, 302])), ('req_assgts', array([20, 17])), ('req_rejections', array([0, 0]))])
day : 4/ time : 16735.746143822773





KeyboardInterrupt: ignored

### Test

In [None]:
score

2880.5767908354665

In [None]:
state = env.reset()
state['dow']

In [None]:
v = list_state[30]['v_locs'][1]
v1 = list_state[30]['request_locs'][0][0]
time = list_state[30]['time']
duree_deplacement(v,v1,time)

array([130.49006675])

In [None]:
list_state[29]['request_times']

In [None]:
list_state[29]['v_jobs']

In [None]:
list_state[6]['v_jobs']

In [None]:
v = new_state['v_locs'][0]
time = new_state['time']
v1 = np.array(env.lots)[i]

In [None]:
v1

In [None]:
heuristic(new_state)

In [None]:
action = env.get_random_action()
action 

In [None]:
OrderedDict([('reposition', array([302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302,
       302])), ('req_assgts', array([20])), ('req_rejections', array([0]))])