# 43008: Reinforcement Learning

## Week 3 Part A: Markov Decision Process (MDPs):
* MDPs

### What you will learn?
1. Create/Setup MDPs from different case studies.



**Scenario: Efficient Package Delivery with Drones in a City**

**Background:**  
In today's fast-paced urban environments, drones have emerged as an innovative solution for delivering packages. They offer the advantage of swift aerial transport, bypassing the busy city streets. The city is divided into distinct zones: Warehouse (W), Location A, Location B, and Location C. Each of these zones serves as a potential delivery or pickup point. The drone's mission is simple: pick up packages from the Warehouse and ensure timely deliveries to the designated locations.

The drone starts its journey from the Warehouse, with a package ready to be delivered to any of the three locations. The most efficient route is determined by various factors, such as air traffic, weather conditions, and distance to the destination. However, in this scenario, we are primarily focusing on choosing the best delivery route without considering the battery constraints.

* **States**: The drone's current location (W, A, B, C).
* **Actions**: Fly to W/A/B/C.
* **Transition Probabilities**: The drone's path isn't deterministic due to changing city dynamics, but for the purpose of this model, we assume it always reaches its intended destination.
* **Rewards**: The drone receives rewards based on successful deliveries, with varying rewards for different locations. The reward structure is influenced by factors like distance, importance of the delivery, and potential penalties for delays.
* **Objective**: The drone's primary goal is to ensure that packages are delivered on time to the correct locations. It aims to maximize its total reward by choosing the most efficient delivery routes.

**Scenario Flow:**

1. The drone starts its journey from the Warehouse (W) with a package.
2. Based on its current location and the package's destination, it decides the next best location to fly to.
3. Upon reaching the next location, the drone completes the delivery and receives feedback in the form of rewards.
4. The drone continues to make decisions on the next best location to fly to, delivering packages and collecting rewards.
5. This process is repeated for each delivery until the drone completes all its deliveries or until a specified number of deliveries are reached.
6. The overarching goal is to make decisions that maximize the total rewards, ensuring that packages are delivered efficiently and on time.

Throughout this scenario, the optimal delivery strategy will guide the drone in making the best decisions to ensure all packages are delivered promptly, maximizing the total reward.


In [None]:
import numpy as np

class DroneDeliveryMDP:
    def __init__(self, states, actions, transition_matrix, reward_matrix, discount_factor=1.0):
        self.states = states
        self.actions = actions
        self.transition_matrix = transition_matrix
        self.reward_matrix = reward_matrix
        self.discount_factor = discount_factor

    def simulate(self, start_state, policy, max_steps):
        current_state = start_state
        total_reward = 0

        print(f"Starting at location {current_state}")

        for step in range(max_steps):
            action = policy[current_state]

            # Directly set the next state based on the action, without using np.random.choice
            action_destination = action.split("_")[1]
            next_state = action_destination

            reward = self.reward_matrix[self.states.index(current_state)][self.actions.index(action)]

            print(f"Step {step + 1}: Action: {action}, Next Location: {next_state}, Reward: {reward}")

            total_reward += reward
            current_state = next_state

        print(f"Total Reward: {total_reward}")

# Define states
states = ['W', 'A', 'B', 'C']

# Define the actions
actions = ["Fly_W", "Fly_A", "Fly_B", "Fly_C"]

# Transition matrix where rows represent current state, columns represent actions, and the inner lists represent the next state probabilities.
transition_matrix = [
    [0, 1, 0, 0],  # from 'W'
    [1, 0, 0, 0],  # from 'A'
    [0, 1, 0, 0],  # from 'B'
    [1, 0, 0, 0]   # from 'C'
]

# Reward matrix where rows represent current state and columns represent actions.
reward_matrix = [
    [0, 10, -4, -6],  # from 'W'
    [-2, 0, 10, -3],  # from 'A'
    [-3, -1, 0, 10],  # from 'B'
    [10, -3, -2, 0]   # from 'C'
]

# Create an MDP instance
mdp = DroneDeliveryMDP(states, actions, transition_matrix, reward_matrix)

# Define a simple policy (e.g., always move to the next location alphabetically)
policy = {"W": "Fly_A", "A": "Fly_B", "B": "Fly_C", "C": "Fly_W"}

# Simulate the MDP
max_steps = 10
mdp.simulate(start_state="W", policy=policy, max_steps=max_steps)


Starting at location W
Step 1: Action: Fly_A, Next Location: A, Reward: 10
Step 2: Action: Fly_B, Next Location: B, Reward: 10
Step 3: Action: Fly_C, Next Location: C, Reward: 10
Step 4: Action: Fly_W, Next Location: W, Reward: 10
Step 5: Action: Fly_A, Next Location: A, Reward: 10
Step 6: Action: Fly_B, Next Location: B, Reward: 10
Step 7: Action: Fly_C, Next Location: C, Reward: 10
Step 8: Action: Fly_W, Next Location: W, Reward: 10
Step 9: Action: Fly_A, Next Location: A, Reward: 10
Step 10: Action: Fly_B, Next Location: B, Reward: 10
Total Reward: 100


In [None]:
import numpy as np

class DroneDeliveryMDP:
    def __init__(self, states, actions, transition_probs, rewards, discount_factor=1.0):
        self.states = states
        self.actions = actions
        self.transition_probs = transition_probs
        self.rewards = rewards
        self.discount_factor = discount_factor

    def simulate(self, start_state, policy, max_steps):
        current_state = start_state
        total_reward = 0

        print(f"Starting at location {current_state}")

        for step in range(max_steps):
            action = policy[current_state]
            next_state = np.random.choice(self.states, p=[self.transition_probs[current_state][action][s_prime] for s_prime in self.states])
            reward = self.rewards[current_state][action]

            print(f"Step {step + 1}: Action: {action}, Next Location: {next_state}, Reward: {reward}")

            total_reward += reward
            current_state = next_state

        print(f"Total Reward: {total_reward}")

# Define states
states = ['W', 'A', 'B', 'C']

# Define the actions possible for each state. For each state, possible actions are moving to the other states.
actions = {s: [f"Fly_{a}" for a in states if a != s] for s in states}

# Define the transition probabilities. Since it's deterministic, the transition is only possible to the destination state with probability 1.
transition_probs = {
    s: {f"Fly_{a}": {s_prime: 1 if a == s_prime else 0 for s_prime in states} for a in states if a != s}
    for s in states
}

# Define the rewards for each transition. If there's no defined reward for a transition, it's assumed to be 0.
rewards = {
    s: {f"Fly_{a}": 0 for a in states if a != s} for s in states
}

delivery_rewards = 10

rewards['W'].update({"Fly_A": delivery_rewards, "Fly_B": -4, "Fly_C": -6})
rewards['A'].update({"Fly_W": -2, "Fly_B": delivery_rewards, "Fly_C": -3})
rewards['B'].update({"Fly_W": -3, "Fly_A": -1, "Fly_C": delivery_rewards})
rewards['C'].update({"Fly_W": delivery_rewards, "Fly_A": -3, "Fly_B": -2})

# Create an MDP instance
mdp = DroneDeliveryMDP(states, actions, transition_probs, rewards)

# Define a simple policy (e.g., always move to the next location alphabetically)
policy = {"W": "Fly_A", "A": "Fly_B", "B": "Fly_C", "C": "Fly_W"}

# Simulate the MDP
max_steps = 10
mdp.simulate(start_state="W", policy=policy, max_steps=max_steps)


Starting at location W
Step 1: Action: Fly_A, Next Location: A, Reward: 10
Step 2: Action: Fly_B, Next Location: B, Reward: 10
Step 3: Action: Fly_C, Next Location: C, Reward: 10
Step 4: Action: Fly_W, Next Location: W, Reward: 10
Step 5: Action: Fly_A, Next Location: A, Reward: 10
Step 6: Action: Fly_B, Next Location: B, Reward: 10
Step 7: Action: Fly_C, Next Location: C, Reward: 10
Step 8: Action: Fly_W, Next Location: W, Reward: 10
Step 9: Action: Fly_A, Next Location: A, Reward: 10
Step 10: Action: Fly_B, Next Location: B, Reward: 10
Total Reward: 100
