# 43008: Reinforcement Learning

## Week 3 Part A: Partially Observable MDPs: Scenario-1
* POMDPs

### What you will learn?
1. Create/Setup POMDPs from given case studie
2. Simulate the POMDP and estimate the rewards based on a policy


## **Scenario: Robot Navigation in a Noisy Environment**

**Background:**  
Imagine a small robot tasked with navigating through a grid-like environment to reach a target location while avoiding obstacles. The robot's sensors provide noisy readings that can be either "Clear" or "Obstacle," but due to sensor inaccuracies, the readings are not always reliable.

The environment is represented as a grid with four locations: A, B, C, and D. The robot starts at location A and needs to reach location D to successfully complete its task. However, there is a possibility of encountering obstacles that might force the robot to change its path.

* **States**: A, B, C, D (Locations in the grid)
* **Actions**: Up, Down, Left, Right (Robot movement directions)
* **Observations**: Clear (No obstacle detected), Obstacle (Obstacle detected)
* **Transition Probabilities**: The robot's movements are not perfectly deterministic, and there's a chance of unintended movement.
* **Observation Probabilities**: The robot's sensors have a chance of providing incorrect readings due to noise.
* **Rewards**: The robot receives rewards for reaching the goal, avoiding obstacles, and penalties for collisions or wrong movements.
* **Discount Factor**: Determines the trade-off between immediate and future rewards.
* **Objective**: The robot's goal is to navigate from location A to location D while avoiding obstacles and taking efficient actions to maximize its cumulative reward over time.


**Scenario Flow:**

1. The robot starts at location A.
2. It selects an action (movement direction) based on its current state and observations from the sensors.
3. The robot receives immediate rewards based on the chosen action and its resulting state.
4. The robot's sensors provide an observation (Clear or Obstacle) about the current state, which might be noisy.
5. The robot updates its belief about the current state based on the observation and transition probabilities.
6. Steps 2 to 5 are repeated for each time step until the robot reaches location D or a specified number of time steps is reached.
7. The robot aims to choose actions that lead to a high cumulative reward, considering the uncertainties introduced by noisy sensors and probabilistic transitions.

Throughout the scenario, the robot's optimal policy will help it make decisions that balance exploration, observation, and action to successfully navigate through the environment and reach the target location with maximum cumulative reward.



**Scenario visualization**
```
A --- Right ---> B
|                |
|                |
Down             Down
|                |
|                |
V                V
D <--- Left ---  C
```



#### 1. POMDP definition

<img src='https://drive.google.com/uc?id=1e3agtMbgHflQe8YxGJATx9doaCU990To' height=350>

In [5]:
# Create a generic class for POMDP
import numpy as np

class PartialObservationPOMDP:
    def __init__(self, states, actions, observations, transition_probs, observation_probs, rewards, discount_factor):
        self.states = states                       # S, set of states
        self.actions = actions                     # A, set of actions
        self.observations = observations           # O, set of observations
        self.transition_probs = transition_probs   # P, state transition probalities
        self.observation_probs = observation_probs # Z, observation probalities
        self.rewards = rewards                     # R, rewards
        self.discount_factor = discount_factor     # gamma, discount factor

    # Create a function to simulate the POMDP, state transitions based on choosen action guided by the policy, and estimate the rewards
    def simulate(self, start_state, policy, max_steps):
        current_state = start_state
        total_reward = 0

        print(f"Starting at location {current_state}")

        # Iterate over a fixed number of step provide
        for step in range(max_steps):

            # Select action based on the given policy
            action = policy[current_state]

            # Find the next state based on the action taken
            next_state = np.random.choice(self.states, p=self.transition_probs[action][self.states.index(current_state)])

            # Select an observation based on the next state
            observation = np.random.choice(self.observations, p=self.observation_probs[next_state])

            # Find the reward based on the current_state, action selected, and next_state
            reward = self.rewards[self.states.index(current_state)][self.actions.index(action)][self.states.index(next_state)]

            print(f"Step {step + 1}: Action: {action}, Next Location: {next_state}, Observation: {observation}, Reward: {reward}")

            # Update the totoal reward and update the current state for next interation.
            total_reward += reward
            current_state = next_state

            # Stop when the terminal state is reached.
            if current_state == "D":
                print("Reached destination D!")
                break

        # Display the total Rewards accumulated
        print(f"Total Reward: {total_reward}")


#### 2. Define the States, Actions, Observations, transition Probabilities, and rewards based on the given scenario

In [6]:
# Define states, actions, observations, transition probabilities, observation probabilities, rewards, and discount factor
states = ["A", "B", "C", "D"]
actions = ["Up", "Down", "Left", "Right"]
observations = ["Clear", "Obstacle"]

transition_probs = {
    "Up": np.array([[0.1, 0.5, 0.3, 0.1],     # Given Action='Up', Transition Probability from  State='A' to State ={"A", "B", "C", "D"}
                    [0, 0, 1, 0],             # Given Action='Up', Transition Probability from  State='B' to State ={"A", "B", "C", "D"}
                    [0.1, 0.7, 0.1, 0.1],
                    [0.1, 0.5, 0.1, 0.3]]),
    "Down": np.array([[0, 0, 0, 1],   # Given Action='Down', Transition Probability from  State='A' to State ={"A", "B", "C", "D"}
                      [0, 0, 1, 0],
                      [0.1, 0.1, 0.7, 0.1],   # Given Action='Down', Transition Probability from  State='C' to State ={"A", "B", "C", "D"}
                      [0.1, 0.1, 0.1, 0.7]]),
    "Left": np.array([[0.7, 0.1, 0.1, 0.1],   # Given Action='Left', Transition Probability from  State='A' to State ={"A", "B", "C", "D"}
                      [0.7, 0.1, 0.1, 0.1],   # Given Action='Left', Transition Probability from  State='B' to State ={"A", "B", "C", "D"}
                      [0, 0, 0, 1],
                      [0.1, 0.1, 0.1, 0.7]]),
    "Right": np.array([[0, 1, 0, 0],  # Given Action='Right', Transition Probability from  State='A' to State ={"A", "B", "C", "D"}
                       [0.1, 0.7, 0.1, 0.1],
                       [0.1, 0.1, 0.7, 0.1],
                       [0.1, 0.3, 0.3, 0.3]]) # Given Action='Right', Transition Probability from  State='D' to State ={"A", "B", "C", "D"}
}
observation_probs = {
    "A": [0.8, 0.2], # Given State='A', Probability of observation=["Clear", "Obstacle"] | High probability no obstacle: 0.8
    "B": [0.6, 0.4], # Given State='B', Probability of observation=["Clear", "Obstacle"] | Probability obstacle is 0.4
    "C": [0.9, 0.1], # Given State='C', Probability of observation=["Clear", "Obstacle"] | High probability no obstacle: 0.9
    "D": [0.7, 0.3]  # Given State='A', Probability of observation=["Clear", "Obstacle"] | High probability no obstacle :0.7
}
rewards = np.array([
    # Current_State: A, [Action X State]
    [[0, 0, 0, 0], # Rewards for Action = 'Up', and Next_State ={'A', 'B', 'C', 'D'}
     [0, 0, 0, 10],   # Rewards for Action = 'Down', and Next_State ={'A', 'B', 'C', 'D'}
     [0, 0, 0, 0],   # Rewards for Action = 'Left', and Next_State ={'A', 'B', 'C', 'D'}
     [0, 5, 0, 0]], # Rewards for Action = 'Right', and Next_State ={'A', 'B', 'C', 'D'}

    # Current_State: B, [Action X State]
    [[0, 0, 0, 0],   # Rewards for Action = 'Up', and Next_State ={'A', 'B', 'C', 'D'}
     [0, 0, 8, 0],   # Rewards for Action = 'Down', and Next_State ={'A', 'B', 'C', 'D'}
     [-5, 0, 0, 0],  # Rewards for Action = 'Left', and Next_State ={'A', 'B', 'C', 'D'}
     [0, 0, 0, 0]],  # Rewards for Action = 'Right', and Next_State ={'A', 'B', 'C', 'D'}

    # Current_State: C, [Action X State]
    [[0, -5, 0, 0], # Rewards for Action = 'Up', and Next_State ={'A', 'B', 'C', 'D'}
     [0, 0, 0, 0],   # Rewards for Action = 'Down', and Next_State ={'A', 'B', 'C', 'D'}
     [0, 0, 0, 10],   # Rewards for Action = 'Left', and Next_State ={'A', 'B', 'C', 'D'}
     [0, 0, 0, 0]], # Rewards for Action = 'Right', and Next_State ={'A', 'B', 'C', 'D'}

    # Current_State: D, [Action X State]
    [[-10, 0, 0, 0], # Rewards for Action = 'Up', and Next_State ={'A', 'B', 'C', 'D'}
     [0, 0, 0, 0],   # Rewards for Action = 'Down', and Next_State ={'A', 'B', 'C', 'D'}
     [0, 0, 0, 0],   # Rewards for Action = 'Left', and Next_State ={'A', 'B', 'C', 'D'}
     [0, 0, -10, 10]]  # Rewards for Action = 'Right', and Next_State ={'A', 'B', 'C', 'D'}
])

discount_factor = 0.9



#### 3. Test the POMDP by creating an object, defining a policy to use and simulate

In [7]:
# Create a POMDP instance
roboNavPOMDP = PartialObservationPOMDP(states, actions, observations, transition_probs, observation_probs, rewards, discount_factor)

# Define a simple random policy (e.g., always go "Right")
roboPolicy1 = {"A": "Right", "B": "Right", "C": "Right", "D": "Right"}

# Define a simple deterministic policy (e.g., always go "Right")
roboPolicy2 = {"A": "Right", "B": "Down", "C": "Left", "D": "up"}

# Define a simple deterministic policy (e.g., always go "Down")
roboPolicy3 = {"A": "Down", "B": "Down", "C": "Down", "D": "Down"}

# Simulate the POMDP with partial observation
max_steps = 100  # You can adjust this number
print("Simulation -1")
roboNavPOMDP.simulate(start_state="A", policy=roboPolicy1, max_steps=max_steps)

print("\n Simulation -2")
roboNavPOMDP.simulate(start_state="A", policy=roboPolicy2, max_steps=max_steps)

print("\n Simulation -3")
roboNavPOMDP.simulate(start_state="A", policy=roboPolicy3, max_steps=max_steps)


Simulation -1
Starting at location A
Step 1: Action: Right, Next Location: B, Observation: Clear, Reward: 5
Step 2: Action: Right, Next Location: B, Observation: Clear, Reward: 0
Step 3: Action: Right, Next Location: C, Observation: Clear, Reward: 0
Step 4: Action: Right, Next Location: C, Observation: Clear, Reward: 0
Step 5: Action: Right, Next Location: C, Observation: Clear, Reward: 0
Step 6: Action: Right, Next Location: C, Observation: Clear, Reward: 0
Step 7: Action: Right, Next Location: C, Observation: Clear, Reward: 0
Step 8: Action: Right, Next Location: C, Observation: Clear, Reward: 0
Step 9: Action: Right, Next Location: C, Observation: Clear, Reward: 0
Step 10: Action: Right, Next Location: C, Observation: Clear, Reward: 0
Step 11: Action: Right, Next Location: C, Observation: Clear, Reward: 0
Step 12: Action: Right, Next Location: C, Observation: Clear, Reward: 0
Step 13: Action: Right, Next Location: C, Observation: Clear, Reward: 0
Step 14: Action: Right, Next Locatio