# 43008: Reinforcement Learning

## Week 2 Part C: Modelling different scenarios using Multi-Arm Bandit formulation:(Solution)
* Epsilon Greedy
* Upper Confidence Bound (UCB)

### What you will learn?
1. Create/Setup Multi-Arm Bandit Environment from different scenarios.
2. Try solving them.

 # **Case Study 1: RoboChef's Galactic Diner**

**Background:**  
In the year 3045, RoboChef, a state-of-the-art robotic chef, operates the famed "Galactic Diner" on a space station orbiting a neutron star. This diner attracts beings from various galaxies, each with distinct culinary preferences shaped by their evolutionary biology and native planet's ecosystem.

**Challenge:**  
RoboChef can prepare 5 different dishes, each with ingredients sourced from different parts of the universe. Given the diverse clientele, RoboChef needs to deduce which dish to serve to maximize customer delight. Happy customers leave a glowing orb as a token of appreciation, while others just depart without any feedback.

**Constraints:**

1. **Unknown Preferences:** The diner's patrons come from countless galaxies, making it impossible for RoboChef to have prior knowledge about each species' preferred dish.
2. **Ingredient Scarcity:** Some ingredients are rare, and RoboChef can't just keep preparing the same dish continuously.
3. **Rapid Service:** Being a robot, RoboChef serves dishes quickly. However, it needs to determine rapidly which dish to prepare as customers don't like waiting.

**Objectives:**

1. **Maximize Delight:** The ultimate goal is to ensure the majority of the patrons leave a glowing orb.
2. **Balance Exploration and Exploitation:** RoboChef must explore the preferences of its diverse customers while also serving the dishes that are more likely to be appreciated.

**Operational Details:**

- RoboChef has a repertoire of 5 unique dishes.
- Each dish has an inherent delight factor, unknown to RoboChef.
- Upon a patron's arrival, RoboChef quickly prepares and serves one of its dishes.
- Delighted customers leave behind a glowing orb. Otherwise, they just leave.

**Problem Formulation:**

This scenario can be likened to a gambler deciding which slot machine (or "bandit") to play:
- Each dish is akin to a slot machine.
- Preparing and serving a dish is similar to playing a slot machine.
- The glowing orb from a pleased customer represents the reward from the slot machine.

Hence, RoboChef is faced with the "Multi-Arm Bandit" problem. It must quickly decide which dish to prepare to maximize the cumulative delight of its patrons.

In [None]:
import numpy as np

class RoboChefBandit:
    def __init__(self, n_dishes=5):
        # Initialize the true delight factors for each dish
        self.true_delight_factors = np.random.rand(n_dishes)

    def prepare(self, dish):
        # Simulate the delight for the selected dish
        # Return a reward (1 for a glowing orb, 0 for no orb) based on the true delight factor of the dish
        return 1 if np.random.random() < self.true_delight_factors[dish] else 0


# **Case Study 2: AgriBot's Crop Yield Optimization**

**Background:**  
AgriBot is a state-of-the-art agricultural robot designed to assist farmers in optimizing crop yields. One of the most challenging aspects of farming is determining the right amount of water (irrigation) and the correct level of fertilizer for different crops.

**Challenge:**  
AgriBot is tasked with determining the optimal combination of irrigation and fertilizer for individual crop patches to maximize yield. If the right combination is achieved, the crops thrive and yield increases. If the combination is suboptimal, the yield can be lower, and in extreme cases, crops can be damaged.

**Constraints:**

1. **Soil Variability:** Different patches of land have unique soil characteristics affecting water retention and nutrient levels.
2. **Resource Limitations:** Over-irrigation can lead to water wastage, and excess fertilizer can be harmful to the environment.
3. **Limited Season:** Crops have a particular growing season, giving AgriBot a limited timeframe to experiment and optimize.

**Objectives:**

1. **Maximize Yield:** The primary goal is to ensure optimal growth conditions for the crops.
2. **Rapid Adaptation:** AgriBot should quickly adapt its strategy based on observed crop responses.
3. **Balance Exploration and Exploitation:** AgriBot needs to test various combinations but also ensure resource efficiency and crop health.

**Operational Details:**

- AgriBot can choose from 5 irrigation levels (from low to high) and 4 fertilizer levels (from low to high).
- Each combination has a different effect on crop yield, which is unknown to AgriBot.
- After applying a combination, AgriBot monitors the crops for growth rate and health.
- Thriving crops indicate a successful combination, while stunted growth or signs of stress indicate suboptimal conditions.

**Problem Formulation:**

This scenario can be equated to a gambler deciding which slot machine (or "bandit") to play, but with a two-dimensional twist:
- Each combination of irrigation and fertilizer level represents a unique slot machine.
- Applying a combination is akin to playing a slot machine.
- The feedback from the crops' growth and health symbolizes the reward from the slot machine.

AgriBot faces a two-dimensional "Multi-Arm Bandit" problem, where it must decide on a combination that maximizes crop yield while ensuring sustainable farming practices.



In [None]:

import numpy as np

class AgriBotBandit:
    def __init__(self, n_irrigation_levels=5, n_fertilizer_levels=4):
        # Initialize the true yield effects for each combination of irrigation and fertilizer
        self.true_yield_effects = np.random.rand(n_irrigation_levels, n_fertilizer_levels)

    def apply_combination(self, irrigation_level, fertilizer_level):
        # Simulate the yield effect of the selected combination
        return 1 if np.random.random() < self.true_yield_effects[irrigation_level, fertilizer_level] else 0


# **Case Study 3: AstroBot's Object Interaction in an Alien Artifact Museum**

**Background:**  
The "Artifact Museum of the Universe" is a vast space museum that houses artifacts from countless civilizations and planets. AstroBot, a robotic explorer, is deployed to study, handle, and catalog these artifacts. However, the materials and structures of these artifacts are diverse and mostly unknown.

**Challenge:**  
AstroBot is equipped with multiple interaction techniques, from gentle touches to strong grips. Given the myriad of artifacts, AstroBot needs to deduce the best interaction technique for each object to ensure safe handling without causing any damage. If AstroBot handles an artifact correctly, the museum's sensors give a green light. Otherwise, a red light is emitted, indicating potential harm to the artifact.

**Constraints:**

1. **Unknown Materials:** The artifacts come from various corners of the universe, making their materials and structural integrity largely unknown.
2. **Limited Interactions:** Some artifacts are rare, and AstroBot can't risk multiple harmful interactions.
3. **Variable Feedback Delay:** Depending on the artifact's material and the interaction technique, feedback (green or red light) might not be instantaneous.

**Objectives:**

1. **Safe Handling:** The primary goal is to ensure artifacts are handled without any damage.
2. **Rapid Learning:** AstroBot should quickly learn the most suitable interaction technique for each type of artifact.
3. **Balance Exploration and Exploitation:** AstroBot needs to test various interaction techniques but also avoid potential harm to the artifacts.

**Operational Details:**

- AstroBot has 7 interaction techniques, ranging from light touches to different grip strengths.
- Each interaction technique has a different success rate for each artifact, but this is unknown to AstroBot.
- On approaching an artifact, AstroBot decides on an interaction technique.
- The museum's sensors provide feedback (green or red light) after the interaction, but the delay can vary.

**Problem Formulation:**

This scenario is analogous to a gambler deciding which slot machine (or "bandit") to play, with some twists:
- Each interaction technique represents a slot machine.
- Choosing an interaction technique is akin to playing a slot machine.
- The feedback (green light) from the sensors symbolizes the reward from the slot machine. The variable delay introduces an additional challenge.

AstroBot faces an enhanced "Multi-Arm Bandit" problem, where it must decide on an interaction technique that ensures the artifact's safety while learning rapidly from feedback.





In [None]:
import numpy as np
import time

class AstroBotBandit:
    def __init__(self, n_techniques=7):
        # Initialize the true success rates for each interaction technique
        self.true_success_rates = np.random.rand(n_techniques)
        self.feedback_delays = np.random.randint(1, 5, size=n_techniques)  # Random delay between 1 to 4 seconds

    def interact(self, technique):
        # Simulate the success of the selected interaction technique
        time.sleep(self.feedback_delays[technique])  # Simulate variable feedback delay
        return 1 if np.random.random() < self.true_success_rates[technique] else 0