# Metadata

**L1 Taxonomy** - Problem Solving & Algorithms

**L2 Taxonomy** - Machine Learning Algorithms

**Subtopic** - Python reinforcement learning for recommendations

**Use Case** - Create a recommendation engine using Python-based reinforcement learning. Develop a system that processes distributed datasets with synchronized graph and matrix operations while applying Q-learning strategies to optimize recommendations. This solution should adapt to users’ historical behaviors and improve over time by leveraging parallel processing techniques discussed in the surveyed frameworks fileciteturn0file9.

**Programming Language** - Python

**Target Model** - o1

# Model Breaking Hints


1) **What is the initial use case?**

   - The initial use case involves creating a recommendation engine using Python-based reinforcement learning.
   - It processes distributed datasets with synchronized graph and matrix operations.
   - Applies Q-learning strategies to optimize recommendations.
   - Adapts to users' historical behaviors and leverages parallel processing techniques from surveyed frameworks.

2) **Why is the initial use case easy?**

   - It utilizes standard Q-learning, which is a basic reinforcement learning algorithm.
   - Processing distributed datasets with synchronized graph and matrix operations is manageable with existing libraries like NumPy, Pandas, and Dask.
   - There's no requirement to handle streaming data or concept drift.
   - Lacks advanced algorithms like dynamic graph neural networks or temporal embeddings.
   - Does not address multi-objective optimization, privacy concerns, or adversarial robustness.
   - The problem scope is limited and doesn't include complex, non-obvious subproblems or logical traps.

3) **How could we make it harder?**

   - **Incorporate dynamic graph neural networks** (Hint 1) to model time-evolving user-item interactions, introducing advanced algorithms like temporal graph embeddings.
   - **Require handling of streaming data with concept drift** (Hint 2), necessitating online reinforcement learning and continuous adaptation.
   - **Introduce multi-agent coordination protocols** for distributed environments (Hint 2), adding complexity in synchronization and collaboration.
   - **Implement multi-objective optimization** (Hint 3) to balance accuracy, diversity, and fairness using techniques like Pareto efficiency and constrained policy optimization.
   - **Utilize federated reinforcement learning across decentralized data silos** (Hint 4), involving differential privacy guarantees, secure aggregation algorithms, and homomorphic encryption.
   - **Add adversarial scenarios** (Hint 5) requiring robustness against manipulation or spoofing attacks, incorporating adversarial training and robust Markov decision processes.
   - These additions introduce advanced data structures, algorithms, and multiple interacting constraints, significantly increasing complexity.

4) **Which parameters can we change?**

   - **Algorithm Complexity**: Switch from basic Q-learning to federated reinforcement learning with dynamic graph neural networks.
   - **Data Handling**: Move from static datasets to streaming data with concept drift, requiring real-time processing and adaptation.
   - **System Architecture**: Transition to a decentralized system requiring multi-agent coordination and communication protocols.
   - **Optimization Goals**: Introduce multi-objective optimization to balance conflicting goals like accuracy, diversity, and fairness.
   - **Privacy and Security**: Incorporate differential privacy, secure aggregation, and homomorphic encryption to protect decentralized data.
   - **Adversarial Robustness**: Design the system to be robust against adversarial attacks using methods like adversarial training and robust MDPs.
   - **Complex Interactions**: Combine these elements to create non-obvious subproblems and logical traps that challenge naive approaches.

5) **What can be a final hard prompt?**

   - *As stated in the Final Complex Question above.*

# Setup

```requirements.txt
```


# Prompt

I want to build a Python based recommendation engine that uses reinforcement learning to adaptively suggest items to users. The system should simulate user interaction data, apply Q learning to model user preferences, and improve recommendations based on historical user behavior over time.

**Input Format**

Your function will take the following inputs:

- num_users: An integer representing the total number of users.
- num_items: An integer representing the total number of items to recommend.
- interactions: A dictionary where each key is a user ID (int) and the value is a list of item IDs (int) the user interacted with positively.
- num_episodes: Integer specifying how many episodes to run the learning process.
- epsilon: A float between 0 and 1 used for exploration in the epsilon greedy strategy.
- alpha: A float between 0 and 1 for the Q learning learning rate.
- gamma: A float between 0 and 1 for the discount factor.

All inputs will be valid, and user/item IDs range from 0 to num_users - 1 and 0 to num_items - 1, respectively.

**Output Format**

Return a dictionary of recommended items:

```python
{
  user_id (int): [recommended_item_ids (list of int)]
}
```

Each user ID in the output should have a list of the top 3 recommended item IDs, based on the learned Q values after all episodes.

**Examples**

Example 1

```python
Input:
num_users = 2
num_items = 5
interactions = {
  0: [1, 2],
  1: [0, 3]
}
num_episodes = 100
epsilon = 0.2
alpha = 0.5
gamma = 0.9

Output:
{
  0: [1, 2, 0],
  1: [3, 0, 2]
}
```

Example 2

```python
Input:
num_users = 1
num_items = 4
interactions = {
  0: [2]
}
num_episodes = 50
epsilon = 0.1
alpha = 0.3
gamma = 0.8

Output:
{
  0: [2, 0, 1]
}
```

# Requirements

**Explicit and Implicit Points**

- Model users as individual states and items as actions.
- Use a Q table of shape (num_users, num_items) to store learned values.
- Use an epsilon greedy strategy for action selection during training episodes.
- Reward must be +1 if the item is in the user’s interaction list, and 0 otherwise.
- Update Q values using the standard Q learning formula.
- After training, return top 3 item IDs for each user with the highest Q values.

**Solution Expectations**

- The function must initialize the Q table and apply Q learning over the specified number of episodes.
- Each user’s Q values must be updated based on simulated rewards.
- Final recommendations should be the top 3 items with the highest Q values per user.
- You must use random.seed(42) to ensure deterministic behavior.
- Exploration and exploitation must be balanced based on the value of epsilon.

**Signature of Expected Function**

```python
def q_learning_recommendation_engine(
    num_users: int,
    num_items: int,
    interactions: dict[int, list[int]],
    num_episodes: int,
    epsilon: float,
    alpha: float,
    gamma: float
) -> dict[int, list[int]]:
```

**Edge Case Behavior**

- If num_users or num_items is 0, raise a ValueError.
- If a user has no positive interactions, they should still be trained with reward 0 and receive the top 3 learned items (can be any).
- If num_episodes is 0, return recommendations based on the initial Q table (which may be all zeros).
- If multiple items have the same Q value, break ties by choosing the lower item ID.
- You must demonstrate a test for each of these edge cases.

**Constraints**

1. Use random.seed(42) exactly once inside the function.
2. Do not use external libraries such as pandas, numpy, scikit learn, or any RL specific frameworks.
3. Do not use recursion; all logic must be written iteratively.
4. You must not use global variables or modify input parameters.
5. The Q table must be represented as a list of lists or dictionary, not a NumPy array.
6. You must not hardcode item IDs or user IDs; they must be derived from input.
7. Action selection must be done using random.random() and random.choice() only.
8. No custom classes are required. You should use primitive Python data structures (lists, dictionaries, etc.)
9. Return exactly 3 recommended items per user, even if Q values are all tied or zero.
10. Do not use any of the following functions: eval(), exec(), globals(), heapq, or sorted() with custom keys.



In [None]:
#code

import random


def q_learning_recommendation_engine(
    num_users: int,
    num_items: int,
    interactions: dict[int, list[int]],
    num_episodes: int,
    epsilon: float,
    alpha: float,
    gamma: float
) -> dict[int, list[int]]:
    """Q learning based recommendation engine that learns user preferences through reinforcement learning."""

    # Edge case validations as per constraints
    if num_users <= 0 or num_items <= 0:
        raise ValueError("num_users and num_items must be positive integers")

    # Set seed exactly once as required
    random.seed(42)

    # Initialize Q table as list of lists
    q_table = [[0.0 for _ in range(num_items)] for _ in range(num_users)]

    # Training loop over episodes
    for episode in range(num_episodes):
        # Train each user in every episode
        for user_id in range(num_users):
            # Epsilon greedy action selection
            if random.random() < epsilon:
                # Exploration: random action
                action = random.choice(list(range(num_items)))
            else:
                # Exploitation: choose action with highest Q value
                max_q_value = max(q_table[user_id])
                # Break ties by choosing lower item ID as per constraint
                action = q_table[user_id].index(max_q_value)

            # Calculate reward based on user interactions
            user_interactions = interactions.get(user_id, [])
            reward = 1 if action in user_interactions else 0

            max_next_q = max(q_table[user_id])
            q_table[user_id][action] += alpha * (reward + gamma * max_next_q - q_table[user_id][action])

    # Generate recommendations: top 3 items with highest Q values per user
    recommendations = {}
    for user_id in range(num_users):
        # Create list of (q_value, item_id) pairs for sorting
        q_items = [(q_table[user_id][item_id], item_id) for item_id in range(num_items)]

        for i in range(len(q_items)):
            for j in range(i + 1, len(q_items)):
                # Sort by Q value descending, then by item_id ascending
                if (q_items[i][0] < q_items[j][0] or
                    (q_items[i][0] == q_items[j][0] and q_items[i][1] > q_items[j][1])):
                    q_items[i], q_items[j] = q_items[j], q_items[i]

        # Extract top 3 item IDs
        top_3_items = [item_id for _, item_id in q_items[:3]]
        recommendations[user_id] = top_3_items

    return recommendations


def test_q_learning_recommendation_engine():
    """Comprehensive test suite covering all requirements and edge cases."""

    print("Running comprehensive tests...")

    #  Example 1 from prompt
    print("\n1. Testing Example 1...")
    result1 = q_learning_recommendation_engine(
        num_users=2,
        num_items=5,
        interactions={0: [1, 2], 1: [0, 3]},
        num_episodes=100,
        epsilon=0.2,
        alpha=0.5,
        gamma=0.9
    )
    print(f"Result: {result1}")
    assert len(result1) == 2, "Should have 2 users"
    assert all(len(items) == 3 for items in result1.values()), "Each user should have 3 recommendations"
    assert all(0 <= item < 5 for items in result1.values() for item in items), "All items should be valid"
    print(" Example 1 passed")

    #  Example 2 from prompt
    print("\n2. Testing Example 2...")
    result2 = q_learning_recommendation_engine(
        num_users=1,
        num_items=4,
        interactions={0: [2]},
        num_episodes=50,
        epsilon=0.1,
        alpha=0.3,
        gamma=0.8
    )
    print(f"Result: {result2}")
    assert len(result2) == 1, "Should have 1 user"
    assert len(result2[0]) == 3, "User should have 3 recommendations"
    assert all(0 <= item < 4 for item in result2[0]), "All items should be valid"
    print(" Example 2 passed")

    #  Edge case  zero episodes
    print("\n3. Testing zero episodes...")
    result3 = q_learning_recommendation_engine(
        num_users=2,
        num_items=3,
        interactions={0: [1], 1: [2]},
        num_episodes=0,
        epsilon=0.1,
        alpha=0.5,
        gamma=0.9
    )
    print(f"Result: {result3}")
    assert len(result3) == 2, "Should have 2 users"
    assert all(len(items) == 3 for items in result3.values()), "Each user should have 3 recommendations"
    print(" Zero episodes test passed")

    #  Edge case  user with no interactions
    print("\n4. Testing user with no interactions...")
    result4 = q_learning_recommendation_engine(
        num_users=3,
        num_items=4,
        interactions={0: [1, 2], 2: [0]},  # User 1 has no interactions
        num_episodes=50,
        epsilon=0.2,
        alpha=0.4,
        gamma=0.8
    )
    print(f"Result: {result4}")
    assert len(result4) == 3, "Should have 3 users"
    assert len(result4[1]) == 3, "User with no interactions should still get 3 recommendations"
    print(" No interactions test passed")

    #  Edge case  invalid num_users
    print("\n5. Testing invalid num_users...")
    try:
        q_learning_recommendation_engine(
            num_users=0,
            num_items=5,
            interactions={},
            num_episodes=10,
            epsilon=0.1,
            alpha=0.5,
            gamma=0.9
        )
        assert False, "Should have raised ValueError"
    except ValueError as e:
        print(f" Correctly raised ValueError: {e}")

    #  Edge case  invalid num_items
    print("\n6. Testing invalid num_items...")
    try:
        q_learning_recommendation_engine(
            num_users=2,
            num_items=0,
            interactions={0: [1]},
            num_episodes=10,
            epsilon=0.1,
            alpha=0.5,
            gamma=0.9
        )
        assert False, "Should have raised ValueError"
    except ValueError as e:
        print(f" Correctly raised ValueError: {e}")

    #  Tie-breaking behavior (lower item ID wins)
    print("\n7. Testing tie-breaking with equal Q values...")
    result7 = q_learning_recommendation_engine(
        num_users=1,
        num_items=3,
        interactions={},  # No interactions, all Q values should be equal
        num_episodes=0,   # No learning, Q values remain 0
        epsilon=0.1,
        alpha=0.5,
        gamma=0.9
    )
    print(f"Result: {result7}")
    expected_order = [0, 1, 2]  # Should be sorted by item ID for ties
    assert result7[0] == expected_order, f"Expected {expected_order}, got {result7[0]}"
    print(" Tie-breaking test passed")

    #  Boundary value testing
    print("\n8. Testing boundary values...")
    result8 = q_learning_recommendation_engine(
        num_users=1,
        num_items=1,
        interactions={0: [0]},
        num_episodes=1,
        epsilon=1.0,  # Always explore
        alpha=1.0,    # Maximum learning rate
        gamma=0.0     # No future reward consideration
    )
    print(f"Result: {result8}")
    assert len(result8[0]) == 1, "Should recommend only 1 item when num_items=1"
    print(" Boundary values test passed")

    #  Large scale test
    print("\n9. Testing larger scale...")
    large_interactions = {i: [i % 10, (i + 1) % 10] for i in range(20)}
    result9 = q_learning_recommendation_engine(
        num_users=20,
        num_items=15,
        interactions=large_interactions,
        num_episodes=200,
        epsilon=0.3,
        alpha=0.1,
        gamma=0.95
    )
    print(f"Result size: {len(result9)} users")
    assert len(result9) == 20, "Should have 20 users"
    assert all(len(items) == 3 for items in result9.values()), "Each user should have 3 recommendations"
    print(" Large scale test passed")

    #  Deterministic behavior verification
    print("\n10. Testing deterministic behavior...")
    result10a = q_learning_recommendation_engine(
        num_users=2,
        num_items=4,
        interactions={0: [1], 1: [2]},
        num_episodes=30,
        epsilon=0.2,
        alpha=0.5,
        gamma=0.8
    )
    result10b = q_learning_recommendation_engine(
        num_users=2,
        num_items=4,
        interactions={0: [1], 1: [2]},
        num_episodes=30,
        epsilon=0.2,
        alpha=0.5,
        gamma=0.8
    )
    print(f"First run: {result10a}")
    print(f"Second run: {result10b}")
    assert result10a == result10b, "Results should be deterministic due to random.seed(42)"
    print(" Deterministic behavior test passed")

    print("\n All tests passed successfully!")


def validate_input_constraints(
    num_users: int,
    num_items: int,
    interactions: dict[int, list[int]],
    num_episodes: int,
    epsilon: float,
    alpha: float,
    gamma: float
):
    """Validate all input parameters meet the specified constraints."""

    print("Validating input constraints...")

    # Type validation
    assert isinstance(num_users, int), "num_users must be an integer"
    assert isinstance(num_items, int), "num_items must be an integer"
    assert isinstance(interactions, dict), "interactions must be a dictionary"
    assert isinstance(num_episodes, int), "num_episodes must be an integer"
    assert isinstance(epsilon, float), "epsilon must be a float"
    assert isinstance(alpha, float), "alpha must be a float"
    assert isinstance(gamma, float), "gamma must be a float"

    # Range validation
    assert num_users > 0, "num_users must be positive"
    assert num_items > 0, "num_items must be positive"
    assert num_episodes >= 0, "num_episodes must be non-negative"
    assert 0 <= epsilon <= 1, "epsilon must be between 0 and 1"
    assert 0 <= alpha <= 1, "alpha must be between 0 and 1"
    assert 0 <= gamma <= 1, "gamma must be between 0 and 1"

    # Interactions validation
    for user_id, item_list in interactions.items():
        assert isinstance(user_id, int), "User IDs must be integers"
        assert isinstance(item_list, list), "Interaction values must be lists"
        assert 0 <= user_id < num_users, f"User ID {user_id} out of range"
        for item_id in item_list:
            assert isinstance(item_id, int), "Item IDs must be integers"
            assert 0 <= item_id < num_items, f"Item ID {item_id} out of range"

    print(" All input constraints validated")


if __name__ == "__main__":
    # Run comprehensive tests
    test_q_learning_recommendation_engine()

    # Validate input constraints
    print("\n" + "="*50)
    validate_input_constraints(
        num_users=2,
        num_items=5,
        interactions={0: [1, 2], 1: [0, 3]},
        num_episodes=100,
        epsilon=0.2,
        alpha=0.5,
        gamma=0.9
    )

    print("\n" + "="*50)
    print(" Q Learning Recommendation Engine is fully tested and ready!")

In [None]:
# tests

import unittest
from main import q_learning_recommendation_engine


class TestQLearningRecommendationEngine(unittest.TestCase):
    """
    Comprehensive unit test suite for q_learning_recommendation_engine

    Tests based on prompt requirements and specifications.
    Covers all edge cases, constraints, and expected behaviors.
    """

    def test_deterministic_behavior(self):
        """Test that same inputs produce same outputs due to random.seed(42)"""
        params = {
            'num_users': 2,
            'num_items': 4,
            'interactions': {0: [1], 1: [2]},
            'num_episodes': 20,
            'epsilon': 0.3,
            'alpha': 0.5,
            'gamma': 0.9
        }

        result1 = q_learning_recommendation_engine(**params)
        result2 = q_learning_recommendation_engine(**params)
        result3 = q_learning_recommendation_engine(**params)

        self.assertEqual(result1, result2)
        self.assertEqual(result2, result3)

    def test_output_format_validation(self):
        """Test output format matches specification requirements"""
        result = q_learning_recommendation_engine(
            num_users=3,
            num_items=4,
            interactions={0: [1], 1: [2]},
            num_episodes=5,
            epsilon=0.1,
            alpha=0.5,
            gamma=0.9
        )

        # Must be dictionary
        self.assertIsInstance(result, dict)

        # Must have all users
        self.assertEqual(len(result), 3)
        for user_id in range(3):
            self.assertIn(user_id, result)

        # Each user must have exactly 3 recommendations
        for user_id, recommendations in result.items():
            self.assertIsInstance(recommendations, list)
            self.assertEqual(len(recommendations), 3)

            # All recommendations must be valid item IDs
            for item_id in recommendations:
                self.assertIsInstance(item_id, int)
                self.assertGreaterEqual(item_id, 0)
                self.assertLess(item_id, 4)

    def test_value_error_zero_users(self):
        """Test ValueError raised when num_users is 0"""
        with self.assertRaises(ValueError):
            q_learning_recommendation_engine(
                num_users=0,
                num_items=5,
                interactions={},
                num_episodes=10,
                epsilon=0.1,
                alpha=0.5,
                gamma=0.9
            )

    def test_value_error_zero_items(self):
        """Test ValueError raised when num_items is 0"""
        with self.assertRaises(ValueError):
            q_learning_recommendation_engine(
                num_users=2,
                num_items=0,
                interactions={},
                num_episodes=10,
                epsilon=0.1,
                alpha=0.5,
                gamma=0.9
            )

    def test_user_with_no_interactions(self):
        """Test user with no positive interactions still gets recommendations"""
        result = q_learning_recommendation_engine(
            num_users=3,
            num_items=4,
            interactions={0: [1, 2], 2: [0]},  # User 1 has no interactions
            num_episodes=10,
            epsilon=0.2,
            alpha=0.5,
            gamma=0.9
        )

        # All users should be present
        self.assertEqual(len(result), 3)
        self.assertIn(0, result)
        self.assertIn(1, result)  # User with no interactions
        self.assertIn(2, result)

        # User with no interactions should still get 3 recommendations
        self.assertEqual(len(result[1]), 3)

        # All recommendations should be valid item IDs
        for item_id in result[1]:
            self.assertIsInstance(item_id, int)
            self.assertGreaterEqual(item_id, 0)
            self.assertLess(item_id, 4)

    def test_zero_episodes_behavior(self):
        """Test behavior when num_episodes is 0 (no training)"""
        result = q_learning_recommendation_engine(
            num_users=2,
            num_items=3,
            interactions={0: [1], 1: [2]},
            num_episodes=0,  # No training
            epsilon=0.1,
            alpha=0.5,
            gamma=0.9
        )

        # Should return recommendations based on initial Q-table (all zeros)
        # Tie-breaking should prefer lower item IDs
        expected = {0: [0, 1, 2], 1: [0, 1, 2]}
        self.assertEqual(result, expected)

    def test_tie_breaking_lower_item_id(self):
        """Test tie-breaking chooses lower item ID when Q-values are equal"""
        result = q_learning_recommendation_engine(
            num_users=1,
            num_items=5,
            interactions={},  # No interactions = all Q-values remain equal
            num_episodes=0,   # No training = Q-values stay at initial values
            epsilon=0.1,
            alpha=0.5,
            gamma=0.9
        )

        # All Q-values should be equal, so should return [0, 1, 2]
        expected = {0: [0, 1, 2]}
        self.assertEqual(result, expected)

    def test_empty_interactions_dictionary(self):
        """Test system handles empty interactions dictionary gracefully"""
        result = q_learning_recommendation_engine(
            num_users=2,
            num_items=3,
            interactions={},  # No user interactions at all
            num_episodes=5,
            epsilon=0.1,
            alpha=0.5,
            gamma=0.9
        )

        # Should provide recommendations for all users
        self.assertEqual(len(result), 2)
        self.assertIn(0, result)
        self.assertIn(1, result)

        # Each user should have exactly 3 recommendations
        for user_id, recommendations in result.items():
            self.assertEqual(len(recommendations), 3)
            for item_id in recommendations:
                self.assertIsInstance(item_id, int)
                self.assertGreaterEqual(item_id, 0)
                self.assertLess(item_id, 3)

    def test_single_user_single_item(self):
        """Test boundary case with single user and single item"""
        result = q_learning_recommendation_engine(
            num_users=1,
            num_items=1,
            interactions={0: [0]},
            num_episodes=5,
            epsilon=0.1,
            alpha=0.5,
            gamma=0.9
        )

        # Should have one user
        self.assertEqual(len(result), 1)
        self.assertIn(0, result)

        # User should have exactly 3 recommendations (requirement)
        self.assertEqual(len(result[0]), 3)

        # Since only 1 item exists, the implementation needs to handle this
        # All recommendations should be valid (even if repeated)
        for item_id in result[0]:
            self.assertIsInstance(item_id, int)
            self.assertGreaterEqual(item_id, 0)
            self.assertLess(item_id, 1)

    def test_extreme_epsilon_values(self):
        """Test boundary epsilon values (pure exploration and exploitation)"""
        base_params = {
            'num_users': 1,
            'num_items': 3,
            'interactions': {0: [2]},
            'num_episodes': 5,
            'alpha': 0.5,
            'gamma': 0.9
        }

        # Test epsilon = 0.0 (pure exploitation)
        result_exploit = q_learning_recommendation_engine(
            **base_params,
            epsilon=0.0
        )
        self.assertEqual(len(result_exploit), 1)
        self.assertEqual(len(result_exploit[0]), 3)

        # Test epsilon = 1.0 (pure exploration)
        result_explore = q_learning_recommendation_engine(
            **base_params,
            epsilon=1.0
        )
        self.assertEqual(len(result_explore), 1)
        self.assertEqual(len(result_explore[0]), 3)

    def test_large_scale_input(self):
        """Test system handles larger user and item counts"""
        # Create interactions for subset of users
        interactions = {i: [i % 10, (i + 1) % 10] for i in range(0, 20, 2)}

        result = q_learning_recommendation_engine(
            num_users=20,
            num_items=15,
            interactions=interactions,
            num_episodes=10,
            epsilon=0.2,
            alpha=0.3,
            gamma=0.8
        )

        # Should have all 20 users
        self.assertEqual(len(result), 20)

        # Each user should have exactly 3 recommendations
        for user_id in range(20):
            self.assertIn(user_id, result)
            self.assertEqual(len(result[user_id]), 3)

            # All recommendations should be valid item IDs
            for item_id in result[user_id]:
                self.assertIsInstance(item_id, int)
                self.assertGreaterEqual(item_id, 0)
                self.assertLess(item_id, 15)

    def test_reward_system_validation(self):
        """Test that reward system correctly identifies user interactions"""
        # Simple case to verify learning occurs
        result = q_learning_recommendation_engine(
            num_users=1,
            num_items=3,
            interactions={0: [2]},  # User 0 likes item 2
            num_episodes=10,
            epsilon=0.1,  # Mostly exploitation
            alpha=0.8,    # Fast learning
            gamma=0.5
        )

        # Verify output format is correct
        self.assertEqual(len(result), 1)
        self.assertIn(0, result)
        self.assertEqual(len(result[0]), 3)

        # All items should be valid
        for item_id in result[0]:
            self.assertIsInstance(item_id, int)
            self.assertGreaterEqual(item_id, 0)
            self.assertLess(item_id, 3)

    def test_interaction_key_types(self):
        """Test that interaction dictionary keys are handled correctly"""
        result = q_learning_recommendation_engine(
            num_users=2,
            num_items=3,
            interactions={0: [1], 1: [0, 2]},  # Integer keys as specified
            num_episodes=5,
            epsilon=0.2,
            alpha=0.5,
            gamma=0.9
        )

        # Standard validation
        self.assertEqual(len(result), 2)
        for user_id, recommendations in result.items():
            self.assertIsInstance(user_id, int)
            self.assertEqual(len(recommendations), 3)
            for item_id in recommendations:
                self.assertIsInstance(item_id, int)
                self.assertGreaterEqual(item_id, 0)
                self.assertLess(item_id, 3)


if __name__ == '__main__':
    unittest.main()

# Model Breaking Proof

#### Model Breaking Task URL: <https://rlhf-v3.turing.com/prompt/f9255ef6-35c7-49d2-b36c-082cafc04870?origin=https%3A%2F%2Flabeling-ms.turing.com&redirect_url=https%3A%2F%2Flabeling-ms.turing.com%2Fconversations%2F44828%2Fview>

#### Model code:

```python
def q_learning_recommendation_engine(
    num_users: int,
    num_items: int,
    interactions: dict[int, list[int]],
    num_episodes: int,
    epsilon: float,
    alpha: float,
    gamma: float
) -> dict[int, list[int]]:
    """
    Builds a simple Q-learning-based recommendation engine for num_users users and num_items items.
    
    Each user is treated as a state, and each item as an action. A Q table (list of lists) with shape
    (num_users, num_items) is maintained and updated over num_episodes episodes. The reward is +1 if
    the user interacted positively with the item and 0 otherwise. Epsilon-greedy action selection is
    used, and Q values are updated with the standard Q-learning update. After training, the function
    returns a dictionary mapping each user to the top 3 items (by Q value).

    Constraints and notes:
     - Exactly one call to random.seed(42).
     - No external libraries (pandas, numpy, etc.).
     - No recursion, no global variables, no RL frameworks.
     - Q must be stored in primitive Python structures (list of lists, etc.).
     - Ties in Q values should be broken by smaller item ID.
     - If num_users == 0 or num_items == 0, raise ValueError.
     - If num_episodes == 0, return from the initial Q table (likely all zeros).
     - If a user has no positive interactions, they still get Q updates with reward 0.
     - Must return exactly 3 recommended items per user in a dictionary.
    """

    import random
    random.seed(42)  # Exactly once, as required

    # Edge cases
    if num_users == 0 or num_items == 0:
        raise ValueError("num_users and num_items must be greater than 0.")

    # Initialize Q table (all zeros)
    Q = []
    for _ in range(num_users):
        Q.append([0.0] * num_items)

    # Convert interactions dict to a set for quick reward lookup
    # interactions[user] = set_of_items
    interaction_sets = {}
    for user_id in range(num_users):
        if user_id in interactions:
            interaction_sets[user_id] = set(interactions[user_id])
        else:
            # If user not in interactions, they simply have an empty set
            interaction_sets[user_id] = set()

    # Run Q-learning for num_episodes
    for _ in range(num_episodes):
        # For each user, pick an action (item) according to epsilon-greedy
        for user_id in range(num_users):
            # Epsilon greedy selection
            if random.random() < epsilon:
                # Explore
                action = random.choice(range(num_items))
            else:
                # Exploit: pick item with best Q value (break ties by item ID)
                best_action = 0
                best_q = Q[user_id][0]
                for item_id in range(1, num_items):
                    if Q[user_id][item_id] > best_q:
                        best_q = Q[user_id][item_id]
                        best_action = item_id
                    elif Q[user_id][item_id] == best_q and item_id < best_action:
                        best_action = item_id
                action = best_action

            # Determine reward
            reward = 1 if action in interaction_sets[user_id] else 0

            # Update Q value
            # Next state is effectively the same user, so we take max Q over all items
            max_q_next = max(Q[user_id])  # best future Q from the same state
            old_q = Q[user_id][action]
            Q[user_id][action] = old_q + alpha * (reward + gamma * max_q_next - old_q)

    # After training, pick the top 3 items per user
    recommendations = {}
    for user_id in range(num_users):
        # Build a list of ( -Qvalue, item_id ) so default sorted() will sort by highest Q first
        # and by ascending item_id for ties in Q since negative sign inverts order for Q.
        items_with_q = []
        for item_id in range(num_items):
            items_with_q.append((-Q[user_id][item_id], item_id))
        # Sort by the tuple: first by -Q (ascending => Q descending), tie-break by item_id (ascending)

        #  <Issue>: <Prompt explicit mentioned to avoid use of `sorted()` with custom keys or tuple tricks. This line uses tuple sorting to simulate key behavior.>
    
        items_with_q = sorted(items_with_q)
        top_3_items = [tup[1] for tup in items_with_q[:3]]
        recommendations[user_id] = top_3_items

    return recommendations


# --------------------------------------------------------------------------
# Below are demonstration tests for the specified edge cases.
# You can comment them out or remove them as needed.
# They are included here to illustrate compliance with the requirement:
# "You must demonstrate a test for each of these edge cases."
# --------------------------------------------------------------------------

def _demo_edge_cases():
    print("Demonstrating edge cases...")

    # 1. num_users=0 or num_items=0 => raise ValueError
    try:
        q_learning_recommendation_engine(0, 5, {}, 10, 0.1, 0.2, 0.9)
    except ValueError as e:
        print("Edge case 1 passed: ", e)

    try:
        q_learning_recommendation_engine(5, 0, {}, 10, 0.1, 0.2, 0.9)
    except ValueError as e:
        print("Edge case 2 passed: ", e)

    # 2. If num_episodes=0 => return recommendations from initial Q (all zeros)
    #    Ties mean smallest item IDs come first => recommended [0,1,2] for each user.
    result = q_learning_recommendation_engine(
        num_users=2,
        num_items=5,
        interactions={},
        num_episodes=0,  # no training
        epsilon=0.3,
        alpha=0.1,
        gamma=0.9
    )
    print("Edge case 3 (num_episodes=0) =>", result)

    # 3. If a user has no positive interactions => they still get Q=0. Top 3 are [0,1,2].
    #    We'll do a small training to see that it remains presumably 0.
    result = q_learning_recommendation_engine(
        num_users=1,
        num_items=3,
        interactions={},  # user 0 has no interactions
        num_episodes=10,
        epsilon=0.5,
        alpha=0.5,
        gamma=0.9
    )
    print("Edge case 4 (no interactions) =>", result)

# Uncomment to run the edge case demonstrations:
# _demo_edge_cases()
```