## Recommender env: design
<!-- video shot="/eOFzSon7dK0" start="03:42" end="14:03" -->

In [1]:
# HIDDEN
import gym

In [2]:
# TODO: adapt images from Sven for this

#### Simulating user behavior

- Simulate user behavior when **repeatedly** responding to item recommendations.
- The key behavior to simulate:

> Recommending "junk food" items is good in the short term, but bad in the long term.

- This is completely our choice as the designer of the environment. 
- You may want to simulate/capture a different type of user behavior.
  - Or, learn from user behavior data -- more on this later!
- But this will be our running example for now.

#### Candy

- We will model every item as having a "sweetness" level. 
- We'll refer to high-sweetness items as "candy".

![](img/candy.jpg)

- This could be short, silly videos, or cheap trinkets on sale, etc.
- Users love candy in the short term, but too much candy leads to dissatisfaciton in the long term.

#### Veggies

- On the other hand, we'll refer to low-sweetness items as "veggies".

![](img/veggies.jpg)

- These could be educational documentaries, or boring-but-useful items, etc.
- Users don't enjoy veggies much in the short term, but they boost satisfaction in the long term.

#### Sugar level

- Each **item** has a sweetness level, that is a fixed property of the item.
- We'll model our **users** as having a variable **sugar level** that measures how much candy they've eaten recently.
- The user's sugar level (or the notion of a sugar level) _will not be known to the agent_!
- But for now, we're designing the simulator, so we are all-knowing.

#### Sugar level dynamics

- We need to decide how the sugar level changes with item consumption.
- A simple approach is:

> When an item is consumed, the sugar level moves towards the sweetness of that item.

Examples:

- If your sugar level is 0.2 and you consume an item with sweetness 0.5, your sugar level goes up ⬆️
- If your sugar level is 0.2 and you consume an item with sweetness 0.1, your sugar level goes down ⬇️

#### Sugar level dynamics

- How do we represent this mathematically?
- We can try this:

> new sugar level = ⍺ (old sugar level) + (1 - ⍺) (item sweetness)

- Here, ⍺ is a number between 0 and 1 that controls how "stubborn" the sugar level is.

In [3]:
# HIDDEN 
# note the slide above and below are partially the same - just want to hide the bottom half at first

#### Sugar level dynamics

- How do we represent this mathematically?
- We can try this:

> new sugar level = ⍺ (old sugar level) + (1 - ⍺) (item sweetness)

- Here, ⍺ is a number between 0 and 1 that controls how "stubborn" the sugar level is.
- For example, if ⍺=1 then the above equation becomes

> new sugar level = old sugar level

and the sugar level never changes. If ⍺=0 then we have

> new sugar level = item sweetness

meaning a single item can complete change the user's sugar level.

- For ⍺ between 0 and 1, we have a combination of the old sugar level and the item sweetness.

#### Sugar level dynamics

We can implement the above using this function:

In [4]:
def update_sugar_level(sugar_level, item_sweetness, alpha=0.9):
    return alpha * sugar_level + (1 - alpha) * item_sweetness

Let's test it out to make sure the behavior makes sense (using the default value of ⍺=0.9):

In [5]:
sugar_level = 0.2
sugar_level = update_sugar_level(sugar_level, 0.8)
sugar_level

0.26

The item was sweet (0.8), so the sugar level went up quite a bit.

In [6]:
sugar_level = update_sugar_level(sugar_level, 0.3)
sugar_level

0.264

The item sweetness was slightly above the sugar level, so the sugar level went up slightly.

#### Sugar level dynamics

In [7]:
sugar_level = update_sugar_level(sugar_level, 0.01)
sugar_level

0.2386

The item was un-sweet, so the sugar level went down.

#### Effect of alpha

We can see that, with a smaller alpha, the sugar level changes much faster:

In [8]:
sugar_level = update_sugar_level(sugar_level, 0.0, alpha=0.5)
sugar_level

0.1193

#### Reward

- Ok great, we have the sugar level dynamics all sorted out!
- The second major piece of the puzzle is the reward.
- What we want:

1. Higher item sweetness leads to higher reward (yum, candy!)
2. Higher sugar level leads to lower reward (ahh, too much candy!)

A simple way to combine these effects is to multiply them together:

> reward = item sweetness * (1 - sugar level)

#### Reward implementation

> reward = item sweetness * (1 - sugar level)

We can code this as:

In [9]:
def reward(sugar_level, item_sweetness):
    return item_sweetness * (1 - sugar_level)

We will be using this pieces in the next section when we implement our environment!

#### Observation space

- Next, we will need to set up the observations. 
- Our observations will be the _features of candidate items_.
- For simplicity, we'll assume only 1 feature, the item sweetness.
- So, the agent will see a bunch of sweetness levels, and choose one of them.

#### Action space

- In this environment, the action is the chosen item to recommend, given the canididates.

#### Let's apply what we learned!

## Big-picture
<!-- multiple choice -->

Which of the following is **NOT** true about the simulated recommender RL environment we are creating?

- [ ] The environment contains a vastly oversimplified model of user behavior, but a trained agent still may be useful in making recommendations.
- [x] The environment accurately represents how real users behave.
- [ ] The environment is a good starting point, and we may wish to add complexity as our work progresses.
- [ ] The environment captures the notion that users will respond differently to different items, and this response may depend on their history.

## Recommender rewards
<!-- multiple choice -->

Recall that our reward function is

> reward = item sweetness * (1 - sugar level)

#### Short-term satisfaction

True or False: at any given moment, the _immediate_ reward is _always_ larger for candy than for veggies.

- [x] True | The immediate reward is directly proportional to item sweetness.
- [ ] False | Take a closer look at the formula above!

#### Long-term satisfaction

True or False: at any given moment, the _long-term total_ reward is _always_ larger for recommending veggies than for candy.

- [ ] True | It's complicated to determine what will be best in the long term - this is what our agent has to learn!
- [x] False

## Sugar crash
<!-- coding exercise -->

Let's assume your sugar level starts at 0.5, and at each step you only have two items to choose from, mega-veggie (sweetness = 0) and mega-candy (sweetness = 1). You will be making 3 recommendations in a row, using alpha = 0.7. Use the coding window below to play around with different options, and find the best sequence of recommendations in terms of _total_ reward. 

In [10]:
# EXERCISE

def update_sugar_level(sugar_level, item_sweetness, alpha=0.9):
    return alpha * sugar_level + (1 - alpha) * item_sweetness

def reward(sugar_level, item_sweetness):
    return item_sweetness * (1 - sugar_level)

# MODIFY THIS LIST
# But make sure it always contains 3 items, each 0 or 1
recommendations = [0, 0, 0]

# starting sugar level
sugar_level = 0.5

total_reward = 0

for item_sweetness in recommendations:
    
    # add reward
    immediate_reward = reward(sugar_level, item_sweetness)
    total_reward += immediate_reward
    
    # update sugar level
    sugar_level = update_sugar_level(sugar_level, item_sweetness, alpha=0.7)
    
    print(f"  Received reward {immediate_reward:.5f}, new sugar level {sugar_level:.5f}")
    
print("Total reward after 5 recommendations:", total_reward)

  Received reward 0.00000, new sugar level 0.35000
  Received reward 0.00000, new sugar level 0.24500
  Received reward 0.00000, new sugar level 0.17150
Total reward after 5 recommendations: 0.0


In [11]:
# SOLUTION

def update_sugar_level(sugar_level, item_sweetness, alpha=0.9):
    return alpha * sugar_level + (1 - alpha) * item_sweetness

def reward(sugar_level, item_sweetness):
    return item_sweetness * (1 - sugar_level)

# MODIFY THIS LIST
# But make sure it always contains 3 items, each 0 or 1
recommendations = [0,1,1]

# starting sugar level
sugar_level = 0.5

total_reward = 0

for item_sweetness in recommendations:
    
    # add reward
    immediate_reward = reward(sugar_level, item_sweetness)
    total_reward += immediate_reward
    
    # update sugar level
    sugar_level = update_sugar_level(sugar_level, item_sweetness, alpha=0.7)
    
    print(f"  Received reward {immediate_reward:.5f}, new sugar level {sugar_level:.5f}")
    
print("Total reward after 5 recommendations", total_reward)

  Received reward 0.00000, new sugar level 0.35000
  Received reward 0.65000, new sugar level 0.54500
  Received reward 0.45500, new sugar level 0.68150
Total reward after 5 recommendations 1.105


#### Was was the best strategy in this example?

- [x] 1 veggie to lower sugar levels, then 2 candies for that sweet, sweet reward.
- [ ] Candy, then veggies for good health, then more candy.
- [ ] Veggies all the way!
- [ ] Candy all the way!