[Bug Report] Calculating rewards for Blackjack toy-text env #1011

peterhungh3 · 2024-04-11T20:48:17Z

Describe the bug

In blackjack.py: reward = cmp(score(self.player), score(self.dealer))

Current rewards seem wrong for 2 edge cases:

Player: Blackjack vs Dealer: 21. Currently this return a draw (reward = 0) while it should be 1 (b/c player wins).
Player and dealer both bust: Currently this return a draw (reward = 0) while it should be -1 (dealer wins).

Code example

In blackjack.py: 
Inside function step():
reward = cmp(score(self.player), score(self.dealer))

And definition of score() and cmp(): 
def score(hand):  # What is the score of this hand (0 if bust)
    return 0 if is_bust(hand) else sum_hand(hand)
def cmp(a, b):
    return float(a > b) - float(a < b)

System info

Gymnasium, main branch
Python 3.9

Additional context

No response

Checklist

I have checked that there is no similar issue in the repo

pseudo-rnd-thoughts · 2024-04-11T21:44:35Z

I'm surprised that no one has noticed this issue with the reward function though admittedly on edge cases

What is your suggested changes to the reward in terms of code?
Could you provide some testing code for this behaviour? (you can use known seeds and actions to testing particular outcomes)

peterhungh3 · 2024-04-12T01:51:00Z

@pseudo-rnd-thoughts

For case 1: player 21 vs Dealer BJ and vice versa. Currently, the returned reward is 0 (draw)

Ex code to test for player = BJ + dealer = 21

from gymnasium.envs.toy_text.blackjack import BlackjackEnv, sum_hand 

env = BlackjackEnv(natural=True)
while True: 
    obs, _ = env.reset()
    if obs[0] == 21: # Player BJ
        action = 0
        next_obs, reward, terminated, truncated, info = env.step(action)
        if sum_hand(env.dealer) == 21 and len(env.dealer) > 2: # 21
            print(f"Player: {env.player} & dealer = {env.dealer}, " 
                f"reward = {reward}")
            assert reward == 1.5, reward

Ex code to test for player = 21 & dealer = BJ

    env = BlackjackEnv(natural=True)
    while True: 
        obs, _ = env.reset()
        if sum_hand(env.player) == 21: # ignore player BJ, we want to find 21
            continue 

        action = 1
        next_obs, reward, terminated, truncated, info = env.step(action)
        if (sum_hand(env.player) == 21 and len(env.player) > 2 and # player 21
            sum_hand(env.dealer) == 21 and len(env.dealer) == 2 # dealer BJ
        ): 
            print(f"Player: {env.player} & dealer = {env.dealer}, " 
                f"reward = {reward}")
            assert reward == -1, reward

For case 2: when both busted: I've rechecked and actually the current codes could handle this.
This was my mistake as I was trying to extend the env to support double-down and that case happened to enter this code path:
reward = cmp(score(self.player), score(self.dealer))
which would produce a reward of 0, which is incorrect. But the current codes already handle the busted player case in another path.
Nevertheless, the above line of code still seems a bit "dangerous" as it made me think it would seem to be able to handle all cases.

pseudo-rnd-thoughts · 2024-04-12T11:48:18Z

Could you make a PR with the suggested changes and tests for the relative rules

frischzenger · 2024-05-16T02:35:18Z

def is_natural(hand):  # Is this hand a natural blackjack?
    return sorted(hand) == [1, 10]

and i am also confused about natural, why here should be sorted?

frischzenger · 2024-05-16T02:36:52Z

    def step(self, action):
        assert self.action_space.contains(action)
        if action:  # hit: add a card to players hand and return
            self.player.append(draw_card(self.np_random))
            if is_bust(self.player):
                terminated = True
                reward = -1.0
            else:
                terminated = False
                reward = 0.0
        else:  # stick: play out the dealers hand, and score
            terminated = True
            while sum_hand(self.dealer) < 17:
                self.dealer.append(draw_card(self.np_random))
            reward = cmp(score(self.player), score(self.dealer))
            if self.sab and is_natural(self.player) and not is_natural(self.dealer):
                # Player automatically wins. Rules consistent with S&B
                reward = 1.0
            elif (
                not self.sab
                and self.natural
                and is_natural(self.player)
                and reward == 1.0
            ):
                # Natural gives extra points, but doesn't autowin. Legacy implementation
                reward = 1.5

        if self.render_mode == "human":
            self.render()
        return self._get_obs(), reward, terminated, False, {}

on this line：
if action: # hit: add a card to players hand and return
the player after the reset should has two cards, but step into this line, the player has three cards, why ?

CloseChoice · 2024-06-19T21:30:32Z

Current rewards seem wrong for 2 edge cases:

Player: Blackjack vs Dealer: 21. Currently this return a draw (reward = 0) while it should be 1 (b/c player wins).

Player and dealer both bust: Currently this return a draw (reward = 0) while it should be -1 (dealer wins).

I think both cases are handled correctly according to standard rules.
1: (cited from Wikipedia)

A player total of 21 on the first two cards is a "natural" or "blackjack", and the player wins immediately unless the dealer also has one, in which case the hand ties. In the case of a tie ("push" or "standoff"), bets are returned without adjustment.

This should also be the case if both player and dealer hit 21 (without either of them having a blackjack) but haven't found this stated explicitly on Wikipedia.
2: (also Wikipedia)

Number cards count as their number, the jack, queen, and king ("face cards" or "pictures") count as 10, and aces count as either 1 or 11 according to the player's choice. If the total exceeds 21 points, it busts, and all bets on it immediately lose.

If the player busts, then the dealer does not need to play as indicated by "immediately lose". So the second case should never happen.

CloseChoice · 2024-06-19T21:42:09Z

def is_natural(hand):  # Is this hand a natural blackjack?
    return sorted(hand) == [1, 10]
and i am also confused about natural, why here should be sorted?

because a natural does not depend on the order, you can either have an Ace + something worth a 10 or have it the other way round. So by sorting first the check is independent of order.

As for your other question:

on this line：
if action: # hit: add a card to players hand and return
the player after the reset should has two cards, but step into this line, the player has three cards, why ?

This should not be possible when I look at the current implementation. Can you write code that reproduces this? I would suspect that you did not call .reset() here.

pseudo-rnd-thoughts · 2024-07-02T09:08:37Z

I believe that @CloseChoice is correct however if @peterhungh3 or @frischzenger disagree, please provide an example case where you believe an issue occurs

peterhungh3 · 2024-07-02T13:23:54Z

@pseudo-rnd-thoughts @CloseChoice

I think what @CloseChoice said is correct, and if that's what gets implemented.

Again, my recommendation is just to make the reward calculation function more explicit, because the reward depends on the state of the game, not just what's calculated in score() func I cited above.

Back to the case:
If we take a look at the step() func of this file:

Gymnasium/gymnasium/envs/toy_text/blackjack.py

Line 178 in 40f5c7b

terminated = True

If the user has a BJ, the action will be stand. But the step() function won't check if the user has a BJ or not, but will add cards to the dealer until the sum is above 17. As a result, the dealer may get to 21 with say 3 cards. Then the comp() func will return a reward of 0, instead of 1 or 1.5 for users (users always win when having BJ, unless dealer also has BJ).

Again, this is a small point. My recommendation is just to make the reward calculation func taking into consideration the state of the game, not just the score of the hand.

peterhungh3 added the bug Something isn't working label Apr 11, 2024

CloseChoice mentioned this issue Jun 19, 2024

add tests for blackjack edge cases #1088

Closed

11 tasks

pseudo-rnd-thoughts closed this as completed Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] Calculating rewards for Blackjack toy-text env #1011

[Bug Report] Calculating rewards for Blackjack toy-text env #1011

peterhungh3 commented Apr 11, 2024

pseudo-rnd-thoughts commented Apr 11, 2024 •

edited

Loading

peterhungh3 commented Apr 12, 2024 •

edited

Loading

pseudo-rnd-thoughts commented Apr 12, 2024

frischzenger commented May 16, 2024

frischzenger commented May 16, 2024

CloseChoice commented Jun 19, 2024 •

edited

Loading

CloseChoice commented Jun 19, 2024

pseudo-rnd-thoughts commented Jul 2, 2024

peterhungh3 commented Jul 2, 2024

[Bug Report] Calculating rewards for Blackjack toy-text env #1011

[Bug Report] Calculating rewards for Blackjack toy-text env #1011

Comments

peterhungh3 commented Apr 11, 2024

Describe the bug

Code example

System info

Additional context

Checklist

pseudo-rnd-thoughts commented Apr 11, 2024 • edited Loading

peterhungh3 commented Apr 12, 2024 • edited Loading

pseudo-rnd-thoughts commented Apr 12, 2024

frischzenger commented May 16, 2024

frischzenger commented May 16, 2024

CloseChoice commented Jun 19, 2024 • edited Loading

CloseChoice commented Jun 19, 2024

pseudo-rnd-thoughts commented Jul 2, 2024

peterhungh3 commented Jul 2, 2024

pseudo-rnd-thoughts commented Apr 11, 2024 •

edited

Loading

peterhungh3 commented Apr 12, 2024 •

edited

Loading

CloseChoice commented Jun 19, 2024 •

edited

Loading