Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Calculating rewards for Blackjack toy-text env #1011

Closed
1 task done
peterhungh3 opened this issue Apr 11, 2024 · 9 comments
Closed
1 task done

[Bug Report] Calculating rewards for Blackjack toy-text env #1011

peterhungh3 opened this issue Apr 11, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@peterhungh3
Copy link

Describe the bug

In blackjack.py: reward = cmp(score(self.player), score(self.dealer))

Current rewards seem wrong for 2 edge cases:

  1. Player: Blackjack vs Dealer: 21. Currently this return a draw (reward = 0) while it should be 1 (b/c player wins).
  2. Player and dealer both bust: Currently this return a draw (reward = 0) while it should be -1 (dealer wins).

Code example

In blackjack.py: 
Inside function step():
reward = cmp(score(self.player), score(self.dealer))

And definition of score() and cmp(): 
def score(hand):  # What is the score of this hand (0 if bust)
    return 0 if is_bust(hand) else sum_hand(hand)
def cmp(a, b):
    return float(a > b) - float(a < b)

System info

  • Gymnasium, main branch
  • Python 3.9

Additional context

No response

Checklist

  • I have checked that there is no similar issue in the repo
@peterhungh3 peterhungh3 added the bug Something isn't working label Apr 11, 2024
@pseudo-rnd-thoughts
Copy link
Member

pseudo-rnd-thoughts commented Apr 11, 2024

I'm surprised that no one has noticed this issue with the reward function though admittedly on edge cases

What is your suggested changes to the reward in terms of code?
Could you provide some testing code for this behaviour? (you can use known seeds and actions to testing particular outcomes)

@peterhungh3
Copy link
Author

peterhungh3 commented Apr 12, 2024

@pseudo-rnd-thoughts

For case 1: player 21 vs Dealer BJ and vice versa. Currently, the returned reward is 0 (draw)

Ex code to test for player = BJ + dealer = 21

from gymnasium.envs.toy_text.blackjack import BlackjackEnv, sum_hand 

env = BlackjackEnv(natural=True)
while True: 
    obs, _ = env.reset()
    if obs[0] == 21: # Player BJ
        action = 0
        next_obs, reward, terminated, truncated, info = env.step(action)
        if sum_hand(env.dealer) == 21 and len(env.dealer) > 2: # 21
            print(f"Player: {env.player} & dealer = {env.dealer}, " 
                f"reward = {reward}")
            assert reward == 1.5, reward

Ex code to test for player = 21 & dealer = BJ

    env = BlackjackEnv(natural=True)
    while True: 
        obs, _ = env.reset()
        if sum_hand(env.player) == 21: # ignore player BJ, we want to find 21
            continue 

        action = 1
        next_obs, reward, terminated, truncated, info = env.step(action)
        if (sum_hand(env.player) == 21 and len(env.player) > 2 and # player 21
            sum_hand(env.dealer) == 21 and len(env.dealer) == 2 # dealer BJ
        ): 
            print(f"Player: {env.player} & dealer = {env.dealer}, " 
                f"reward = {reward}")
            assert reward == -1, reward

For case 2: when both busted: I've rechecked and actually the current codes could handle this.
This was my mistake as I was trying to extend the env to support double-down and that case happened to enter this code path:
reward = cmp(score(self.player), score(self.dealer))
which would produce a reward of 0, which is incorrect. But the current codes already handle the busted player case in another path.
Nevertheless, the above line of code still seems a bit "dangerous" as it made me think it would seem to be able to handle all cases.

@pseudo-rnd-thoughts
Copy link
Member

Could you make a PR with the suggested changes and tests for the relative rules

@frischzenger
Copy link

def is_natural(hand):  # Is this hand a natural blackjack?
    return sorted(hand) == [1, 10]

and i am also confused about natural, why here should be sorted?

@frischzenger
Copy link

    def step(self, action):
        assert self.action_space.contains(action)
        if action:  # hit: add a card to players hand and return
            self.player.append(draw_card(self.np_random))
            if is_bust(self.player):
                terminated = True
                reward = -1.0
            else:
                terminated = False
                reward = 0.0
        else:  # stick: play out the dealers hand, and score
            terminated = True
            while sum_hand(self.dealer) < 17:
                self.dealer.append(draw_card(self.np_random))
            reward = cmp(score(self.player), score(self.dealer))
            if self.sab and is_natural(self.player) and not is_natural(self.dealer):
                # Player automatically wins. Rules consistent with S&B
                reward = 1.0
            elif (
                not self.sab
                and self.natural
                and is_natural(self.player)
                and reward == 1.0
            ):
                # Natural gives extra points, but doesn't autowin. Legacy implementation
                reward = 1.5

        if self.render_mode == "human":
            self.render()
        return self._get_obs(), reward, terminated, False, {}

on this line:
if action: # hit: add a card to players hand and return
the player after the reset should has two cards, but step into this line, the player has three cards, why ?

@CloseChoice
Copy link
Contributor

CloseChoice commented Jun 19, 2024

Current rewards seem wrong for 2 edge cases:

  1. Player: Blackjack vs Dealer: 21. Currently this return a draw (reward = 0) while it should be 1 (b/c player wins).
  2. Player and dealer both bust: Currently this return a draw (reward = 0) while it should be -1 (dealer wins).

I think both cases are handled correctly according to standard rules.
1: (cited from Wikipedia)

A player total of 21 on the first two cards is a "natural" or "blackjack", and the player wins immediately unless the dealer also has one, in which case the hand ties. In the case of a tie ("push" or "standoff"), bets are returned without adjustment.

This should also be the case if both player and dealer hit 21 (without either of them having a blackjack) but haven't found this stated explicitly on Wikipedia.
2: (also Wikipedia)

Number cards count as their number, the jack, queen, and king ("face cards" or "pictures") count as 10, and aces count as either 1 or 11 according to the player's choice. If the total exceeds 21 points, it busts, and all bets on it immediately lose.

If the player busts, then the dealer does not need to play as indicated by "immediately lose". So the second case should never happen.

@CloseChoice
Copy link
Contributor

def is_natural(hand):  # Is this hand a natural blackjack?
    return sorted(hand) == [1, 10]

and i am also confused about natural, why here should be sorted?

because a natural does not depend on the order, you can either have an Ace + something worth a 10 or have it the other way round. So by sorting first the check is independent of order.

As for your other question:

on this line:
if action: # hit: add a card to players hand and return
the player after the reset should has two cards, but step into this line, the player has three cards, why ?

This should not be possible when I look at the current implementation. Can you write code that reproduces this? I would suspect that you did not call .reset() here.

@pseudo-rnd-thoughts
Copy link
Member

I believe that @CloseChoice is correct however if @peterhungh3 or @frischzenger disagree, please provide an example case where you believe an issue occurs

@peterhungh3
Copy link
Author

@pseudo-rnd-thoughts @CloseChoice

I think what @CloseChoice said is correct, and if that's what gets implemented.

Again, my recommendation is just to make the reward calculation function more explicit, because the reward depends on the state of the game, not just what's calculated in score() func I cited above.

Back to the case:
If we take a look at the step() func of this file:

If the user has a BJ, the action will be stand. But the step() function won't check if the user has a BJ or not, but will add cards to the dealer until the sum is above 17. As a result, the dealer may get to 21 with say 3 cards. Then the comp() func will return a reward of 0, instead of 1 or 1.5 for users (users always win when having BJ, unless dealer also has BJ).

Again, this is a small point. My recommendation is just to make the reward calculation func taking into consideration the state of the game, not just the score of the hand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants