No issue here, just wanted to ask a couple of questions #1

aleksandarilic95 · 2023-06-29T09:31:53Z

Heyy,

First of all, great project, I'm glad to see it succeeded! I'm working on a similar project of making A2C model learn to play Yamb (it's a variation of Yahtzee played with 6 dices where you keep 5 and the categories are different), so I was wondering if you could possibly find a couple of minutes of your day to answer some of my questions of how you got the agent to learn, since it's a pretty similar problem that we're dealing with.

Sorry for opening an issue, I just had no idea of how to message you other than this :)

Regards,
Aleksandar

dionhaefner · 2023-06-29T09:42:03Z

Hey,

I'll help if I can - I'm by no means an expert on RL. I assume you have seen this?

aleksandarilic95 · 2023-06-29T09:49:02Z

Yeah, I've read it couple of times, but some stuff that I have questions about is only provided in the code and I've never worked with Jax/Haiku before, so trying to fully understand the code is not an easiest task in the world (I'm working with pytorch).
Basically, what was the minimum you needed to do to get the learning process going? I've dumbed down the game to only 6 categories, 1->6 and I want to see at least some proof of learning before I start scaling it up.
My approach was using a simple MLP, just as you, with zeroing (or rather -inf) the probabilities of invalid actions before applying the softmax to the output of the last layer, just as you. My action space is basically first 2 ** 6 + 6 actions for categories. But even after 500k episodes, the model still does random moves.
How did you represent the state of the game? Did you do any scaling on the input variables? I cannot find it inside your code, or rather I don't know where to look. My state representation is [roll_no, dice_1, dice_2, dice_3, ..,, dice_6, category_1_available, category_1_points, ..., category_6_available, category_6_points], so a tensor with 19 elements basically.

I'll help if I can - I'm by no means an expert on RL.
You got it to work, which is all the expertise I need :)

aleksandarilic95 · 2023-06-29T09:52:45Z

Also, what was the reward system you used? I'm curious as to how you rewarded the first two rolls where the bot is supposed to re-roll the dices.

dionhaefner · 2023-06-29T10:33:45Z

Basically, what was the minimum you needed to do to get the learning process going?

I didn't explore this very well so I can't say much. Pre-training helped a lot in my case. You could try (lots of) supervised pre-training iterations with a simple heuristic, then see what happens if you switch on the agent (for example, does it revert to random play, keep the pre-trained policy, or start learning on top of it).

How did you represent the state of the game? Did you do any scaling on the input variables? I cannot find it inside your code, or rather I don't know where to look. My state representation is [roll_no, dice_1, dice_2, dice_3, ..,, dice_6, category_1_available, category_1_points, ..., category_6_available, category_6_points], so a tensor with 19 elements basically.

That's almost the same I use, with the only difference that I pass the count of dice. So it's [roll_number, number_of_ones, number_of_twos, ...].

Also, what was the reward system you used? I'm curious as to how you rewarded the first two rolls where the bot is supposed to re-roll the dices.

I use a reward of 0 for the first two rolls. Only the final roll receives a reward (the score).

aleksandarilic95 · 2023-06-29T11:25:02Z

I didn't explore this very well so I can't say much. Pre-training helped a lot in my case. You could try (lots of) supervised pre-training iterations with a simple heuristic, then see what happens if you switch on the agent (for example, does it revert to random play, keep the pre-trained policy, or start learning on top of it).

That's what I've tried as well, but maybe I did not do the appropriate amount of pretraining.

This is with 15.000 episodes of both pretraining and training, you can clearly see where the switch happens. Why I'm concerned is that when it stops using the heuristic, the loss becomes really small, which tells me it thinks it's learning, but it's just doing random moves.

I'll try to do something like 200k episodes of pretraining followed by 15k episodes of training with 0 eps, to see if it still behaves the same, but at this point I'm starting to worry that there might be a bug in my implementation, since it thinks it's learning, but it's really not.

dionhaefner · 2023-06-29T13:22:09Z

Looks like you have a bug in your implementation :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No issue here, just wanted to ask a couple of questions #1

No issue here, just wanted to ask a couple of questions #1

aleksandarilic95 commented Jun 29, 2023

dionhaefner commented Jun 29, 2023

aleksandarilic95 commented Jun 29, 2023

aleksandarilic95 commented Jun 29, 2023 •

edited

Loading

dionhaefner commented Jun 29, 2023

aleksandarilic95 commented Jun 29, 2023

dionhaefner commented Jun 29, 2023

No issue here, just wanted to ask a couple of questions #1

No issue here, just wanted to ask a couple of questions #1

Comments

aleksandarilic95 commented Jun 29, 2023

dionhaefner commented Jun 29, 2023

aleksandarilic95 commented Jun 29, 2023

aleksandarilic95 commented Jun 29, 2023 • edited Loading

dionhaefner commented Jun 29, 2023

aleksandarilic95 commented Jun 29, 2023

dionhaefner commented Jun 29, 2023

aleksandarilic95 commented Jun 29, 2023 •

edited

Loading