Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No issue here, just wanted to ask a couple of questions #1

Open
aleksandarilic95 opened this issue Jun 29, 2023 · 6 comments
Open

No issue here, just wanted to ask a couple of questions #1

aleksandarilic95 opened this issue Jun 29, 2023 · 6 comments

Comments

@aleksandarilic95
Copy link

Heyy,

First of all, great project, I'm glad to see it succeeded! I'm working on a similar project of making A2C model learn to play Yamb (it's a variation of Yahtzee played with 6 dices where you keep 5 and the categories are different), so I was wondering if you could possibly find a couple of minutes of your day to answer some of my questions of how you got the agent to learn, since it's a pretty similar problem that we're dealing with.

Sorry for opening an issue, I just had no idea of how to message you other than this :)

Regards,
Aleksandar

@dionhaefner
Copy link
Owner

Hey,

I'll help if I can - I'm by no means an expert on RL. I assume you have seen this?

@aleksandarilic95
Copy link
Author

Yeah, I've read it couple of times, but some stuff that I have questions about is only provided in the code and I've never worked with Jax/Haiku before, so trying to fully understand the code is not an easiest task in the world (I'm working with pytorch).
Basically, what was the minimum you needed to do to get the learning process going? I've dumbed down the game to only 6 categories, 1->6 and I want to see at least some proof of learning before I start scaling it up.
My approach was using a simple MLP, just as you, with zeroing (or rather -inf) the probabilities of invalid actions before applying the softmax to the output of the last layer, just as you. My action space is basically first 2 ** 6 + 6 actions for categories. But even after 500k episodes, the model still does random moves.
How did you represent the state of the game? Did you do any scaling on the input variables? I cannot find it inside your code, or rather I don't know where to look. My state representation is [roll_no, dice_1, dice_2, dice_3, ..,, dice_6, category_1_available, category_1_points, ..., category_6_available, category_6_points], so a tensor with 19 elements basically.

I'll help if I can - I'm by no means an expert on RL.
You got it to work, which is all the expertise I need :)

@aleksandarilic95
Copy link
Author

aleksandarilic95 commented Jun 29, 2023

Also, what was the reward system you used? I'm curious as to how you rewarded the first two rolls where the bot is supposed to re-roll the dices.

@dionhaefner
Copy link
Owner

Basically, what was the minimum you needed to do to get the learning process going?

I didn't explore this very well so I can't say much. Pre-training helped a lot in my case. You could try (lots of) supervised pre-training iterations with a simple heuristic, then see what happens if you switch on the agent (for example, does it revert to random play, keep the pre-trained policy, or start learning on top of it).

How did you represent the state of the game? Did you do any scaling on the input variables? I cannot find it inside your code, or rather I don't know where to look. My state representation is [roll_no, dice_1, dice_2, dice_3, ..,, dice_6, category_1_available, category_1_points, ..., category_6_available, category_6_points], so a tensor with 19 elements basically.

That's almost the same I use, with the only difference that I pass the count of dice. So it's [roll_number, number_of_ones, number_of_twos, ...].

Also, what was the reward system you used? I'm curious as to how you rewarded the first two rolls where the bot is supposed to re-roll the dices.

I use a reward of 0 for the first two rolls. Only the final roll receives a reward (the score).

@aleksandarilic95
Copy link
Author

I didn't explore this very well so I can't say much. Pre-training helped a lot in my case. You could try (lots of) supervised pre-training iterations with a simple heuristic, then see what happens if you switch on the agent (for example, does it revert to random play, keep the pre-trained policy, or start learning on top of it).

That's what I've tried as well, but maybe I did not do the appropriate amount of pretraining.
image

This is with 15.000 episodes of both pretraining and training, you can clearly see where the switch happens. Why I'm concerned is that when it stops using the heuristic, the loss becomes really small, which tells me it thinks it's learning, but it's just doing random moves.

I'll try to do something like 200k episodes of pretraining followed by 15k episodes of training with 0 eps, to see if it still behaves the same, but at this point I'm starting to worry that there might be a bug in my implementation, since it thinks it's learning, but it's really not.

@dionhaefner
Copy link
Owner

Looks like you have a bug in your implementation :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants