HIDE AND SEEK

NEED FOR IMPROVEMENT IN SINGLE AGENT RL

collecting demonstrations and specifying reward functions can be costly and time consuming
once the rl has found a way to perform a particular task it has no scope of improvement.
all the other ways like unsupervised exploration scale poorly on complex environment.

the agents were trained in a number of other tasks like object permanence, navigation and construction. and it showed much better results in an actual game of hide and seek.

this uses two networks- policy network that produces action distribution and other a critic network which predicts discounted future rewards.
policy optimisation is done using proximal policy optimisation.
it utilises decentralised execution and centralised training.
that is similar agents have different brains but they share the same weights

we notice that there was no direct incentives to use ramps and develop strategies it was solely due to adaptability on the changing environment.
hiders learn efficient division of labor for instance while constructing a shelter the agents would bring their own separate box.
these agents were used on multiple other tasks which required cognition/remembering skills and construction tasks. surprisingly the agents which were pretrained in hide and seek performed better than who trained from scratch.

intrinsic motivation is heavily based on exploration. but sometimes explorations can be 'noisy' i.e unuseful.
when intrinsic motivation applied to the system the number of box usage seems to be higher than self-play but when the number of boxes and players increase agent and in particular box movement decreased.
this proves that when the environment scales the methods like count based because the things that are interesting or needed to be explored more are to be hand specified.