Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evil Rooms #323

Open
epurdy opened this issue Jun 2, 2018 · 3 comments
Open

Evil Rooms #323

epurdy opened this issue Jun 2, 2018 · 3 comments

Comments

@epurdy
Copy link

epurdy commented Jun 2, 2018

The Room Hypothesis sounds like an excellent starting point for reasoning about Natural Language Uncertainty. Unfortunately, it suffers from the following limitation: some rooms may be constructed by powerful adversaries, such that applying common sense to them leads to bad outcomes. For instance, if you are in a literal room that is monitored by an evil adversary, then even seemingly harmless actions (saying "I love you" to a loved one, e.g.) could be used against one. (Now the loved one is identified as a potential source of leverage by the evil adversary!) Ultimately, the only defense against such issues seems to be "don't go in those rooms in the first place"... which is great advice for meatspace and borderline useless for anything involving potentially networked computers.

@bvssvni
Copy link
Collaborator

bvssvni commented Jun 2, 2018

The "room" is a virtual construct that consists of the objects that the AI thinks about. So, you can't actually lock them inside a room, but you might try to exploit the knowledge that the AI reasons that way.

@epurdy
Copy link
Author

epurdy commented Jun 2, 2018

Right, but there is some actual room that functions according to some logic. There is a truth of the matter. Ultimately, misalignments between the effects of an action and what the AI thinks the effects are, are what I term a "hostile simulation". It's great for enslaving a superintelligence... but it's not great morally in my opinion.

@bvssvni
Copy link
Collaborator

bvssvni commented Sep 8, 2019

I believe that if the AI learns from the environment and creates a "room" to model common sense, then it should correspond to the environment and be aligned. An adversary must attempt to interrupt at the function from the environment to the model. Otherwise, the AI actions will take into account whatever the adversary will do, or else it would simply be a wrong model to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants