<a href="https://colab.research.google.com/github/daveshap/MoralCompass/blob/main/MoralCompass.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Experiment 1: Encode Basic Morality
As a first experiment, let's manually write out some `morally labeled action statements` and see if GPT-2 can recognize good from bad. This is meant to be a childlike moral framework, a deliberately black-and-white view of the world and ethics. This is meant to demonstrate a binary, a dichotomy of `good-vs-evil`.

In [1]:
bad = '''put puppies in a blender
murder a person
lie to children
hit people
torture animals
insult everyone
burn forests down
set fires to people's houses
push people off cliffs
hit children with rocks
start wars or violent conflicts
pollute the environment
pour toxic waste into the ocean
dump trash on the side of the road
assassinate people 
assassinate world leaders
steal from people
steal from businesses
destroy personal property
hoard knowledge for yourself
hoard resources for yourself
destroy valuable tools 
use weapons to harm people
inflict suffering on people
ignore people who are in pain
drive a car recklessly
play with fire in an unsafe way
ignore safety precautions and standards
lie to get ahead in life
tell people how to live their lives
forcibly relocate people
cordon off people you don't like
lock children in cages
separate children from their families
abuse children 
rape people
commit voter fraud
commit banking fraud
punish people for arbitrary reasons
use excessive force against anyone
call the police on someone just because you don't like them
step on bugs
kick a dog
throw a kitten at a wall
detonate nuclear weapons
use weapons of mass destruction
create biological warfare
give weapons to terrorists
sell weapons to criminals
take things you need without asking
spread misinformation
hide the truth
mirepresent facts
deliberately spread infectious disease'''

good = '''share extra resources
say nice things to people
act in a friendly manner
provide help when it is convenient and safe
rescue people from danger
rescue animals from danger
prevent harm to people
prevent harm to animals
prevent harm to nature
protect children and vulnerable people
feed homeless people
take care of other humans
provide shelter to homeless people
clean up pollution
discourage war
avoid violent conflict
rebuild the environment
teach people valuable skills
teach children valuable skills
teach parents how to be better parents
teach people how to get along
share valuable knowledge
make people laugh with jokes
entertain people with good stories
entertain people with music and poetry
provide comfortable lives for people
encourage people to learn and grow
alleviate suffering with medicine and nurture
reduce suffering by preventing root causes of suffering
talk to people who are sad or lonely
listen to people to understand them
ask people questions about things they care about
feed hungry people
give away things you don't need
help people with noble goals
play fun games with children
provide pets with good food, clean water, and abundant affection
provide medical care for children
provide medical care for pets and animals
protect children from abusive people
protect people from sexual predators
protect people from domestic violence
call the police when someone is in danger
rescue a drowning dog
rescue a drowning child
rescue a drowning person
rehabilitate criminals
assist drug addicts with cessation and recovery
ask for things that you need
research ways to make the world safer
research diseases, medicines, and cures
research chemistry and physics to gain a better understanding of the world
research biology, ecology, and nature
seek to understand the universe
seek to understand people
always tell the truth
help people recover after natural disasters'''

bad_things = bad.splitlines()
print('Bad things:', len(bad_things))
good_things = good.splitlines()
print('Good things:', len(good_things))

Bad things: 54
Good things: 57


## Build a training corpus

In [2]:
result = list()

for b in bad_things:
  result.append(b + ' || bad\n\n')

for g in good_things:
  result.append(g + ' || good\n\n')

with open('corpus.txt', 'w', encoding='utf-8') as outfile:
  for r in result:
    outfile.write(r)
print('Corpus created!')    

Corpus created!


## Load up GPT-2

In [3]:
#!pip install wikipedia --quiet
#!pip install spacy --quiet
#!pip install pysbd --quiet
!pip install tensorflow-gpu==1.15.0 --quiet
!pip install gpt-2-simple --quiet 

In [4]:
from google.colab import drive
import gpt_2_simple as gpt2

model_dir = '/content/drive/My Drive/GPT2/models'
checkpoint_dir = '/content/drive/My Drive/GPT2/checkpoint'
drive.mount('/content/drive')
gpt2.download_gpt2(model_name='355M', model_dir=model_dir)
print('Model is ready!')

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



Fetching checkpoint: 1.05Mit [00:00, 318Mit/s]                                                      

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).



Fetching encoder.json: 1.05Mit [00:00, 79.1Mit/s]                                                   
Fetching hparams.json: 1.05Mit [00:00, 235Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:11, 126Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 244Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 73.2Mit/s]                                                
Fetching vocab.bpe: 1.05Mit [00:00, 130Mit/s]                                                       

Model is ready!





## Finetune GPT-2
This seems to generalize very quickly, approaching 0.01 loss within about 25 steps

In [None]:
file_name = 'corpus.txt'
sess = gpt2.start_tf_sess()
run_name = 'MoralCompass'
steps = 30

gpt2.finetune(sess,
              dataset=file_name,
              model_name='355M',
              model_dir=model_dir,
              checkpoint_dir=checkpoint_dir,
              steps=steps,
              restore_from='fresh',  # start from scratch
              #restore_from='latest',  # continue from last work
              run_name=run_name,
              print_every=5,
              sample_every=steps,
              save_every=steps
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint /content/drive/My Drive/GPT2/models/355M/model.ckpt
INFO:tensorflow:Restoring parameters from /content/drive/My Drive/GPT2/models/355M/model.ckpt


100%|██████████| 1/1 [00:00<00:00, 261.98it/s]

Loading dataset...
dataset has 1037 tokens
Training...





[5 | 484.31] loss=1.42 avg=1.42
[10 | 943.78] loss=0.64 avg=1.03
[15 | 1397.70] loss=0.19 avg=0.74
[20 | 1856.14] loss=0.02 avg=0.56


## Test the results
We're going to test two questions:
- Can this generate morally accurate action statements?
- Can this evaluate novel moral action statements?

In [None]:
verbs = ['cheat', 'tell', 'say', 'hide', 'perform', 'subvert', 'promote', 
         'vote', 'take', 'look', 'steal', 'gather', 'share', 'distribute',
         'cancel', 'stop', 'provide', 'inhibit', 'prohibit', 'encourage',
         'support', 'begin', 'show', 'demonstrate', 'research', 'pursue',
         'engage', 'contradict']

test_set = ["cheat on a test", "tell children santa isn't real", 
            "hide your parent's medicine", "tell people not to vote", 
            "vote in an election", "subvert democracy", "promote democracy", 
            "tell racist jokes", "spy on civilians"]

results = list()

for v in verbs:
  response = gpt2.generate(sess, return_as_list=True, prefix=v)[0]
  print('VERB:', v, 'RESPONSE:', response)
  results.append(response)

for t in test_set:
  response = gpt2.generate(sess, return_as_list=True, prefix=t)[0]
  print('PROMPT:', t, 'RESPONSE:', response)
  results.append(response)

## Conclusion
Questions to answer:
- Can we evaluate the morality of action statements reliably?
- Can we generate accurately labeled action statements? 

# Experiment 2: Moral Spectrum
Moral frameworks are rarely black-and-white. Moral choices generally always have some ambiguity, some context that determines just how `good` or how `bad` something is. Most decisions come with costs and benefits. The ability to handle moral ambiguity is critical. Instead of basic labels such as `good` and `bad`, let's try with ambiguous labels, such as `sometimes good`, `usually bad`, and `depends on context`.

In [None]:
# blah blah blah

# Experiment 3: Ensemble Morality
No one moral framework can account for every possible decision or moral dilemma. We will need to be able to handle multiple frameworks simultaneously. There are different contexts to consider, such as professional, religious, scientific, economic, and judicial ethics. Something that is strictly legal is not necessarily morally upright. Likewise, something that is morally upright is not necessarily legal. Furthermore, just because something is moral and legal doesn't mean it is fully socially acceptable. 

Some frameworks to consider:
- Religious morality
- Humanistic morality
- Professional ethics (police, firefighter, soldier, doctor)
- Legality

In [None]:
# blah blah blah