### Hi! I'm Milliam Wacaskill! Welcome to my brain!

Please follow these steps:

1. For this to run fast, go to `Runtime > Change runtime type > select "GPU"`
2. If you put the `effective_altruism_qa` somewhere other than your main Google Drive directory, set the path in `input_dir`
3. Just press `shft+Enter` to execute the code

In [None]:
# You might need to change this path if you put in a folder different than the root of your Google Drive
input_dir = './drive/MyDrive/effective_altruism_qa/'
include = ['ea_faq.txt','glossary.txt'] # others:'glossary.txt',  'organisations.txt'

In [None]:
# When you run the bottom code, it will ask you to click on a link to sign into Google, copy a code and paste here, so you can load files in Google Drive.
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#@title Install package (may take a minute)
# install package
print('may take a few seconds...')
!pip install ktrain
print('done.')

may take a few seconds...
done.


In [None]:
#@title Code
import os
import ktrain
from ktrain import text
from random import uniform
import pandas as pd
import textwrap

def clean_glossary(document):
  alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' # some lines of document are just a letter, eg "A" and then glossary terms that start with A, remove those lines
  alphabet = [n for n in alphabet]
  doc = [n.strip() for n in document]
  doc = [n for n in doc if not (n=='' or n in alphabet)] #remove empty lines and letter lines
  # replace "Artificial General Intelligence (AGI) -" with "what is ____?"
  doc = ['What is '+n.split(' -')[0]+'?'+n.split(' -')[1] for n in doc]
  # If articles are longer, use this to break into paragraphs
  # doc = ktrain.text.textutils.paragraph_tokenize(document, join_sentences=True)
  return doc

# load and clean EA info
def load_docs(input_dir, include):
  docs = []
  files = os.listdir(input_dir+'/data/input/')
  files = [n for n in files if n in include]

  for file in files:
    if file == 'glossary.txt':
      with open(input_dir+file, 'r') as f:
        document = f.readlines()
      document_cleaned = clean_glossary(document)
    elif file == 'ea_faq.txt':
      with open(input_dir+file, 'r') as f:
        document = f.read()
      document = document.split('\n\n\n')
      document_cleaned = [n.strip().replace('\n', ' ') for n in document]
    docs += document_cleaned
  qs = [n.split('? ')[0]+'?' for n in docs]
  answers = [n.split('? ')[1] for n in docs]
  
  return docs, qs,answers




def initialize_ktrain_qa(docs):
  # See the whoosh documentation for more information on these parameters and how to use them to speedup indexing. 
  docs = [n.lower() for n in docs]
  random_number = str(uniform(0, 10)).split('.')[-1]
  INDEXDIR = '/tmp/myindex'+random_number #random number is added because it can't use the same path on a single Colab run
  text.SimpleQA.initialize_index(INDEXDIR)
  text.SimpleQA.index_from_list(docs, INDEXDIR, commit_every=len(docs),
                                multisegment=True, procs=20, # these args speed up indexing,
                                limitmb = 1024,
                                min_words = 10,
                                breakup_docs=True         # this slows indexing but speeds up answer retrieval
                                )
  return

def answers2df(answers):
    dfdata = []
    for a in answers:
        answer_text = a['answer']
        snippet_html = '<div>' +a['sentence_beginning'] + " <font color='red'>"+a['answer']+"</font> "+a['sentence_end']+'</div>'
        confidence = a['confidence']
        doc_key = a['reference']
        dfdata.append([answer_text, snippet_html, confidence, doc_key])
    df = pd.DataFrame(dfdata, columns = ['Candidate Answer', 'Context',  'Confidence', 'Document Reference'])
    if "\t" in answers[0]['reference']:
        df['Document Reference'] = df['Document Reference'].apply(lambda x: '<a href="{}" target="_blank">{}</a>'.format(x.split('\t')[1], x.split('\t')[0]))
    return df

def insert_newlines(s, every=64):
  s1 = '\n\t'.join(textwrap.wrap(s, every))
  return s1


def chatbot(batch_size = 32, n_answers=10, confidence_thresh = 0.05):
  print("""
Hi! My name is Milliam Wacaskill, Professor of Philosophy from the University of Ozford, 
in Munchkin Country. \n\n 
Want to know about Effective Altruism is, what are the most pressing problems of our time, 
or what utilitarianism is? Just ask! (write "Bye" to exit)) 
For instance, ask me: "What is the pond analogy?"
  """)
  while True:
    question = input().lower()
    if question == 'bye':
      break
    elif question == 'who are you?' or "what's your name?":
      print('Milliam Wacaskill, nice to meet you!\n')
    else:
      question = question.replace('what is the', 'what is')
      answers = qa.ask(question, batch_size=batch_size,
                 n_answers=5,
                 include_np =False,
                 )

      answers_df = answers2df(answers)
      try: 
        answers_final = answers_df[answers_df.Confidence>confidence_thresh]['Candidate Answer'].values
        answers_final = [n.replace('? ', '').replace('ism ','').capitalize() for n in answers_final]
        answers_final = '. '.join(answers_final)+'.'
        answers_final = insert_newlines(answers_final, every=85)
      except:
        answers_final = "I'm not that smart. Ask me a different question..."
      print(answers_final, '\n')
  return



In [None]:
#@title Downloading BERT model (may take a few minutes...)
# Load docs, initalize model, download pretrained 
docs, qs,answers = load_docs(input_dir, include)
initialize_ktrain_qa(docs) # encode strings into model input
qa = text.SimpleQA(INDEXDIR) # Download pre-trained model.  TODO Load model from Google Drive?

FileNotFoundError: ignored

In [None]:
# Run chatbot
chatbot(confidence_thresh = 0.15)


Hi! My name is Milliam Wacaskill, Professor of Philosophy from the University of Ozford, 
in Munchkin Country. 

 
Want to know about Effective Altruism is, what are the most pressing problems of our time, 
or what utilitarianism is? Just ask! (write "Bye" to exit)) 
For instance, ask me: "What is the pond analogy?"
  
What is the pond analogy?
Milliam Wacaskill, nice to meet you!



KeyboardInterrupt: ignored





### Resources for further developing this chatbot:

Model
* tutorial: https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb
* BERT finetuned on SQUAD from ktrain package: https://github.com/amaiya/ktrain/blob/master/ktrain/text/qa/core.py

Data:
* https://resources.eahub.org/learn/glossary/
* add cause specific faqs: https://resources.eahub.org/learn/articles/faqs/
* good for chatbot (instead of Q&A system): https://resources.eahub.org/learn/articles/what-to-say/

Q&A systems:
* https://huggingface.co/transformers/usage.html#extractive-question-answering
* https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb#scrollTo=JkqzGVF_Gy_W
* https://github.com/huggingface/node-question-answering
* https://huggingface.co/transformers/custom_datasets.html#qa-squad


Next steps:
* Add more data
* qa = text.SimpleQA(INDEXDIR) # Download pre-trained model.  TODO Load model from Google Drive instead of downloading each time Colab is restarted?
* Try out this method: https://huggingface.co/transformers/usage.html#extractive-question-answering
* Combine rule-based answers for perfect matches with model answers
* Combine Q&A system like this with chatbot like Blenderbot to develop a personality style 



Created by Daniel M. Low (Harvard, MIT) and Benjamin Villa (Harvard)

License: Apache 2.0 (free to reproduce or modify)
