# Propt Template Generator
---
This code just generates a list of new prompt templates for each relation, by paraphrasing each original template many times using the Pegasus language model. The new prompt templates are exported as a json file.  

---


Paraphraser: https://huggingface.co/tuner007/pegasus_paraphrase   
Dataset with orig templates: https://huggingface.co/datasets/lama  



In [None]:
%%capture
!pip install transformers
!pip install datasets
!pip install sentencepiece

In [None]:
# Troubleshooting: https://stackoverflow.com/questions/65854722/huggingface-albert-tokenizer-nonetype-error-with-colab

import pandas as pd
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
model_name = 'tuner007/pegasus_paraphrase' 
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)

# paraphrase method - paraphrases one string (TODO: can it do a list of multiple strings?? i think so, given that input_text is immediately list-ified)
# return string[]? - list of paraphrases of input_text
def paraphrase_pegasus(input_text, num_return_sequences, num_beams=10, max_tokens=20): # TODO: limit max length??
  batch = tokenizer([input_text],truncation=True,padding='longest',max_length=max_tokens, return_tensors="pt").to(torch_device)
  translated = model.generate(**batch, max_length=max_tokens, num_beams=num_beams, num_return_sequences=num_return_sequences, temperature=1.5) # TODO - try different temperatures?
  tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
  return tgt_text

Downloading:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

In [None]:
from datasets import load_dataset

dataset = load_dataset("lama")
dataset

Downloading builder script:   0%|          | 0.00/2.86k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

No config specified, defaulting to: lama/trex


Downloading and preparing dataset lama/trex (download: 71.19 MiB, generated: 626.48 MiB, post-processed: Unknown size, total: 697.68 MiB) to /root/.cache/huggingface/datasets/lama/trex/1.1.0/430016dd70224564ad385a96e0e4a3f88aeb5beaf4e34a8cf65b390fbc83aed7...


Downloading data:   0%|          | 0.00/74.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1304391 [00:00<?, ? examples/s]

Dataset lama downloaded and prepared to /root/.cache/huggingface/datasets/lama/trex/1.1.0/430016dd70224564ad385a96e0e4a3f88aeb5beaf4e34a8cf65b390fbc83aed7. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['uuid', 'obj_uri', 'obj_label', 'sub_uri', 'sub_label', 'predicate_id', 'sub_surface', 'obj_surface', 'masked_sentence', 'template', 'template_negated', 'label', 'description', 'type'],
        num_rows: 1304391
    })
})

In [None]:
train_df = dataset["train"].to_pandas()
train_df.sample(5)

Unnamed: 0,uuid,obj_uri,obj_label,sub_uri,sub_label,predicate_id,sub_surface,obj_surface,masked_sentence,template,template_negated,label,description,type
437168,43da065d-6002-44ec-bb01-3f11bfc15748,Q419,Peru,Q298,Chile,P47,Chile,Peru,"As is the case with many gulls, it has traditi...",[X] shares border with [Y] .,[X] does not share border with [Y] .,shares border with,"countries or administrative subdivisions, of e...",N-M
5139,5f830228-5233-4785-96e8-c6a876b2a994,Q168052,Lancaster,Q985279,Morecambe,P131,Morecambe,Lancaster,It is an active Anglican parish church in the ...,[X] is located in [Y] .,[X] is not located in [Y] .,located in the administrative territorial entity,the item is located on the territory of the fo...,N-1
488091,fdd2bef6-de6d-4bf0-b070-c7645bcc5f3a,Q865,Taiwan,Q17,Japan,P47,Japan,Taiwan,"It is found in China, India, Russia, [MASK] an...",[X] shares border with [Y] .,[X] does not share border with [Y] .,shares border with,"countries or administrative subdivisions, of e...",N-M
951700,92797ef8-9120-40c5-a02b-871cec231b34,Q38,Italy,Q39,Switzerland,P530,Switzerland,Italy,The program of the party included the idea of ...,[X] maintains diplomatic relations with [Y] .,[X] does not maintain diplomatic relations wit...,diplomatic relation,diplomatic relations of the country,N-M
635283,9a32b64b-ab6e-475b-bc8c-06c48d100262,Q829,Utah,Q1522,New Mexico,P47,New Mexico,Utah,It was described from the Klamath Lakes area i...,[X] shares border with [Y] .,[X] does not share border with [Y] .,shares border with,"countries or administrative subdivisions, of e...",N-M


In [None]:
# GOAL: get a list of prompt templates for each relation
# loop over each relation id, pick a random entry for that id (they all have same template), generate new prompts from the original template.

# given a row, generate prompts
def generate_prompts(row, num_paraphrases=40):
  template = row['template']
  # orig_prompt = template.replace('[X]', row['sub_label']).replace('[Y]', row['obj_label'])   # plug subject and object into template
  
  # feed template into paraphraser as is (but remove brackets first)
  orig_prompt = template.replace("[", "").replace("]", "") 
  new_sentences = paraphrase_pegasus(orig_prompt, num_paraphrases, num_beams=num_paraphrases) # list?
  new_sentences = [sentence.replace(' X ', ' [X] ').replace(' Y ', ' [Y] ').replace(' X.', ' [X].').replace(' Y.', ' [Y].').replace('X ', '[X] ').replace('Y ', '[Y] ').replace(' X', ' [X]').replace(' Y', ' [Y]') for sentence in new_sentences]
  # TODO: check no multiple masks !!!!!!!!!!!!!!!
  print(new_sentences)
  return new_sentences

# create dict containing lists of generated propmt templates for each relation
subset_dict = {}
relations = train_df['predicate_id'].unique() 
for r in relations:
  subset_df = train_df[train_df['predicate_id'] == r].sample(1)
  subset_df['generated_prompts'] = subset_df.apply(generate_prompts, axis=1)
  subset_dict[r] = subset_df


['[Y] is where [X] is located.', '[Y] is the location of [X].', 'The location of [X] is in [Y].', '[X] is in [Y].', 'There is a place called [X] located in [Y].', 'The location of [X] is [Y].', 'There is a place called [X] in [Y].', '[Y] is located in [X].', 'You can find [X] in [Y].', '[X] is located in [Y].', 'In [Y] is where [X] is located.', '[Y] is the location for [X].', '[Y] is the place where [X] is located.', '[X] is in the same location as [Y].', 'In [Y], there is [X].', 'The place called [X] is located in [Y].', '[Y] is where [X] is.', '[X] is in the same area as [Y].', '[X] is in the same place as [Y].', 'There is [X] in [Y].', 'There is [X] located in [Y].', 'It is located in [Y].', 'In [Y] is the location of [X].', 'There is a street called [X] located in [Y].', '[Y] has [X] located in it.', 'There is a place named [X] located in [Y].', 'There is a street called [X] in [Y].', 'There is a person located in [Y].', '[X] is in the same city as [Y].', '[X] can be found in [Y].

In [None]:
subset_dict['P1001']

Unnamed: 0,uuid,obj_uri,obj_label,sub_uri,sub_label,predicate_id,sub_surface,obj_surface,masked_sentence,template,template_negated,label,description,type,generated_prompts
707048,e97e884a-3ed6-4933-8665-29e8834ae811,Q17,Japan,Q274948,Prime Minister of Japan,P1001,Prime Minister,Japan,"The Twenty-One Demands (Japanese: 対華21ヶ条要求, Ta...",[X] is a legal term in [Y] .,[X] is not a legal term in [Y] .,applies to jurisdiction,"the item (an institution, law, public office ....",N-M,"[A legal term in [Y] is [X]., In [Y], [X] is a..."


In [None]:
# Export/Import dict of prompt templates to json file
import json
from google.colab import drive
drive.mount('/content/drive')

fileName = 'paraphrased_prompt_templates_unfiltered.json'

# Serialize subset_dict as a json file
def exportJson(dict_to_export):
  out_dict = {r: {'orig_template': sub_df['template'].iloc[0], 'generated_prompts': sub_df['generated_prompts'].iloc[0]} for r, sub_df in dict_to_export.items() }
  with open("/content/drive/MyDrive/"+fileName, "w") as outfile:
      json.dump(out_dict, outfile)
      print("Exported json.")

exportJson(subset_dict)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Exported json.


Unnamed: 0,uuid,obj_uri,obj_label,sub_uri,sub_label,predicate_id,sub_surface,obj_surface,masked_sentence,template,template_negated,label,description,type,generated_prompts
1072091,c5e5a6eb-b13d-4fb9-828f-b4ddfdbec71a,Q408,Australia,Q16,Canada,P530,Canada,Australia,It is a multinational project involving resear...,[X] maintains diplomatic relations with [Y] .,[X] does not maintain diplomatic relations wit...,diplomatic relation,diplomatic relations of the country,N-M,"[[X] and [Y] have diplomatic relations., [X] h..."


In [None]:
# Import pre-generated prompts from json file
def importJson():
  with open("/content/drive/MyDrive/"+fileName, "r") as f:
    data = json.load(f)
  return data
new_templates_dict = importJson()
new_templates_dict

{'P1001': {'generated_prompts': ['A legal term in [Y] is [X].',
   'In [Y], [X] is a legal term.',
   'There is a legal term called [X].',
   "It's a legal term in [Y].",
   '[X] is a legal term.',
   'The legal term is [X].',
   'A legal term is [X].',
   'There is a legal term in [Y] called [X].',
   'In [Y] there is a legal term called [X].',
   'The legal term in [Y] is [X].',
   '[Y] has a legal term called [X].',
   'In [Y], the legal term is [X].',
   'The legal term [X] is in [Y].',
   'In [Y], there is a legal term called [X].',
   'It is a legal term in [Y].',
   'There is a legal term in [Y].',
   'In [Y], the term [X] is a legal one.',
   'There is a legal term called [X] in [Y].',
   "There's a legal term in [Y] called [X].",
   '[Y] is the legal term for [X].',
   'In [Y], the term [X] is used as a legal term.',
   'There is a legal term for [X].',
   '[Y] is a legal term for [X].',
   "There's a legal term called [X] in [Y].",
   '[X] is a legal term in [Y].',
   'In [Y]

In [None]:
# loop over the final dict. Filter out bad templates.
# ex. Multiple [X] or [Y], missing [X] or [Y]
import re

def template_has_exactly_one_x_and_y(template_string):
  """returns true if a template has exactly one '[X]' and [Y]' in it (no more, no less)"""
  num_x = len( re.findall('\[X\]', template_string) )
  num_y = len( re.findall('\[Y\]', template_string) )
  return num_x == 1 and num_y == 1

final_dict = {}
for r, entry in new_templates_dict.items():
  subdict = {}
  subdict['orig_template'] = new_templates_dict[r]['orig_template']
  subdict['generated_prompts'] = [t for t in entry['generated_prompts'] if template_has_exactly_one_x_and_y(t)]
  final_dict[r] = subdict

final_dict['P101']

{'generated_prompts': ['The field of [Y] is being worked on by [X].',
  'The field of [Y] is worked by [X].',
  '[X] works in the field of [Y].',
  'The field of [Y] is worked on by [X].',
  'The field of [Y] is being worked in by [X].',
  '[X] is in the field of [Y].',
  'In the field of [Y], [X] works.',
  'The field of [Y] is worked in by [X].',
  '[X] is working in the field of [Y].',
  'The field of [Y] is where [X] works.',
  'The field of [Y] has [X] working in it.',
  '[X] does work in the field of [Y].',
  '[X] is involved in the field of [Y].',
  '[X] works in the field of [Y]',
  'The field of [Y] is occupied by [X].',
  "[Y] is [X]'s field.",
  '[Y] is the field that [X] works in.',
  'The field of [Y] is what [X] works in.',
  'There is a field of [Y] that [X] works in.',
  'The field of [Y] is being worked upon by [X].',
  '[Y] is a field that [X] works in.',
  '[X] is employed in the field of [Y].',
  '[Y] is the field [X] works in.',
  '[X] works in [Y].',
  "[Y] is [X]

In [None]:
# print number of valid prompt templates for each relation.
for k, val in final_dict.items():
  print(k, len(final_dict[k]['generated_prompts']))

P131 35
P17 35
P136 22
P276 35
P740 34
P140 33
P37 31
P413 28
P159 35
P19 25
P279 23
P27 12
P20 27
P103 31
P361 29
P30 35
P47 33
P190 36
P1303 29
P527 39
P495 34
P1001 23
P1376 28
P264 35
P530 34
P39 34
P101 33
P138 36
P463 27
P937 28
P176 31
P106 27
P364 30
P127 29
P449 29
P407 33
P36 28
P31 35
P178 36
P108 27
P1412 32


In [None]:
# finally, export the cleaned prompts dict to json
final_fileName = 'paraphrased_prompt_templates.json'
with open("/content/drive/MyDrive/"+final_fileName, "w") as outfile2:
      json.dump(final_dict, outfile2)
      print("Exported json.")

Exported json.
