In this notebook adversarial examples will be created from PubMed texts. These examples are texts. These texts will be annotated, added to the existing PubMed Corpus. Finally SciBERT, RoBERTa and PubmedBERT models will be trained with the created data.


In [None]:
#mount google drive to use data
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


STEP 1: Select PubMed texts and create adversarial examples on their base.

In [None]:
#install textattack that allows to generate adversarial examples
!pip3 install textattack

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting textattack
  Downloading textattack-0.3.8-py3-none-any.whl (418 kB)
[K     |████████████████████████████████| 418 kB 5.2 MB/s 
Collecting datasets==2.4.0
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 54.0 MB/s 
Collecting lemminflect
  Downloading lemminflect-0.2.3-py3-none-any.whl (769 kB)
[K     |████████████████████████████████| 769 kB 52.1 MB/s 
Collecting word2number
  Downloading word2number-1.1.zip (9.7 kB)
Collecting OpenHowNet
  Downloading OpenHowNet-2.0-py3-none-any.whl (18 kB)
Collecting terminaltables
  Downloading terminaltables-3.1.10-py2.py3-none-any.whl (15 kB)
Collecting bert-score>=0.3.5
  Downloading bert_score-0.3.12-py3-none-any.whl (60 kB)
[K     |████████████████████████████████| 60 kB 7.6 MB/s 
Collecting language-tool-python
  Downloading language_tool_python-2.7.1-py3-none-any.whl (34 kB)
C

In [None]:
#create adversarial examples with the help of textattack augmenters
from textattack.augmentation import CharSwapAugmenter, WordNetAugmenter
from textattack.transformations import CompositeTransformation, WordSwapHowNet
from textattack.constraints.pre_transformation import RepeatModification, StopwordModification
from textattack.augmentation import Augmenter
import glob
import nltk
nltk.download('omw-1.4')

def get_text_paths(percentage:int, text_directory:str):
  """
  Determine the number of text in the given directory
  and return the the amount of paths to texts which is equal 
  the given percentage value 
  """
  #total number of files in the directory
  files = glob.glob(text_directory) 
  file_amount = len(files)

  #how many texts should be read
  amount_to_read = int(file_amount/100 * percentage)  
  print("amount to read: " + str(amount_to_read))
 
  #get names of N files to read
  files_to_read = files[:amount_to_read] 
  print(files_to_read)
  return files_to_read

def create_adv_texts(paths_to_texts):
  adv_counter = 1
  
  for path in paths_to_texts:
    #just in case the creating of files starts from the point where it was stopped
    if adv_counter < 14:
      adv_counter += 1
      continue
    print("process text " + str(adv_counter))
    adv_text = ""   
    with open(path) as f:
      all_lines = f.readlines()      
      for line in all_lines:
        #error handling for texts that could not be augmented 
        #with synonym augmenter
        if adv_counter in [6, 10, 14]:
          wn_result = wnAugmenter.augment(line)
        else:
          wn_result = synAugmenter.augment(line)        
        changed_data = wn_result[0]
        result = csAugmenter.augment(changed_data)
        print(result)
        adv_text += result[0]       
    #print(adv_text)
    with open(advers_path +'adv'+ str(adv_counter) + ".txt", 'w') as writer:
      writer.write(adv_text)
      adv_counter +=1        
  return "done"

def create_synonym_swap_augmenter():
  transformation = WordSwapHowNet()
  constraints = [RepeatModification(), StopwordModification()]
  synSwapAugmenter = Augmenter(transformation=transformation, constraints=constraints, pct_words_to_swap=0.05, transformations_per_example=1)
  return synSwapAugmenter

#start point
csAugmenter = CharSwapAugmenter(pct_words_to_swap=0.05, transformations_per_example=1)
wnAugmenter = WordNetAugmenter(pct_words_to_swap=0.05, transformations_per_example=1)
synAugmenter = create_synonym_swap_augmenter()

path = "/content/drive/MyDrive/ExtractedPubMedArticles/done/*.txt"
advers_path = "/content/drive/MyDrive/ExtractedPubMedArticles/adv/"

texts_to_read = get_text_paths(10, path)
create_adv_texts(texts_to_read)


        


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


amount to read: 15
['/content/drive/MyDrive/ExtractedPubMedArticles/done/34535716.txt', '/content/drive/MyDrive/ExtractedPubMedArticles/done/28530238.txt', '/content/drive/MyDrive/ExtractedPubMedArticles/done/23826101.txt', '/content/drive/MyDrive/ExtractedPubMedArticles/done/23939394.txt', '/content/drive/MyDrive/ExtractedPubMedArticles/done/24411942.txt', '/content/drive/MyDrive/ExtractedPubMedArticles/done/28145098.txt', '/content/drive/MyDrive/ExtractedPubMedArticles/done/34193219.txt', '/content/drive/MyDrive/ExtractedPubMedArticles/done/24579088.txt', '/content/drive/MyDrive/ExtractedPubMedArticles/done/29160417.txt', '/content/drive/MyDrive/ExtractedPubMedArticles/done/24647473.txt', '/content/drive/MyDrive/ExtractedPubMedArticles/done/25432738.txt', '/content/drive/MyDrive/ExtractedPubMedArticles/done/34199925.txt', '/content/drive/MyDrive/ExtractedPubMedArticles/done/27391467.txt', '/content/drive/MyDrive/ExtractedPubMedArticles/done/27405968.txt', '/content/drive/MyDrive/Extr

STEP 2: Convert Label Studio Json- files into .spacy-data

In [None]:
import glob
import json
import os
import spacy
from spacy.tokens import DocBin
import itertools

#a path to the directory with further data directories. Each subdirectory contains 10 annotated texts
path_to_labelled_texts = "/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts"

def get_file_paths():
  """
  collect all JSON-file names in the directory
  AnnotatedTexts
  """
  dev_files = []
  train_files = []
  start_dir = path_to_labelled_texts
  #get all subdirectories that contain annotated data for dev-set
  dev_subdirs = ["/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/81-90",
                 "/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/91-100",
                 "/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/101-110",
                 ]
  to_exclude = ["/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/131-140"]
  #get all subdirectories that contain annotated data for train-set
  train_subdirs = [x[0] for x in os.walk(start_dir) if x[0] != start_dir and not x[0].endswith("ipynb_checkpoints") and x[0] not in dev_subdirs] 
  print(train_subdirs)
  
  #collect texts for spacy dev-set
  for item in dev_subdirs:   
    text_files = [f for f in os.listdir(item) if f.endswith('.json')]
    abs_paths = [item + "/" + f for f in text_files]    
    dev_files.extend(abs_paths)
  print(dev_files)

  #collect texts for spacy train-set
  for item in train_subdirs:   
    text_files = [f for f in os.listdir(item) if f.endswith('.json')]
    abs_paths = [item + "/" + f for f in text_files]    
    train_files.extend(abs_paths)
  print(train_files)

  return dev_files, train_files

def read_data(path_to_file: str):
  """
  path_to_file: path to a LS Json file
  read JSON file and save it as a dictionary
  """
  with open(path_to_file) as f:
    data = f.read().strip()
    text_info = json.loads(data)    
  return text_info

def get_entity_positions(ls_json):
  """
  ls_json: content of a Label Studio Json file
  determine start and end position of each IVD concept in the text and
  save this information of the form (start_index, end_index, label) 
  in the list
  return: text + list of (start_index, end_index) for each concept in this text
  """
  label = "MedTech"
  total_text = ""  
  entities = []    
  total_size = 0
  for par in ls_json: 
    #text paragraph    
    data = par["data"]
    text = data["text"].strip() + " "
    total_text += text   
    annotations = par["annotations"]
    for annot in annotations:
      result = annot["result"]
      if(len(result) > 0): 
        for res in result:          
          value = res["value"]                   
          start = value["start"]
          end = value["end"]
          total_start = total_size + start          
          total_end = total_size + end          
          entry = (total_start, total_end, label)
          entities.append(entry)
      total_size = len(total_text)
  return (total_text.strip(), entities)

def correct_entity_positions(doc, start, end, label):
  """
  correct wrongly defined positions of concepts
  """
  span = doc.char_span(start, end, label=label)
  if span is not None:
    return span
  span = doc.char_span(start, end + 1, label=label)
  if span is not None:
    return span
  span = doc.char_span(start, end - 1, label=label)
  if span is not None:
    return span
  span = doc.char_span(start + 1, end, label=label)
  if span is not None:
    return span
  span = doc.char_span(start - 1, end, label=label)
  if span is not None:
    return span
  print(f"Skipping entity [{start}, {end}, {label}] in the following text because the character span '{doc.text[start:end]}' does not align with token boundaries:\n{repr(text)}")
  return None

def remove_doublicated_entries(annotations):
  """
  remove overlapping entities and dublicates
  """
  to_remove = []
  for a, b in itertools.combinations(annotations, 2):
    if a[0]<=b[0] and a[1]>=b[1]:
      to_remove.append(b)
      continue
    if a[0]>=b[0] and a[1]<=b[1]:
      to_remove.append(a)
      continue
    if a[0]<b[0] and a[1]>b[0]:
      to_remove.append(a)
      continue
    if a[0]>b[0] and a[1]<b[0]:
      to_remove.append(a)

  for item in to_remove:
    if item in entity_annot:
      entity_annot.remove(item)
  return annotations

#start point
train_data_path = "/content/drive/MyDrive/SpacyData/textkorpusAdv/train.spacy"
dev_data_path = "/content/drive/MyDrive/SpacyData/textkorpusAdv/dev.spacy"
dev_data, train_data = get_file_paths()
spacy_dev_items = []
spacy_train_items = []

nlp = spacy.blank("en")
db_train = DocBin()
db_dev = DocBin()

#create dev-set
for item in dev_data:  
  total_text = ""  
  entities = []    
  total_size = 0
  #read LS JSON
  info = read_data(item)  
  #create items for spacy data format
  text, entities = get_entity_positions(info)
  #create span-index item
  spacy_item = [text, {"entities" : entities}]  
  spacy_dev_items.append(spacy_item)

for text, annotations in spacy_dev_items:
  #process annotations  
  doc = nlp.make_doc(text) 
  ents = []
  entity_annot = annotations["entities"]
  corrected_entities = remove_doublicated_entries(entity_annot)    
  for start, end, label in entity_annot:    
    span = correct_entity_positions(doc, start, end, label)     
    if span is not None:
      ents.append(span)   
  doc.ents = ents
  db_dev.add(doc)
#save data in SpaCy format
db_dev.to_disk(dev_data_path)


#create train-set
for item in train_data:  
  total_text = ""  
  entities = []    
  total_size = 0
  #read LS JSON
  info = read_data(item)  
  #create items for spacy data format
  text, entities = get_entity_positions(info)
  #create span-index item
  spacy_item = [text, {"entities" : entities}]  
  spacy_train_items.append(spacy_item)

for text, annotations in spacy_train_items:
  #process annotations  
  doc = nlp.make_doc(text) 
  ents = []
  entity_annot = annotations["entities"]
  corrected_entities = remove_doublicated_entries(entity_annot)    
  for start, end, label in entity_annot:    
    span = correct_entity_positions(doc, start, end, label)     
    if span is not None:
      ents.append(span)   
  doc.ents = ents
  db_train.add(doc)
#save data in SpaCy format
db_train.to_disk(train_data_path)
print("done")

['/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/1-10', '/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/11-20', '/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/21-30', '/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/31-40', '/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/41-50', '/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/51-60', '/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/61-70', '/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/71-80', '/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/111-120', '/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/121-130', '/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/131-140', '/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/141-150', '/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/adv_examples']
['/content/drive/MyDrive/TextcorpusCreation/AnnotatedTexts/81-90/r25393025.json', '/content/drive/MyDrive/TextcorpusCreation/An

STEP 3: Install necessary for training libraries


In [None]:
#change runtime to GPU and after that check CUDA version (current 11.2)
#!nvidia-smi

# install PyTorch 1.10.0 for CUDA 11.1
!pip3 install torch==1.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html

# install spaCy transformers tuned for CUDA 11.1
!pip3 install -U spacy[cuda111,transformers]==3.2.0
!pip3 install transformers[sentencepiece]

# install spacy transformer pipeline
!python -m spacy download en_core_web_trf

# library, equivalent of NumPy library for GPU
#!pip3 install cupy

!pip3 install numpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.10.0+cu111
  Downloading https://download.pytorch.org/whl/cu111/torch-1.10.0%2Bcu111-cp37-cp37m-linux_x86_64.whl (2137.6 MB)
[K     |████████████▌                   | 834.1 MB 1.7 MB/s eta 0:13:10tcmalloc: large alloc 1147494400 bytes == 0x39c5c000 @  0x7fc83cc53615 0x592b76 0x4df71e 0x59afff 0x515655 0x549576 0x593fce 0x548ae9 0x51566f 0x549576 0x593fce 0x548ae9 0x5127f1 0x598e3b 0x511f68 0x598e3b 0x511f68 0x598e3b 0x511f68 0x4bc98a 0x532e76 0x594b72 0x515600 0x549576 0x593fce 0x548ae9 0x5127f1 0x549576 0x593fce 0x5118f8 0x593dd7
[K     |███████████████▉                | 1055.7 MB 1.2 MB/s eta 0:14:35tcmalloc: large alloc 1434370048 bytes == 0x7e2b2000 @  0x7fc83cc53615 0x592b76 0x4df71e 0x59afff 0x515655 0x549576 0x593fce 0x548ae9 0x51566f 0x549576 0x593fce 0x548ae9 0x5127f1 0x598e3b 0x511f68 0x59

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 10.3 MB/s 
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-trf==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.2.0/en_core_web_trf-3.2.0-py3-none-any.whl (460.2 MB)
[K     |████████████████████████████████| 460.2 MB 27 kB/s 
Installing collected packages: en-core-web-trf
Successfully installed en-core-web-trf-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')
Looking in indexes: https://pypi.org/simple, https://us-pyth

STEP 4: Train adversarial models

In [None]:
#adversarial sciBERT model
!python -m spacy train /content/drive/MyDrive/SpacyData/scibert_adv_config.cfg --output /content/drive/MyDrive/SpacyData/models/scibert_adv

In [None]:
#adversarial RoBERTa model
!python -m spacy train /content/drive/MyDrive/SpacyData/roberta_adv_config.cfg --output /content/drive/MyDrive/SpacyData/models/roberta_adv

In [None]:
#adversarial PubmedBERT model
!python -m spacy train /content/drive/MyDrive/SpacyData/pubmed_bert_adv_config.cfg --output /content/drive/MyDrive/SpacyData/models/pubmed_bert_adv