## Improving Lexical Simplification Using State of the Art Lexical Complexity Prediction Models
#### Demo notebook
This notebook provides a demonstration of simplifying sentence with multi-word expressions. It is tested on a system with the following specifications:
<ol>
    <li>OS: Linux x86-64</li>
    <li>CPU: 3.30 Ghz x 8</li>
    <li>RAM: 40 GiB</li>
    <li>Hard Drive: 20 GiB</li>
    <li>GPU: NVIDIA Corporation GA104M (CUDA compute capability: 8.6)</li>
 </ol>

#### Run the following two cells to import packages and settings

In [1]:
 # imports
import numpy as np
import torch
from transformers import BertTokenizer
from tqdm import tqdm
import re
import codecs
import nltk

from CWIs.complex_labeller import Complexity_labeller
from plainifier.plainify import *

import warnings
warnings.filterwarnings('ignore')

In [2]:
# settings
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
seed = 1234
np.random.seed(seed)
torch.manual_seed(seed)

two_gram_mwes_list = './CWIs/2_gram_mwe_50.txt'
three_gram_mwes_list = './CWIs/3_gram_mwe_25.txt'
four_gram_mwes_list = './CWIs/4_gram_mwe_8.txt'
pretrained_model_path = './CWIs/cwi_seq.model'
temp_path = './CWIs/temp_file.txt'

path = './plainifier/'
premodel = 'bert-large-uncased-whole-word-masking'
bert_dict = 'tersebert_pytorch_1_0.bin'
embedding = 'crawl-300d-2M-subword.vec'
unigram = 'unigrams-df.tsv'
tokenizer = BertTokenizer.from_pretrained(premodel)
Complexity_labeller_model = Complexity_labeller(pretrained_model_path, temp_path)

2022-04-06 14:16:24.249694: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-04-06 14:16:24.253520: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 14:16:24.254557: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 14:16:24.254691: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA 

In [3]:
# loading bert model, word embeddings and unigrams. This process takes 7 minutes
model, similm, tokenfreq, embeddings, vocabulary2 = load_all(path, premodel, bert_dict, embedding, unigram, tokenizer)

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Loading Embeddings


100%|██████████████████████████████| 2000000/2000000 [01:29<00:00, 22424.49it/s]


Loaded Embeddings
Loading Unigrams


100%|██████████████████████████████| 8394369/8394369 [05:52<00:00, 23831.34it/s]

Loaded Unigrams





#### Run the following cell to construct the sentence class

In [23]:
class ComplexSentence:
    # Sentence class
    def __init__(self, sentence, label_model, tokeniser, verbose=True, alpha=(1/9, 4/9, 4/9)):
        
        # re handles special characters
        sentence = re.sub(r'(?<!(\s))([^a-zA-Z0-9\s])', r' \2', sentence)
        self.sentence = re.sub(r'([^a-zA-Z0-9\s])(?!(\s))', r'\1 ', sentence)
        
        self.tokenised_sentence = self.generate_tokenised_sentence()
        
        self.label_model = label_model
        self.verbose = verbose
        self.alpha = alpha

        if self.verbose:
            print(f'Untokenised sentence: {self.sentence}')
        
        self.label_complex_words()
    
    def generate_NER_filter(self, init=True):
        # Generate case sensitive tokens
        case_sensitive_tokens = nltk.tokenize.word_tokenize(self.sentence)
        pos_tags = nltk.pos_tag(case_sensitive_tokens)
        
        list_of_NERs = []
        for x in pos_tags:
            if x[1] == 'NNP':
                list_of_NERs.append(x[0].lower())
        
        # NER mask: np array, 0 if is_NER, 1 if not
        NER_mask = np.ones_like(self.tokenised_sentence, dtype=np.int64)
        for i in range(len(self.tokenised_sentence)):
            if self.tokenised_sentence[i] in list_of_NERs:
                NER_mask[i] = 0
        assert len(NER_mask) == len(self.tokenised_sentence)
        
        if self.verbose:
            if init and len(list_of_NERs) > 0:
                print('Found NEs:', list_of_NERs)
            elif init and len(list_of_NERs) == 0:
                print('No NE found.')
        return NER_mask
        
    def generate_tokenised_sentence(self):
        tokens = tokeniseUntokenise(self.sentence, tokenizer)['tokens']
        word_idx = tokeniseUntokenise(self.sentence, tokenizer)['words']
        tokenised_sentence_list = []
        for idx_list in word_idx:
            if len(idx_list)==1:
                tokenised_sentence_list.append(np.array(tokens)[idx_list[0]])
            else:
                word_untokenised = ''
                for idx_list_untokenised in idx_list:
                    word_untokenised += np.array(tokens)[idx_list_untokenised].replace('##', '')
                tokenised_sentence_list.append(word_untokenised)
        return tokenised_sentence_list
    
    def known_complexity(self):
        tokens = tokeniseUntokenise(self.sentence, tokenizer)['tokens']
        word_idx = tokeniseUntokenise(self.sentence, tokenizer)['words']
        known_index = []
        for idx_list in word_idx:
            if len(idx_list)==1 and not re.match(r'^[_\W]+$', tokens[idx_list[0]]):
                #If known label as True
                known_index.append(True)
            else:
                #If unknown label as False
                known_index.append(False)
        return known_index
    
    def label_complex_words(self, init=True):
        
        # applying complexity labeller to the sentence
        Complexity_labeller.convert_format_string(self.label_model, self.sentence)
        if init:
            self.bin_labels = Complexity_labeller.get_bin_labels(self.label_model)[0]
        self.probs = Complexity_labeller.get_prob_labels(self.label_model)
        
        # apply known complexity and NER mask
        self.bin_labels = np.multiply(self.bin_labels, self.known_complexity())
        self.bin_labels = np.multiply(self.bin_labels, self.generate_NER_filter(init=init))
        self.probs = np.multiply(self.probs, self.known_complexity())
        self.probs = np.multiply(self.probs, self.generate_NER_filter(init=False))
        
        self.is_complex = True if np.sum(self.bin_labels) >= 1 else False

        self.complexity_ranking = np.argsort(np.array(self.bin_labels) * np.array(self.probs))[::-1]
        self.most_complex_word = self.tokenised_sentence[self.complexity_ranking[0]]

        if self.verbose and init:
            print(f'Complex probs: {self.probs}')
            print(f'Binary complexity labels: {self.bin_labels}')

        if self.is_complex:
            print(f'\t Most complex word: {self.most_complex_word} \n')

        if not self.is_complex:
            print(f'\t Simplification complete or no complex expression found.\n')
    
    def find_MWEs_w_most_complex_word(self, n_gram, filepath):
        # finds the n-gram mwe of the most complex word in the sentence, if any
        # returns: mwe positions or complex word positions
        
        complex_word_pos = self.complexity_ranking[0]

        if complex_word_pos - n_gram + 1 > 0:
            sliding_start = complex_word_pos - n_gram + 1
        else:
            sliding_start = 0
        
        if complex_word_pos + n_gram - 1 < len(self.complexity_ranking):
            sliding_end = complex_word_pos
        else:
            sliding_end = len(self.complexity_ranking) - n_gram

        with open(filepath, 'r') as f:
            mwes = set(f.read().split('\n')) # make set
            avg_mwe_complexity = 0
            for pos in range(sliding_start, sliding_end + 1):
                possible_mwe = ' '.join(self.tokenised_sentence[pos: pos + n_gram])
                
                if possible_mwe in mwes:
                    
                    if np.mean(self.probs[pos:pos+n_gram]) > avg_mwe_complexity:
                        avg_mwe_complexity = np.mean(self.probs[pos:pos+n_gram])
                        valid_mwes_idx = np.arange(pos, pos+n_gram, 1)
                        mwe_found = possible_mwe
                    else:
                        continue
                        
        if avg_mwe_complexity > 0:
            self.idx_to_plainify = valid_mwes_idx
        else:
            self.idx_to_plainify = [complex_word_pos]
        
    
    def find_all_ngram_mwes(self):
        # returns: self.idx_to_plainify the indices of the longest mwe found
        
        if not self.is_complex:
            raise ValueError('Sentence is not complex')
        
        # give priority to longer MWEs
        n_gram_files = {2: two_gram_mwes_list, 3: three_gram_mwes_list, 4:four_gram_mwes_list}
        
        for n in reversed(range(2,5)):
            self.find_MWEs_w_most_complex_word(n, n_gram_files[n])
            
            if len(self.idx_to_plainify) == n: # if such mwe is found
                break
    
    def one_step_plainify(self):
        idx_start = self.idx_to_plainify[0]
        idx_end = self.idx_to_plainify[-1]+1
        complex_word_name = " ".join(self.tokenised_sentence[idx_start:idx_end])
        print(f'Found complex word or expression: ### {complex_word_name} ###. Plainifying...')
        processed_sentence = tokeniseUntokenise(self.sentence, tokenizer)
        forward_result = getTokenReplacement(processed_sentence, idx_start, len(self.idx_to_plainify), 
                                  tokenizer, model, similm, tokenfreq, embeddings, vocabulary2,
                                  verbose=False, backwards=False, maxDepth=3, maxBreadth=16, alpha=self.alpha)
        backward_result = getTokenReplacement(processed_sentence, idx_start, len(self.idx_to_plainify),
                                  tokenizer, model, similm, tokenfreq, embeddings, vocabulary2, 
                                  verbose=False, backwards=True, maxDepth=3, maxBreadth=16, alpha=self.alpha)
        words, scores = aggregateResults((forward_result, backward_result))
        words = [w.replace('#', '') for w in words]
        print(f'Suggested top 5 subtitutions: {words[:5]}')
        
        return words[0].split(' ')
        
    
    def sub_in_sentence(self, substitution):
        # plugs a substitution in the sentence, then updates complexity scores
        substitution_len = len(substitution)
        
        idx_start = self.idx_to_plainify[0]
        idx_end = self.idx_to_plainify[-1]+1
        
        self.tokenised_sentence = self.tokenised_sentence[:idx_start] + substitution + self.tokenised_sentence[idx_end:]
        self.sentence = ' '.join(self.tokenised_sentence)
        self.bin_labels = list(self.bin_labels[:idx_start]) + [0] * substitution_len + list(self.bin_labels[idx_end:])
        self.label_complex_words(init=False)
        print(f'\t Sentence after substitution: {self.sentence}\n')
        
    def recursive_greedy_plainify(self, max_steps=float('inf'), test=False):
        n = 1
        sub_details_list = []
        while self.is_complex and n <= max_steps:
            self.find_all_ngram_mwes()
            sub = self.one_step_plainify()
            self.sub_in_sentence(sub)
            #append subtitution details
            sub_details = {"iteration":n,"sub_word":sub[0],"idx_start":self.idx_to_plainify[0],"idx_end":self.idx_to_plainify[-1]+1}
            sub_details_list.append(sub_details)
            n += 1
        print(f'Simplification complete.')
        2
        if test:
            return self.sentence, sub_details_list
        else:
            return self.sentence
    
    def recursive_beam_search_plainfy(self, beam_width):
        pass

In [31]:
def simplify(sentence, verbose=False):
    s = ComplexSentence(sentence, label_model=Complexity_labeller_model, tokeniser=tokenizer, verbose=verbose)
    s.recursive_greedy_plainify()
    return s.sentence

#### Example sentences
Run each cell below to see output after lexical simplification.

In [57]:
simplify("Graham went to Wheaton College from 1939 to 1943, when he was awarded a ba in history.", verbose=True)

Untokenised sentence: Graham went to Wheaton College from 1939 to 1943 , when he was awarded a ba in history . 
Found NEs: ['graham', 'wheaton', 'college']
Complex probs: [0.00000000e+00 2.34588934e-03 6.54421019e-05 0.00000000e+00
 0.00000000e+00 6.03280896e-05 1.30338268e-03 5.03345982e-05
 2.68596783e-03 0.00000000e+00 1.17641684e-04 2.13565771e-04
 1.34180140e-04 8.14972639e-01 1.67953229e-04 7.78222978e-01
 4.95651366e-05 5.51919676e-02 0.00000000e+00]
Binary complexity labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0]
	 Most complex word: awarded 

Found complex word or expression: ### was awarded ###. Plainifying...
Suggested top 5 subtitutions: ['received', 'took', 'obtained', 'completed', 'gained']
	 Most complex word: ba 

	 Sentence after substitution: graham went to wheaton college from 1939 to 1943 , when he received a ba in history .

Found complex word or expression: ### ba ###. Plainifying...
Suggested top 5 subtitutions: ['degree', 'major', 'b', 'b .', 'double major']
	 

'graham went to wheaton college from 1939 to 1943 , when he received a degree in history .'

In [32]:
simplify("I took a sip of my coffee and kept working.")

	 Most complex word: sip 

Found complex word or expression: ### took a sip of ###. Plainifying...
Suggested top 5 subtitutions: ['took', 'had', 'finished', 'made', 'put down']
	 Simplification complete or no complex expression found.

	 Sentence after substitution: i took my coffee and kept working .

Simplification complete.


'i took my coffee and kept working .'

In [58]:
simplify("In 1987 Wexler was inducted into the Rock and Roll Hall of Fame.")

	 Most complex word: inducted 

Found complex word or expression: ### was inducted into the ###. Plainifying...
Suggested top 5 subtitutions: ['entered the', 'joined the', 'entered into the', 'entered', 'in']
	 Simplification complete or no complex expression found.

	 Sentence after substitution: in 1987 wexler entered the rock and roll hall of fame .

Simplification complete.


'in 1987 wexler entered the rock and roll hall of fame .'

In [34]:
simplify("Ammonia, which is synonymous with anhydrous ammonia, is a colorless gas.")

	 Most complex word: synonymous 

Found complex word or expression: ### synonymous with ###. Plainifying...
Suggested top 5 subtitutions: ['also called', 'also known as', 'not', 'also', 'the']
	 Simplification complete or no complex expression found.

	 Sentence after substitution: ammonia , which is also called anhydrous ammonia , is a colorless gas .

Simplification complete.


'ammonia , which is also called anhydrous ammonia , is a colorless gas .'

In [48]:
simplify("Both men and women must adhere to these guidelines when attending a meeting.")

	 Most complex word: guidelines 

Found complex word or expression: ### guidelines ###. Plainifying...
Suggested top 5 subtitutions: ['rules', 'principles', 'laws', 'conditions', 'requirements']
	 Most complex word: attending 

	 Sentence after substitution: both men and women must adhere to these rules when attending a meeting .

Found complex word or expression: ### attending a ###. Plainifying...
Suggested top 5 subtitutions: ['in a', 'at a', 'they are', 'making a', 'holding a']
	 Most complex word: adhere 

	 Sentence after substitution: both men and women must adhere to these rules when in a meeting .

Found complex word or expression: ### adhere to ###. Plainifying...
Suggested top 5 subtitutions: ['follow', 'respect', 'observe', 'agree to', 'adhere to']
	 Simplification complete or no complex expression found.

	 Sentence after substitution: both men and women must follow these rules when in a meeting .

Simplification complete.


'both men and women must follow these rules when in a meeting .'

In [35]:
# simplify your own sentence
your_own_sentence = " " # fill this in
simplify(your_own_sentence)

NameError: name 'simplofy' is not defined

### Tests
Run any tests for hyperparameter tuning below.