## Improving Lexical Simplification Using State of the Art Lexical Complexity Prediction Models
#### Demo notebook
This notebook provides a demonstration of simplifying sentence with multi-word expressions. It is tested on a system with the following specifications:
<ol>
    <li>OS: Linux x86-64</li>
    <li>CPU: 3.30 Ghz x 8</li>
    <li>RAM: 40 GiB</li>
    <li>Hard Drive: 20 GiB</li>
    <li>GPU: NVIDIA Corporation GA104M (CUDA compute capability: 8.6)</li>
 </ol>

#### Run the following two cells to import packages and settings

In [1]:
 # imports
import numpy as np
import torch
from transformers import BertTokenizer
from tqdm import tqdm
import re
import codecs

from CWIs.complex_labeller import Complexity_labeller
from plainifier.plainify import *

import warnings
warnings.filterwarnings('ignore')

In [2]:
# settings
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
seed = 1234
np.random.seed(seed)
torch.manual_seed(seed)

two_gram_mwes_list = './CWIs/2_gram_mwe_50.txt'
three_gram_mwes_list = './CWIs/3_gram_mwe_25.txt'
four_gram_mwes_list = './CWIs/4_gram_mwe_8.txt'
pretrained_model_path = './CWIs/cwi_seq.model'
temp_path = './CWIs/temp_file.txt'

path = './plainifier/'
premodel = 'bert-large-uncased-whole-word-masking'
bert_dict = 'tersebert_pytorch_1_0.bin'
embedding = 'crawl-300d-2M-subword.vec'
unigram = 'unigrams-df.tsv'
tokenizer = BertTokenizer.from_pretrained(premodel)
Complexity_labeller_model = Complexity_labeller(pretrained_model_path, temp_path)

2022-04-03 15:49:58.324608: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-04-03 15:49:58.325438: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-03 15:49:58.325902: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-03 15:49:58.326033: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA 

In [3]:
# loading bert model, word embeddings and unigrams. This process takes 7 minutes
model, similm, tokenfreq, embeddings, vocabulary2 = load_all(path, premodel, bert_dict, embedding, unigram, tokenizer)

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Loading Embeddings


100%|██████████████████████████████| 2000000/2000000 [01:30<00:00, 22001.64it/s]


Loaded Embeddings
Loading Unigrams


100%|██████████████████████████████| 8394369/8394369 [05:53<00:00, 23747.04it/s]

Loaded Unigrams





#### Run the following cell to construct the sentence class

In [4]:
class ComplexSentence:
    # Sentence class
    def __init__(self, sentence, label_model, tokeniser, verbose=True, beam_width=3):
        self.sentence = sentence
        self.tokenised_sentence = self.generate_tokenised_sentence()
#         self.tokenised_sentence = tokeniser.tokenize(self.sentence)
        self.label_model = label_model
        self.verbose = verbose
        self.beam_width = beam_width

        if self.verbose:
            print(f'Untokenised sentence: {self.sentence}')
            print(f'Tokenised sentence: {self.tokenised_sentence}')

        self.label_complex_words()
    
    def generate_tokenised_sentence(self):
        tokens = tokeniseUntokenise(self.sentence, tokenizer)['tokens']
        word_idx = tokeniseUntokenise(self.sentence, tokenizer)['words']
        tokenised_sentence_list = []
        for idx_list in word_idx:
            if len(idx_list)==1:
                tokenised_sentence_list.append(np.array(tokens)[idx_list[0]])
            else:
                word_untokenised = ''
                for idx_list_untokenised in idx_list:
                    word_untokenised += np.array(tokens)[idx_list_untokenised].replace('##', '')
                tokenised_sentence_list.append(word_untokenised)
        return tokenised_sentence_list
    
    def known_complexity(self):
        tokens = tokeniseUntokenise(self.sentence, tokenizer)['tokens']
        word_idx = tokeniseUntokenise(self.sentence, tokenizer)['words']
        known_index = []
        for idx_list in word_idx:
            if len(idx_list)==1 and not re.match(r'^[_\W]+$', tokens[idx_list[0]]):
                #If known label as True
                known_index.append(True)
            else:
                #If unknown label as False
                known_index.append(False)
        return known_index
    
    def label_complex_words(self, init=True):
        # applying complexity labeller to the sentence

        Complexity_labeller.convert_format_string(self.label_model, self.sentence)
        if init:
            self.bin_labels = Complexity_labeller.get_bin_labels(self.label_model)[0]

        # override complexity
        self.bin_labels = np.multiply(self.bin_labels,self.known_complexity())

        self.is_complex = True if np.sum(self.bin_labels) >= 1 else False
        self.probs = Complexity_labeller.get_prob_labels(self.label_model)

        # override complexity
        self.probs = np.multiply(self.probs,self.known_complexity())

        self.complexity_ranking = np.argsort(np.array(self.bin_labels) * np.array(self.probs))[::-1]
        self.most_complex_word = self.tokenised_sentence[self.complexity_ranking[0]]

        if self.verbose:
            print(f'Complex probs: {self.probs}')
            print(f'Binary complexity labels: {self.bin_labels}')

            if self.is_complex:
                print(f'\t Most complex word: {self.most_complex_word} \n')

        if not self.is_complex:
            print(f'\t Simplificaiton complete or no complex expression found.\n')
    
    def find_MWEs_w_most_complex_word(self, n_gram, filepath):
        # finds the n-gram mwe of the most complex word in the sentence, if any
        # returns: mwe positions or complex word positions
        
        complex_word_pos = self.complexity_ranking[0]

        if complex_word_pos - n_gram + 1 > 0:
            sliding_start = complex_word_pos - n_gram + 1
        else:
            sliding_start = 0
        
        if complex_word_pos + n_gram - 1 < len(self.complexity_ranking):
            sliding_end = complex_word_pos
        else:
            sliding_end = len(self.complexity_ranking) - n_gram

        with open(filepath, 'r') as f:
            mwes = set(f.read().split('\n')) # make set
            avg_mwe_complexity = 0
            for pos in range(sliding_start, sliding_end + 1):
                possible_mwe = ' '.join(self.tokenised_sentence[pos: pos + n_gram])
                
                if possible_mwe in mwes:
                    
                    if np.mean(self.probs[pos:pos+n_gram]) > avg_mwe_complexity:
                        avg_mwe_complexity = np.mean(self.probs[pos:pos+n_gram])
                        valid_mwes_idx = np.arange(pos, pos+n_gram, 1)
                        mwe_found = possible_mwe
                    else:
                        continue
                        
        if avg_mwe_complexity > 0:
            self.idx_to_plainify = valid_mwes_idx
        else:
            self.idx_to_plainify = [complex_word_pos]
        
    
    def find_all_ngram_mwes(self):
        # returns: self.idx_to_plainify the indices of the longest mwe found
        
        if not self.is_complex:
            raise ValueError('Sentence is not complex')
        
        # give priority to longer MWEs
        n_gram_files = {2: two_gram_mwes_list, 3: three_gram_mwes_list, 4:four_gram_mwes_list}
        
        for n in reversed(range(2,5)):
            self.find_MWEs_w_most_complex_word(n, n_gram_files[n])
            
            if len(self.idx_to_plainify) == n: # if such mwe is found
                break
    
    def one_step_plainify(self):
        idx_start = self.idx_to_plainify[0]
        idx_end = self.idx_to_plainify[-1]+1
        print(f'Found complex word or expression: ### {" ".join(self.tokenised_sentence[idx_start:idx_end])} ###. Plainifying...')
        processed_sentence = tokeniseUntokenise(self.sentence, tokenizer)
        forward_result = getTokenReplacement(processed_sentence, idx_start, len(self.idx_to_plainify), 
                                  tokenizer, model, similm, tokenfreq, embeddings, vocabulary2,
                                  verbose=False, backwards=False, maxDepth=3, maxBreadth=16, alpha=(1/9,6/9,2/9))
        backward_result = getTokenReplacement(processed_sentence, idx_start, len(self.idx_to_plainify),
                                  tokenizer, model, similm, tokenfreq, embeddings, vocabulary2, 
                                  verbose=False, backwards=True, maxDepth=3, maxBreadth=16, alpha=(1/9,6/9,2/9))
        words, scores = aggregateResults((forward_result, backward_result))
        print(f'Suggested top 5 subtitutions: {words[:5]}')
        return words[0].split(' ')
        
    
    def sub_in_sentence(self, substitution):
        # plugs a substitution in the sentence, then updates complexity scores
        substitution_len = len(substitution)
        
        idx_start = self.idx_to_plainify[0]
        idx_end = self.idx_to_plainify[-1]+1
        
        self.tokenised_sentence = self.tokenised_sentence[:idx_start] + substitution + self.tokenised_sentence[idx_end:]
        self.sentence = ' '.join(self.tokenised_sentence)
        self.bin_labels = list(self.bin_labels[:idx_start]) + [0] * substitution_len + list(self.bin_labels[idx_end:])
        self.label_complex_words(init=False)
        print(f'\n\t Sentence after substitution: {self.sentence}\n')
        
    def recursive_greedy_plainify(self, max_steps=float('inf')):
        n = 1
        while self.is_complex and n < max_steps:
            self.find_all_ngram_mwes()
            sub = self.one_step_plainify()
            self.sub_in_sentence(sub)
            n += 1
        print(f'Simplification complete.')
    
    def recursive_beam_search_plainfy(self, beam_width):
        pass

In [5]:
# # Sentence class
#     def __init__(self, sentence, label_model, tokeniser, verbose=True):
#         self.sentence = sentence
#         #self.tokenised_sentence = tokeniser.tokenize(self.sentence)
#         self.tokenised_sentence = self.generate_tokenised_sentence()
#         self.label_model = label_model
#         self.verbose = verbose

#         if self.verbose:
#             print(f'Untokenised sentence: {self.sentence}')
#             print(f'Tokenised sentence: {self.tokenised_sentence}')

#         self.label_complex_words()
#         #print(self.generate_tokenised_sentence())
        
#     def generate_tokenised_sentence(self):
#         tokens = tokeniseUntokenise(self.sentence, tokenizer)['tokens']
#         word_idx = tokeniseUntokenise(self.sentence, tokenizer)['words']
#         tokenised_sentence_list = []
#         for idx_list in word_idx:
#             if len(idx_list)==1:
#                 tokenised_sentence_list.append(np.array(tokens)[idx_list[0]])
#             else:
#                 word_untokenised = ''
#                 for idx_list_untokenised in idx_list:
#                     word_untokenised += np.array(tokens)[idx_list_untokenised].replace('##', '')
#                 tokenised_sentence_list.append(word_untokenised)
#         return tokenised_sentence_list

#### Example sentences

In [6]:
input_sentence = "Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true."
sentence = ComplexSentence(input_sentence, label_model=Complexity_labeller_model, tokeniser=tokenizer, verbose=False)
sentence.recursive_greedy_plainify()

2022-04-03 15:59:00.884765: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.


Found complex word or expression: ### descriptions of ###. Plainifying...
Suggested top 5 subtitutions: ['measures of', 'estimates of', 'determination of', 'expression of', 'values of']

	 Sentence after substitution: probability is the branch of mathematics concerning numerical measures of how likely an event is to occur , or how likely it is that a proposition is true .

Found complex word or expression: ### probability ###. Plainifying...
Suggested top 5 subtitutions: ['probability theory', 'probability', '. probability', 'or probability', 'theory']

	 Sentence after substitution: probability theory is the branch of mathematics concerning numerical measures of how likely an event is to occur , or how likely it is that a proposition is true .

Found complex word or expression: ### a proposition ###. Plainifying...
Suggested top 5 subtitutions: ['it', 'a statement', 'something', 'a claim', 'an event']

	 Sentence after substitution: probability theory is the branch of mathematics conc

In [7]:
input_sentence = "I took a sip of coffee and kept working."

sentence = ComplexSentence(input_sentence, label_model=Complexity_labeller_model, tokeniser=tokenizer, verbose=False)
sentence.recursive_greedy_plainify()

Found complex word or expression: ### a sip of coffee ###. Plainifying...
Suggested top 5 subtitutions: ['a break', 'it', 'off', 'over', 'a deep breath']
	 Simplificaiton complete or no complex expression found.


	 Sentence after substitution: i took a break and kept working .

Simplification complete.


In [8]:
input_sentence = "We will first introduce several fundamental concepts."

sentence = ComplexSentence(input_sentence, label_model=Complexity_labeller_model, tokeniser=tokenizer, verbose=False)
sentence.recursive_greedy_plainify()

Found complex word or expression: ### fundamental ###. Plainifying...
Suggested top 5 subtitutions: ['fundamental', 'new', 'basic', 'important', 'key']

	 Sentence after substitution: we will first introduce several fundamental concepts .

Found complex word or expression: ### introduce ###. Plainifying...
Suggested top 5 subtitutions: ['introduce', 'establish', 'define', 'discuss', 'present']

	 Sentence after substitution: we will first introduce several fundamental concepts .

Found complex word or expression: ### concepts ###. Plainifying...
Suggested top 5 subtitutions: ['concepts', 'principles', 'ideas', 'elements', 'questions']
	 Simplificaiton complete or no complex expression found.


	 Sentence after substitution: we will first introduce several fundamental concepts .

Simplification complete.


In [9]:
input_sentence = "A neural network is a series of rules that attempt to recognize patterns in a set of data through a process that is the way the human brain operates."

sentence = ComplexSentence(input_sentence, label_model=Complexity_labeller_model, tokeniser=tokenizer, verbose=False)
sentence.recursive_greedy_plainify()

Found complex word or expression: ### patterns in ###. Plainifying...
Suggested top 5 subtitutions: ['and process', 'and understand', 'patterns in', 'and interpret', 'or process']

	 Sentence after substitution: a neural network is a series of rules that attempt to recognize and process a set of data through a process that is the way the human brain operates .

Found complex word or expression: ### operates ###. Plainifying...
Suggested top 5 subtitutions: ['works', 'operates', 'processes information', 'processes', 'functions']

	 Sentence after substitution: a neural network is a series of rules that attempt to recognize and process a set of data through a process that is the way the human brain works .

Found complex word or expression: ### a process that ###. Plainifying...
Suggested top 5 subtitutions: ['this', 'it', 'that', '. this', 'which']

	 Sentence after substitution: a neural network is a series of rules that attempt to recognize and process a set of data through this is th