## Protocol Extraction v0.1

#### It is uncertain if Natural Language Processing Techniques can be used to automate the identification of risks from protocols as the foundation for the Adaptive Monitoring Assessment Process.

In [1]:
from docx import Document
from docx.shared import Inches

import re
import nltk
from nltk import tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import spacy

import sys
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

## Extract a Segment of the Protocol

In [2]:
def getSegment(doc, heading):
    #heading = 'Inclusion criteria'
    document = Document(doc)
    i = -1
    st = 0
    en=0
    seg_text =  ''
    for para in document.paragraphs:
        i += 1
        if para.text == heading:
            st = i + 1
            inc_sty = para.style 
        if st > 0:
            if para.style == inc_sty and i > st:
                en = i 
                break
                
    for para in document.paragraphs[st:en]:
        seg_text += para.text
      
    
    return seg_text

In [3]:
def getAllText(doc):
    #heading = 'Inclusion criteria'
    document = Document(doc)
    i = -1
    st = 0
    en=0
    seg_text =  ''
    for para in document.paragraphs:
        seg_text += para.text
      
    
    return seg_text

In [4]:
text = getSegment('protocols/Immu09.docx','Inclusion Criteria')
#text = getAllText('protocols/Immu09.docx')


## Pre-process the extracted text

Manage the use of colon and periods in the places that are not the end of a sentence.

Tokenize - Break bag of words into coherent sentences.

In [5]:
def sentence_tokens(itext):
    #mask all dots between numbers
    pattern = re.compile(r'(?<=\d)[.](?=\d)')
    isatext = pattern.sub('_isadot_',itext)

    #prepare sentence for tokenization
    isatext = isatext.replace(':', '. ').replace('\t', ' ').replace('.', '. ')
    
    sent_text = nltk.sent_tokenize(isatext)

    sent_text1 = []
    for sen in sent_text:
        sent_text1.append(sen.replace('_isadot_', '.'))
        
    return sent_text1


In [6]:
sentence_tokens(text)

['Female or male subjects aged ≥18 years at the time of signing the informed consent form (ICF).',
 'Documented evidence of hormone receptor-positive HER2-negative (HR+/HER2-) MBC confirmed (local  laboratory)  with  the  most recently available or  newly  obtained tumorbiopsy(within last 12 months) from a locally recurrent or metastatic site(s) and defined per ASCO/CAP criteria as.',
 'HR positive (a tumor is considered HR positive if at least 1% of the cells examined have estrogen and/or progesterone receptors).',
 'Human epidermal growth factor receptor 2 (HER2) -negative (defined as immunohistochemistry [IHC] ≤2+ or fluorescence in situ hybridization negative).',
 'Availability of archival tumor tissue FFPE) block (within 12 months prior to randomization) or newly acquired biopsy (FFPE block) from a metastatic site.',
 'Refractory to or relapsed after at least 2, and no more than 4, prior systemic chemotherapy regimens for MBC.',
 'Adjuvant or neoadjuvant therapy for early stage di

In [9]:
print("There are {} sentences in all". format(len(sentence_tokens(text))))

There are 51 sentences in all


## Extractive Summarization - TFIDF Approach (Term Frequency)

In [13]:
def _create_frequency_table(text_string) -> dict:
    """
    we create a dictionary for the word frequency table.
    For this, we should only use the words that are not part of the stopWords array.
    Removing stop words and making frequency table
    Stemmer - an algorithm to bring words to its root word.
    :rtype: dict
    """
    stopWords = set(stopwords.words("english"))
    words = word_tokenize(text_string)
    ps = PorterStemmer()
    #print(len(words))

    freqTable = dict()
    for word in words:
        word = ps.stem(word)
        if word in stopWords:
            continue
        if word in freqTable:
            freqTable[word] += 1
        else:
            freqTable[word] = 1

    return freqTable

def _score_sentences(sentences, freqTable) -> dict:
    """
    score a sentence by its words
    Basic algorithm: adding the frequency of every non-stop word in a sentence divided by total no of words in a sentence.
    :rtype: dict
    """

    sentenceValue = dict()

    for sentence in sentences:
        word_count_in_sentence = (len(word_tokenize(sentence)))
        word_count_in_sentence_except_stop_words = 0
        for wordValue in freqTable:
            if wordValue in sentence.lower():
                word_count_in_sentence_except_stop_words += 1
                if sentence[:10] in sentenceValue:
                    sentenceValue[sentence[:10]] += freqTable[wordValue]
                else:
                    sentenceValue[sentence[:10]] = freqTable[wordValue]

        sentenceValue[sentence[:10]] = sentenceValue[sentence[:10]] / word_count_in_sentence_except_stop_words

        '''
        Notice that a potential issue with our score algorithm is that long sentences will have an advantage over short sentences. 
        To solve this, we're dividing every sentence score by the number of words in the sentence.
        
        Note that here sentence[:10] is the first 10 character of any sentence, this is to save memory while saving keys of
        the dictionary.
        '''

    return sentenceValue


def _find_average_score(sentenceValue) -> int:
    """
    Find the average score from the sentence value dictionary
    :rtype: int
    """
    sumValues = 0
    for entry in sentenceValue:
        sumValues += sentenceValue[entry]

    # Average value of a sentence from original text
    average = (sumValues / len(sentenceValue))

    return average


def _generate_summary(sentences, sentenceValue, threshold):
    sentence_count = 0
    summary = ''

    for sentence in sentences:
        if sentence[:10] in sentenceValue and sentenceValue[sentence[:10]] >= (threshold):
            summary += " " + sentence
            sentence_count += 1

    return summary


def run_summarization(text):
    # 1 Create the word frequency table
    freq_table = _create_frequency_table(text)

    '''
    We already have a sentence tokenizer, so we just need 
    to run the sent_tokenize() method to create the array of sentences.
    '''

    # 2 Tokenize the sentences
    #sentences = sent_tokenize(text)
    sentences = sentence_tokens(text)

    # 3 Important Algorithm: score the sentences
    sentence_scores = _score_sentences(sentences, freq_table)

    # 4 Find the threshold
    threshold = _find_average_score(sentence_scores)

    # 5 Important Algorithm: Generate the summary
    summary = _generate_summary(sentences, sentence_scores, 1.1 * threshold)

    return summary

if __name__ == '__main__':
    result = run_summarization(text)
   # print(sentence_tokens(result))

In [14]:
list((sentence_tokens(result)))

[' Female or male subjects aged ≥18 years at the time of signing the informed consent form (ICF).',
 'HR positive (a tumor is considered HR positive if at least 1% of the cells examined have estrogen and/or progesterone receptors).',
 'Availability of archival tumor tissue FFPE) block (within 12 months prior to randomization) or newly acquired biopsy (FFPE block) from a metastatic site.',
 'Refractory to or relapsed after at least 2, and no more than 4, prior systemic chemotherapy regimens for MBC.',
 'bone metastases treatments (eg, bisphosphonates, denosumab, etc) and hormonal therapy are not considered as prior systemic treatments for advanced diseaseSubjects should have been previously treated with.',
 'Taxanes in any setting.',
 'At least 1 prior anticancer hormonal treatment.',
 'At least 1 cyclin-dependent kinase inhibitor 4/6 in the metastatic setting.',
 'At least 1 measurable target lesion according to RECIST 1.1 (bony disease only is not allowed) meeting all of the following

In [16]:
print("The sentences have been summarised into {} ". format(len(sentence_tokens(result))))

The sentences have been summarised into 21 



PDFMIner - https://www.binpress.com/manipulate-pdf-python/

BeautifulSoup - https://www.dataquest.io/blog/web-scraping-tutorial-python/

PyTextRank - https://medium.com/@aneesha/beyond-bag-of-words-using-pytextrank-to-find-phrases-and-summarize-text-f736fa3773c5

Text Summarization with NLTK in Python - https://stackabuse.com/text-summarization-with-nltk-in-python/

Text summarization in 5 steps using NLTK - https://becominghuman.ai/text-summarization-in-5-steps-using-nltk-65b21e352b65

TFIDF - https://towardsdatascience.com/tfidf-for-piece-of-text-in-python-43feccaa74f8

NLP For Topic Modeling Summarization Of Financial Documents  https://blog.usejournal.com/nlp-for-topic-modeling-summarization-of-financial-documents-10-k-q-93070db96c1d

This is a nice subject to play with LDA on! It might also be cool to see how treating individual sentences as documents could affect topics. Computationally more expensive, but it might be feasible.

https://towardsdatascience.com/basic-nlp-on-the-texts-of-harry-potter-topic-modeling-with-latent-dirichlet-allocation-f3c00f77b0f5