# Extraction of Individual Eligibility Criteria from ClinicalTrials.gov in Python
## Author: Sam Kaskovich

### Description:
- This notebook extracts individual-level eligibility criteria for all non-actively enrolling pediatric acute leukemia trials on ClinicalTrials.gov. Clinical trial protocols can be downloaded as XML files, and eligibility criteria can be extracted from these files as a single block of free text.
- Splitting each block of free text into individual criteria is not trivial, given that there is no single standardized format in which the criteria are listed. Criteria may appear as a bulleted list, bulleted list with related subbullets, outline, etc.
- The following steps were undertaken to tackle this problem:
    - 216 XML files for all non-actively enrolling clinical trials for pediatric acute leukemia were downloaded from ClinicalTrials.gov, using the instructions listed at this link: https://clinicaltrials.gov/ct2/resources/download
    - The eligibility criteria free text block for each trial was manually inspected in order to ascertain patterns in formatting. The following four patterns emerged as dominant and each trial was manually labeled as such:<br><br>
        - *Major Header/Subheader/Subbullets*:
            - DISEASE CHARACTERISTICS:
                - Criterion
                - Criterion
                - Etc
            - PATIENT CHARACTERISTICS:
                - Age
                    - Criterion
                - Performance status
                    - Criterion
                        - SubCriterion
                - Etc
            - PRIOR CONCURRENT THERAPY:
                - Biologic therapy
                    - Criterion
                - Etc
            - Etc<br><br>
        - *Major Header/Subbullets*
            - DISEASE CHARACTERISTICS:
                - Criterion
                - Criterion
                    - SubCriterion
                - Etc
            - PATIENT CHARACTERISTICS:
                - Criterion
                - Criterion
                - Etc
            - PRIOR CONCURRENT THERAPY:
                - Criterion
                - Criterion
                - Etc
            - Etc<br><br>
        - *Inclusion and/or Exclusion Criteria/No Nested Subbullets*
            - Inclusion Criteria:
                - Criterion
                - Criterion
                - Etc
            - Exclusion Criteria:
                - Criterion
                - Criterion
                - Etc<br><br>
        - *Inclusion and Exclusion Criteria/Nested Subbullets*
            - Inclusion Criteria:
                - Criterion
                - Criterion
                    - SubCriterion
                    - SubCriterion
                    - Etc
                - Etc
            - Exclusion Criteria:
                - Criterion
                - Criterion
                    - SubCriterion
                    - SubCriterion
                    - Etc
                - Etc<br><br>
    - A function [ExtractCriteria()] was written to parse each text block according to its format and then extract as many intact criteria as possible.
    - NOTE: The goal of this process was not to extract each individual criterion perfectly, but to extract as many as feasible given the highly non-standardized nature of the free text blocks. Thus, a small number of criteria may be extracted as a large chunk, given that they have idiosyncratic structure or lack detectable common split character.

### Notebook Format
- Running the notebook requires the folder of XML files ('trial_protocols').
- The notebook then can simply be run cell-by-cell from start to finish in order to reproduce the results.

In [None]:
#import pandas, numpy, regex, Counter
import pandas as pd
import numpy as np
import re
from collections import Counter
import os

In [None]:
#define RepeatRegexFinder function
def RepeatRegexFinder(text, regex):
    """This function uses the Python regular expressions library to parse
    a text block and return a list of subcontents that span a repeated regular
    expression (i.e. returns the substring between each repeat of the given
    regular expression pattern). Arguments are the text block (string) and 
    the desired regular expression (raw string) that demarcates the subcontents."""
    
    #create empty lists in which to store starts and stops
    starts = []
    stops = []

    #iterate over text to find indices matching regex and its start and stops
    for match in re.finditer(regex, text):

        #note start and stop of matching sequence
        starts.append(match.span()[0])
        stops.append(match.span()[1])

    #create empty list in which to store subcontents
    subcontents = []

    #loop through starts and stops and add desired slices to list--leaving last subcontent off
    for x in range(len(starts) - 1):

        #store each mainbullet slice
        subcontent = text[stops[x]:starts[x + 1]]
        subcontents.append(subcontent)

    #add last subcontent as last slice from 'text'
    lastsubcontent = text[stops[-1]:]
    subcontents.append(lastsubcontent)
    
    #return list of subcontents
    return subcontents

In [None]:
#define MultiRegexFinder() function
def MultiRegexFinder(text, regex_list):
    """This function uses the Python regular expressions library to parse
    a text block and return a list of subcontents between a list of unique
    regular expression patterns (i.e. returns the substring between each 
    unique given regular expression pattern, if it exists in the text). 
    Arguments are the text block (string) and the desired regular expressions
    (list of raw strings) that demarcate the subcontents."""
  
    #create empty list in which to store indices for starts of regex's
    starts = []

    #iterate through regex's
    for regex in regex_list:

        #compile each regex
        re_compiled = re.compile(regex)

        #search text for regex
        re_search = re_compiled.search(text)

        #if search is not empty, add start to above empty list
        if re_search != None:
            start = re_search.span()[0]
            starts.append(start)

        #otherwise, do nothing
        else:
            pass

    #create empty list in which to store subcontents attached to each regex
    subcontents = []

    #loop through starts and add desired content slices to list--leaving last bullet off
    for x in range(len(starts) - 1):

        #add individual content slice of text to above list
        subcontent = text[starts[x]:starts[x + 1]]
        subcontents.append(subcontent)

    #add last header content slice to list
    lastsubcontent = text[starts[-1]:]
    subcontents.append(lastsubcontent)
    
    #return subcontents
    return subcontents

In [None]:
#define ExtractCriteria()
def ExtractCriteria(trial, text, output_list, format_dict):
    """This function will extract clinical trial criteria given the NCT ID (str),
    text format (str), text block (str), and output list (list). It REQUIRES the
    prior importation of the Python regular expressions library (re) and the prior
    definition of RepeatRegexFinder() and MultiRegexFinder()."""

    #create lists of headers for later use in format detection
    majorheaders = [r'DISEASE CHARACTERISTICS', r'PATIENT CHARACTERISTICS', r'PRIOR CONCURRENT THERAPY', r'DONOR CHARACTERISTICS']
    patientchars = [r'Age', r'Performance status', r'Life expectancy', r'Hematopoietic', r'Hepatic', r'Renal', r'Cardiovascular', r'Pulmonary', r'Other']
    priortherapy = [r'Biologic therapy', r'Chemotherapy', r'Endocrine therapy', r'Radiotherapy', r'Surgery', r'Other']
    
    #extraction for: Inclusion and Exclusion Criteria/No Nested Subbullets
    #if inclusion/exclusion criteria followed by any multiple charcters then ':', AND no subbullets
    if ((re.search(r"inclusion criteria(.*):", text.lower()) != None) or (re.search(r"exclusion criteria(.*):", text.lower()) != None) or ()) and (re.search(r'\r\n\r\n {15}\S', text) == None):

        #split text at at each bullet/number, almost always noted by double carriage return, and add to output list
        text = text.split("\r\n\r\n")

        #add criteria to output list
        for each in text:
            output_list.append(each)      

        #add to format dict
        format_dict['Inclusion and Exclusion Criteria/No Nested Subbullets'] += 1
    
    #extraction for: Inclusion and Exclusion Criteria/Nested Subbullets
    #if inclusion/exclusion criteria followed by any multiple charcters then ':', AND contains subbullets
    elif ((re.search(r"inclusion criteria(.*):", text.lower()) != None) or (re.search(r"exclusion criteria(.*):", text.lower()) != None) or ()) and (re.search(r'\r\n\r\n {15}\S', text) != None):

        #use RepeatRegexFinder() to return list of mainbullets and their subcontents
        #mainbullets are denoted by the pattern of double carriage return followed by exactly 10 spaces
        mainbullets = RepeatRegexFinder(text = text, regex = r'\r\n\r\n {10}\S')

        #add criteria to output list
        for each in mainbullets:
            output_list.append(each)

        #add to format dict
        format_dict['Inclusion and Exclusion Criteria/Nested Subbullets'] += 1
            
    #extraction for: Major Header/Subbullets
    #if contains major head and 0 or 1 capitalized subheaders
    elif (any(term in text for term in majorheaders) and (np.count_nonzero([term in text for term in patientchars]) < 4)):
        
        #use MultiRegexFinder to return header subcontents
        headercontents = MultiRegexFinder(text = text, regex_list = majorheaders)

        #deal with subbullet handling in the same way used above
        #iterate through each header content
        for subtext in headercontents:

            #if carriage returns exist
            if re.search(r'\r\n\r\n {10}\S', subtext) != None:

                #use RepeatRegexFinder() to return list of mainbullets and their subcontents
                #mainbullets are denoted by the pattern of double carriage return followed by exactly 10 spaces
                subtexts = RepeatRegexFinder(text = subtext, regex = r'\r\n\r\n {10}\S')

                #add criteria to output list
                for each in subtexts:
                    output_list.append(each)

            #otherwise has no bullets - split on capitalized words
            else:
                capital_words = re.findall( r"\b[A-Z][a-z]*\b", subtext)
                subtexts = MultiRegexFinder(text = subtext, regex_list = capital_words)
                for each in subtexts:
                    output_list.append(each)

        #add to format dict
        format_dict['Major Header/Subbullets'] += 1
            
    #extraction for last remaining Format ID: Major Header/Subheader/Subbullets
    #this process will be the most complex, as it has the most heterogeneity
    #if contains major head and >=2 capitalized subheaders
    elif (any(term in text for term in majorheaders)) and (np.count_nonzero([term in text for term in patientchars]) >= 4):

        #use MultiRegexFinder to return major header subcontents
        headercontents = MultiRegexFinder(text = text, regex_list = majorheaders)

        #deal with subbullet/subheader handling for each major header section
        #iterate through each header content        
        for subtext in headercontents:

            #if patient characteristics in subtext (regardless of capitalization pattern), handle as list with subheaders/bullets from patientchars
            if "patient characteristics" in subtext.lower():

                #use MultiRegexFinder() to return subcontents of patientchars headers
                subcontents = MultiRegexFinder(text = subtext, regex_list = patientchars)

                #add to raw criteria
                for each in subcontents:
                    output_list.append(each)

            #if prior concurrent therapy in subtext (regardless of capitalization pattern), handle as bulleted list with subheaders listed above
            #the "chemotherapy" specification was added to ensure this section occurs in subheader format
            #when "chemotherapy" is not present, the format does not have subheaders and will be handled otherwise
            elif ("prior concurrent therapy" in subtext.lower()) and ("chemotherapy" in subtext.lower()):

                #use MultiRegexFinder() to return subcontents of priortherapy headers
                subcontents = MultiRegexFinder(text = subtext, regex_list = priortherapy)

                #add to output_list
                for each in subcontents:
                    output_list.append(each)

            #otherwise, most can be treated as bulleted list with potential subbullets as with other format IDs
            elif re.search(r'\r\n\r\n {10}\S', subtext) != None:

                #use RepeatRegexFinder() to return list of mainbullets and their subcontents
                #mainbullets are denoted by the pattern of double carriage return followed by exactly 10 spaces
                mainbullets = RepeatRegexFinder(text = subtext, regex = r'\r\n\r\n {10}\S')

                #add criteria to output list
                for each in mainbullets:
                    output_list.append(each)

            #otherwise, subtext is a full paragraph without demarcation between individual criteria
            else:

                #add full subtext to output_list
                output_list.append(subtext)

        #add to format dict
        format_dict['Major Header/Subheader/Subbullets'] += 1
            
    #otherwise, treat as bulleted list
    else:

        #split text at at each bullet/number, almost always noted by double carriage return, and add to output list
        text = text.split("\r\n\r\n")

        #add criteria to output list
        for each in text:
            output_list.append(each)

        #add to format dict
        format_dict['Other'] += 1 

In [None]:
#import ElementTree from xml library to parse files   
import xml.etree.ElementTree as ET

#create list of desired variables to extract from xml
variables = ["nct_id", "start_date", "phase", "./eligibility/gender", "./eligibility/minimum_age", "./eligibility/maximum_age", "./eligibility/criteria/textblock"]

#create dict in which each key is one of the variables
variables_dict = {}.fromkeys(variables)

#assign each key (i.e. variable) in dict a value of empty list
for variable in variables_dict.keys():
    variables_dict[variable] = []

In [None]:
#iterate over names of xml files
for trial in os.listdir('/Users/Sam/Dropbox/Capstone/trials_protocols_extended'):

    #create a parser called "tree" using .parse() and pass trial name
    tree = ET.parse('/Users/Sam/Dropbox/Capstone/trials_protocols_extended/' + trial)
    
    #represent data as tree-like structure with .getroot()
    root = tree.getroot()

    #iterate through variables to extract
    for variable in variables_dict.keys():
        if variable == 'nct_id':
            variables_dict[variable].append(trial.strip('.xml'))
        else:
            variables_dict[variable].append(root.findtext(variable))

In [None]:
#count different formats detected
format_dict = {label:0 for label in format_df.Label}
format_dict['Other'] = 0

#iterate through list containing free text blocks and replace with formatted text
for x in range(len(os.listdir('/Users/Sam/Dropbox/Capstone/trials_protocols_extended'))):
    
    #extraction of raw criteria
    output_list = []
    trial = (os.listdir('/Users/Sam/Dropbox/Capstone/trials_protocols_extended')[x]).strip('.xml')
    fulltext = variables_dict["./eligibility/criteria/textblock"][x]
    ExtractCriteria(trial = trial, text = fulltext, output_list = output_list, format_dict = format_dict)
    variables_dict["./eligibility/criteria/textblock"][x] = output_list

In [None]:
#trials master df
trials_df = pd.DataFrame(variables_dict)
trials_df.head()

In [None]:
#return format frequencies
format_dict

In [None]:
#write to csv
trials_df.to_csv("trial_metadata.csv")

In [None]:
#store dates
dates = []

#original raw_criteria
raw_criteria = []

#empty list in which to store separated raw criteria
raw_criteria_long = []

#set regex as raw string for first subbullet
regex = r'\r\n\r\n {15}\S'

#compile each regex
re_compiled = re.compile(regex)

#complex subbullet counter
subbullet_counter = 0

#iterate through trial df
for x in range(len(trials_df)):
    
    #note date
    date = trials_df['start_date'][x]
    
    #note full text block
    fulltext = trials_df['./eligibility/criteria/textblock'][x]

    #for each criterion in list of separated criteria
    for text in fulltext:

        #append to original raw criteria
        raw_criteria.append(text)
        
        #search text for regex
        re_search = re_compiled.search(text)

        #if the first subbullet regex is found
        if re_search != None:

            #add to complex subbullet counter
            subbullet_counter += 1

            #create empty lists in which to store starts of subbullets
            starts = []

            #iterate over text to find indices matching regex and its start
            for match in re.finditer(regex, text):

                #note start and stop of matching sequence
                starts.append(match.span()[0])

            #define first "base statement" as substring with indices of first start and stop
            base_statement = text[:starts[0]]

            #for each subbullet detected
            for x in range(len(starts) - 1):

                #concatenate subbullet to base_statement and add to new list
                new_criterion = base_statement + text[starts[x]:starts[x + 1]]
                raw_criteria_long.append(new_criterion)
                dates.append(date)
        
        #otherwise, ignore
        else:
            raw_criteria_long.append(text)
            dates.append(date)
            
#break up 'Other' and "Biologic therapy" blocks
raw_criteria_longer = []
for each in raw_criteria_long:
    
    #strip leading/ending whitespace
    stripped = each.strip()
    
    #biologic therapy
    if stripped.startswith("Biologic therapy", 0, len("Biologic therapy")) == True:
        criteria_split = each.split("\r\n\r\n          -")
        for criterion in criteria_split:
            raw_criteria_longer.append(criterion)
            
    #other
    elif stripped.startswith("Other", 0, len("Other")) == True:
        criteria_split = each.split("\r\n\r\n          -")
        for criterion in criteria_split:
            raw_criteria_longer.append(criterion)
    
    #otherwise, add criterion to list
    else:
        raw_criteria_longer.append(each)

In [None]:
print(f"Raw criteria:              {len(raw_criteria)}")
print(f"Raw criteria long:         {len(raw_criteria_long)}")
print(f"Raw criteria longer:       {len(raw_criteria_longer)}")
print(f"Complex subbullets:        {subbullet_counter} ({(subbullet_counter/len(raw_criteria)*100):.2f}%)")

In [None]:
#compile regular expression to detect strings of all caps (i.e. abbreviations)
capitalized_words = []
regex = r"\b[A-Z]{2,}\b"

#iterate through criteria
for text in raw_criteria_longer:
    
    #iterate over each text to find indices matching regex and its start/stop
    for match in re.finditer(regex, text):

        #note start and stop of matching sequence
        capitalized_word = text[(match.span()[0]):(match.span()[1])]
        
        #append to list
        capitalized_words.append(capitalized_word)
        
#make frequency table
top_caps = Counter(capitalized_words)
top_caps.most_common()

In [None]:
#create empty list in which to store cleaned text with the following changes
raw_criteria_long_trimmed = []

#compile regular expression to detect strings of all caps (i.e. abbreviations)
regex = r"\b[A-Z]{2,}\b"
re_compiled = re.compile(regex)

#list of common all caps non-abbreviation words (appear > 1x)
non_abrv = ["DONOR", "DISEASE", "CHARACTERISTICS", "AND", "DONORS", "RELATED", "OR", "INCLUSION", "CRITERIA", "EXCLUSION", "PRIOR", "CONCURRENT", "THERAPY", "NOTE", "BEFORE", "PATIENTS", "MATCHED", "UNRELATED", "MUST", "REAL", "TRANSPLANT", "PATIENT", "ELIGIBILITY", "ALLOWED", "ADULT", "PEDIATRIC", "ORGAN", "DYSFUNCTION", "EXCEPT", "STRATUM", "STRATA", "GROUP", "AGED"]

#conduct pre-processing steps for each criterion in list
for each in raw_criteria_longer:

    #break criterion into single words
    word_list = []
    for word in each.split():
        
        #search word for abbreviations
        re_search = re_compiled.search(word)
        
        #if search is not empty and word isn't a commonly all-caps non abbreviation
        #keep abbreviation as is, otherwise lowercase
        if (re_search != None) & (not any(term in word for term in non_abrv)):
            word_list.append(word)
        else:
            word = word.lower()
            word_list.append(word)
        
    #reassign "each" to sentence that is lowercased except for abbreviations
    each = " ".join(word_list)

    #remove all special characters except numbers and ';' (often used in genetic mutations)
    each = re.sub(r'[^A-z0-9 ;]', "", each)

    #remove all single characters
    each = re.sub(r'\s+[a-zA-Z]\s+', "", each)
    
    #replace multiple whitespace with single whitespace
    each = re.sub(" +", " ", each)
    
    #strip leading and ending whitespace
    each = each.strip()

    #add to new empty list
    raw_criteria_long_trimmed.append(each)
    
#confirm same length
print(f"Raw criteria longer:       {len(raw_criteria_longer)}")
print(f"Raw criteria long trimmed: {len(raw_criteria_long_trimmed)}")

In [None]:
#print frequency table
criteria_freq = Counter(raw_criteria_long_trimmed)
for each in criteria_freq.most_common():
    print(f"{each[0]}\n{each[1]}\n\n\n\n")

In [None]:
#based on frequency table, assemble list of common phrases to be tossed out (single word phrases will later be removed)
meaningless_criteria = ["inclusion criteria", "exclusion criteria", "see disease characteristics", "not specified", "at least 8 weeks"]

#create empty list store indices of non-empty, non-single word criteria
to_keep = []

#iterate through criteria list 
for x in range(len(raw_criteria_long_trimmed)):
    
    #if empty string, pass
    if not raw_criteria_long_trimmed[x]:
        pass
    
    #if single word, pass
    elif len(raw_criteria_long_trimmed[x].split()) == 1:
        pass
    
    #if criterion in list of meaningless criteria, pass
    elif raw_criteria_long_trimmed[x] in meaningless_criteria:
        pass
    
    #else, add to raw_criteria_v2
    else:
        to_keep.append(x)

print(f"Raw criteria long trimmed: {len(raw_criteria_long_trimmed)}")
print(f"Final criteria trimmed:    {len(to_keep)}")

In [None]:
#create empty lists to which original and trimmed (i.e. formatted for NLP)
#criteria_original = []
criteria_trimmed_stops = []

#loop through indices of criteria to keep
for index in to_keep:
    
    #add original and trimmed criteria, respectively
    #criteria_original.append(raw_criteria_long[index])
    criteria_trimmed_stops.append(raw_criteria_long_trimmed[index])
    #final_dates.append(dates[index])
    
#check 
#print(final_dates[:50], criteria_original[:50], criteria_trimmed_stops[:50])
#print(f"Raw criteria long trimmed: {len(criteria_original)}")
print(f"Final criteria trimmed:    {len(criteria_trimmed_stops)}")

In [None]:
#create list of all words
all_words = []
for criterion in criteria_trimmed_stops:
    for word in criterion.split():
        all_words.append(word)
        
#create frequency table
word_freq = Counter(all_words)

#print 100 most common words
for x in range(100):
    print(word_freq.most_common()[x][0], end = "','")

In [None]:
#list of custom stop words based on top 100 terms, many removed for semantic significance
custom_stops = ['or','of','the','patients','to','for','with','no','and','at','not','must','be','have','in',
                'are','than','as', 'by','is','study','other','on', 'who','if', 'will','any', 'criteria','patient',
                'from','this','that','allowed','an','may','all','known']

In [None]:
#empty list for nostops criteria
criteria_trimmed_nostops = []

#iterate through criteria, split into words, remove stops, and join back together
for criterion in criteria_trimmed_stops:
    listofwords = []
    for word in criterion.split():
        if word in custom_stops:
            pass
        else:
            listofwords.append(word)
    newcriterion = " ".join(listofwords)
    criteria_trimmed_nostops.append(newcriterion)  
    
#check length
print(len(criteria_trimmed_nostops))

In [None]:
#create list with stops removed, no empty strings or single words
pre_lemmed = []

#iterate through criteria list 
for criterion in criteria_trimmed_nostops:
    
    #if empty string, pass
    if not criterion:
        pass
    
    #if single word, pass
    elif len(criterion.split()) == 1:
        pass
    
    #else, add to raw_criteria_v2
    else:
        pre_lemmed.append(criterion)
        
print(len(pre_lemmed))

In [27]:
#import necessary objects/functions for lemmatization with parts of speech tagging
import nltk 
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet 
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/Sam/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /Users/Sam/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/Sam/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
#instantiate lemmatizer object
lemmatizer = WordNetLemmatizer() 
  
#define pos_tagger, which changes a given nltk tag to the wordnet abbrev
def pos_tagger(nltk_tag): 
    if nltk_tag.startswith('J'): 
        return wordnet.ADJ 
    
    elif nltk_tag.startswith('V'): 
        return wordnet.VERB 
    
    elif nltk_tag.startswith('N'): 
        return wordnet.NOUN 
    
    elif nltk_tag.startswith('R'): 
        return wordnet.ADV 
    
    else:           
        return None  

In [None]:
#create empty list in which to store final, lemmatized criteria
criteria_final = []

#list of additional suffixes to be removed
suffix_list = ["tion", "ical", "ious", "ance"]

#iterate through criteria
for sentence in pre_lemmed:
    
    # tokenize the sentence and find the POS tag for each token 
    pos_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))   

    #use previously defined function to fix tags
    #reference: https://www.geeksforgeeks.org/python-lemmatization-approaches-with-examples/#:~:text=Wordnet%20Lemmatizer%20(with%20POS%20tag)&text=This%20is%20because%20these%20words,%2C%20noun%2C%20adjective%20etc).
    wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tagged)) 

    #create empty list in which to store lemmatized sentence
    lemmatized_sentence = [] 
    
    #iterate through mapped list with wordnet tags
    for word, tag in wordnet_tagged: 
        #if there is no available tag, append the token as is 
        if tag is None: 
            lemmatized_sentence.append(word) 
        
        # else use the tag to lemmatize the token         
        else:         
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag)) 
    
    #remove selected suffixes from words that are poorly handled by automatic lemmatizer
    for index in range(len(lemmatized_sentence)):
        if lemmatized_sentence[index][-4:] in suffix_list:
            lemmatized_sentence[index] = lemmatized_sentence[index][:-4]
    
    #join previously created list into sentence (i.e. single string)
    lemmatized_sentence = " ".join(lemmatized_sentence) 

    #add lemmatized sentence to finalized criteria list
    criteria_final.append(lemmatized_sentence)
    
print(len(pre_lemmed))
print(len(criteria_final))

In [None]:
criteria_final_df = pd.DataFrame({'Criteria': criteria_final})
criteria_final_df.to_csv("criteria_final.csv")

In [21]:
criteria_final_df = pd.read_csv("criteria_final.csv")
criteria_final_df.head()

Unnamed: 0.1,Unnamed: 0,Criteria
0,0,diagnosis acute myeloid leukemia acute lymphob...
1,1,first subsequent relapse refractory disease af...
2,2,antecedent hematologic disorder except philade...
3,3,age 15 over
4,4,perform status 03


In [24]:
criteria_final_df = criteria_final_df.sort_values(by = ["Criteria"])
criteria_final_df.head()

Unnamed: 0.1,Unnamed: 0,Criteria
3050,3050,1 21 year age when originally diagnose acute l...
2222,2222,1 30 year old havebody weight 10 kg entry note...
5206,5206,1 31 year age
932,932,1 APML diagnosis base upon morpholog histochem...
212,212,1 age 21 year age when enrol onto t2005001 pro...


In [25]:
criteria_final_df['Criteria'] = criteria_final_df['Criteria'].str.replace('^[0-9]*', '', regex = True)
criteria_final_df['Criteria'] = criteria_final_df['Criteria'].str.strip()
criteria_final_df.head()

Unnamed: 0.1,Unnamed: 0,Criteria
3050,3050,21 year age when originally diagnose acute lym...
2222,2222,30 year old havebody weight 10 kg entry note m...
5206,5206,31 year age
932,932,APML diagnosis base upon morpholog histochem a...
212,212,age 21 year age when enrol onto t2005001 proto...


In [28]:
#tokenize sentences for FastText input
tokenized_final = [nltk.word_tokenize(criterion) for criterion in criteria_final_df.Criteria]

print(len(tokenized_final))
print(tokenized_final[:5])

5251
[['21', 'year', 'age', 'when', 'originally', 'diagnose', 'acute', 'lymphoblastic', 'leukemia', 'ALL'], ['30', 'year', 'old', 'havebody', 'weight', '10', 'kg', 'entry', 'note', 'more', '3', 'age', '21', '30', 'enrol'], ['31', 'year', 'age'], ['APML', 'diagnosis', 'base', 'upon', 'morpholog', 'histochem', 'andor', 'flow', 'cytometric', 'confirm', 'upon', 'review', 'bycentral', 'studydesignated', 'hematologic', 'pathologist', ';'], ['age', '21', 'year', 'age', 'when', 'enrol', 'onto', 't2005001', 'protocol', 'version', '6272007', '17']]


In [29]:
#import FastText
from gensim.models.fasttext import FastText

In [30]:
#set FastText hyperparameters
#defaults listed in this article: https://radimrehurek.com/gensim/auto_examples/tutorials/run_fasttext.html
#embedding size of 256 chosen as order of magnitude for 2^n
data = tokenized_final
embedding_size = 256
window_size = 5
min_word = 5
down_sampling = 1e-4
alpha = 0.025
model = 0
epochs = 5

In [31]:
%%time

#instantiate FastText model and print time
ft_model = FastText(data,
                    size = embedding_size,
                    window = window_size,
                    min_count = min_word,
                    sample = down_sampling,
                    sg = model,
                    iter = epochs)

CPU times: user 28.1 s, sys: 20.3 s, total: 48.4 s
Wall time: 3min 10s


In [49]:
%%time
ft_model.save("ft_embedding_size256_window5.model")

CPU times: user 450 ms, sys: 26.5 s, total: 26.9 s
Wall time: 1min 10s


In [None]:
%%time
testmodel = FastText.load("ft_embedding_size256_window5.model")

In [50]:
#this function will turn individual word vectors into a sentence vector
#accepts the sentence item (sent) and the FastText model (model)
def sent_vectorizer(sent, model):
    
    #empty list in which to store sentence vectors
    sent_vec =[]
    
    #keeps track of total number of words in sentence
    numw = 0
    
    #for each word in a sentence
    for w in sent:
        
        #if this is the first word, sentence vector starts out with single embedding
        #if not the first word, add word embedding to previous embeddings as part of cumulative sentence vector
        try:
            if numw == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            
            #add 1 to word counter for each iteration
            numw+=1
        
        #if there's an error, do nothing
        except:
            pass
    
    #when finished, return the overall sentence vector divided by the number of words 
    return np.asarray(sent_vec) / numw

#create empty list in which to store sentence embeddings
X = []

#for each criterion in overall data list, vectorize the sentence and append to X
for sentence in tokenized_final:
    X.append(sent_vectorizer(sentence, ft_model))

  sent_vec = model[w]
  sent_vec = np.add(sent_vec, model[w])


In [51]:
#add embeddings to df and save 
criteria_final_df['Embedding'] = X
criteria_final_df = criteria_final_df.drop(columns = ['Unnamed: 0'])
criteria_final_df.to_csv('criteria_final_embedded.csv')