In [None]:
# @author Alec Chapman
# @date 1/19/2018
# @author Olga Patterson
# @date 10/01/2018

In [None]:
import os, glob
from gensim.models import Word2Vec

import LexDiscover as ld
from nltk.tokenize import word_tokenize, sent_tokenize
from importlib import reload
reload(ld)

# Documentation

Code repository: https://github.com/department-of-veterans-affairs/vex

Documentation: https://github.com/department-of-veterans-affairs/vex/wiki


# Introduction
This notebook is meant to be an in-depth introduction to the `LexDiscover` module. Another notebook will be provided, generically titled `LexDiscover.ipynb`, that will be meant as a plug-and-play notebook with limited functionality.

**Outline**
- Motivation
- Overview
- Example
    - Set Up
    - Example
- Exploring the Results
    - Getting the lexicon
    - Set operations
    - Getting contexts
- Next Steps
- References

# Motivation
Rule-based NLP systems require an extensive lexicon in order to discover concepts in clinical text. Depending on the task, this lexicon may be relatively simple to develop. However, challenges in developing a lexicon resulting from the varying nature of clinical text includes misspellings, abbreviations, and little-known synonyms. Discovering additional representations of these terms can increase an NLP system's coverage and speed up the development process.

This module is meant to help discover new lexical variants of any concept. It is not meant to be an automated part of a pipeline, but rather a toolkit to help the system developer come up with an extensive lexicon. This module is heavily inspired by the following paper by Velupillai et al.: ["Vocabulary Development To Support Information Extraction of Substance Abuse from Psychiatry Notes"](http://www.aclweb.org/anthology/W/W16/W16-2912.pdf) [1].

# Overview
As outlined in the paper by Velupillai et al., we implement vocabulary expansion using a number of different sources. Currently, three methods are implemented: an automated, neural-network based method using the `word2vec` implementation by `GenSim` , called `"word2vec"`;  a linguistic and rule-based implementation called `wnling`; and an ontological-based implementation called `ont`. 

The user provides a corpus of text and a base lexicon to generate new vocabulary terms. Each word in the base lexicon is then used to expand the vocabulary by finding similar words in the corpus. The corpus should ideally be the same corpus that will be used for the NLP task; for example, if the study involves extracting information from radiology reports, the corpus should consist of radiology reports similar or identical those that will be processed in the study. This insures that similar words will be found and returned. Any discovered words that occur in the corpus above a user-defined minimum count are kept.

The `word2vec` algorithm uses the context of words in a vocabulary to create dense, real-valued vectors representing words [2]. We use the GenSim implementation [3]. The resulting vectors can be used to measure similarity between words. In this implementation, each word in the corpus is queried against each word the base lexicon. Any words that have a similarity score above a user-defined threshold is added to the lexicon. This does not necessarily return synonyms; it may return antonyms or other words that appear in similar contexts. As such, the results of the `word2vec` method may be especially noisy and should be used with caution, but may still discover useful terms. 

`wnling` uses string manipulations and orthographic rules to generate new terms. Currently, new strings are generated using substuutition, replacement deletion, and abbreviation. For example, "abdomen" becomes "abdomin", "abdomi", "abd", etc. "Cardiac arrest" becomes "crdc arrst", "ca", and "car arr". Generated strings that do not occur in the corpus above the min_count threshold are discarded.

`ont` uses SNOMED-CT codes to find synonyms, parents, and children of concepts. The user provides base codes and all terms belonging to that code, and optionally parents and children, are searched for in the corpus and added.

Future implementations should include additional linguistic generations (inflection, term reordering) and more expansive ontological-based expansion (UMLS, text searching).

# Example

## Setup
To instantiate our vocabulary discovery model, we'll need to pass in the corpus as a single string. Any preprocessing can be done beforehand; by default, the only preprocessing will be lower-casing and replacing multi-word phrases from the base lexicon with single words joined by underscored (all occurrences in the corpus of "cardiac arrest" become "cardiac_arrest"). 

In this example, we'll pass in a directory containing text files that you want to read in (or change the code below to be the path to a single file.) We'll then read all the files in and train a word2vec model which we'll use to pass in to our discovery objects. Future functionality should include methods to read from files/directories or database clients.

The texts in this example are radiology reports from MIMIC-III [4].

In [None]:
# Read in data here using helper functions
# Pass into discover as a single string.
import helpers

DATADIR = 'data' # directory containing *.txt files
text = helpers.read_text_from_file(DATADIR)


## Example
We first define our base lexicon: a list of pre-selected words that will be used to define our concept. In this example, we'll assume that we are aiming to extract anatomic locations from radiology reports. This is a broad task and is well-suited for this tool.

In [None]:
base_lex = ['diabetes', 'depression', 'hypertension']
base_codes = [73211009, 38341003] # Diabetes mellitus,  Diabetes mellitus type 2

Now that we've provided some initial examples, we can initialize our model. The main class to be used is `LexDiscover.AggregateLexDiscover`, which will utilize all of implemented methods. We pass in the following keyword arguments:

- `text`: a corpus of clinical text as a single string
- `base_lex`: a list of intial terms
- `min_count`: the minimum number of times a word must occur to be added to the lexicon. Default is 1, so any word occurring in the corpus will be considered.
- `edit_dist`: the number of substitutions, replacements, and deletions that will occur in `wnling` (think Minimum Edit Distance). Default is 2.
- `sim_thresh`: the similarity threshold that `word2vec` will use to decide whether or not to add a new term to the lexicon. Default is 0.5
- `base_codes`: a list of initial SNOMED-CT codes
- `children`: Boolean, whether or not to expand to a base code's children. Default False.
- `parents`: Boolean, whether or not to expand to a base code's parents. Default False.

Upon initializing, the model will create a new dictionary containing the current lexicon, lower-case the text, replace any multi-word phrases in the text with single strings, split the corpus up into sentences, and then initialize the sub-models. This will automatically train a `word2vec` model using the default parameters. Alternatively, the sub-models can be passed in directly (more on that in the future).

In [None]:
import LexDiscover as ld

In [None]:
# Initializing this model takes some time due to the text processing and model training.
discover = ld.AggregateLexDiscover(text=text, base_lex=base_lex, min_count=1, edit_dist=2, sim_thresh=0.5,
                                  base_codes=base_codes, children=True, parents=True)

In [None]:
w2v = Word2Vec.load("word2vec.model")
existing_models={}
existing_models['word2vec'] = w2v
print("Loaded word2vec model: ", w2v)

In [None]:
discover = ld.AggregateLexDiscover(text=text, base_lex=base_lex, min_count=1, edit_dist=2, 
                                   sim_thresh=0.5,
                                   models=existing_models,
                                   base_codes=base_codes, 
                                   children=True, 
                                   parents=True)

To discover new lexical terms, we call `discover.discover_lex()`. This will call the same method on the sub-models and discover new lexical terms using each method. This method returns a list of new words sorted by frequency (most frequent words at the top), as well as adds each new term to the lexicon. 

In [None]:
new_terms = discover.discover_lex()
print(discover)
print()
print(new_terms[:40])

As you can see, a number of useful words were retrieved: "pelvis", "chest", and "bladder" are all anatomic locations similar to those found in the base lexicon. Additionally, "abd", an abbreviation for "abdomen", was retrieved. Some words are not necessarily useful in themselves, but are part of multi-word phrases: "anterior" and "lower" are all likely part of multi-word phrases that should be considered. More support for multi-word phrases should be added in the future.

Note that some words are not useful, such as "contrast" and "terminates". These are words that are commonly mentioned *in the same context* as the anatomic locations in radiology reports, so they were retrieved by `word2vec`. We'll next examine more closely the results of the different methods.

## Sub-models
The sub-models can be directly accessed through `discover`'s `model` attribute. Instead of using the class `AggregateLexDiscover`, each of these models can be instantiated separately to only use one of the algorithms.

In [None]:
print(discover.models)

## `word2vec` model

Let's look at which words were retrieved by `word2vec`. We can do this by calling the `get_discovered_terms` and passing in the model name:

In [None]:
word2vec_terms = discover.get_discovered_terms('word2vec')
print(len(word2vec_terms))
print(word2vec_terms[:40]) # sorted by frequency

A large number of new terms were discovered by `word2vec`. As said before, this method is quite noisy, but depending on the task, lots of useful results can come from this.

## `wnling` model

The linguistically generated words are easier to understand and justify. Let's look at the words that were generated by `wnling`:

In [None]:
ling_terms = discover.get_discovered_terms('wnling')
print(len(ling_terms))
print(ling_terms)

As you can see, this list is much smaller. Especially useful terms include 'ruq', short for 'right upper quadrant', one of our base terms, 'luq', and many misspellings of 'abdomen'. Some are not useful, like 'lie', 'fever', and 'liters'. This list is sorted by frequency in descending order (most-frequent to least-frequent). Let's look at the frequency of these by calling `discover.get_count(term)`:

In [None]:
print([(term, discover.get_count(term)) for term in ling_terms])

Many of the least common terms only occurred once, so they would be excluded by a stricter count threshold. But 'abd', 'kidneys', 'ruq', 'luq', and 'abdomenal' all occurred a fair number oftimes, suggesting that they may be useful terms to keep. Of course, the count threshold should be adjusted depending on the size of the corpus and desired precision; Velupillai et al. use a count threshold of 15 with a corpus of 100 notes, but use a baseline lexicon of 91 terms. 

## `ont` model
The ontological model using SNOMED-CT to find synonyms for concepts defined by the codes in `base_codes`. By default, it will not look for parent or children terms, but setting this keyword args to `True` can be quite useful.

In [None]:
discover.get_discovered_terms('ont')

Looking at the list above, you can see the two original terms that were added by the codes: 'kidney structure (body structure)' and  'gastrointestinal tract structure (body structure)'. Neither of these phrases actually appear in the text due to the '(body structure)' string. Here's a quick look into `pymedtermino`. This library takes some manual building in order to have access to the SNOMED-CT data, so if the installation is not complete it may cause some errors. If an `OperationalError` is raised, an empty list is returned. Check out [pymedtermino's documentation](http://pythonhosted.org/PyMedTermino/)

In [None]:
from pymedtermino.snomedct import SNOMEDCT
for code in [73211009, 44054006]:
    concept = SNOMEDCT[code]
    print("Original concept: {}".format(concept))
    print("Original synonyms: {}".format(concept.terms))
    print("1st Child synonyms:")
    for child in concept.children[:1]:
        print(child.terms)
    print("1st Parent synonyms:")
    for parent in concept.parents[:1]:
        print(parent.terms)
    print()
     

# Exploring Results
For data permanence, the entire new lexicon can be written to a file using the `write_lex` method:

In [None]:
discover.write_lex('example_lex.txt', '\n')

The results should then be reviewed and irrelevant terms should be discarded. You could then potentially repeat the process with the expanded vocabulary. Here are some methods useful for exploring and evaluating the results:

## Getting the lexicon

In [None]:
# Get all lexical terms
lex = discover.get_lex()
print(len(lex))
print(lex[:10])

In [None]:
# Get only the original baseline lexicon
base = discover.get_base_lex()
print(len(base))
print(base)

In [None]:
# Get SNOMED-CT codes. Returns False if there is no ontological model
codes = discover.get_codes()
print(len(codes))
print(codes[:10])

In [None]:
# Get only the baseline SNOMED-CT codes
base_codes = discover.get_base_codes()
print(len(base_codes))
print(base_codes)

## Set Operations
`get_lex()` returns a list of *all* lexicon terms, including the baseline terms. There are several methods that allow you to see just the terms that were discovered by a particular model and not in the baseline vocabulary (results are always returned sorted by frequency):

In [None]:
# Get all new terms that were not in the baseline vocabulary.
t = discover.get_discovered_terms()
print(len(t))
print(t[:10])

In [None]:
# Get all terms that were discovered by a particular model
t = discover.get_discovered_terms("ont")
print(len(t))
print(t[:10])

In [None]:
# Get all new terms that were discovered by all of two or more models (set intersection)
i = discover.get_intersect('word2vec', 'wnling')
print(len(i))
print(i[:10])

In [None]:
# Get all new terms that were discovered by any of two or more models (set union)
u = discover.get_union('wnling', 'ont')
print(len(u))
print(u[:10])

In [None]:
# Get all new terms that were discovered by the first model, but not one or more others (set difference)
# ie., only terms discovered by 'ont'
d = discover.get_difference('ont', 'word2vec', 'wnling')
print(len(d))
print(d[:10])

## Getting contexts
These methods, particularly `word2vec`, will return terms that are ambiguous in their usefulness, or only part of a larger, more useful phrase. To help recognize this, there are several methods that allow you to see the context of a term.

`get_context(word, window=(1, 0), remove_stopwords=True)` returns a list of n-grams for a given word and their probability as a decimal, with `n` defined by the tuple `window`. For example, with a context window of (1, 0) for the term 'tract', the tuple (('sinus', 'tract'), 0.11) would mean that ~11% of instances of 'tract' are in the context of 'sinus tract'.

By default, it will return context windows of one word before, 0 after (bigrams). The keyword argument `remove_stopwords` defines whether or not stopwords should be included in the windows. 'PHI' means before a sentence (**Î¦**, not protected-health-information); 'OMEGA' means after a sentence.

In [None]:
# Get all bigrams with stopwords removed
c = discover.get_context('heart')
print(len(c))
print(c[:10])

In [None]:
# See all words that follow 'anterior'
c = discover.get_context('depression', window=(0, 1))
print(len(c))
print(c[:10])

In [None]:
# Get a larger context window, with a multi-word phrase, keeping stopwords
c = discover.get_context('mellitus', window=(3, 6), remove_stopwords=False)
print(len(c))
print(c[:5])

In [None]:
# Get sentences with a word in it
# Returns a list of sentences
g = discover.search_in_sentences('parkinson', 5)
for i, s in enumerate(g):
    print(i, '.', s)

In [None]:
g = discover.search_in_sentences('acetaminophen' , 5)
for i, s in enumerate(g):
    print(i, '.', s)

# Next Steps (1/24/18)
- **Multi-word phrases:** Many of the retrieved words are likely just part of a multi-word phrase. Approaches to this could include simple expanding the words in the corpus to be both bigrams and unigrams; using GenSim's Phraser module to automatically combine multi-word phrases; and adding functionality to see the context of words in the corpus when reviewing results.
- **Evaluation:** There has been little evaluation in this methodology. Evaluating the various methods of vocabulary expansion could lead to better resutls.


# References
[1] Velupillai et al., ["Vocabulary Development To Support Information Extraction of Substance Abuse from Psychiatry Notes"](http://www.aclweb.org/anthology/W/W16/W16-2912.pdf)

[2] Mikolov et al., ["Distributed Representations of Words and Phrases and their Compositionality"](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

[3] [GenSim word2vec documentation](https://radimrehurek.com/gensim/models/word2vec.html)

[4] [MIMIC](https://mimic.physionet.org/)

[5] [PyMed Termino](http://pythonhosted.org/PyMedTermino/)