# Develop a negation identification algorithm using word embedding and SVM

## Methods
Dataset: 50 notes out of the i2b2 2010 challenge dataset on relations. Type: Discharge summaries. 

~2500 annotations including Problems concept + assertion status. 

Task: Classify the assertion status for each concept

Features: sentencence embedding + target word embedding + (Context_output) + Section_info(hot_encoding)


Preprocessing:
- Tokenize using NLTK
- Keep only letters and the dash symbel (`-`)
- Lower case
- Sentence splitting using NLTK punckt. Each row in the dataframe represents a sentence that contains at least a concept.
- Converting to word vectors. 
    - Tokens outside of the vocabulary: Random vector  

Word vectors: 
- gensim Word2Vec, gensim Phraser
- https://github.com/ncbi-nlp/BioSentVec (default. Dimension=200) (TODO: use similarity to check the quality of the embedding. eg. diagnose vs diagnoses)

[Sentence embedding](https://stackoverflow.com/questions/29760935/how-to-get-vector-for-a-sentence-from-the-word2vec-of-tokens-in-sentence)
- Simple average of word vectors (default)
- Word vector multiples TF-IDF
- The BioSentVec 

Model 
- SVM Linear kernel (default)


Evaluation: 
- F1

## User input

In [53]:
DATA_DIR = r'/Users/chenkx/git/clinical-negation/data/extracted_concepts/explore_i2b2-2010-problem_concepts-with_sentences_partial.csv'
DATA_DIR_RAW = r'/Users/chenkx/Box Sync/NLP group/2010 i2b2 challenge - rel'
FILE_SORTER= r"/Users/chenkx/Box Sync/NLP group/2019 n2c2 Challenge/Track 3 (normalization)/Test/test_file_list.txt"
MAP_DIR = r"/Users/chenkx/git/clinical-negation/notebooks/2010Corpus/section_mapping_v4_all.csv"

In [1]:
DIR_EMBEDDING = r'/Users/chenkx/git/clinical-negation/data/BioWordVec_PubMed_MIMICIII_d200.vec.bin'

In [38]:
wordvec = 'Word2Vec' # or 'BioWordVec' # TODO: develop a word vector using the BioWordVec and BioSentVec

## Begin of codes

In [226]:
# conda environment: python3.8
import os
import pandas as pd
import numpy as np
import re
import json
import bisect
import csv
from nltk.tokenize import word_tokenize

from sklearn import svm
from sklearn.metrics import classification_report

import matplotlib.pyplot as plt
%matplotlib inline

In [120]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/chenkx/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Load word embedding

In [4]:
from gensim.models import KeyedVectors
wv = KeyedVectors.load_word2vec_format(DIR_EMBEDDING, binary=True)

Load training data and later extract features (including converting text to embeddings).

In [54]:
df = pd.read_csv(DATA_DIR)
df

Unnamed: 0,id,b,e,t,a,c,s,sent
0,134300717::T1,62,109,problem,present,deep venous thrombosis of right lower extremity,Unknown/Unclassified,134300717\nFIH\n9099750\n22777/5p0t\n355178\n2...
1,134300717::T2,222,269,problem,present,deep venous thrombosis of right lower extremity,Diagnoses,Unsigned\nDIS\nReport Status :\nUnsigned\nADMI...
2,134300717::T3,295,314,problem,present,metastatic melanoma,Diagnoses,ASSOCIATED DIAGNOSIS :\nMetastatic melanoma .
3,134300717::T4,464,483,problem,present,metastatic melanoma,Present illness,HISTORY OF PRESENT ILLNESS :\nThis is the seco...
4,134300717::T5,486,498,problem,present,his melanoma,Present illness,His melanoma was originally diagnosed last fal...
...,...,...,...,...,...,...,...,...
2480,0194::T145,7027,7052,problem,present,torn transverse mesocolon,Procedures/Surgery,"7-2-93 , emergency room exploratory laparotomy..."
2481,0194::T147,7065,7097,problem,present,serosal tear of transverse colon,Procedures/Surgery,"7-2-93 , emergency room exploratory laparotomy..."
2482,0194::T150,7125,7149,problem,present,ruptured left renal vein,Procedures/Surgery,"7-2-93 , emergency room exploratory laparotomy..."
2483,0194::T164,7616,7647,problem,present,severe necrotizing pancreatitis,Complications,"7-12-93 , 7-14 , 7-20 , 7-22 , and 7-24 , were..."


In [238]:
# number of notes
len(df.id.transform(lambda x: x.split('::')[0]).unique())

50

Feature extraction

In [190]:
def embed_sent(tokens_li, wv=wv):
    '''
    Turn a sentence (list of tokens) to the embedding by calculating the average of word vectors.
    :param tokens Optional(List[str]) list of processed tokens
    return np array of dimension 200.
        random embedding if list is empty. Average of the arrays elsewise. 
        For tokens not in the vocab: skip 
        
    '''
    if not tokens_li:
        return np.random.rand(200)
    
    count = 0
    current = np.zeros(200)
    skipped = set()
    for i in tokens_li:
        if i not in wv.key_to_index:
            skipped.add(i)
            continue
        count += 1
        current = (current * (count - 1) + wv.get_vector(i))/count 
    
    if count == 0:
        return np.random.rand(200)
    
    if skipped:
        print(f'================\nNot found in the vacab: {skipped}')

    return current 

In [191]:
def text_transformer(text):
    '''
    lower case. Keep only letters and hyphen (-). 
    '''
    tokens = word_tokenize(text)
    tokens = [re.sub('[^a-zA-Z\-]', '', i).strip().lower() for i in tokens]
    tokens = [i for i in tokens if i != '']
    return embed_sent(tokens)

In [206]:
# concept embedding 
concept_embed = df.apply(lambda x: text_transformer(x['c']), axis=1, result_type='expand')

Not found in the vacab: {'exacerpation'}
Not found in the vacab: {'exacerpation'}
Not found in the vacab: {'anapical'}
Not found in the vacab: {'extopyon'}


In [193]:
# Sentence embedding
sent_embed = df.apply(lambda x: text_transformer(x['sent']), axis=1, result_type='expand')

Not found in the vacab: {'ijordcompmac', 'taroby'}
Not found in the vacab: {'freiermlinkeneighcaablinfarst'}
Not found in the vacab: {'nienemileisele', 'preheprior'}
Not found in the vacab: {'breunlinke'}
Not found in the vacab: {'breunlinke'}
Not found in the vacab: {'breunlinke'}
Not found in the vacab: {'breunlinke'}
Not found in the vacab: {'breunlinke'}
Not found in the vacab: {'laymie'}
Not found in the vacab: {'laymie'}
Not found in the vacab: {'elmvh', 'shaigaydona', 'schoellsullkotefong', '---', 'maustinie', 'freiermfusc'}
Not found in the vacab: {'elmvh', 'shaigaydona', 'schoellsullkotefong', '---', 'maustinie', 'freiermfusc'}
Not found in the vacab: {'gento', 'yaneslaunt', 'maillliepslighsint'}
Not found in the vacab: {'gento', 'yaneslaunt', 'maillliepslighsint'}
Not found in the vacab: {'gento', 'yaneslaunt', 'maillliepslighsint'}
Not found in the vacab: {'gento', 'yaneslaunt', 'maillliepslighsint'}
Not found in the vacab: {'gento', 'yaneslaunt', 'maillliepslighsint'}
Not f

In [194]:
sent_embed

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,190,191,192,193,194,195,196,197,198,199
0,-0.064707,0.384395,-0.468533,-0.080553,-0.255096,0.072584,-0.311713,0.216473,0.641069,0.342274,...,-0.031248,-0.501538,0.373918,0.134099,0.068561,-0.414175,-0.339091,-0.001164,-0.326886,-0.169732
1,-0.046252,0.272477,-0.273617,-0.045511,-0.203801,0.209715,-0.279689,0.298105,0.573146,0.311637,...,-0.063372,-0.332264,0.089704,0.022142,-0.023571,-0.303106,-0.250754,0.119181,-0.167865,-0.224531
2,0.073445,0.358834,-0.069379,0.123792,-0.262080,0.125619,-0.410947,0.329054,0.507147,0.271214,...,0.058458,-0.316363,0.187362,0.186948,0.012272,-0.353090,0.020669,0.250579,-0.203437,-0.476817
3,0.074016,0.056316,-0.135955,0.103325,-0.216840,-0.033189,-0.183829,0.104087,0.543406,0.233325,...,0.079607,-0.239987,0.097954,0.066457,-0.206716,-0.265553,-0.148072,0.224581,-0.096775,-0.358192
4,-0.033655,0.276199,-0.262096,0.021220,-0.081770,0.003196,-0.325079,0.190374,0.618006,0.180608,...,0.077576,-0.421060,0.240369,0.328979,-0.156029,-0.374467,-0.170497,0.231744,-0.025903,-0.396418
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2480,0.034549,0.179899,-0.183499,0.189685,-0.196032,0.325938,-0.250845,0.072951,0.474206,0.126242,...,-0.012124,-0.413559,0.277520,0.094156,-0.089839,-0.239982,-0.215583,-0.044092,-0.215788,-0.483029
2481,0.034549,0.179899,-0.183499,0.189685,-0.196032,0.325938,-0.250845,0.072951,0.474206,0.126242,...,-0.012124,-0.413559,0.277520,0.094156,-0.089839,-0.239982,-0.215583,-0.044092,-0.215788,-0.483029
2482,0.034549,0.179899,-0.183499,0.189685,-0.196032,0.325938,-0.250845,0.072951,0.474206,0.126242,...,-0.012124,-0.413559,0.277520,0.094156,-0.089839,-0.239982,-0.215583,-0.044092,-0.215788,-0.483029
2483,-0.046467,0.140476,-0.200314,0.012333,-0.271358,0.105989,-0.279307,0.111648,0.519055,0.061328,...,0.012965,-0.229361,0.230629,0.206227,-0.253299,-0.272201,0.033349,0.203672,-0.118689,-0.255258


In [202]:
section_encode = pd.get_dummies(df.s, dtype=float)

In [212]:
features = pd.concat([concept_embed, sent_embed, section_encode], axis=1)
features.columns = features.columns.astype(str)
features

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,Patient information/Demographics,Physical examination/Status,Present illness,Problems,Procedures/Surgery,Radiology,Reasons/Indications,Review of systems,Social history,Unknown/Unclassified
0,0.109126,0.421394,-0.554760,-0.073817,-0.355418,0.285520,-0.343200,0.273875,0.614616,0.254417,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.109126,0.421394,-0.554760,-0.073817,-0.355418,0.285520,-0.343200,0.273875,0.614616,0.254417,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.160263,0.318689,0.097067,0.110554,-0.443355,0.095429,-0.518645,0.045509,0.542215,0.390435,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.160263,0.318689,0.097067,0.110554,-0.443355,0.095429,-0.518645,0.045509,0.542215,0.390435,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.005735,0.485740,-0.329065,0.314020,-0.401773,0.169909,-0.482345,-0.118701,0.527160,0.439670,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2480,-0.206621,0.394687,-0.026277,0.573347,-0.305623,0.390420,0.149948,0.361260,0.463523,0.182643,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2481,0.080779,0.274269,-0.155487,0.231534,-0.320310,0.434496,-0.203749,0.212669,0.142134,0.305124,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2482,0.104982,0.401942,-0.466670,0.360233,-0.174763,0.435795,-0.184113,0.118282,0.732727,0.171617,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2483,-0.047389,0.153507,0.163099,0.415893,-0.211197,0.518803,-0.035749,0.387763,0.837130,0.076910,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [222]:
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
np.random.seed(31415)
X_train, X_test, y_train, y_test = train_test_split(features, df.a == 'present', test_size=0.3,random_state=109) # 70% training and 30% test

In [227]:
# With section information
clf = svm.SVC(kernel='linear') # Linear Kernel
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.84      0.79      0.81       243
        True       0.90      0.92      0.91       503

    accuracy                           0.88       746
   macro avg       0.87      0.86      0.86       746
weighted avg       0.88      0.88      0.88       746



In [228]:
# Without section information
clf = svm.SVC(kernel='linear') # Linear Kernel
clf.fit(X_train.iloc[:, :400], y_train)
y_pred = clf.predict(X_test.iloc[:, :400])
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.81      0.77      0.79       243
        True       0.89      0.91      0.90       503

    accuracy                           0.87       746
   macro avg       0.85      0.84      0.85       746
weighted avg       0.87      0.87      0.87       746



In [229]:
# Concept only
clf = svm.SVC(kernel='linear') # Linear Kernel
clf.fit(X_train.iloc[:, :200], y_train)
y_pred = clf.predict(X_test.iloc[:, :200])
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.73      0.62      0.67       243
        True       0.83      0.89      0.86       503

    accuracy                           0.80       746
   macro avg       0.78      0.75      0.76       746
weighted avg       0.80      0.80      0.80       746



### TODO 
- cross validation 

1. generalizability test on the MIMIC III data. (Priority)
2. normalized section vs SOAP section ?

- Add a feature: whether the assertion (e.g. present) is dominant in the section that the concept falls in.

- Question: how to combine word embedding with rule-based features. ( the scales are different)

- Thoughts: Instead of using one-hot-encoding, use an embdding to represent section information. 