# Syantactic Processing - Identifying Entities in Healthcare Data Assignment

Project URL: https://github.com/Vasista-Eranki/SyntacticProcessingAssignment/

### Imports and Prepare the Model

In [1]:
#!pip install spacy
#!pip install sklearn-crfsuite

In [2]:
import spacy
import sklearn_crfsuite
from sklearn_crfsuite import metrics

model = spacy.load("en_core_web_sm")

In [3]:
import pandas as pd
import numpy as np

## Read the input file contents

In [4]:
base_dir = './'
with open(base_dir+'train_sent', 'r') as train_sentence_file:
  input_train_sent = train_sentence_file.readlines()

with open(base_dir+'train_label', 'r') as train_label_file:
  input_train_labels = train_label_file.readlines()

with open(base_dir+'test_sent', 'r') as test_sentence_file:
  input_test_sent = test_sentence_file.readlines()

with open(base_dir+'test_label', 'r') as test_label_file:
  input_test_labels = test_label_file.readlines()


**The Lines in each file have \n as a trailing character. This needs to be trimmed down before creating the sentences**

In [5]:
def get_sentences(data):
    "Returns a list of sentences by joining the input data wherever an empty string value is encountered..."
    return_data = []
    current_line = ''
    for w in data:
        word = w.strip()
        if len(word) == 0 :
            return_data.append(current_line)
            current_line = ""
        else:
            if len(current_line) == 0:
                current_line = word
            else:
                current_line += ' '+ word
    return return_data

In [6]:
train_sentences = get_sentences(input_train_sent)
train_labels = get_sentences(input_train_labels)
test_sentences = get_sentences(input_test_sent)
test_labels = get_sentences(input_test_labels)

#### Number of Sentences in Train and Test dataset

In [7]:
print("Total Sentences in the training Corpus: ", len(train_sentences))
print("Total Sentences in the testing Corpus: ", len(test_sentences))

Total Sentences in the training Corpus:  2599
Total Sentences in the testing Corpus:  1056


#### Number of lines of Labels in the Train and Test dataset

In [8]:
print("Total Labels in the training Corpus: ", len(train_labels))
print("Total Labels in the testing Corpus: ", len(test_labels))

Total Labels in the training Corpus:  2599
Total Labels in the testing Corpus:  1056


### Task-01: Print 5 sentences

#### Training sentences

In [9]:
for i in range(0,5):
    print("Sentence: ", train_sentences[i])
    print("Labels: ", train_labels[i], end="\n\n")

Sentence:  All live births > or = 23 weeks at the University of Vermont in 1995 ( n = 2395 ) were retrospectively analyzed for delivery route , indication for cesarean , gestational age , parity , and practice group ( to reflect risk status )
Labels:  O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

Sentence:  The total cesarean rate was 14.4 % ( 344 of 2395 ) , and the primary rate was 11.4 % ( 244 of 2144 )
Labels:  O O O O O O O O O O O O O O O O O O O O O O O O O

Sentence:  Abnormal presentation was the most common indication ( 25.6 % , 88 of 344 )
Labels:  O O O O O O O O O O O O O O O

Sentence:  The `` corrected '' cesarean rate ( maternal-fetal medicine and transported patients excluded ) was 12.4 % ( 273 of 2194 ) , and the `` corrected '' primary rate was 9.6 % ( 190 of 1975 )
Labels:  O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

Sentence:  Arrest of dilation was the most common indication in both `` c

#### Testing sentences

In [10]:
for i in range(0,5):
    print("Sentence: ", test_sentences[i])
    print("Labels: ", test_labels[i], end="\n\n")

Sentence:  Furthermore , when all deliveries were analyzed , regardless of risk status but limited to gestational age > or = 36 weeks , the rates did not change ( 12.6 % , 280 of 2214 ; primary 9.2 % , 183 of 1994 )
Labels:  O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

Sentence:  As the ambient temperature increases , there is an increase in insensible fluid loss and the potential for dehydration
Labels:  O O O O O O O O O O O O O O O O O O O

Sentence:  The daily high temperature ranged from 71 to 104 degrees F and AFI values ranged from 1.7 to 24.7 cm during the study period
Labels:  O O O O O O O O O O O O O O O O O O O O O O O O

Sentence:  There was a significant correlation between the 2- , 3- , and 4-day mean temperature and AFI , with the 4-day mean being the most significant ( r = 0.31 , p & # 60 ; 0.001 )
Labels:  O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

Sentence:  Fluctuations in ambient temperat

### Task-02: Extract the POS - Frequency of NOUN OR PROPN in train & test corpus

In [11]:
train_data = pd.DataFrame(columns=['sent', 'Text', 'LEMMA', 'POS', 'DEP'])
test_data = pd.DataFrame(columns=['sent', 'Text', 'LEMMA', 'POS', 'DEP'])

In [12]:
for index, each_sent in enumerate(train_sentences):
    sent = model(each_sent)
    for token in sent:
        train_data.loc[len(train_data)] = {'sent': index, 'Text': token.text, 'LEMMA': token.lemma_, 'POS' : token.pos_, 'DEP': token.dep_}

In [13]:
for index, each_sent in enumerate(test_sentences):
    sent = model(each_sent)
    for token in sent:
        test_data.loc[len(test_data)] = {'sent': index, 'Text': token.text, 'LEMMA': token.lemma_, 'POS' : token.pos_, 'DEP': token.dep_}

In [14]:
nouns_in_train_data = train_data[(train_data.POS == 'NOUN') | (train_data.POS == 'PROPN')].loc[:, ['Text', 'POS']]
nouns_in_test_data = test_data[(test_data.POS == 'NOUN') | (test_data.POS == 'PROPN')].loc[:, ['Text', 'POS']]

final_nouns_in_data= pd.concat([nouns_in_train_data, nouns_in_test_data])

In [15]:
grouped_final_nouns_in_data = final_nouns_in_data.groupby(by='Text').agg('count').sort_values(by='POS', ascending=False).reset_index()

**Top 25 Most frequent words in the corpus (training+testing)**

In [16]:
grouped_final_nouns_in_data.head(25)

Unnamed: 0,Text,POS
0,patients,492
1,treatment,281
2,%,247
3,cancer,200
4,therapy,175
5,study,154
6,disease,142
7,cell,140
8,lung,116
9,group,94


### Task-03: Define the CRF Features

#### Functions to Process the Sentences

In [17]:
def features_for_word(words, position, pos_tags):
    word = words[position]
    features = [
        'word.lower=' + word.lower(), # serves as word id
        'word[-3:]=' + word[-3:],     # last three characters
        'word[-2:]=' + word[-2:],     # last two characters
        'word.isupper=%s' % word.isupper(),  # is the word in all uppercase
        'word.isdigit=%s' % word.isdigit(),  # is the word a number
        'word.startsWithCapital=%s' % word[0].isupper(), # is the word starting with a capital letter
        'word.pos=' + pos_tags[position]
    ]
    if position > 0:
        prev_word = words[position-1]
        previous_word_features = [
        'prev_word.lower=' + prev_word.lower(),
        'prev_word.isupper=%s' % prev_word.isupper(),
        'prev_word.isdigit=%s' % prev_word.isdigit(),
        'prev_word.startsWithCapital=%s' % prev_word[0].isupper(),
        'prev_word.pos=' + pos_tags[position-1]]
        features.extend(previous_word_features)
    else:
        features.append('BEG')

    if (position == len(pos_tags)-1):
        features.append('END')

    return features

#### Functions to process the Labels

In [18]:
def labels_for_sentence(labels):
  return labels.split()

### Task-04: Compute the Features of a Sentence

In [19]:
def features_for_sentence(sentence):
    spacy_sentence = model(sentence)

    pos_tags = []
    #pos_tags = [word.pos_ for word in spacy_sentence]
    for token in spacy_sentence:
        pos_tags.append(token.pos_)

    words = sentence.split()

    return [features_for_word(words, i, pos_tags) for i in range(0, len(words))]

In [20]:
features_for_sentence(train_sentences[0])[-1]

['word.lower=)',
 'word[-3:]=)',
 'word[-2:]=)',
 'word.isupper=False',
 'word.isdigit=False',
 'word.startsWithCapital=False',
 'word.pos=PUNCT',
 'prev_word.lower=status',
 'prev_word.isupper=False',
 'prev_word.isdigit=False',
 'prev_word.startsWithCapital=False',
 'prev_word.pos=NOUN',
 'END']

### Task-05: Extract Features' Values for the Sentence

In [21]:
X_train = [features_for_sentence(sentence) for sentence in train_sentences]
Y_train = [labels_for_sentence(labels) for labels in train_labels]

In [22]:
X_test = [features_for_sentence(sentence) for sentence in test_sentences]
Y_test = [labels_for_sentence(labels) for labels in test_labels]

In [23]:
print(X_train[0][-1])

['word.lower=)', 'word[-3:]=)', 'word[-2:]=)', 'word.isupper=False', 'word.isdigit=False', 'word.startsWithCapital=False', 'word.pos=PUNCT', 'prev_word.lower=status', 'prev_word.isupper=False', 'prev_word.isdigit=False', 'prev_word.startsWithCapital=False', 'prev_word.pos=NOUN', 'END']


In [24]:
print(Y_train[35])

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'D', 'D', 'D']


In [25]:
print(Y_test[29])

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'D', 'D', 'O', 'O', 'O', 'O', 'D', 'D', 'O', 'O', 'O', 'O', 'O', 'D', 'D']


### Task-06: CRF model for a custom NER application

In [26]:
import sklearn_crfsuite
from sklearn_crfsuite import metrics

In [27]:
crf = sklearn_crfsuite.CRF(max_iterations=100)

try:
    crf.fit(X_train, Y_train)
except AttributeError as e:
    print('EXCEPTION:', e)
    pass


### Evaluation of the Model

In [28]:
print("Classes identified in the custom CRF model:", crf.classes_)

Classes identified in the custom CRF model: ['O', 'D', 'T']


**Print Transition Details**

In [29]:
crf.transition_features_

{('O', 'O'): 0.899472,
 ('O', 'D'): 0.588422,
 ('O', 'T'): -1.918435,
 ('D', 'O'): -0.8025,
 ('D', 'D'): 4.574284,
 ('D', 'T'): -1.857337,
 ('T', 'O'): 0.633899,
 ('T', 'T'): 3.719617}

In [30]:
crf.state_features_

{('word.lower=all', 'O'): 0.234134,
 ('word[-3:]=All', 'O'): 0.024385,
 ('word[-2:]=ll', 'O'): -0.019916,
 ('word[-2:]=ll', 'D'): 0.245115,
 ('word[-2:]=ll', 'T'): -0.225199,
 ('word.isupper=False', 'O'): 1.0545,
 ('word.isupper=False', 'D'): -0.490619,
 ('word.isupper=False', 'T'): -0.563881,
 ('word.isdigit=False', 'O'): 0.085916,
 ('word.isdigit=False', 'D'): 0.057947,
 ('word.isdigit=False', 'T'): -0.143863,
 ('word.startsWithCapital=True', 'O'): 1.009713,
 ('word.startsWithCapital=True', 'D'): -0.459867,
 ('word.startsWithCapital=True', 'T'): -0.549846,
 ('word.pos=DET', 'O'): 0.382093,
 ('word.pos=DET', 'D'): -0.145612,
 ('word.pos=DET', 'T'): -0.236481,
 ('BEG', 'O'): 0.733014,
 ('BEG', 'D'): 0.052865,
 ('BEG', 'T'): -0.785879,
 ('word.lower=live', 'O'): 0.005976,
 ('word[-3:]=ive', 'O'): -0.043213,
 ('word[-3:]=ive', 'D'): 0.107902,
 ('word[-3:]=ive', 'T'): -0.064689,
 ('word[-2:]=ve', 'O'): 0.052991,
 ('word[-2:]=ve', 'D'): -0.167837,
 ('word[-2:]=ve', 'T'): 0.114846,
 ('word.

In [31]:
X_test = [features_for_sentence(sentence) for sentence in test_sentences]
Y_test = [labels_for_sentence(labels) for labels in test_labels]

In [32]:
y_pred = crf.predict(X_test)

### Task: 07: Calculate the F1 score

In [34]:
metrics.flat_f1_score(Y_test, y_pred, average='weighted')

0.906724757630721

In [35]:
print(len(y_pred), len(test_sentences))

1056 1056


### Task-08: Get all predicted Treatment labels
*and corresponding to each Disease label D in the test Dataaset*

In [37]:
overall_response = {}

for sent_index in range(len(y_pred)):
    current_disease = ''
    current_treatment = ''
    for label_index in range(len(y_pred[sent_index])):
        label = y_pred[sent_index][label_index]
        if label=='D':
            current_disease += test_sentences[sent_index].split()[label_index] + ' '
            #print('>D:', current_disease)
        elif label=='T':
            current_treatment += test_sentences[sent_index].split()[label_index] + ' '
            #print('>T:', current_treatment)

    current_disease = current_disease.strip()
    current_treatment = current_treatment.strip()

    if (len(current_disease) == 0 or len(current_treatment) == 0) :
        continue
    #print('>>D:', current_disease)
    #print('>>T:', current_treatment)
    #print('Disease Exists:', (current_disease in overall_response))
    #print('Disease:', overall_response[current_disease] )
    if current_disease in overall_response:
        overall_response[current_disease].append(current_treatment)
        #print(overall_response)
    else:
        overall_response[current_disease] = [current_treatment]
        #print(overall_response)

In [38]:
overall_response

{'hereditary retinoblastoma': ['radiotherapy'],
 'myocardial infarction': ['warfarin with 80 mg aspirin , or 1 mg warfarin with 80 mg aspirin'],
 'unstable angina or non-Q-wave myocardial infarction': ['roxithromycin'],
 'primary pulmonary hypertension ( PPH )': ['fenfluramines'],
 'foot infection': ['G-CSF treatment'],
 "early Parkinson 's disease": ['Ropinirole monotherapy'],
 'female stress urinary incontinence': ['surgical treatment'],
 'stress urinary incontinence': ['therapy'],
 'preeclampsia ( proteinuric hypertension )': ['intrauterine insemination with donor sperm versus intrauterine insemination'],
 'intra-abdominal injury': ['senior surgery celiotomy'],
 'cancer': ['organ transplantation and chemotherapy',
  'oral drugs chemotherapy'],
 'major pulmonary embolism': ['Thrombolytic treatment right-side hemodynamics'],
 'malignant pleural mesothelioma': ['thoracotomy , radiotherapy , and chemotherapy'],
 'tumor markers pulmonary symptoms': ['chemotherapy'],
 'non-obstructive azo

### Predict the treatment for the disease name: 'hereditary retinoblastoma'

In [39]:
overall_response['hereditary retinoblastoma']

['radiotherapy']