# Identifying Entities in Healthcare Data

## <font color="blue"> Introduction:

A health tech company called BeHealthy has a web platform that allows doctors to list their services and manage patient interactions and provides services for patients such as booking interactions with doctors and ordering medicines online. Here, doctors can easily organise appointments, track past medical records and provide e-prescriptions, thus generating huge amount of data day by day and through which we can take proper decisions using machine learning.

## <font color="blue"> Problem statement:

In the generated text, there is lot of diseases and their respective treatment are explicitly mentioned. In order to map the disease with their treatment, we can build an algorithm using syntactic processing in NLP.

## <font color="blue"> Importing and installing usefull packages:

In [1]:
#!pip install pycrf
#!pip install sklearn-crfsuite

import spacy
import sklearn
import sklearn_crfsuite
from sklearn_crfsuite import metrics
import pandas as pd
import numpy as np
import random

model = spacy.load("en_core_web_sm")

## <font color="blue"> Data preprocessing:

The dataset provided is in the form of one word per line. Let's understand the format of data below:
- Suppose there are *x* words in a sentence, then there will be *x* continuous lines with one word in each line. 
- Further, the two sentences are separated by empty lines. The labels for the data follow the same format.

We need to pre-process the data to recover the complete sentences and their labels.

**Constructing the proper sentences from individual words and printing the first 5 sentences.**

In [2]:
# Creating a function for converting words into sentences

def words_to_sentence(path):
  # opening the file 
  file = open(path, "r")

  # reading the lines and storing in a variable
  file_name = file.readlines()
  file.close()

  # Creating a output list to store the all the sentences
  output_list = []

  # creating a string to store each sentences to append separately in the output list
  sentence = ""

  # iterating through all the words and storing in the output list
  for word in file_name:
    word = word.strip("\n")
    if word != "":
      if sentence:
        sentence +=" "+word
      else:
        sentence = word

    else:
      output_list.append(sentence)
      sentence = ""

  return output_list

In [3]:
# Creating the sentences and labels for train and test dataset
train_sentences = words_to_sentence("train_sent")
train_labels = words_to_sentence("train_label")
test_sentences = words_to_sentence("test_sent")
test_labels = words_to_sentence("test_label")

In [4]:
# printing first five sentences and its labels
for i in range(5):
  print("\033[1m",f"Row {i+1}","\033[0m")
  print(f"Sentence is: ", train_sentences[i])
  print(f"Labels for sentence is:", train_labels[i])
  print("--"*50)

[1m Row 1 [0m
Sentence is:  All live births > or = 23 weeks at the University of Vermont in 1995 ( n = 2395 ) were retrospectively analyzed for delivery route , indication for cesarean , gestational age , parity , and practice group ( to reflect risk status )
Labels for sentence is: O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
----------------------------------------------------------------------------------------------------
[1m Row 2 [0m
Sentence is:  The total cesarean rate was 14.4 % ( 344 of 2395 ) , and the primary rate was 11.4 % ( 244 of 2144 )
Labels for sentence is: O O O O O O O O O O O O O O O O O O O O O O O O O
----------------------------------------------------------------------------------------------------
[1m Row 3 [0m
Sentence is:  Abnormal presentation was the most common indication ( 25.6 % , 88 of 344 )
Labels for sentence is: O O O O O O O O O O O O O O O
---------------------------------------------------------

**Counting the number of sentences in the processed train and test dataset.**

In [5]:
print(f"Number of sentences in processed train dataset is: {len(train_sentences)}")
print(f"Number of sentences in processed test dataset is: {len(test_sentences)}")

Number of sentences in processed train dataset is: 2599
Number of sentences in processed test dataset is: 1056


**Counting the number of lines of labels in the processed train and test dataset.**

In [6]:
print(f"Number of lines of labels in processed train dataset is: {len(train_labels)}")
print(f"Number of lines of labels in processed test dataset is: {len(test_labels)}")

Number of lines of labels in processed train dataset is: 2599
Number of lines of labels in processed test dataset is: 1056


## <font color="blue"> Concept Identification:

We will first explore what are the various concepts present in the dataset. For this, we will use PoS Tagging. 

**Extracting those tokens which have NOUN or PROPN as their PoS tag and find their frequency.**

In [7]:
# Identifying the noun and pronouns in the train and test sentences

# Creating a variable to store the noun and propn in the corpus
noun_pronoun_list = []

# Creating the list of train and test sentences
corpus = [train_sentences, test_sentences]

# Extracting the noun and propn from the corpus and storing into noun_pronoun_list
for sentences in corpus:
  for sentence in sentences:
    nlp = model(sentence)
    for token in nlp:
      if token.pos_ == "NOUN" or token.pos_ == "PROPN":
        noun_pronoun_list.append(token.text)

In [8]:
# Creating a series from the created list
df_noun_pronoun = pd.Series(noun_pronoun_list)

**Printing the top 25 most common tokens with NOUN or PROPN PoS tags.**

In [9]:
# Checking the top most Frequent nouns or proper noun in the corpus
df_noun_pronoun.value_counts().head(25)

patients        492
treatment       281
%               247
cancer          200
therapy         175
study           152
disease         141
cell            140
lung            116
group            94
chemotherapy     88
gene             87
effects          85
results          78
women            77
use              74
risk             71
surgery          71
cases            71
analysis         70
rate             67
response         66
survival         65
children         64
effect           63
dtype: int64

## <font color="blue"> Defining the features for CRF:

> **Performing following steps:**
> - Defining the features with the pos tag as one of the features.
> - Also need to consider the preceding word details of the current word for better accurate and exhaustive model.
> - Marking the beginning and the end words of a sentence correctly in the form of features.

**Defining the features to get the feature value for one word.**


Defining the following features for CRF model building:

- f1 = input word is in lower case; 
- f2 = last 3 characters of word;
- f3 = last 2 characers of word;
- f4 = 1; if the word is in uppercase, 0 otherwise;
- f5 = 1; if word is a number; otherwise, 0 
- f6 = 1; if the word starts with a capital letter; otherwise, 0
- f7 = BEG, if the word is the beggining word in the sentence
- f8 = END, if the word is the end word in the sentence

In [10]:
# Extracting the features from the word
def getFeaturesForOneWord(sentence, position, pos_tag):
  word = sentence[position]

  # Defining the features using the word
  features = ["word.lower=" + word.lower(),
              "word[-3:]=" + word[-3:],
              "word[-2:]=" + word[-2:],
              "word.isupper=%s" %word.isupper(),
              "word.isdigit=%s" %word.isdigit(),
              "word.startWithCapital=%s" %word[0].isupper(),
              "word.pos=" + pos_tag[position]]

  # Defining the features using the preceding word
  if (position > 0):
    prev_word = sentence[position-1]
    features.extend(["prev_word.lower=" + prev_word.lower(),
                     "prev_word.isupper=%s" %word.isupper(),
                     "prev_word.isdigit=%s" %word.isdigit(),
                     "prev_word.startsWithCapital=%s" %word[0].isupper(),
                    "prev_word.pos=" + pos_tag[position-1]
                    ])
    
  else:
    # Feature to track whether the word is in the beginning of the sentence
    features.append("BEG")

  if(position == len(sentence) - 1):
    features.append("END")

  return features

## <font color="blue"> Getting the features and the labels of sentences::

**Writing a code/function to get the features for a sentence.**

In [11]:
# Defining a function to get features for a sentence
def getFeaturesForOneSentence(sentence):

  # Splitting the sentence into tokens
  sentence_list = sentence.split()

  # Extracting the list of pos_tag for the sentence
  pos_tag = []
  processed_sentence = model(sentence)
  for token in processed_sentence:
    pos_tag.append(token.pos_)

  return [getFeaturesForOneWord(sentence_list, pos, pos_tag) for pos in range(len(sentence_list))]

**Writing a code/function to get the labels of a sentence.**

In [12]:
# Defining the function to get labels of a sentence
def getLabelsInListForOneSentence(labels):
  return labels.split()

**Checking the features for sample sentence.**

In [13]:
# Checking the features for sample sentence
sample_sentence = train_sentences[2]
print("The sentence is **"+sample_sentence,"**")

features = getFeaturesForOneSentence(sample_sentence)
# Checking the features of a sample sentence
for index,word in enumerate(sample_sentence.split()):
  print("\033[1m","Word = ",word,"\033[0m" )
  print(features[index])

The sentence is **Abnormal presentation was the most common indication ( 25.6 % , 88 of 344 ) **
[1m Word =  Abnormal [0m
['word.lower=abnormal', 'word[-3:]=mal', 'word[-2:]=al', 'word.isupper=False', 'word.isdigit=False', 'word.startWithCapital=True', 'word.pos=ADJ', 'BEG']
[1m Word =  presentation [0m
['word.lower=presentation', 'word[-3:]=ion', 'word[-2:]=on', 'word.isupper=False', 'word.isdigit=False', 'word.startWithCapital=False', 'word.pos=NOUN', 'prev_word.lower=abnormal', 'prev_word.isupper=False', 'prev_word.isdigit=False', 'prev_word.startsWithCapital=False', 'prev_word.pos=ADJ']
[1m Word =  was [0m
['word.lower=was', 'word[-3:]=was', 'word[-2:]=as', 'word.isupper=False', 'word.isdigit=False', 'word.startWithCapital=False', 'word.pos=AUX', 'prev_word.lower=presentation', 'prev_word.isupper=False', 'prev_word.isdigit=False', 'prev_word.startsWithCapital=False', 'prev_word.pos=NOUN']
[1m Word =  the [0m
['word.lower=the', 'word[-3:]=the', 'word[-2:]=he', 'word.isupper=

## <font color="blue"> Defining input and target variables:

Correctly computing X and Y sequence matrices for training and test data.
Check that both sentences and labels are processed

**Extracting the features values for each sentence as an input variable for the CRF model in the test and the train dataset.**

In [14]:
# Creating features for all the sentences in train and test
X_train = [getFeaturesForOneSentence(sentence) for sentence in train_sentences]
X_test = [getFeaturesForOneSentence(sentence) for sentence in test_sentences]

**Define the labels as the target variable for test and the train dataset.**

In [15]:
# Creating labels in list for one sentence
Y_train = [getLabelsInListForOneSentence(labels) for labels in train_labels]
Y_test = [getLabelsInListForOneSentence(labels) for labels in test_labels]

## <font color="blue"> Build the CRF Model:

In [16]:
#!pip install scikit-learn==0.22.2 --user

In [17]:
# Building the model
crf = sklearn_crfsuite.CRF(max_iterations=300)
crf.fit(X_train,Y_train)



CRF(algorithm=None, all_possible_states=None, all_possible_transitions=None,
    averaging=None, c=None, c1=None, c2=None, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=300,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=False)

## <font color="blue"> Evaluation:

**Predicting the labels of each of the tokens in each sentence of the test dataset that has been preprocessed earlier.**

In [18]:
# Predicting labels for the test sentences
Y_pred = crf.predict(X_test)

**Calculating the f1 score using the actual and the predicted labels of the test dataset.**

In [19]:
# Calculating the f1 score
f1_score = metrics.flat_f1_score(Y_test, Y_pred, average="weighted")
print(f"F1 score is = {round(f1_score,4)}")

F1 score is = 0.9112


## <font color="blue"> Identifying the diseases and treatment using a custom NER: 

We now use the CRF model's prediction to prepare a record of diseases identified in the corpus and treatments used for the diseases.

In [20]:
# Creating a dictionary
Disease_Treatment_Dict = dict()

# looping through all the sentences in the test sentences
for i in range(len(test_sentences)):
  labels_list = Y_pred[i]

  # Creating an empty string to store the diseases and treatments
  disease = ""
  treatments = ""

  # Storing the words that belongs to D and T in disease and treatments string respectively
  for j,label in enumerate(labels_list):
    if label == "D":
      disease += test_sentences[i].split()[j]+" "
    elif label == "T":
      treatments += test_sentences[i].split()[j]+" "
  
  # Cleaning the space at the end of the string
  disease = disease.lstrip().rstrip()
  treatments = treatments.lstrip().rstrip()

  # Storing the non empty disease and treatments into the disease treatment dictionary
  if disease != "" and treatments != "":
    if disease not in Disease_Treatment_Dict.keys():
      Disease_Treatment_Dict[disease] = treatments
    elif disease in Disease_Treatment_Dict.keys():
      treat_list = Disease_Treatment_Dict[disease].split(" ")
      treat_list.append(treatments)
      Disease_Treatment_Dict[disease] = treat_list

In [21]:
# Printing the dictionary
Disease_Treatment_Dict

{'B16 melanoma': 'adenosine triphosphate and treatment with buthionine sulfoximine',
 "Barrett 's esophagus": 'Acid suppression therapy',
 'CBD stones': 'one-time surgical exploration',
 "Eisenmenger 's syndrome": 'laparoscopic cholecystectomy',
 "Parkinson 's disease": 'Microelectrode-guided posteroventral pallidotomy',
 "abdominal tuberculosis Crohn 's disease": 'steroids',
 'acoustic neuroma': 'Stereotactic radiosurgery',
 'acute cerebral ischemia': 'Antiplatelet therapy',
 'acute myocardial infarction': 'Thrombolytic therapy',
 'acute nasopharyngitis ( ANP )': 'antibiotic treatment',
 'acute occlusion of the middle cerebral artery': 'thrombolytic therapy',
 'advanced esophageal cancer': 'adjuvant chemoradiotherapy with CDDP',
 'advanced non -- small-cell lung cancer': 'paclitaxel plus carboplatin ( pc ) vinorelbine plus cisplatin',
 'advanced nsclc': 'assessing combination chemotherapy of cisplatin , ifosfamide and irinotecan with rhg-csf support',
 'advanced rectal cancer': 'Nerve

**Predicting the treatment for the disease name: 'hereditary retinoblastoma'**

In [22]:
# Checking the treatment for hereditary retinoblastoma disease
Disease_Treatment_Dict['hereditary retinoblastoma']

'radiotherapy'