<a href="https://colab.research.google.com/github/VvRavi78/Identify_Entities_In_Healthcare_Data/blob/main/NER_Healthcare_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Identifying Entities(NER) in Healthcare Data

By: Venkata Ravi Kumar Vissamsetty

In [6]:
# Setting up Google Colab Environment. Please disable if running locally.
import pathlib
import os
base_dir = pathlib.Path('/content/NLP')
os.chdir(str(base_dir))

In [7]:
!ls

test_label  test_sent  train_label  train_sent


In [8]:
# Installing and importing required libraries
!pip install pycrf
!pip install sklearn-crfsuite

import spacy
import sklearn_crfsuite
from sklearn_crfsuite import metrics
import pandas as pd

model = spacy.load("en_core_web_sm")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pycrf
  Downloading pycrf-0.0.1.tar.gz (1.1 kB)
Building wheels for collected packages: pycrf
  Building wheel for pycrf (setup.py) ... [?25l[?25hdone
  Created wheel for pycrf: filename=pycrf-0.0.1-py3-none-any.whl size=1897 sha256=7895ff12db1d05513c6c14c14ad349bbf8d0563957f42a6abcc6435052d9b1e2
  Stored in directory: /root/.cache/pip/wheels/0b/68/37/a457e156cfd6174ed28c9c8cb76f18eeb559b760d84c0a22eb
Successfully built pycrf
Installing collected packages: pycrf
Successfully installed pycrf-0.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.3.6-py2.py3-none-any.whl (12 kB)
Collecting python-crfsuite>=0.8.3
  Downloading python_crfsuite-0.9.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (965 kB)
[K     |████████████████████████████████| 965 kB 4.9 MB

**Data Preprocessing**
The dataset provided is in the form of one word per line. Let's understand the format of data below:

1. If there are X words in a sentence, then there will be X continuous lines with one word in each line.
2. Two sentences are separated by empty lines. 
3. The labels for the data follow the same format.

We need to pre-process the data to recover the complete sentences and their labels.

Construct the proper sentences from individual words and print the 5 sentences.



In [9]:
# Reading the train and test sentences and labels
with open('train_sent', 'r') as train_sent_file:
  train_words = train_sent_file.readlines()

with open('train_label', 'r') as train_labels_file:
  train_labels_by_word = train_labels_file.readlines()

with open('test_sent', 'r') as test_sent_file:
  test_words = test_sent_file.readlines()

with open('test_label', 'r') as test_labels_file:
  test_labels_by_word = test_labels_file.readlines()

In [12]:
# Check to see if the # of tokens and # of corresponding labels match.
print("Count of tokens in training set\n","No. of words: ",len(train_words),"\nNo. of labels: ",len(train_labels_by_word))
print("\n\nCount of tokens in test set\n","No. of words: ",len(test_words),"\nNo. of labels: ",len(test_labels_by_word))

Count of tokens in training set
 No. of words:  48501 
No. of labels:  48501


Count of tokens in test set
 No. of words:  19674 
No. of labels:  19674


In [13]:
# Function to combine tokens belonging to the same sentence. Sentences are separated by "\n" in the dataset.
def convert_to_sentences(dataset):
    sent_list = []
    sent = ""
    for entity in dataset:
        if entity != '\n':
            sent = sent + entity[:-1] + " "       # Adding word/label to current sentence/sequence of labels 
        else: 
            sent_list.append(sent[:-1])           # Removing the space added after the last entity.
            sent = ""
    return sent_list

In [14]:
# Converting tokens to sentences and individual labels to sequences of corresponding labels.
train_sentences = convert_to_sentences(train_words)
train_labels = convert_to_sentences(train_labels_by_word)
test_sentences = convert_to_sentences(test_words)
test_labels = convert_to_sentences(test_labels_by_word)

print("First Six training sentences and their labels:\n")
for i in range(6):
    print(train_sentences[i],"\n",train_labels[i],"\n")

First Six training sentences and their labels:

All live births > or = 23 weeks at the University of Vermont in 1995 ( n = 2395 ) were retrospectively analyzed for delivery route , indication for cesarean , gestational age , parity , and practice group ( to reflect risk status ) 
 O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 

The total cesarean rate was 14.4 % ( 344 of 2395 ) , and the primary rate was 11.4 % ( 244 of 2144 ) 
 O O O O O O O O O O O O O O O O O O O O O O O O O 

Abnormal presentation was the most common indication ( 25.6 % , 88 of 344 ) 
 O O O O O O O O O O O O O O O 

The `` corrected '' cesarean rate ( maternal-fetal medicine and transported patients excluded ) was 12.4 % ( 273 of 2194 ) , and the `` corrected '' primary rate was 9.6 % ( 190 of 1975 ) 
 O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 

Arrest of dilation was the most common indication in both `` corrected '' subgroups ( 23.4 an

In [15]:
print("First Six test sentences and their labels:\n")
for i in range(6):
    print(test_sentences[i],"\n",test_labels[i],"\n")

First Six test sentences and their labels:

Furthermore , when all deliveries were analyzed , regardless of risk status but limited to gestational age > or = 36 weeks , the rates did not change ( 12.6 % , 280 of 2214 ; primary 9.2 % , 183 of 1994 ) 
 O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 

As the ambient temperature increases , there is an increase in insensible fluid loss and the potential for dehydration 
 O O O O O O O O O O O O O O O O O O O 

The daily high temperature ranged from 71 to 104 degrees F and AFI values ranged from 1.7 to 24.7 cm during the study period 
 O O O O O O O O O O O O O O O O O O O O O O O O 

There was a significant correlation between the 2- , 3- , and 4-day mean temperature and AFI , with the 4-day mean being the most significant ( r = 0.31 , p & # 60 ; 0.001 ) 
 O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 

Fluctuations in ambient temperature are inversely correlated to cha

In [16]:
#Count the # of sentences in the processed train and test dataset.
print("Number of sentences in the train dataset: {}".format(len(train_sentences)))
print("Number of sentences in the test dataset: {}".format(len(test_sentences)))

Number of sentences in the train dataset: 2599
Number of sentences in the test dataset: 1056


In [17]:
# Count the # of labels in the processed train and test dataset.
print("Number of labels in the train dataset: {}".format(len(train_labels)))
print("Number of labels in the test dataset: {}".format(len(test_labels)))

Number of labels in the train dataset: 2599
Number of labels in the test dataset: 1056


**Concept Identification**

We will first explore what are the different concepts present in the dataset. For this, we will use PoS Tagging.

Extract those tokens which have NOUN or PROPN as their PoS tag and find their frequency

In [18]:
# Create a merged dataset from training and test sentences, since this is an Exploratory analysis.
combined = train_sentences + test_sentences
print("Number of sentences in combined dataset (training + test): {}".format(len(combined)))

Number of sentences in combined dataset (training + test): 3655


In [20]:
# Create a list of tokens which have PoS tag as 'NOUN' or 'PROPN'
noun_propn = []         # Initiating list for nouns and proper nouns
pos_tag = []            # Initiating list for corresponding PoS tags.
for sent in combined:
    for token in model(sent):
        if token.pos_ in ['NOUN', 'PROPN']:
           noun_propn.append(token.text)
           pos_tag.append(token.pos_)
print("No. of tokens in combined dataset with PoS tag as 'NOUN' or 'PROPN': {}".format(len(noun_propn)))

print(len(pos_tag))

No. of tokens in combined dataset with PoS tag as 'NOUN' or 'PROPN': 24340
24340


In [22]:
# Print the top 20 most common tokens with NOUN or PROPN PoS tags
noun_pos = pd.DataFrame({"NOUN_PROPN":noun_propn,"POS_tag":pos_tag})
print("Top 20 comon tokens with PoS tag as 'NOUN' or 'PROPN' \n")
print(noun_pos["NOUN_PROPN"].value_counts().head(20))

Top 20 comon tokens with PoS tag as 'NOUN' or 'PROPN' 

patients        492
treatment       281
%               247
cancer          200
therapy         175
study           154
disease         142
cell            140
lung            116
group            94
chemotherapy     88
gene             87
effects          85
women            77
results          77
use              75
surgery          71
risk             71
cases            71
analysis         70
Name: NOUN_PROPN, dtype: int64


In [23]:
# Defining features for CRF
# Analysis of PoS tags - Independent assignment for words vs Contextual assignment in a sentence.
sentence = train_sentences[1]   
sent_list = sentence.split()      # Splitting the sentence into its constituent words.
position = 2                      # Choosing position of word within sentence. Index starts at 0.

word = sent_list[position]        # Extracting word for PoS tag analysis.

print(sentence)

# Independent assignment of PoS tag (No contextual info)
print("\nPoS tag of word in isolation\nWord:",word,"--",model(word)[0].pos_,"\n")

# Contextual assignment of PoS tag based on other words in the sentence.
print("PoS tag of all words in sentence with context in tact.")
for token in model(sentence):
    print(token.text, "--", token.pos_)

# Modified workflow to obtain PoS tag of specific word in question while keeping sentence context in tact.
print("\nResult of modified workflow to obtain PoS tag of word at a specific position while keeping context within sentence in-tact.")
cnt = 0                           # Count of the word position within sentence.
for token in model(sentence):
      postag = token.pos_
      if (token.text == word) and (cnt == position):
          break
      cnt += 1
print("Word:", word,"POSTAG:",postag)

The total cesarean rate was 14.4 % ( 344 of 2395 ) , and the primary rate was 11.4 % ( 244 of 2144 )

PoS tag of word in isolation
Word: cesarean -- PROPN 

PoS tag of all words in sentence with context in tact.
The -- DET
total -- ADJ
cesarean -- ADJ
rate -- NOUN
was -- AUX
14.4 -- NUM
% -- NOUN
( -- PUNCT
344 -- NUM
of -- ADP
2395 -- NUM
) -- PUNCT
, -- PUNCT
and -- CCONJ
the -- DET
primary -- ADJ
rate -- NOUN
was -- AUX
11.4 -- NUM
% -- NOUN
( -- PUNCT
244 -- NUM
of -- ADP
2144 -- NUM
) -- PUNCT

Result of modified workflow to obtain PoS tag of word at a specific position while keeping context within sentence in-tact.
Word: cesarean POSTAG: ADJ


In [25]:
# As per the above analysis, the PoS tag of the word "cesarean" is not captured correctly if the word is considered individually. 
# However, if the word is considered as part of the sentence, then it is captured correctly. Defining a function below to execute this.
# Function to obtain contextual PoS tagger.
def contextual_pos_tagger(sent_list,position):
    '''Obtaining PoS tag for individual word with sentence context in-tact. 
       If the PoS tag is obtained for a word individually, it may not capture the context of use in the sentence and may assign the incorrect PoS tag.'''

    sentence = " ".join(sent_list)          # Sentence needs to be in string format to process it with spacy model. List of words won't work.
    posit = 0                               # Initialising variable to record position of word in joined sentence to compare with the position of the word under considertion.
    for token in model(sentence):
        postag = token.pos_
        if (token.text == word) and (posit == position):
            break
        posit += 1
    return postag

In [26]:
# Define the features to get the feature values for one word.
def getFeaturesForOneWord(sent_list, position):
  word = sent_list[position]
    
  # Obtaining features for current word
  features = [
    'word.lower=' + word.lower(),                                   # serves as word id
    'word.postag=' + contextual_pos_tagger(sent_list, position),    # PoS tag of current word
    'word[-3:]=' + word[-3:],                                       # last three characters
    'word[-2:]=' + word[-2:],                                       # last two characters
    'word.isupper=%s' % word.isupper(),                             # is the word in all uppercase
    'word.isdigit=%s' % word.isdigit(),                             # is the word a number
    'words.startsWithCapital=%s' % word[0].isupper()                # is the word starting with a capital letter
  ]
 
  if(position > 0):
    prev_word = sent_list[position-1]
    features.extend([
    'prev_word.lower=' + prev_word.lower(),                               # previous word
    'prev_word.postag=' + contextual_pos_tagger(sent_list, position - 1), # PoS tag of previous word
    'prev_word.isupper=%s' % prev_word.isupper(),                         # is the previous word in all uppercase
    'prev_word.isdigit=%s' % prev_word.isdigit(),                         # is the previous word a number
    'prev_words.startsWithCapital=%s' % prev_word[0].isupper()            # is the previous word starting with a capital letter
  ])
  else:
    features.append('BEG')                                                # feature to track begin of sentence 
 
  if(position == len(sent_list)-1):
    features.append('END')                                                # feature to track end of sentence
 
  return features

In [27]:
# Getting the features
# Write a code to get features for a sentence.
def getFeaturesForOneSentence(sentence):
  sentence_list = sentence.split()
  return [getFeaturesForOneWord(sentence_list, position) for position in range(len(sentence_list))]

In [28]:
# Checking feature extraction
example_sentence = train_sentences[5]
print(example_sentence)

features = getFeaturesForOneSentence(example_sentence)
features[0]

Cesarean rates at tertiary care hospitals should be compared with rates at community hospitals only after correcting for dissimilar patient groups or gestational age


['word.lower=cesarean',
 'word.postag=NOUN',
 'word[-3:]=ean',
 'word[-2:]=an',
 'word.isupper=False',
 'word.isdigit=False',
 'words.startsWithCapital=True',
 'BEG']

In [30]:
features[4]

['word.lower=care',
 'word.postag=NOUN',
 'word[-3:]=are',
 'word[-2:]=re',
 'word.isupper=False',
 'word.isdigit=False',
 'words.startsWithCapital=False',
 'prev_word.lower=tertiary',
 'prev_word.postag=NOUN',
 'prev_word.isupper=False',
 'prev_word.isdigit=False',
 'prev_words.startsWithCapital=False']

In [31]:
# Write a code to get the labels for a sentence.
def getLabelsInListForOneSentence(labels):
  return labels.split()
  
# Checking label extraction
example_labels = getLabelsInListForOneSentence(train_labels[5])
print(example_labels)

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [37]:
# Define source and target variables
# Correctly computing X and Y sequence matrices for training and test data. Check that both sentences and labels are processed

# Define the features values for each sentence as source variable for CRF model in test and the train dataset
X_train = [getFeaturesForOneSentence(sentence) for sentence in train_sentences]
X_test = [getFeaturesForOneSentence(sentence) for sentence in test_sentences]

# Define the labels as the target variable for test and the train dataset
Y_train = [getLabelsInListForOneSentence(labels) for labels in train_labels]
Y_test = [getLabelsInListForOneSentence(labels) for labels in test_labels]

In [41]:
# Building the CRF model. Using max_iterations as 200.
%%time

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=200,
    all_possible_transitions=True
)
try:
    crf.fit(X_train, Y_train)
except AttributeError:
    pass

CPU times: user 5.55 s, sys: 21 ms, total: 5.58 s
Wall time: 5.58 s


**Evaluation**

Predict the labels of each of the tokens in each sentence from the test dataset that has been pre processed earlier.

In [42]:
Y_pred = crf.predict(X_test)

In [43]:
# Calculate the f1 score using the actual labels and the predicted labels of the test dataset.
metrics.flat_f1_score(Y_test, Y_pred, average='weighted')

0.9176590528721192

An F1 Score of more than 91% is good. We shall proceed with this CRF model.

In [45]:
# Example test sentence and corresponding actual and predicted labels 
print("Sentence: ",test_sentences[15])
print("Actual labels:    ", Y_test[15])
print("Predicted labels: ", Y_pred[15])

Sentence:  The rate of severe preeclampsia was increased significantly in the triplet group 12 of 53 ( 22.6 % ) as compared with the twin group 3 of 53 ( 5.7 % ) ( OR = 4.9 , 95 % CI 1.2-23.5 , p = 0.02 )
Actual labels:     ['O', 'O', 'O', 'O', 'D', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Predicted labels:  ['O', 'O', 'O', 'D', 'D', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [46]:
# Feature list of sentence above
print(X_test[15])

[['word.lower=the', 'word.postag=PUNCT', 'word[-3:]=The', 'word[-2:]=he', 'word.isupper=False', 'word.isdigit=False', 'words.startsWithCapital=True', 'BEG'], ['word.lower=rate', 'word.postag=PUNCT', 'word[-3:]=ate', 'word[-2:]=te', 'word.isupper=False', 'word.isdigit=False', 'words.startsWithCapital=False', 'prev_word.lower=the', 'prev_word.postag=PUNCT', 'prev_word.isupper=False', 'prev_word.isdigit=False', 'prev_words.startsWithCapital=True'], ['word.lower=of', 'word.postag=PUNCT', 'word[-3:]=of', 'word[-2:]=of', 'word.isupper=False', 'word.isdigit=False', 'words.startsWithCapital=False', 'prev_word.lower=rate', 'prev_word.postag=PUNCT', 'prev_word.isupper=False', 'prev_word.isdigit=False', 'prev_words.startsWithCapital=False'], ['word.lower=severe', 'word.postag=PUNCT', 'word[-3:]=ere', 'word[-2:]=re', 'word.isupper=False', 'word.isdigit=False', 'words.startsWithCapital=False', 'prev_word.lower=of', 'prev_word.postag=PUNCT', 'prev_word.isupper=False', 'prev_word.isdigit=False', 'pre

**Identifying Diseases and Treatments using Custom NER**

We now use the CRF models prediction to prepare a record of diseases identified in the corpus and treatments used for those diseases.

Create the logic to get all the predicted treatments (T) labels corresponding to each disease (D) label in the test dataset.

In [47]:
# Extracting a dictionary of all the predicted diseases from our test data and the corresponding treatments.
# Assumption: For each identified disease, one of the treatments is in the same sentence as the disease exists.
disease_treatment = {}            # Initializing an empty dictionary
for i in range(len(Y_pred)):
    cnt_disease = 0           # Count of number of diseases mentioned in the sentence
    cnt_treatment = 0         # Count of the number of treatments mentioned in the sentence
    diseases = [""]           # Initializing a blank list of diseases for current sentence.
    treatment = [""]          # Initializing a blank list of treatments for current sentence.
    length = len(Y_pred[i])   # Length of current sentence.
    for j in range(length):
        if (Y_pred[i][j] == 'D'):                                                     # Checking for label indicating disease for current word ('D')
            diseases[cnt_disease] += (X_test[i][j][0].split('=')[1] + " ")            # Adding word to diseases list.
            if j < length - 1:
                if (Y_pred[i][j+1] != 'D'):                                           # Check for name of disease extending over multiple words. 
                    # If next word does not have label 'D', then truncate the space added at the end of the last word.
                    diseases[cnt_disease] = diseases[cnt_disease][:-1]
                    cnt_disease += 1
                    diseases.append("")                                               # Adding a placeholder for the next disease in the current sentence.
            else:
                diseases[cnt_disease] = diseases[cnt_disease][:-1]
                cnt_disease += 1
                diseases.append("")
                            
        if (Y_pred[i][j] == 'T'):                                                     # Checking for label indicating treatment for current word ('T')
            treatment[cnt_treatment] += (X_test[i][j][0].split('=')[1] + " ") # Adding word to corresponding treatment list.
            if j < length - 1:
                if (Y_pred[i][j+1] != 'T'):                                           # Check for name of treatment extending over multiple words. 
                    # If next word does not have label 'T', then truncate the space added at the end of the last word.
                    treatment[cnt_treatment] = treatment[cnt_treatment][:-1]
                    cnt_treatment += 1
                    treatment.append("")                                              # Adding a placeholder for the next treatment in the current sentence.
            else:
                treatment[cnt_treatment] = treatment[cnt_treatment][:-1]
                cnt_treatment += 1
                treatment.append("")

    diseases.pop(-1)    # Getting rid of the last empty placeholder in diseases list
    treatment.pop(-1)   # Getting rid of the last empty placeholder in treatments list

    # To our dictionary, add or append treatments to the diseases identified from the current sentence, if any.
    if len(diseases) > 0:       # Checking if any diseases have been identified for the current sentence.
        for disease in diseases:
            if disease in disease_treatment.keys():
                # Extend treatment list if other treatments for the particular disease already exist
                disease_treatment[disease].extend(treatment)
            else:
                # Creating list of treatments for particular disease if it doesn not exist already.
                disease_treatment[disease] = treatment
                
# Displaying dictionary of extracted diseases and potential treatments.
disease_treatment

{'macrosomic infants in gestational diabetes cases': ['good glycemic control'],
 'nonimmune hydrops fetalis': ['trisomy'],
 'preeclampsia': ['insemination program'],
 'severe preeclampsia': [],
 'asymmetric double hemiplegia': [],
 'a subchorial placental hematoma': [],
 'reversible nonimmune hydrops fetalis': [],
 'cancer': ['radiotherapy',
  'organ transplantation and chemotherapy',
  'chemotherapy',
  'matrix metalloproteinase inhibitors'],
 'breast cancer': ['hormone replacement therapy',
  'oxaliplatin',
  'vaccination',
  'undergone subcutaneous mastectomy'],
 'ovarian cancer': ['hormone replacement therapy',
  'oxaliplatin',
  'vaccination',
  'undergone subcutaneous mastectomy'],
 'prostate cancer': ['radical prostatectomy and iodine 125 interstitial radiotherapy'],
 'prostate cancers': [],
 'hereditary prostate cancer': [],
 'multiple sclerosis ( ms )': [],
 'hereditary retinoblastoma': ['radiotherapy'],
 'pericardial effusions': [],
 'epilepsy': ['methylphenidate', 'methylphe

1. It can be observed that several diseases do not have any identified treatments from our text corpus.
2. Avoid including these diseases in our final dictionary of diseases and corresponding treatments.

In [48]:
# Obtaining a neat version of our "disease_treatment" dictionary
neat_dict = {}
for key in disease_treatment.keys():
    if disease_treatment[key] != []:
        neat_dict[key] = disease_treatment[key]
neat_dict


{'macrosomic infants in gestational diabetes cases': ['good glycemic control'],
 'nonimmune hydrops fetalis': ['trisomy'],
 'preeclampsia': ['insemination program'],
 'cancer': ['radiotherapy',
  'organ transplantation and chemotherapy',
  'chemotherapy',
  'matrix metalloproteinase inhibitors'],
 'breast cancer': ['hormone replacement therapy',
  'oxaliplatin',
  'vaccination',
  'undergone subcutaneous mastectomy'],
 'ovarian cancer': ['hormone replacement therapy',
  'oxaliplatin',
  'vaccination',
  'undergone subcutaneous mastectomy'],
 'prostate cancer': ['radical prostatectomy and iodine 125 interstitial radiotherapy'],
 'hereditary retinoblastoma': ['radiotherapy'],
 'epilepsy': ['methylphenidate', 'methylphenidate'],
 'adhd': ['methylphenidate', 'methylphenidate'],
 'unstable angina or non-q-wave myocardial infarction': ['roxithromycin'],
 'coronary-artery disease': ['antichlamydial antibiotics'],
 'cerebral palsy': ['hyperbaric oxygen therapy'],
 'primary pulmonary hypertensi

In [49]:
# Converting dictionary to a dataframe
neat_df = pd.DataFrame({"Disease":neat_dict.keys(),"Treatments":neat_dict.values()})
neat_df.head()

Unnamed: 0,Disease,Treatments
0,macrosomic infants in gestational diabetes cases,[good glycemic control]
1,nonimmune hydrops fetalis,[trisomy]
2,preeclampsia,[insemination program]
3,cancer,"[radiotherapy, organ transplantation and chemo..."
4,breast cancer,"[hormone replacement therapy, oxaliplatin, vac..."


In [50]:
# Predict the treatment for the disease name: 'preeclampsia'

search_item = 'preeclampsia'
treatments = neat_dict[search_item]
print("Treatments for '{0}' is/are ".format(search_item), end = "")
for i in range(len(treatments)-1):
    print("'{}'".format(treatments[i]),",", end="")
print("'{}'".format(treatments[-1]))

Treatments for 'preeclampsia' is/are 'insemination program'
