<b><u> Extraction of Selected Data Fields </u></b>

This jupyter notebook focuses on extracting only selected data fields from the unstructured data source file that would be useful for our research. For this part of the resarch, I have extracted all clinical trials studies into one XML file containing only required data fields. For this notebook, I have tried three different kinds of parsing.

1) Parsing using <b> Regular Expression </b>
2) Parsing using <b> DOM Minidom Parser </b>
3) Parsing using <b> ElementTree Parser </b>

These parsing are shown in the cells below:

In [4]:
# In this cell, I was using regular expression to parse the XML file and retrieve the required data fields.

# we are first importing the regular expression and then opening the source data. 
import re
text=open("/Users/Tarun/Documents/Courses/Fall_2016/Research/Data/Data_test/study_fields.xml").read()

# The commands below searches for the beginning and ending "title" tag and retreives the 
# data in-between the tags.
found=re.findall("<title>(.*)</title>",text)
for official_title in found:
    print (official_title)

# The commands below searches for the beginning and ending of "Start Date" tag and retreives the 
# data in-between the tags.
f2 = re.findall("<start_date>(.*)</start_date>", text)
for startDate in f2:
    print (startDate)


Multicenter, Randomized, Double-blind, Placebo-controlled Study to Evaluate the Effect of ITF2357 on Mucosal Healing in Patients With Moderate-to-severe Active Crohn's Disease
Contrast-Enhanced Ultrasound in Human Crohn's Disease
Lanreotide Autogel in the Treatment of Symptomatic Polycystic Liver Disease
Anakinra for Behcet s Disease
Phase 1 Study of Zoledronic Acid in Sickle Cell Disease
The COPD Patient Management European Trial (COMET)
Evaluation of [18F]MK-9470 as a Brain Tracer of Cannabinoid-1 Receptor in Parkinson's Disease and Healthy Subjects
Respiratory Kinematics of Cough in Healthy Older Adults and Parkinson's Disease
Bandage Lenses in Treating Patients With Ocular Graft-Versus-Host Disease
Safety and Efficacy Study of Hybrid Revascularization in Multivessel Coronary Artery Disease
Wireless Capsule Endoscopy in Small-Bowel Crohn's Disease
Safety and Efficacy of UC-MSC in Patients With Acute Severe Graft-versus-host Disease
STA-5326 in Crohn's Disease Patients
GM1 Gangliosid

In [14]:
# for this cell, we will be using the DOM Minidom parser. 

# we are calling the DOM Minidom parser
from xml.dom import minidom
import nltk

# In the command below, we are retrieving the data from the source data file.
doc = minidom.parse("/Users/Tarun/Documents/Courses/Fall_2016/Research/Data/Data_test/study_fields.xml")

# We are calling the Study tag from the XML file and then using a for-clause to retreive the data from 
# the similar data fields
clinicalstudy = doc.getElementsByTagName("study")
for study in clinicalstudy:
    print ("************************* STUDY ****************************")
    
# Using the below commands, we are parsing the XML file and retrieving the required fields mentioned below.    
    Title = study.getElementsByTagName("title")[0]
    print("Title: %s" % Title.firstChild.data)
    
    
    Conditions = study.getElementsByTagName("conditions")
    for conditions in Conditions:
        condition1 = conditions.getElementsByTagName("condition")[0]
    print(condition1.childNodes[0].data)
    
    
    Interventions = study.getElementsByTagName("interventions")
    for interventions in Interventions:
        Intervention1 = interventions.getAttribute("type")
        Intervention2 = interventions.getElementsByTagName("intervention")
    print("Intervention_type: %s" % Intervention1)
    print("Interventions:  %s" % Intervention2)
    
    

************************* STUDY ****************************
Title: Multicenter, Randomized, Double-blind, Placebo-controlled Study to Evaluate the Effect of ITF2357 on Mucosal Healing in Patients With Moderate-to-severe Active Crohn's Disease
Crohn's Disease
Intervention_type: 
Interventions:  [<DOM Element: intervention at 0x25c167470>, <DOM Element: intervention at 0x25c167508>]
************************* STUDY ****************************
Title: Contrast-Enhanced Ultrasound in Human Crohn's Disease
Crohn's Disease
Intervention_type: 
Interventions:  [<DOM Element: intervention at 0x25c172d58>, <DOM Element: intervention at 0x25c172df0>]
************************* STUDY ****************************
Title: Lanreotide Autogel in the Treatment of Symptomatic Polycystic Liver Disease
Polycystic Liver Disease
Intervention_type: 
Interventions:  [<DOM Element: intervention at 0x25c17b508>]
************************* STUDY ****************************
Title: Anakinra for Behcet s Disease
Auto

In [None]:
# For this cell, I have used the Element Tree Parser.

import nltk


from xml.dom.minidom import parse, Node
from textblob import TextBlob

# We are extracting the source data from the below command.
xmltree=parse("/Users/Tarun/Documents/Courses/Fall_2016/Research/Data/study_fields.xml")


# The below for-clause commands to retieve all the title tags from the xml file containing all
# clinical trials data
# We have used node and childnodes below as we are using element tree parser.
for node1 in xmltree.getElementsByTagName('title'):
            for node2 in node1.childNodes:
                if node2.nodeType == Node.TEXT_NODE:
                    print("******")
                    print(node2.data)
            


******
Multicenter, Randomized, Double-blind, Placebo-controlled Study to Evaluate the Effect of ITF2357 on Mucosal Healing in Patients With Moderate-to-severe Active Crohn's Disease
******
A Phase 1/2 Trial of Donor Regulatory T-cells for Steroid-Refractory Chronic Graft-versus-Host-Disease
******
Contrast-Enhanced Ultrasound in Human Crohn's Disease
******
Clinical Trial With MSC for Graft Versus Host Disease Treatment
******
Vitamin D Supplementation as Non-toxic Immunomodulation in Children With Crohn's Disease
******
Lanreotide Autogel in the Treatment of Symptomatic Polycystic Liver Disease
******
Anakinra for Behcet s Disease
******
Phase 1 Study of Zoledronic Acid in Sickle Cell Disease
******
Safety and Exploratory Efficacy Study of UCMSCs in Patients With Alzheimer's Disease
******
The Effect of a New Antioxidant Combination (ASTED) on Mild Thyroid Eye Disease (TED)
******
A Disease Management Study Targeted to Reduce Health Care Utilization for Patients With Congestive Heart