<h1>Extracting the Vitek Data</h1>

There is approximately 10 years of Vitek antimicrobial sensitvitiy data stored on archive CD-ROMS at the Bristol PHE Microbiology laboratory. Data files are stored in xml format, which can then be interpreted by the Vitek software, but it is the intention of this project to consolidate this data into a noSQL database that can be accessible through a web application, opening up the data to research interests.

To achieve this I will be building two objects, which I will later package into a module that can be run from the command line. The intention is that I can pass in the database name and CD-ROM drive pathway as strings, and the module will loop through XML files contained on the CD-ROW, build tree data structures from the XML files, and then store these objects in a mongo database.

Below you can find the two objects:
* BuildReportTree: using the XML file path, create a beautifulsoup object using the built-in xml parser, remove unnecessary data, and build a tree structure, with each branch corresponding to a different section of the report
* BuildDatabase: Loop through the XML files in the CD-ROM and, using the BuildReportTree object, create a tree structure object for each report, and then attempt to save that object in the designated mongo database

In [1]:
from sys import argv, exit
from bs4 import BeautifulSoup as Soup
from collections import defaultdict
import re
import PyPDF2 as pdfreader
import pymongo
import os

In [2]:
class BuildReportTree:
    """Generate a tree of hash tables to represent the reports extracted from XML file"""
    
    def __init__(self, path):
        """Instantiate BuildReportTree object using XML file path. Will generate report_array property, a list of string
        elements repesenting the list
        params:
        path -- binary string"""
        #There may be multiple reports in an xml file i.e. multiple isolates
        self.reports = []
        with open(path, "r") as f:
            handler = f.read()
            soup = Soup(handler, 'lxml')
            report_body = soup.findAll("report_body")
            for content in report_body:
                if str(content.contents).find("ReportData&gt") != -1:
                    report_strings = str(content.contents).replace("\n", "").split("&gt;&lt;")
                    report_array = list(map(lambda x: x.replace("/", ""), report_strings))
                    self.reports.append(report_array)

    def drop_not_body(self, tag):
        """Remove any elements that are not part of the report body
        params:
        tag -- element of report soup object"""
        
        return str(tag).find("ReportData&gt") != -1
    
    def build_tree(self):
        """Using current object property report_array, generate a tree structure to represent the report"""
        report_trees = []
        for report in self.reports:
            report_tree = dict()
            headings = {'ReportData': 0,
                'AstDetailedInfo': 0,
                'AstTestInfo':0}
            #Drop source_xmlstring
            def not_sourcexmlstring(x):
                return x.find("source_xmlstring") == -1
            report = list(filter(not_sourcexmlstring, report))
            #Find start index for each section
            for i, row in enumerate(report):
                if row in headings.keys():
                    headings[row] = i
            if not all(val == 0 for key, val in headings.items()):
                sections = dict()
                sections["ReportData"] = report[headings["ReportData"]:headings["AstDetailedInfo"]]
                sections["AstDetailedInfo"] = report[headings["AstDetailedInfo"]: headings['AstTestInfo']]
                sections["AstTestInfo"] = report[headings["AstTestInfo"]:len(report)]
                for header, section in sections.items():
                    report_tree = self.process_data(header, section, report_tree)
                report_trees.append(report_tree)
            else:
                return {"error":"Section index error, check report_array property for inconsistencies"}
        return report_trees
    
    def process_data(self, header, section, report_tree):
        """Create branch and leaves for passed section, add too tree and return structure
        params:
        header -- section header, as string, to be used as branch key
        section -- array of strings of section elements
        report_tree -- report tree structure"""
        
        section_data = dict()
        #remove any elements containing a single item
        section = list(filter(lambda x: len(x.split(" ")) > 1, section))
        #If this section is the AstTestInfo then pull out drug family names as keys, and assign value of phenotype
        #All other elements add as key, value pairs according to string value
        if header == "AstTestInfo":
            phenotype_data = []
            for row in section:
                if row.split(" ")[0] == "DrugFamily" or row.split(" ")[0] == "Phenotype":
                    phenotype_data.append(" ".join(row.split(" ")[1:]))
                else:
                    section_data[row.split(" ")[0]] = self.create_dict(" ".join(row.split(" ")[1:]))
            phenotype_data = list(self.split_list(phenotype_data, 2))
            phenotype_data = list(self.create_dict(" ".join(x)) for x in phenotype_data)
            section_data["phenotype_info"] = dict()
            for phenotype in phenotype_data:
                section_data["phenotype_info"].update({phenotype["familyName"]: phenotype["phenotypeName"]})
            report_tree[header] = section_data
            return report_tree
        #For all other sections, build dictionary using key value pairs according to string value
        for row in section:
            #If row is for Drug information, use drug name as key
            if row.split(" ")[0] == "AstDrugResultInfo":
                drug_key, values = self.get_drug_data(" ".join(row.split(" ")[1:]))
                section_data[drug_key] = values
            else:
                section_data[row.split(" ")[0]] = self.create_dict(" ".join(row.split(" ")[1:]))
            report_tree[header] = section_data
        return report_tree
    
    def get_drug_data(self, drug_info):
        """Take string of drug information, create dictionary with key as drug name, and value as dictionary of attributes"""
        
        drug_dict = self.create_dict(drug_info)
        drug_key = drug_dict["drugName"]
        values = {key: value for key, value in drug_dict.items() if key is not "drugName"}
        return drug_key, values

    def create_dict(self, string):
        """Take in string containing substrings of format *key*=*val*, seperate into key, value pairs and return as dictionary"""
        element_dict = dict()
        key_vals = list(map(lambda x: x.replace("\"", ""), string.split("\" ")))
        for key_val in key_vals:
            key, value = key_val[0:key_val.find("=")], key_val[key_val.find("=")+1:len(key_val)]
            if not self.confidential_data(key):
                element_dict[key] = self.format_val(value)
        return element_dict
    
    def split_list(self, l, n):
        """Split list into list of lists with length n. List length must equal n to yield
        params:
        l -- list to split
        n -- disired number of elements per list"""
        
        for i in range(0, len(l), n):
            if len(l[i:i+n]) == n:
                yield l[i:i+n]
                
    def format_val(self, string):
        """Check if value is interget or float"""
        
        if len(string) == 0:
            return string
        if all(x.isdigit() for x in list(string)):
            return int(string)
        else:
            try:
                return float(string)
            except:
                return string
        
    def confidential_data(self, string):
        """If key is a patient identifier return true"""
        
        if string.find('patient') != -1:
            return True
        else:
            return False

In [55]:
class BuildDatabase:
    """Using a supplied mongodb client, database name, and CD-ROM file pathway, this object attempts to populate the designated
    mongo database with report objects obtained from XML files on the target CD-ROM"""
    
    def __init__(self, mongoclient, dbname, dir_path):
        """Initislise object and set global variables"""
        
        self.db = mongoclient[dbname]
        self.file_path = os.fsencode(dir_path)
        
    def build(self):
        """Iterate over files in path specified, if they correspond to a report, add to database"""
        
        for file in os.listdir(self.file_path):
            filename = os.fsdecode(file)
            if 'reports_isolate' in filename:
                xml_obj = BuildReportTree(filename)
                report_trees = xml_obj.build_tree()
                for report_tree in report_trees:
                    if 'error' in report_tree.keys():
                        print('{}: {}'.format(filename, report_tree['error']))
                    else:
                       self.insert_report(report_tree, filename)
                            
    def insert_report(self, report_tree, filename):
        """Attempt to save report tree structure as new document in report collection
        params:
        report_tree -- nested hash tables representing the report
        filename -- string path of file currently being processed"""
        
        #try:
        insert_id = self.db.reports.insert_one(report_tree).inserted_id
        print('{} inserted with id {}'.format(filename, insert_id))
        self.insert_org(report_tree, insert_id)
        #except:
            #print('Failed to save {}'.format(filename))
    
    def insert_org(self, report_tree, report_id):
        """Check if organism exists in organism collection, if not add new organism, else add report ID to list
        of report id's for this organism
        params:
        report_tree -- nested hash tables representing the report
        report_id -- mongo id for report document"""
        
        org_name = report_tree['AstTestInfo']['SelectedOrg']['orgFullName']
        if self.db['orgs'].find_one({org_name: {'$exists': True}}):
            org_doc = self.db['orgs'].find_one({org_name: {'$exists': True}})
            org_doc[org_name].append(report_id)
            self.db.orgs.update_one({'_id': org_doc['_id']}, {'$set': org_doc}, upsert=False)
            print("{} summary updated".format(org_name))
        else:
            new_org = {org_name:[report_id]}
            insert_id = self.db.orgs.insert_one(new_org).inserted_id
            print('Create new summary entry for organism {}, with id {}'.format(org_name, insert_id))
            
            

<h2>Example of a tree data structure for a Vitek report</h2>

In [56]:
report_obj = BuildReportTree('reports_isolate-19611987.xml')

In [57]:
report_tree = report_obj.build_tree()

In [58]:
report_tree[0]

{'AstDetailedInfo': {'Benzylpenicillin': {'astCategoryCall': 'none',
   'disabledWithComment': 'false',
   'drugCode': 'P',
   'drugName': 'Benzylpenicillin',
   'drugScreen': 'DRUG_SCREEN_NONE',
   'drugStatusInfo': 'AST_DRUG_NONE',
   'hasMultiInfectionSite': 'false',
   'infectionSite': 'Other',
   'interpretation': 'R',
   'isChangedByUser': 'false',
   'isChangedByValidation': 'false',
   'isDeduced': 'false',
   'isMICCorrected': 'false',
   'isResistantInterpretation': 'true',
   'mic': 0.5,
   'missingRequiredTest': 'false',
   'prelimExpertizationStatus': 'OK',
   'relationshipValue': 'GreaterThan',
   'reportingDisabed': 'false',
   'sortCode': 1,
   'status': 'Final'},
  'Cefoxitin Screen': {'astCategoryCall': '-',
   'disabledWithComment': 'false',
   'drugCode': 'OXSF',
   'drugName': 'Cefoxitin Screen',
   'drugScreen': 'DRUG_SCREEN_NONE',
   'drugStatusInfo': 'AST_DRUG_NONE',
   'hasMultiInfectionSite': 'false',
   'infectionSite': 'Other',
   'interpretation': '-',
   '

In [59]:
report_tree[1]

{'AstDetailedInfo': {'Benzylpenicillin': {'astCategoryCall': 'none',
   'disabledWithComment': 'false',
   'drugCode': 'P',
   'drugName': 'Benzylpenicillin',
   'drugScreen': 'DRUG_SCREEN_NONE',
   'drugStatusInfo': 'AST_DRUG_NONE',
   'hasMultiInfectionSite': 'false',
   'infectionSite': 'Other',
   'interpretation': 'R',
   'isChangedByUser': 'false',
   'isChangedByValidation': 'false',
   'isDeduced': 'false',
   'isMICCorrected': 'false',
   'isResistantInterpretation': 'true',
   'mic': 0.5,
   'missingRequiredTest': 'false',
   'prelimExpertizationStatus': 'OK',
   'relationshipValue': 'GreaterThan',
   'reportingDisabed': 'false',
   'sortCode': 1,
   'status': 'Final'},
  'Cefoxitin Screen': {'astCategoryCall': '-',
   'disabledWithComment': 'false',
   'drugCode': 'OXSF',
   'drugName': 'Cefoxitin Screen',
   'drugScreen': 'DRUG_SCREEN_NONE',
   'drugStatusInfo': 'AST_DRUG_NONE',
   'hasMultiInfectionSite': 'false',
   'infectionSite': 'Other',
   'interpretation': '-',
   '

<h2>Example of inserting into Mongodb</h2>

In [60]:
client = pymongo.MongoClient()
BuildDatabase(client, 'vitekTest', "/home/rossco/Documents/Data Science Portfolio/vitek_project/").build()

reports_isolate-19611987.xml inserted with id 5a68a63d1285011149a0351e
Create new summary entry for organism Staphylococcus aureus, with id 5a68a63d1285011149a0351f
reports_isolate-19611987.xml inserted with id 5a68a63d1285011149a03520
Staphylococcus aureus summary updated


Perform a search and grab any objects corresponding to *Staphylococcus aureus* which should return both objects that I just inserted

In [10]:
db = client['vitekTest']
reports = db.reports

In [11]:
reports

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'vitekTest'), 'reports')

In [12]:
import pprint
for sta in reports.find({'AstTestInfo.SelectedOrg.orgFullName' : 'Staphylococcus aureus'}):
    pprint.pprint(sta)

{'AstDetailedInfo': {'Benzylpenicillin': {'astCategoryCall': 'none',
                                          'disabledWithComment': 'false',
                                          'drugCode': 'P',
                                          'drugName': 'Benzylpenicillin',
                                          'drugScreen': 'DRUG_SCREEN_NONE',
                                          'drugStatusInfo': 'AST_DRUG_NONE',
                                          'hasMultiInfectionSite': 'false',
                                          'infectionSite': 'Other',
                                          'interpretation': 'R',
                                          'isChangedByUser': 'false',
                                          'isChangedByValidation': 'false',
                                          'isDeduced': 'false',
                                          'isMICCorrected': 'false',
                                          'isResistantInterpretation': 'true',
    

                                     'reportingDisabed': 'false',
                                     'sortCode': 10,
                                     'status': 'Final'},
                     'Tetracycline': {'astCategoryCall': 'none',
                                      'disabledWithComment': 'false',
                                      'drugCode': 'TE',
                                      'drugName': 'Tetracycline',
                                      'drugScreen': 'DRUG_SCREEN_NONE',
                                      'drugStatusInfo': 'AST_DRUG_NONE',
                                      'hasMultiInfectionSite': 'false',
                                      'infectionSite': 'Other',
                                      'interpretation': 'S',
                                      'isChangedByUser': 'false',
                                      'isChangedByValidation': 'false',
                                      'isDeduced': 'false',
                           

In [17]:
orgs = db.org.find({})

In [18]:
for org in orgs:
    print(org)