# XML Parsing for PubMed Central

The representative code used to conduct XML parsing for articles in PubMed Central to obtain title, abstract, and full body content in order to search for cooccurrence of drug and mutation related to HBV. Figures mentioned in sub-heading are making reference to the figure numbers in the report. 

In [None]:
from os import listdir
import os
import re
import xml.etree.ElementTree as ET

## Step 1 in Figure 5 (i.e. Download scientific literature as XML files from the scientific literature)

1. Download all files from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/ and deposit inside your working directory. 
2. Unzip all tar.gz inside your working directory. 

## Step 2 in Figure 5 (i.e. Select articles that refer to HBV and/or Hepatitis B)

By using the command line arguments in Linux: 
1. Select articles that refer to Hepatitis B by typing "find . -name '*.nxml' -exec fgrep -i -l 'hepatitis b' { }\; | wc -l | parallel echo { } > match_hepatitisB.txt" in the command line. This would place all the paths to the XML files that consisted of 'hepatitis b' (case insensitive) in the file content inside a text file called 'match_hepatitisB.txt' in the working directory.
2. Select articles that refer to HBV by typing "find . -name '*.nxml' -exec fgrep -i -l 'hbv' { }\; | wc -l | parallel echo { } > match_hbv.txt" in the command line. This would place all the paths to the XML files that consisted of 'hbv'  (case insensitive) in the file content inside a text file called 'match_hbv.txt' in the working directory.
3. Merge two text files from Steps 1 and 2 by typing "copy match*.txt match_hbv_hepatitisB.txt" in the command line of the working directory.

## Steps 3-5 in Figure 5 (i.e. Extract title, abstract, and full body content; Search for co-occurrence of drug and mutation; Output Data (hit DOIs, mutations, sentences))

1. Create set of paths to access articles relevant to HBV and/or Hepatitis B from Step 2 in Figure 5

In [None]:
#Create set for pathways from match_hbv_hepatitisB.txt in Step 2 in Figure 5
match_hbv_hepatitisB_location = set()

#Change the path accordingly 
XML_files_contain_hbv_hepatitisB = open("./match_hbv_hepatitisB.txt", "r")

#Creates the set to open the XML file contents 
for file in XML_files_contain_hbv_hepatitisB:
    match_hbv_hepatitisB_location.add(file)

#Remove \n for each elements from the set
match_hbv_hepatitisB_location = map(lambda s: s.strip(), match_hbv_hepatitisB_location)
match_hbv_hepatitisB_location = set(match_hbv_hepatitisB_location)

2. Load the HBV drug vocabulary from DrugBank.ca (could be applied to any vocabulary that you wish to conduct XML parsing for)

In [None]:
#Create set for pathways for integrity_drug_regex
drugbank_regex = set()
#Change the path accordingly 
vocabulary_drugbank = open("./HBV_clinically_approved_DrugBank.txt", "r")

#Creates the set to open the XML file contents 
for file in vocabulary_drugbank:
    drugbank_regex.add(file)

#Remove \n for each elements from the set
drugbank_regex = map(lambda s: s.strip(), drugbank_regex)
drugbank_regex = set(drugbank_regex)

3. Regular expressions for HBV (combined version of category 1-7) and splitting sentences

In [None]:
mutation_patterns = r"(?i)(p\.|rt)?(?P<AA3wt>Ala|Arg|Asn|Asp|Asx|Cys|Glu|Gln|Glx|Gly|His|Ile|Leu|Lys|Met|Phe|Pro|Ser|Thr|Trp|Tyr|Val)(?P<POS>\d+)(?P<AA3mut>Ala|Arg|Asn|Asp|Asx|Cys|Glu|Gln|Glx|Gly|His|Ile|Leu|Lys|Met|Phe|Pro|Ser|Thr|Trp|Tyr|Val)|(c\.)?(?P<POS_C>\d+)(?P<AA1wt_C>[ARNDBCEQZGHILKMFPSTWYVO])>(?P<AA1mut_C>[ARNDBCEQZGHILKMFPSTWYVO])|(rt)?(?P<AA1wt>[ARNDBCEQZGHILKMFPSTWYVO])(?P<POS_1>\d+)(?P<AA1mut>[ARNDBCEQZGHILKMFPSTWYVO])"
regex_split = r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s"

4. Search for co-occurrence of drug and mutation whilst extracting for title from articles in 1)

In [None]:
#Find content that matches the title, then find HBV mutation patterns for files that satisfied drug filter

cooccurrence_title_mutation = set() #set that collects the HBV mutations for title with cooccurrence of mutation and drug 
cooccurrence_title_mutation_drug_line = set() #set that collects the lines from title with cooccurrence of mutation and drug 
cooccurrence_title_mutation_drug_location = set() #set that collects the DOI for title with cooccurrence of mutation and drug 
title_match_drug = set()

for path_name in match_hbv_hepatitisB_location: #iterates over all articles related to HBV
    tree = ET.parse(path_name)
    root = tree.getroot()
    title = root.find('.//title-group/article-title') #keyword to search for title content in PubMed Central 
    if title != None:
        title_match_bytes = ET.tostring(title)
        title_match_drug_string = title_match_bytes.decode()
        article_meta = root.find('.//article-meta')
        doi = article_meta.find('article-id[@pub-id-type="doi"]') #keyword to search for DOIs
        if doi != None:
            doi_match_bytes = ET.tostring(doi)
            path_name_doi = doi_match_bytes.decode()
            for drug in drugbank_regex:
                find_drug_patterns = re.search(drug, title_match_drug_string) #searches whether a sentence mentions drug from drugbank_regex
                if find_drug_patterns != None:
                    title_match_drug.add(title_match_drug_string)
                    for title_match_string in title_match_drug:
                        find_mutation_patterns = re.search(mutation_patterns, title_match_string) #searches whether a sentence that mentioned drug from drugbank_regex also mentions HBV-related mutation
                        if find_mutation_patterns != None:
                            cooccurrence_title_mutation_drug_line.add(title_match_string)
                            cooccurrence_title_mutation.add(find_mutation_patterns.group(0))
                            cooccurrence_title_mutation_drug_location.add(path_name_doi)
                        else:
                            continue

5. Search for co-occurrence of drug and mutation whilst extracting for abstract from articles in 1)

In [None]:
cooccurrence_abstract_mutation = set() #set that collects the HBV mutations for abstract with cooccurrence of mutation and drug 
cooccurrence_abstract_mutation_drug_line = set() #set that collects the lines from abstract with cooccurrence of mutation and drug 
cooccurrence_abstract_mutation_drug_location = set() #set that collects the DOI for abstract with cooccurrence of mutation and drug 
abstract_match_drug = set()

for path_name in match_hbv_hepatitisB_location: #iterates over all articles related to HBV
    tree = ET.parse(path_name)
    root = tree.getroot()
    abstract = root.find('.//abstract') #keyword to search for abstract content in PubMed Central 
    if abstract != None:
        abstract_match_bytes = ET.tostring(abstract)
        abstract_match_string = abstract_match_bytes.decode()
        article_meta = root.find('.//article-meta')
        doi = article_meta.find('article-id[@pub-id-type="doi"]') #keyword to search for DOIs
        if doi != None:
            doi_match_bytes = ET.tostring(doi)
            path_name_doi = doi_match_bytes.decode()
            lines = re.split(regex_split, abstract_match_string) #splits texts into a sentence
            for line1 in lines:
                for drug in drugbank_regex:
                    find_drug_patterns = re.search(drug, line1) #searches whether a sentence mentions drug from drugbank_regex
                    if find_drug_patterns != None:
                        abstract_match_drug.add(line1)
                        for line in abstract_match_drug:
                            find_mutation_patterns = re.search(mutation_patterns, line) #searches whether a sentence that mentioned drug from drugbank_regex also mentions HBV-related mutation
                            if find_mutation_patterns != None:
                                cooccurrence_abstract_mutation.add(find_mutation_patterns.group(0))
                                cooccurrence_abstract_mutation_drug_location.add(path_name_doi)
                                cooccurrence_abstract_mutation_drug_line.add(line)
                            else:
                                continue

6. Search for co-occurrence of drug and mutation whilst extracting for full body content from articles in 1)

In [None]:
cooccurrence_body_content_mutation = set() #set that collects the HBV mutations for body content with cooccurrence of mutation and drug 
cooccurrence_body_content_mutation_drug_line = set() #set that collects the lines from body content with cooccurrence of mutation and drug 
cooccurrence_body_content_mutation_drug_location = set() #set that collects the DOI for body contentwith cooccurrence of mutation and drug 
body_content_match_drug = set()

for path_name in match_hbv_hepatitisB_location: #iterates over all articles related to HBV
    tree = ET.parse(path_name)
    root = tree.getroot()
    body_content = root.find('.//body') #keyword to search for full body content in PubMed Central 
    if body_content != None:
        body_content_match_bytes = ET.tostring(body_content)
        body_content_match_string = body_content_match_bytes.decode()
        article_meta = root.find('.//article-meta') 
        doi = article_meta.find('article-id[@pub-id-type="doi"]') #keyword to search for DOIs
        if doi != None:
            doi_match_bytes = ET.tostring(doi)
            path_name_doi = doi_match_bytes.decode()
            lines = re.split(regex_split, body_content_match_string) #splits texts into a sentence
            for line1 in lines:
                for drug in drugbank_regex:
                    find_drug_patterns = re.search(drug, line1) #searches whether a sentence mentions drug from drugbank_regex
                    if find_drug_patterns != None:
                        body_content_match_drug.add(line1)
                        for line in body_content_match_drug:
                            find_mutation_patterns = re.search(mutation_patterns, line) #searches whether a sentence that mentioned drug from drugbank_regex also mentions HBV-related mutation
                            if find_mutation_patterns != None:
                                cooccurrence_body_content_mutation.append(find_mutation_patterns.group(0))
                                cooccurrence_body_content_mutation_drug_location.add(path_name_doi)
                                cooccurrence_body_content_mutation_drug_line.add(line)
                            else:
                                continue

In 4)-6), for ease of data processing, I have used the regular expression for each categories instead of the combination of all seven categories into one. Hence, mutation_patterns mentioned in 4)-6) above were replaced with regular expressions noted in Table 1 of the report (e.g. regex_base_Pos_base = r"(?i)(?P<base1>[AGTC])(?P<POS_1>\d+)(?P<base2>[GACT])" for mutation type 3)  