<center><h1>Reading scholarly.html files</h1></center>

## Introduction

In this notebook I am going to attempt to write a parser to read scholarly.html files.

### Find an example file and load it

In [1]:
import os

print(os.listdir('./results/PMC2241601'))

html_root = './results/PMC2241601/scholarly.html'

#load the data in
with open(html_root, 'r') as f:

    raw_text = f.read()
    


['svg', 'word.frequencies.snippets.xml', 'fulltext.pdf', 'pdfimages', 'search.country.snippets.xml', 'scholarly.html', 'eupmc_result.json', 'search.country.count.xml', 'results', 'fulltext.xml', 'word.frequencies.count.xml']


In [2]:
from bs4 import BeautifulSoup

#parse the html into soup
soup = BeautifulSoup(raw_text, 'html.parser')

#extract the body
body = soup.find('body')



### Try and make the parser work using just the abstract

In [3]:
#extract the front element
front = body.find('div', {'class':'front'})

### Extract the paper metadata

In [4]:
from pprint import pprint

def iterdict(d):
    for k,v in d.items():        
        if isinstance(v, dict):
            iterdict(v)
        else:            
            print (k,":",v)
        
def iterhtml(el):
    
    store_json = {}
    
    if el.children:
    
        for item in el.children:
            
            #check if the item is whatever a navigable string is
            if 'bs4.element.NavigableString' in str(item.__class__):
                #move on if it does
                continue

            if item.name == 'div':
                
                #object_name = item.text
                
                #check if we have a title object
                if item.has_attr('tagx'):
                    
                    if item['tagx'] == 'title':
                        store_json['title'] = item.text.strip()

                        #this is not a div element we are interested in diving into
                        #so lets move on
                        continue
                
                #do some recursion
                store_json[item['class'][0]] = iterhtml(item)

            #we have a span object which contains information we are interested in
            elif item.name == 'span':
                
                store_json[item['class'][0]] = item.text
                
            elif item.name == 'p':
                
                last_key = list(store_json.keys())[-1]
                
                if 'text' not in store_json.keys():
                    
                    store_json['text'] = ''
                    
                store_json['text'] += item.text
                
                #pprint(store_json)
                #print('-------------')
                #print(item)
                
                #print(stop)
    #print('************************')   
    return store_json


result = iterhtml(front)
    
print('++++++++')
    
pprint(result)


++++++++
{'article-meta': {'abstract': {'abstract-title': {},
                               'background': {'text': 'The key enzymes of '
                                                      'photosynthetic carbon '
                                                      'assimilation in C4 '
                                                      'plants have evolved '
                                                      'independently several '
                                                      'times from C3 isoforms '
                                                      'that were present in '
                                                      'the C3 ancestral '
                                                      'species. The C4 isoform '
                                                      'of phosphoenolpyruvate '
                                                      'carboxylase (PEPC), the '
                                                      'primary CO2-fixing '
 

In [5]:
#try to extract the body using this

main = body.find('div', {'class':'body'})

body_result = iterhtml(main)

pprint(body_result)

{"authors'contributions": {'text': 'SE carried out the histochemical and '
                                   'quantitative GUS assays, the cloning of '
                                   'construct ppcA-PRFtΔIntron-DR(+)Ft, the '
                                   'sequence alignments and wrote the '
                                   'manuscript. CZ produced construct '
                                   'ppcA-PRFp-DR(+)Ft. MK, US and MS performed '
                                   'the transformation of F. bidentis. PW '
                                   'coordinated the design of this study and '
                                   'participated in drafting the manuscript. '
                                   'All authors read and approved the final '
                                   'manuscript.',
                           'title': "Authors' contributions"},
 'background': {'text': 'About 90% of terrestrial plant species, including '
                        'major crops such 

                                                                                                             'ppcA '
                                                                                                             'promoter '
                                                                                                             'sequences '
                                                                                                             'from '
                                                                                                             'different '
                                                                                                             'Flaveria '
                                                                                                             'species'},
                          'experimentalstrategy': {'fig': {'figure': {},
                                                           'text': 'Schematic '
      

                                                                                                                                                                        'of '
                                                                                                                                                                        'protein '
                                                                                                                                                                        'per '
                                                                                                                                                                        'minute.'},
                                                                                                                                                        'text': 'Gowik '
                                                                                                                           

### Define functions for reference extraction

In [6]:
#process a object that contains information about a person
def process_people_group(people_group_els, this_ref_dict):
    
    #scroll through the people objects
    for pg_els in people_group_els:
        
        #define the person dictionary
        person_dict = {}
        
        #run through the elements in this person object
        for span_el in pg_els:
            
            #store the person's details
            person_dict[span_el['class'][0]] = span_el.text

        #store this person
        this_ref_dict['people'].append(person_dict)
        
    return this_ref_dict

#process an list element which is a reference
def process_reference_element(l):
    
    #run through the elements of this reference element
    for t in l.children:

        #check if the item is whatever a navigable string is
        if 'bs4.element.NavigableString' in str(t.__class__):
            #move on if it does
            continue

        #if we have the label element grab it and form the background of our element
        if t.name == 'a':
            ref_dict = {'people':[],
                       'label':t['name'],
                       'ref_no':int(t['name'][1:])}
            continue

        #go through the children of the reference to grab the reference's elements
        for u in t.children:
            
            #check that the element we have is not a NavigableString
            if 'bs4.element.NavigableString' in str(u.__class__):
                #move on if it does
                continue

            #if we have a person group then we want to use our extraction tool
            if 'person-group' in u['class'][0]:

                #process it as a people element
                ref_dict = process_people_group(u, ref_dict)

            else: 
                
                #find to see if there are links in here
                As = u.findAll('a')
                
                #if there are links we process it as a link object
                if As != []:

                    #if we have a reference link we store the text as well
                    #as the link
                    ref_dict[u['class'][0]] = {'link':As[0]['href'],
                                               'text':u.text}

                else:
                    #we have a plain piece of information, lets just store it
                    ref_dict[u['class'][0]] = u.text
                    
    return ref_dict

#this is the function for extracting the references from the tail
def extract_references(tail):
    
    references = []

    #find the references block
    refs = tail.find('div', {'tag':'ref-list'})

    #find the references list
    r = refs.find('ul')

    #go through the list of references
    for ref_el in r.children:

        #check if the item is whatever a navigable string is
        if 'bs4.element.NavigableString' in str(ref_el.__class__):
            #move on if it does
            continue

        ref_dict = process_reference_element(ref_el)

        references.append(ref_dict)
        
    return references

### Create a class which will read the objects in

In [35]:

#define a class to extract the acknowledgments
class acknowledgements_extractor(object):
    
    def __init__(self, tail):
        
        #get the acknowledgements element
        acks = tail.find('div', {'class':'ack'})

        #get the ack_dict element
        self.ack_dict = {'text':''}
        
        #start iterating
        self.iteracks(acks)
        
    def iteracks(self, el):
                
        #iterate through the children of this device
        for child in el.children:
            
            #check if the item is whatever a navigable string is
            if 'bs4.element.NavigableString' in str(child.__class__):
                #move on if it does
                continue
            
            #check if we have an div, usually the acknowledgments heading
            if child.name == 'div':
                
                print('FOUND DIV')
                
                #check if this element is a title
                if child['class'][0] == 'title':

                    self.ack_dict['title'] = child.text.strip()
                    
                    #if it is we are done with this element
                    continue
                    
                #go one level down
                self.iteracks(child)
            
            #check if this is a text element
            elif child.name == 'p':
                
                print('FOUND TEXT')
                
                #now we need to extract and store the result
                self.ack_dict['text'] += ' ' + el.text.strip()


#this is the object that will process the paper and in which the results will be stored
class paper_obj(object):
    
    def __init__(self, html_file, keep_fig_caps = True):
        
        
        self.abstract_text = ''
        self.body_text = ''
        self.figures = []
        
        self.abstract_el = False
        self.body_el = False
        
        #tell the tool to keep the figure captions in the fulltexts
        self.keep_fig_labels = keep_fig_caps
        
        #load the data in
        with open(html_root, 'r') as f:

            raw_text = f.read()
            
            
        #parse the html into soup
        soup = BeautifulSoup(raw_text, 'html.parser')

        #extract the metadata
        front = soup.find('div', {'class':'front'})
        
        #extract the body
        body = soup.find('div', {'class':'body'})
        
        #extract the tail
        tail = soup.find('div', {'class':'back'})
        
        #extract the metadata into a json
        self.meta = self.iterhtml(front)
        
        #we are now going to work with the body so set the flag
        self.body_el = True
        
        #extract the body of the document
        self.body = self.iterhtml(body)
        
        #we are done with the body so we will reset the body to False
        self.body_el = False
        
        
        #extract the end notes
        self.extract_tail(tail)

    def iterhtml(self, el):
        
        #set up the json we're going to be returning
        store_json = {}
        
        #check if this item has children
        if el.children:

            for item in el.children:
                
                #print('I HAVE AN ITEM')

                #check if the item is whatever a navigable string is
                if 'bs4.element.NavigableString' in str(item.__class__):
                    #move on if it does
                    continue

                if item.name == 'div':

                    #object_name = item.text

                    #check if we have a title object
                    if item.has_attr('tagx'):

                        if item['tagx'] == 'title':
                            store_json['title'] = item.text.strip()

                            #this is a heading, lets store the text from it
                            self.capture_text(item)
                            
                            #this is not a div element we are interested in diving into
                            #so lets move on
                            continue
                            
                    if item['class'][0] == 'fig':
                        
                        self.process_figure(item)
                        
                        #we dont want to dive in here since that will lead to double counts.
                        #lets move on
                        continue

                    #do some recursion
                    store_json[item['class'][0]] = self.iterhtml(item)
                    
                    #we dont want to store text from div elements so lets move on
                    #continue

                #we have a span object which contains information we are interested in
                elif item.name == 'span':

                    store_json[item['class'][0]] = item.text
                    
                    #grab the text from this element
                    self.capture_text(item)
      
                elif item.name == 'p':

                    last_key = list(store_json.keys())[-1]

                    if 'text' not in store_json.keys():

                        store_json['text'] = ''

                    store_json['text'] += item.text
                    
                    #grab the text from this element
                    self.capture_text(item)
                

        #print('************************')   
        return store_json
    
    def capture_text(self, el):
        
        #check if we are working with a body
        if self.body_el:
            self.body_text += el.text + '\n'

        #if we have an abstract element we want to store it
        elif self.abstract_el:
            self.abstract_text += el.text + '\n'
            
    def process_figure(self, fig_el):
        
        fig_dict = {'caption':''}

        for child in fig_el.children:

            #check if the item is whatever a navigable string is
            if 'bs4.element.NavigableString' in str(child.__class__):
                #move on if it does
                continue

            if child.name == 'div':

                fig_dict['title'] = child.text
                
                if self.keep_fig_labels:
                    self.capture_text(child)

            elif child.name == 'p':
                #child.name == 'p':
                
                if self.keep_fig_labels:
                    self.capture_text(child)

                fig_dict['caption'] += child.text + '\n'

        fig_dict['title'] = fig_dict['title'].strip()
        fig_dict['caption'] = fig_dict['caption'].strip()
        
        #store the details of this figure
        self.figures.append(fig_dict)


    def extract_tail(self, tail):
        
        ack_tool = acknowledgements_extractor(tail)
        
        self.acks = ack_tool.ack_dict
        
        self.references = extract_references(tail)
        
    
###############################
### End of class definition ###
###############################

#set the filename
filename = './results/PMC2241601/scholarly.html'

#create the paper object
result = paper_obj(filename)

print(result.body_text)

FOUND DIV
FOUND DIV
FOUND TEXT

Background
About 90% of terrestrial plant species, including major crops such as rice, soybean, barley and wheat, assimilate CO2 via the C3 pathway of photosynthesis. Ribulose-1,5-bisphosphate carboxylase/oxygenase (Rubisco) acts as the primary CO2-fixing enzyme of C3 photosynthesis, but its ability to use O2 as a substrate instead of CO2 results in the energy-wasting process of photorespiration. The photosynthetic C4 cycle represents an addition to the C3 pathway which acts as a pump that accumulates CO2 at the site of Rubisco so that the oxygenase activity of the enzyme is inhibited and photorespiration is largely suppressed. C4 plants therefore achieve higher photosynthetic capacities and better water- and nitrogen-use efficiencies when compared with C3 species [1].
C4 photosynthesis is characterized by the coordinated division of labour between two morphologically distinct cell types, the mesophyll and the bundle-sheath cells. The correct functioning

In [36]:
result.figures

[{'caption': 'Schematic presentation of the promoter-GUS fusion constructs used for the transformation of Flaveria bidentis (C4).',
  'title': 'Figure 1'},
 {'caption': '(A) to (C): Histochemical localization of GUS activity in leaf sections of transgenic F. bidentis plants transformed with constructs ppcA-PRFt-DR(+)Ft(A), ppcA-PRFp-DR(+)Ft (B) or ppcA-PRFtΔIntron-DR(+)Ft (C). Incubation times were 6 h (A, C) and 20 h (B). (D): GUS activities in leaves of transgenic F. bidentis plants. The numbers of independent transgenic plants tested (N) are indicated at the top of each column. Median values (black lines) of GUS activities are expressed in nanomoles of the reaction product 4-methylumbelliferone (MU) generated per milligram of protein per minute.',
  'title': 'Figure 2'},
 {'caption': "Nucleotide sequence alignment of the proximal regions of ppcA promoters from F. trinervia (C4, ppcA-Ft), F. bidentis (C4, ppcA-Fb), F. vaginata (C4-like, ppcA-Fv), F. brownii (C4-like, ppcA-Fbr), F. pu