<center><h1>Reading scholarly.html files</h1></center>

## Introduction

In this notebook I am going to attempt to write a parser to read scholarly.html files.

### Find an example file and load it

In [1]:
import os

print(os.listdir('./results/PMC2241601'))

html_root = './results/PMC2241601/scholarly.html'

#load the data in
with open(html_root, 'r') as f:

    raw_text = f.read()
    


['svg', 'word.frequencies.snippets.xml', 'fulltext.pdf', 'pdfimages', 'search.country.snippets.xml', 'scholarly.html', 'eupmc_result.json', 'search.country.count.xml', 'results', 'fulltext.xml', 'word.frequencies.count.xml']


In [2]:
from bs4 import BeautifulSoup

#parse the html into soup
soup = BeautifulSoup(raw_text, 'html.parser')

#extract the body
body = soup.find('body')



### Try and make the parser work using just the abstract

In [3]:
#extract the front element
front = body.find('div', {'class':'front'})

### Extract the paper metadata

In [56]:
from pprint import pprint

def iterdict(d):
    for k,v in d.items():        
        if isinstance(v, dict):
            iterdict(v)
        else:            
            print (k,":",v)
        
def iterhtml(el):
    
    store_json = {}
    
    if el.children:
    
        for item in el.children:
            
            #check if the item is whatever a navigable string is
            if 'bs4.element.NavigableString' in str(item.__class__):
                #move on if it does
                continue

            if item.name == 'div':
                
                #object_name = item.text
                
                #check if we have a title object
                if item.has_attr('tagx'):
                    
                    if item['tagx'] == 'title':
                        store_json['title'] = item.text.strip()

                        #this is not a div element we are interested in diving into
                        #so lets move on
                        continue
                
                #do some recursion
                store_json[item['class'][0]] = iterhtml(item)

            #we have a span object which contains information we are interested in
            elif item.name == 'span':
                
                store_json[item['class'][0]] = item.text
                
            elif item.name == 'p':
                
                last_key = list(store_json.keys())[-1]
                
                if 'text' not in store_json.keys():
                    
                    store_json['text'] = ''
                    
                store_json['text'] += item.text
                
                #pprint(store_json)
                #print('-------------')
                #print(item)
                
                #print(stop)
    #print('************************')   
    return store_json


result = iterhtml(front)
    
print('++++++++')
    
pprint(result)


++++++++
{'article-meta': {'abstract': {'background': {'text': 'The key enzymes of '
                                                      'photosynthetic carbon '
                                                      'assimilation in C4 '
                                                      'plants have evolved '
                                                      'independently several '
                                                      'times from C3 isoforms '
                                                      'that were present in '
                                                      'the C3 ancestral '
                                                      'species. The C4 isoform '
                                                      'of phosphoenolpyruvate '
                                                      'carboxylase (PEPC), the '
                                                      'primary CO2-fixing '
                                                      

In [57]:
#try to extract the body using this

main = body.find('div', {'class':'body'})

body_result = iterhtml(main)

pprint(body_result)

{"authors'contributions": {'text': 'SE carried out the histochemical and '
                                   'quantitative GUS assays, the cloning of '
                                   'construct ppcA-PRFtΔIntron-DR(+)Ft, the '
                                   'sequence alignments and wrote the '
                                   'manuscript. CZ produced construct '
                                   'ppcA-PRFp-DR(+)Ft. MK, US and MS performed '
                                   'the transformation of F. bidentis. PW '
                                   'coordinated the design of this study and '
                                   'participated in drafting the manuscript. '
                                   'All authors read and approved the final '
                                   'manuscript.',
                           'title': "Authors' contributions"},
 'background': {'text': 'About 90% of terrestrial plant species, including '
                        'major crops such 

                                                                                                                                                    'gene '
                                                                                                                                                    'of '
                                                                                                                                                    'F. '
                                                                                                                                                    'trinervia '
                                                                                                                                                    'might '
                                                                                                                                                    'be '
                                                                

### Create a class which will read the objects in

In [49]:
class paper_obj(object):
    
    def __init__(self, html_file):
        
        self.abstract_text = ''
        self.body_text = ''
        
        #load the data in
        with open(html_root, 'r') as f:

            raw_text = f.read()
            
            
        #parse the html into soup
        soup = BeautifulSoup(raw_text, 'html.parser')

        #extract the metadata
        front = body.find('div', {'class':'front'})
        
        #extract the body
        body = soup.find('div', {'class':'body'})
        
        #extract the tail
        tail = soup.find('div', {'class':'back'})
        
        #extract the metadata into a json
        self.meta = self.interhtml(front)
        
        #extract the body of the document
        self.body = self.interhtml(body)
        
        #extract the end notes
        self.tail = self.extract_tail(tail)

    def iterhtml(self, el):
        
        #set up the json we're going to be returning
        store_json = {}

        if el.children:

            for item in el.children:

                #check if the item is whatever a navigable string is
                if 'bs4.element.NavigableString' in str(item.__class__):
                    #move on if it does
                    continue

                if item.name == 'div':

                    #object_name = item.text

                    #check if we have a title object
                    if item.has_attr('tagx'):

                        if item['tagx'] == 'title':
                            store_json['title'] = item.text.strip()

                            #this is not a div element we are interested in diving into
                            #so lets move on
                            continue

                    #do some recursion
                    store_json[item['class'][0]] = iterhtml(item)

                #we have a span object which contains information we are interested in
                elif item.name == 'span':

                    store_json[item['class'][0]] = item.text

                elif item.name == 'p':

                    last_key = list(store_json.keys())[-1]

                    if 'text' not in store_json.keys():

                        store_json['text'] = ''

                    store_json['text'] += item.text

                    #pprint(store_json)
                    #print('-------------')
                    #print(item)

                    #print(stop)
        #print('************************')   
        return store_json
    
###############################
### End of class definition ###
###############################

#set the filename
filename = './results/PMC2241601/scholarly.html'

#create the paper object
result = paper_obj(filename)

['__bool__',
 '__call__',
 '__class__',
 '__contains__',
 '__copy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_all_strings',
 '_find_all',
 '_find_one',
 '_is_xml',
 '_lastRecursiveChild',
 '_last_descendant',
 '_should_pretty_print',
 'append',
 'attrs',
 'can_be_empty_element',
 'cdata_list_attributes',
 'childGenerator',
 'children',
 'clear',
 'contents',
 'decode',
 'decode_contents',
 'decompose',
 'decomposed',
 'descendants',
 'encode',
 'encode_contents',
 'extend',
 'extract',
 'fetchNextSiblings',
 'fetchParents',
 'fetchPrevious',
 'fetchPreviousSiblings',
 '

In [14]:
pprint(meta_store)

{'abstract': 'Abstract  \n'
             'Background The key enzymes of photosynthetic carbon assimilation '
             'in C4 plants have evolved independently several times from C3 '
             'isoforms that were present in the C3 ancestral species. The C4 '
             'isoform of phosphoenolpyruvate carboxylase (PEPC), the primary '
             'CO2-fixing enzyme of the C4 cycle, is specifically expressed at '
             'high levels in mesophyll cells of the leaves of C4 species. We '
             'are interested in understanding the molecular changes that are '
             'responsible for the evolution of this C4-characteristic PEPC '
             'expression pattern, and we are using the genus Flaveria '
             '(Asteraceae) as a model system. It is known that cis-regulatory '
             'sequences for mesophyll-specific expression of the ppcA1 gene of '
             'F. trinervia (C4) are located within a distal promoter region '
             '(DR).   \n'
   

In [61]:
tail = soup.find('div', {'class':'back'})

print(tail)

<div class="back" title="back"> <div class="ack" title="ack"> <div class="acknowledgements" title="sec"> <div class="title" tagx="title" title="title">
Acknowledgements</div> <p>This work was supported by the Deutsche Forschungsgemeinschaft within the SFB 590 "Inhärente und adaptive Differenzierungsprozesse" at the Heinrich-Heine-Universität Düsseldorf.</p> </div> </div> <div class="references">
References</div> <div tag="ref-list"> <ul> <li tag="ref"><a name="B1"></a><div class="citation" tagx="citation" title="citation"><span class="person-group'"><span class="name" tagx="name" title="name"><span class="surname" tagx="surname" title="surname">Black</span><span class="given-names" tagx="given-names" title="given-names">CC</span><span class="suffix" tagx="suffix" title="suffix">Jr.</span></span></span><span class="article-title" tagx="article-title" title="article-title">Photosynthetic carbon fixation in relation to net CO2 uptake</span><span class="source" tagx="source" title="source"

In [91]:

#define a class to extract the acknowledgments
class acknowledgements_extractor(object):
    
    def __init__(self, tail):
        
        #get the acknowledgements element
        acks = tail.find('div', {'class':'ack'})

        #get the ack_dict element
        self.ack_dict = {'text':''}
        
        #start iterating
        self.iteracks(acks)
        
    def iteracks(self, el):
                
        #iterate through the children of this device
        for child in el.children:
            
            #check if the item is whatever a navigable string is
            if 'bs4.element.NavigableString' in str(child.__class__):
                #move on if it does
                continue
            
            #check if we have an div, usually the acknowledgments heading
            if child.name == 'div':
                
                print('FOUND DIV')
                
                #check if this element is a title
                if child['class'][0] == 'title':

                    self.ack_dict['title'] = child.text.strip()
                    
                    #if it is we are done with this element
                    continue
                    
                #go one level down
                self.iteracks(child)
            
            #check if this is a text element
            elif child.name == 'p':
                
                print('FOUND TEXT')
                
                #now we need to extract and store the result
                self.ack_dict['text'] += ' ' + el.text.strip()

################################
### End of class definitions ###
################################


acks_obj = acknowledgements_extractor(tail)

print(acks_obj.ack_dict)

FOUND DIV
FOUND DIV
FOUND TEXT
{'text': ' Acknowledgements This work was supported by the Deutsche Forschungsgemeinschaft within the SFB 590 "Inhärente und adaptive Differenzierungsprozesse" at the Heinrich-Heine-Universität Düsseldorf.', 'title': 'Acknowledgements'}


### Look into making a reference extractor

In [101]:
refs = tail.find('div', {'tag':'ref-list'})

for r in refs.children:
    
    #check if the item is whatever a navigable string is
    if 'bs4.element.NavigableString' in str(r.__class__):
        #move on if it does
        continue
        
        
    #check if this element is the table of references
    if r.name == 'ul':
        
        #go through the list
        for l in r.children:
            
            #check if the item is whatever a navigable string is
            if 'bs4.element.NavigableString' in str(l.__class__):
                #move on if it does
                continue
            
            print(l)
            print(type(l))
            
            break


<li tag="ref"><a name="B1"></a><div class="citation" tagx="citation" title="citation"><span class="person-group'"><span class="name" tagx="name" title="name"><span class="surname" tagx="surname" title="surname">Black</span><span class="given-names" tagx="given-names" title="given-names">CC</span><span class="suffix" tagx="suffix" title="suffix">Jr.</span></span></span><span class="article-title" tagx="article-title" title="article-title">Photosynthetic carbon fixation in relation to net CO2 uptake</span><span class="source" tagx="source" title="source">Ann Rev Plant Physiol</span><span class="year" tagx="year" title="year">1973</span><span class="volume" tagx="volume" title="volume">24</span><span class="fpage" tagx="fpage" title="fpage">253</span><span class="lpage" tagx="lpage" title="lpage">286</span><span class="pub-id"><a href="https://dx.doi.org/10.1146/annurev.pp.24.060173.001345">10.1146/annurev.pp.24.060173.001345</a></span></div> </li>
<class 'bs4.element.Tag'>


In [108]:

ref_dict = {}
for c in l:
    
    #check if the item is whatever a navigable string is
    if 'bs4.element.NavigableString' in str(c.__class__):
        #move on if it does
        continue
    
    print(c)
    print('---')
    
    if c.name == 'a':
        
        ref_dict['id'] = c['name']
        
        ref_dict['ref_no'] = int(c['name'][1:])
        
    elif c.name == 'div':
        
        print(c)

<a name="B1"></a>
---
<div class="citation" tagx="citation" title="citation"><span class="person-group'"><span class="name" tagx="name" title="name"><span class="surname" tagx="surname" title="surname">Black</span><span class="given-names" tagx="given-names" title="given-names">CC</span><span class="suffix" tagx="suffix" title="suffix">Jr.</span></span></span><span class="article-title" tagx="article-title" title="article-title">Photosynthetic carbon fixation in relation to net CO2 uptake</span><span class="source" tagx="source" title="source">Ann Rev Plant Physiol</span><span class="year" tagx="year" title="year">1973</span><span class="volume" tagx="volume" title="volume">24</span><span class="fpage" tagx="fpage" title="fpage">253</span><span class="lpage" tagx="lpage" title="lpage">286</span><span class="pub-id"><a href="https://dx.doi.org/10.1146/annurev.pp.24.060173.001345">10.1146/annurev.pp.24.060173.001345</a></span></div>
---
<div class="citation" tagx="citation" title="citati

In [106]:
ref_dict

{'id': 'B1', 'ref_no': 1}

### Extract the abstract

In [15]:
#find the abstract
abstract = front.find('div', {'class':'abstract'})

#get the abstract title
abstract_title = abstract.find('div', {'class':'abstract-title'}).text.strip()

#remove the title section
abstract.find('div', {'class':'abstract-title'}).decompose()

#get the abstract text
abstract_text = abstract.text.strip()


#print('TITLE')
#print(abstract_title)
#print('ABSTRACT BODY')
#print(abstract_text)

### Extract the references etc.

In [16]:
#Extract the references

#pull the back
back = body.find('div', {'class':'back'})

#get the Acknowledgements
ack = back.find('div', {'class':'ack'})

#get the references
refs = back.find('div', {'tag':'ref-list'})

print(refs)

<div tag="ref-list"> <ul> <li tag="ref"><a name="B1"></a><div class="citation" tagx="citation" title="citation"><span class="person-group'"><span class="name" tagx="name" title="name"><span class="surname" tagx="surname" title="surname">Black</span><span class="given-names" tagx="given-names" title="given-names">CC</span><span class="suffix" tagx="suffix" title="suffix">Jr.</span></span></span><span class="article-title" tagx="article-title" title="article-title">Photosynthetic carbon fixation in relation to net CO2 uptake</span><span class="source" tagx="source" title="source">Ann Rev Plant Physiol</span><span class="year" tagx="year" title="year">1973</span><span class="volume" tagx="volume" title="volume">24</span><span class="fpage" tagx="fpage" title="fpage">253</span><span class="lpage" tagx="lpage" title="lpage">286</span><span class="pub-id"><a href="https://dx.doi.org/10.1146/annurev.pp.24.060173.001345">10.1146/annurev.pp.24.060173.001345</a></span></div> </li> <li tag="ref">