# GROBID (GeneRation Of BIbliographic Data)

## 1. Introduction

[GROBID](https://github.com/kermitt2/grobid) is a machine learning library for extracting, parsing and restructuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.  

Functionalities include full text extraction and structuring from PDF articles including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference and footnote callouts, figures, tables, data availability statements, etc.).

We will use this to extract the full text from the arXiv PDFs for which we downloaded the metadata.

## 2. Install/import libraries

In [None]:
import pandas as pd
import pickle
import re
import requests
import urllib
import time
import concurrent

from multiprocessing.pool import ThreadPool
from functools import lru_cache
from bs4 import BeautifulSoup, NavigableString, Tag
from itertools import chain
from collections import Counter

## 3. Import arXiv PDF metadata

Read in metadata collected via arXiv API with PDF URLs from export.arxiv.org subdomain.

In [None]:
article_results_arxiv_304 = pd.read_pickle('2023-01-06_arxiv_results_for_dl.pickle')
article_results_arxiv_304

Unnamed: 0,arxiv-id,published,revised,title,journal,authors,doi,pdf_url
0,2109.06377v4,2021-09-14,2022-12-22,ASGARD: A Single-cell Guided pipeline to Aid R...,,"Bing He, Yao Xiao, Haodong Liang, Qianhui Huan...",,http://export.arxiv.org/pdf/2109.06377v4
1,2212.09867v1,2022-12-19,2022-12-19,Detecting Contradictory COVID-19 Drug Efficacy...,,"Daniel N. Sosa, Malavika Suresh, Christopher P...",,http://export.arxiv.org/pdf/2212.09867v1
2,2212.09610v1,2022-12-19,2022-12-19,Drying of Bio-colloidal Sessile Droplets: Adva...,,"Anusuya Pal, Amalesh Gope, Anupam Sengupta",,http://export.arxiv.org/pdf/2212.09610v1
3,2212.00023v2,2022-11-30,2022-12-08,Random Copolymer inverse design system orienti...,,"Tianyu Wu, Yang Tang",,http://export.arxiv.org/pdf/2212.00023v2
4,2212.03911v1,2022-12-07,2022-12-07,Analysis of Drug repurposing Knowledge graphs ...,,Ajay Kumar Gogineni,,http://export.arxiv.org/pdf/2212.03911v1
...,...,...,...,...,...,...,...,...
299,2003.13665v1,2020-03-30,2020-03-30,Genomics-guided molecular maps of coronavirus ...,,Gennadi Glinsky,,http://export.arxiv.org/pdf/2003.13665v1
300,2003.14258v1,2020-03-30,2020-03-30,Nanomechanical sonification of the 2019-nCoV c...,,Markus J. Buehler,,http://export.arxiv.org/pdf/2003.14258v1
301,2003.12454v1,2020-03-26,2020-03-26,A Machine Learning alternative to placebo-cont...,,"Ezequiel Alvarez, Federico Lamagna, Manuel Szewc",,http://export.arxiv.org/pdf/2003.12454v1
302,2003.04524v1,2020-03-10,2020-03-10,"Old Drugs for Newly Emerging Viral Disease, CO...",,Mohammad Reza Dayer,,http://export.arxiv.org/pdf/2003.04524v1


## 4. Download full text as XML

Create a function to download parsed XML for each PDF using a Python wrapper for the public GROBID demo server available for testing purposes. Please note according to the [GROBID repo](https://github.com/kermitt2/grobid/) quota and query limitations apply to the demo server and for any serious works you will need to deploy and use your own GROBID server. If a document exceeds the max token limit you will get an exception e.g. Exception('[TOO_MANY_TOKENS] The document has 1374455 tokens, but the limit is 1000000').

The code below used the previous demo server hosted at [https://cloud.science-miner.com/grobid](https://cloud.science-miner.com/grobid) which has since been updated and redirects to a new demo server with a combination of Deep Learning models and CRF models hosted on Hugging Face Spaces at [https://kermitt2-grobid.hf.space/](https://kermitt2-grobid.hf.space/) or [ https://huggingface.co/spaces/kermitt2/grobid](https://huggingface.co/spaces/kermitt2/grobid).

A faster demo with CRF only is available at https://kermitt2-grobid-crf.hf.space/ or https://huggingface.co/spaces/kermitt2/grobid-crf.

Updating the GROBID URL in the function to [https://kermitt2-grobid.hf.space/api/processFulltextDocument](https://kermitt2-grobid.hf.space/api/processFulltextDocument) will be necessary.



In [None]:
# Adapted from https://github.com/titipata/scipdf_parser/blob/master/scipdf/pdf/parse_pdf.py

def parse_pdf(pdf_url: str):
    """
    Parse PDF to XML using GROBID tool

    :param pdf_url: str, URL to article PDF

    :return: XML of parsed article
    """
    # GROBID URL for the cloud service to parse full text of the article
    url = "https://cloud.science-miner.com/grobid/api/processFulltextDocument"

    if isinstance(pdf_url, str):
            page = urllib.request.urlopen(pdf_url).read()
            resp = requests.post(url, files={"input": page})
            if resp.status_code != 200:
              raise Exception(resp.text)
            parsed_article = resp.text
            time.sleep(3)
    else:
        raise TypeError("Need to supply a url")


    return parsed_article

Multithreading is used here since the code is I/O-bound rather than CPU-bound. Executing multiple threads concurrently speeds up the process as opposed to just iterating through the pdf_url list using a for loop calling the parse_pdf() function sequentially for each URL.

In [None]:
with concurrent.futures.ThreadPoolExecutor(4) as executor:
     futures = [executor.submit(parse_pdf, pdf_url) for pdf_url in article_results_arxiv_304.pdf_url]
     concurrent.futures.wait(futures)

Create a dictionary of futures which are proxies for results that do not yet exist but will in the future.

In [None]:
futures_map = dict(zip(article_results_arxiv_304.pdf_url, futures))
futures_map

{'http://export.arxiv.org/pdf/2109.06377v4': <Future at 0x7f9fc2e3d490 state=finished returned str>,
 'http://export.arxiv.org/pdf/2212.09867v1': <Future at 0x7f9fc6be80d0 state=finished returned str>,
 'http://export.arxiv.org/pdf/2212.09610v1': <Future at 0x7f9fc2e3db50 state=finished returned str>,
 'http://export.arxiv.org/pdf/2212.00023v2': <Future at 0x7f9fc2e3de20 state=finished raised Exception>,
 'http://export.arxiv.org/pdf/2212.03911v1': <Future at 0x7f9fc6174670 state=finished returned str>,
 'http://export.arxiv.org/pdf/2212.01575v1': <Future at 0x7f9fc616dac0 state=finished returned str>,
 'http://export.arxiv.org/pdf/2103.02009v2': <Future at 0x7f9fc616d910 state=finished returned str>,
 'http://export.arxiv.org/pdf/2204.08697v2': <Future at 0x7f9fc616d8e0 state=finished returned str>,
 'http://export.arxiv.org/pdf/2207.09551v2': <Future at 0x7f9fc616da00 state=finished returned str>,
 'http://export.arxiv.org/pdf/2107.02905v2': <Future at 0x7f9fc616d730 state=finished r

In [None]:
len(futures_map)

304

Create dictionary of exceptions with PDF URL as key and exception error message as value.

In [None]:
exceptions = {url: f.exception() for url, f in futures_map.items() if f.exception() is not None}
exceptions

{'http://export.arxiv.org/pdf/2212.00023v2': Exception('[BAD_INPUT_DATA] PDF to XML conversion failed with error code: 1'),
 'http://export.arxiv.org/pdf/2109.00100v4': Exception('[BAD_INPUT_DATA] PDF to XML conversion failed with error code: 1'),
 'http://export.arxiv.org/pdf/2109.00435v3': Exception('[BAD_INPUT_DATA] PDF to XML conversion failed with error code: 1'),
 'http://export.arxiv.org/pdf/2007.09186v3': Exception('[GENERAL] An exception occurred while running Grobid.')}

Four exceptions were found. The three with Exception(' [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 1') are caused by the PDFs being unavailable so no XML would have been extracted. The one with Exception('[GENERAL] An exception occurred while running Grobid.') is a legitimate PDF but something went wrong during processing.

In [None]:
results = {url: f.result()[:100] for url, f in futures_map.items() if f.exception() is None}
results

{'http://export.arxiv.org/pdf/2109.06377v4': '<?xml version="1.0" encoding="UTF-8"?>\n<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0"',
 'http://export.arxiv.org/pdf/2212.09867v1': '<?xml version="1.0" encoding="UTF-8"?>\n<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0"',
 'http://export.arxiv.org/pdf/2212.09610v1': '<?xml version="1.0" encoding="UTF-8"?>\n<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0"',
 'http://export.arxiv.org/pdf/2212.03911v1': '<?xml version="1.0" encoding="UTF-8"?>\n<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0"',
 'http://export.arxiv.org/pdf/2212.01575v1': '<?xml version="1.0" encoding="UTF-8"?>\n<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0"',
 'http://export.arxiv.org/pdf/2103.02009v2': '<?xml version="1.0" encoding="UTF-8"?>\n<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0"',
 'http://export.arxiv.org/pdf/2204.08697v2': '<?xml version="1.0" encoding="UTF-8"?>\n<TEI xml:space

Amend the parse_pdf function with error handling for specific exceptions and run again to download the full text.

We will use the @lru_cache decorator from Python's functools module which helps in reducing the execution time of the function for the same inputs by using the memoization technique.

In [None]:
@lru_cache(maxsize=None)
def parse_pdf(pdf_url: str):
    """
    Parse PDF to XML using GROBID tool

    :param pdf_url: str, URL to article PDF

    :return: XML of parsed article
    """
    # GROBID URL for the cloud service to parse full text of the article
    url = "https://cloud.science-miner.com/grobid/api/processFulltextDocument"

    if isinstance(pdf_url, str):
            page = urllib.request.urlopen(pdf_url).read()
            resp = requests.post(url, files={"input": page})
            if resp.status_code != 200:
                if resp.status_code >= 500:
                  retry = 1
                else:
                  return "500"
                if resp.text in ['[BAD_INPUT_DATA] PDF to XML conversion failed with error code: 1']:
                    return "1"
                if resp.text in ['{\n  "message":"An invalid response was received from the upstream server"\n}',
                                '[GENERAL] An exception occurred while running Grobid.']:
                    return "0"
                else:
                    raise Exception(resp.text)
            parsed_article = resp.text
            time.sleep(3)
    else:
        raise TypeError("Need to supply a url")


    return parsed_article

In [None]:
with ThreadPool(4) as pool:
  dl_pdf_xml = pool.map(parse_pdf, article_results_arxiv_304.pdf_url)

In [None]:
with open('2023-01-06_grobid_xml_str_v2.pickle', "wb") as f:
    pickle.dump(dl_pdf_xml, f)

In [None]:
len(dl_pdf_xml)

304

Check XML output for first article which was extracted successfully.

In [None]:
dl_pdf_xml[0]

'<?xml version="1.0" encoding="UTF-8"?>\n<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" \nxmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" \nxsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"\n xmlns:xlink="http://www.w3.org/1999/xlink">\n\t<teiHeader xml:lang="en">\n\t\t<fileDesc>\n\t\t\t<titleStmt>\n\t\t\t\t<title level="a" type="main">ASGARD: A Single-cell Guided Pipeline to Aid Repurposing of Drugs</title>\n\t\t\t</titleStmt>\n\t\t\t<publicationStmt>\n\t\t\t\t<publisher/>\n\t\t\t\t<availability status="unknown"><licence/></availability>\n\t\t\t</publicationStmt>\n\t\t\t<sourceDesc>\n\t\t\t\t<biblStruct>\n\t\t\t\t\t<analytic>\n\t\t\t\t\t\t<author>\n\t\t\t\t\t\t\t<persName><forename type="first">Bing</forename><surname>He</surname></persName>\n\t\t\t\t\t\t\t<affiliation key="aff0">\n\t\t\t\t\t\t\t\t<orgName type="department" key="dep1">Department of Computational Medicine and

Check output for the first exception in futures_map dictionary above with Exception('[BAD_INPUT_DATA] PDF to XML conversion failed with error code: 1').

In [None]:
dl_pdf_xml[3]

'1'

## 5. Clean and return XML as Beautiful Soup object

Function to remove XML namespaces (xmlns) attribute and return the cleaned, parsed text as a [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) object. This object represents the document as a nested data structure so that we can navigate and extract text using XML tags.

The simplest way to navigate the parse tree is to find a tag by name so we will use the find() method to return the XML for the `<body>` tag content.



In [None]:
# remove xmlns references from XML and convert to soup object

def soupify(text):
    cleaned_text = re.sub('\s*xmlns(:\w+)?=\"[^\"]*\"', '', text)
    return  BeautifulSoup(cleaned_text, 'lxml-xml').find("body")

View output for a 'soupified' article.

In [None]:
soupify_article_6 = soupify(dl_pdf_xml[6])
soupify_article_6

<body>
<div><head>INTRODUCTION</head><p>The global impact of the COVID-19 pandemic on both human health and socioeconomic activity has brought to light the importance of a nuanced understanding of the way that epidemics spread in cities. Urban spread of disease is inherently tied to human mobility: as we move through and between cities, we serve as vectors that allow disease to spread to new individuals and communities. The nature of the relationship between mobility and spread of disease has been extensively studied; it is widely understood that human travel is a driving force behind disease spread <ref target="#b0" type="bibr">[1]</ref><ref target="#b1" type="bibr">[2]</ref><ref target="#b2" type="bibr">[3]</ref><ref target="#b3" type="bibr">[4]</ref><ref target="#b4" type="bibr">[5]</ref>. For this reason, many public policy interventions implemented worldwide to contain the spread of COVID-19 focused on limiting mobility, restricting the radius that individuals could travel from th

In [None]:
soup_results_arxiv = list(map(soupify, dl_pdf_xml))

In [None]:
type(soup_results_arxiv)

list

In [None]:
len(soup_results_arxiv)

304

## 6. Navigating the tree

A `<Tag>` object corresponds to an XML tag in the original document, in this case the `<body>` tag.

In [None]:
type(soup_results_arxiv[6])

bs4.element.Tag

To find the `<head`> tag we can just use the format soup.head

In [None]:
soup_results_arxiv[6].head

<head>INTRODUCTION</head>

View the result when returning the type of object for the first exception in futures_map dictionary above with Exception('[BAD_INPUT_DATA] PDF to XML conversion failed with error code: 1').

In [None]:
type(soup_results_arxiv[3])

NoneType

## 7. Find all tags including descendants

Previously we used the find() method for each article which gives only the *first* tag by that name as each article only has one `<body>` and `<head>` tag.

The function below uses the find_all() method to return *all* of the direct child and descendant tags of the `<body>` tag for each article.

In [None]:
def find_all_tags(article):

    tags_list = []

    try:
        for tag in article.find_all(True):
            if tag.name != None:
              tags_list.append(tag.name)
    except Exception as e:
      print(e)


    return sorted(list(set(tags_list)))


In [None]:
all_tags = list(map(find_all_tags, soup_results_arxiv))

'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'


As expected, these are the four exceptions we found earlier which show up as empty lists.

In [None]:
all_tags

[['div',
  'figDesc',
  'figure',
  'formula',
  'graphic',
  'head',
  'label',
  'note',
  'p',
  'ref'],
 ['cell',
  'div',
  'figDesc',
  'figure',
  'graphic',
  'head',
  'label',
  'note',
  'p',
  'ref',
  'row',
  'table'],
 ['cell',
  'div',
  'figDesc',
  'figure',
  'graphic',
  'head',
  'label',
  'note',
  'p',
  'ref',
  'row',
  'table'],
 [],
 ['div', 'figDesc', 'figure', 'graphic', 'head', 'label', 'p', 'ref'],
 ['cell',
  'div',
  'figDesc',
  'figure',
  'formula',
  'graphic',
  'head',
  'label',
  'p',
  'ref',
  'row',
  'table'],
 ['div',
  'figDesc',
  'figure',
  'formula',
  'graphic',
  'head',
  'label',
  'p',
  'ref'],
 ['cell',
  'div',
  'figDesc',
  'figure',
  'formula',
  'graphic',
  'head',
  'label',
  'note',
  'p',
  'ref',
  'row',
  'table'],
 ['cell',
  'div',
  'figDesc',
  'figure',
  'formula',
  'graphic',
  'head',
  'label',
  'note',
  'p',
  'ref',
  'row',
  'table'],
 ['cell',
  'div',
  'figDesc',
  'figure',
  'formula',
  'grap

Sum function to take in all_tags nested list and return the sum of all elements as one list.

In [None]:
sum_all_tags = sum(all_tags, [])
sum_all_tags

['div',
 'figDesc',
 'figure',
 'formula',
 'graphic',
 'head',
 'label',
 'note',
 'p',
 'ref',
 'cell',
 'div',
 'figDesc',
 'figure',
 'graphic',
 'head',
 'label',
 'note',
 'p',
 'ref',
 'row',
 'table',
 'cell',
 'div',
 'figDesc',
 'figure',
 'graphic',
 'head',
 'label',
 'note',
 'p',
 'ref',
 'row',
 'table',
 'div',
 'figDesc',
 'figure',
 'graphic',
 'head',
 'label',
 'p',
 'ref',
 'cell',
 'div',
 'figDesc',
 'figure',
 'formula',
 'graphic',
 'head',
 'label',
 'p',
 'ref',
 'row',
 'table',
 'div',
 'figDesc',
 'figure',
 'formula',
 'graphic',
 'head',
 'label',
 'p',
 'ref',
 'cell',
 'div',
 'figDesc',
 'figure',
 'formula',
 'graphic',
 'head',
 'label',
 'note',
 'p',
 'ref',
 'row',
 'table',
 'cell',
 'div',
 'figDesc',
 'figure',
 'formula',
 'graphic',
 'head',
 'label',
 'note',
 'p',
 'ref',
 'row',
 'table',
 'cell',
 'div',
 'figDesc',
 'figure',
 'formula',
 'graphic',
 'head',
 'label',
 'p',
 'ref',
 'row',
 'table',
 'cell',
 'div',
 'figDesc',
 'figure

Counter class with additional most_common() method to return a list of the *n* most common elements and their counts from the most common to the least.

In [None]:
num_all_unique_tags = Counter(sum_all_tags).most_common()
num_all_unique_tags = num_all_unique_tags[::]

for tag, count in num_all_unique_tags:
  print(tag, count)

div 300
head 300
p 300
ref 299
label 298
figDesc 297
figure 297
graphic 249
table 240
cell 235
row 235
formula 179
note 177


## 8. Find direct child tags of `<body>` tag

Every tag has a name which can be accessed using the `.name` attribute.

We will use this to find the direct child tags of the `<body>` tag.


In [None]:
def find_body_child_tags(article):

    body_child_list = []

    try:
        for tag in article:
            if tag.name != None:
                  body_child_list.append(tag.name)
    except Exception as e:
        print(e)

    return sorted(list(set(body_child_list)))

In [None]:
body_child_tags = list(map(find_body_child_tags, soup_results_arxiv))

'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable


We can see below that the `<body>` tag has direct child tags for `<div>`, `<figure>` and `<note>`.

In [None]:
body_child_tags

[['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure'],
 [],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'figure', 'note'],
 ['div', 'figure', 'note'],
 ['div', 'figure', 'note'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'figure', 'note'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'figure', 'note'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'figure', 'not

Sum function to take in body_child_tags nested list and return the sum of all elements as one list.

In [None]:
sum_body_child_tags = sum(body_child_tags, [])
sum_body_child_tags

['div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 '

Again use the Counter class with most_common() method to return a list of the *n* most common elements and their counts from the most common to the least.

In [None]:
num_unique_body_child_tags = Counter(sum_body_child_tags).most_common()
num_unique_body_child_tags = num_unique_body_child_tags[::]

for tag, count in num_unique_body_child_tags:
  print(tag, count)

div 300
figure 297
note 105


##  9. Find direct child tags of `<div>` tags

Function to find all `<div>` tags and append all direct child tags to a list.



In [None]:
def find_div_tags(article):

    div_tags = []

    try:
        divs = article.find_all("div")
        for div in divs:
          for tag in div:
              if tag.name != None:
                  div_tags.append(tag.name)
    except Exception as e:
        print(e)

    return sorted(list(set(div_tags)))

In [None]:
div_tags = list(map(find_div_tags, soup_results_arxiv))

'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'


In [None]:
div_tags

[['formula', 'head', 'note', 'p'],
 ['head', 'p'],
 ['head', 'note', 'p'],
 [],
 ['head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 ['head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['formula', 'head', 'note', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 ['head', 'p'],
 ['formula', 'head', 'p'],
 ['head', 'p'],
 ['head', 'note', 'p'],
 ['formula', 'head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 ['head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 ['head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'note', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 [

Sum function to take in div_tags nested list and return the sum of all elements as one list.

In [None]:
sum_div_tags = sum(div_tags, [])
sum_div_tags

['formula',
 'head',
 'note',
 'p',
 'head',
 'p',
 'head',
 'note',
 'p',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'formula',
 'head',
 'note',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'head',
 'p',
 'head',
 'note',
 'p',
 'formula',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'note',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'he

Again use the Counter class with most_common() method to return a list of the *n* most common elements and their counts from the most common to the least.

In [None]:
num_unique_div_tags = Counter(sum_div_tags).most_common()
num_unique_div_tags = num_unique_div_tags[::]

for tag, count in num_unique_div_tags:
  print(tag, count)

p 300
head 297
formula 179
note 15


## 10. Remove unwanted tags and content

Define functions to remove unwanted tags and their contents using decompose() method which removes a tag from the tree, then completely destroys it and its contents.

In [None]:
def remove_headings(article):

    for head in article("head"):
        head.decompose()

    return article


def remove_figures(article):

    for figure in article("figure"):
        figure.decompose()

    return article

def remove_tables(article):

    for table in article("table"):
        table.decompose()

    return article

def remove_formulas(article):

    for formula in article("formula"):
        formula.decompose()

    return article


def remove_labels(article):

    for label in article("label"):
        label.decompose()

    return article


def remove_refs(article):

    for ref in article("ref"):
        ref.decompose()

    return article


def remove_graphics(article):

    for graphic in article("graphic"):
        graphic.decompose()

    return article


def remove_notes(article):

    for note in article("note"):
        note.decompose()

    return article


Function to call the functions above on each article and return new article list.

In [None]:
def remove_tags(article):

    new_article_list = []

    if article != None:
        try:
              new_article = remove_headings(article)
              new_article = remove_figures(new_article)
              new_article = remove_tables(new_article)
              new_article = remove_formulas(new_article)
              new_article = remove_labels(new_article)
              new_article = remove_refs(new_article)
              new_article = remove_graphics(article)
              new_article = remove_notes(article)
              new_article_list.append(new_article)
        except Exception as e:
            print(e)

    return new_article_list

In [None]:
soup_results_arxiv_removed_tags = list(map(remove_tags, soup_results_arxiv))

In [None]:
len(soup_results_arxiv_removed_tags)

304

View article with unwanted tags and contents removed.

In [None]:
soup_results_arxiv_removed_tags[6]

[<body>
 <div><p>The global impact of the COVID-19 pandemic on both human health and socioeconomic activity has brought to light the importance of a nuanced understanding of the way that epidemics spread in cities. Urban spread of disease is inherently tied to human mobility: as we move through and between cities, we serve as vectors that allow disease to spread to new individuals and communities. The nature of the relationship between mobility and spread of disease has been extensively studied; it is widely understood that human travel is a driving force behind disease spread . For this reason, many public policy interventions implemented worldwide to contain the spread of COVID-19 focused on limiting mobility, restricting the radius that individuals could travel from their homes and limiting travel between cities and countries.</p><p>As we continue to learn more about the ways that humans move, it is important to apply new discoveries about human mobility to the study of disease spre

And one with Exception('[BAD_INPUT_DATA] PDF to XML conversion failed with error code: 1').

In [None]:
soup_results_arxiv_removed_tags[3]

[]

## 11.  Strip markup and keep text

We only want to keep the human-readable text so we will use the get_text() method to return all the text in the articles as a single Unicode string without the `<body>`, `<div>` and `<p>` tags.

In [None]:
def strip_markup(articles):

    for article in articles:

        return article.get_text()



In [None]:
stripped_markup_articles = list(map(strip_markup, soup_results_arxiv_removed_tags))

In [None]:
len(stripped_markup_articles)

304

View article with all tags removed.

In [None]:
stripped_markup_articles[6]

'\nThe global impact of the COVID-19 pandemic on both human health and socioeconomic activity has brought to light the importance of a nuanced understanding of the way that epidemics spread in cities. Urban spread of disease is inherently tied to human mobility: as we move through and between cities, we serve as vectors that allow disease to spread to new individuals and communities. The nature of the relationship between mobility and spread of disease has been extensively studied; it is widely understood that human travel is a driving force behind disease spread . For this reason, many public policy interventions implemented worldwide to contain the spread of COVID-19 focused on limiting mobility, restricting the radius that individuals could travel from their homes and limiting travel between cities and countries.As we continue to learn more about the ways that humans move, it is important to apply new discoveries about human mobility to the study of disease spread, deepening our und

In [None]:
print(stripped_markup_articles[3])

None


In [None]:
with open('2023-01-06_stripped_markup_articles_304.pickle', 'wb') as f:
  pickle.dump(stripped_markup_articles, f)

## 12. Add extracted full text to DataFrame

Read in  arXiv metadata.

In [None]:
article_results_arxiv_304 = pd.read_pickle('2023-01-06_arxiv_results_for_dl.pickle')
article_results_arxiv_304

Unnamed: 0,arxiv-id,published,revised,title,journal,authors,doi,pdf_url
0,2109.06377v4,2021-09-14,2022-12-22,ASGARD: A Single-cell Guided pipeline to Aid R...,,"Bing He, Yao Xiao, Haodong Liang, Qianhui Huan...",,http://export.arxiv.org/pdf/2109.06377v4
1,2212.09867v1,2022-12-19,2022-12-19,Detecting Contradictory COVID-19 Drug Efficacy...,,"Daniel N. Sosa, Malavika Suresh, Christopher P...",,http://export.arxiv.org/pdf/2212.09867v1
2,2212.09610v1,2022-12-19,2022-12-19,Drying of Bio-colloidal Sessile Droplets: Adva...,,"Anusuya Pal, Amalesh Gope, Anupam Sengupta",,http://export.arxiv.org/pdf/2212.09610v1
3,2212.00023v2,2022-11-30,2022-12-08,Random Copolymer inverse design system orienti...,,"Tianyu Wu, Yang Tang",,http://export.arxiv.org/pdf/2212.00023v2
4,2212.03911v1,2022-12-07,2022-12-07,Analysis of Drug repurposing Knowledge graphs ...,,Ajay Kumar Gogineni,,http://export.arxiv.org/pdf/2212.03911v1
...,...,...,...,...,...,...,...,...
299,2003.13665v1,2020-03-30,2020-03-30,Genomics-guided molecular maps of coronavirus ...,,Gennadi Glinsky,,http://export.arxiv.org/pdf/2003.13665v1
300,2003.14258v1,2020-03-30,2020-03-30,Nanomechanical sonification of the 2019-nCoV c...,,Markus J. Buehler,,http://export.arxiv.org/pdf/2003.14258v1
301,2003.12454v1,2020-03-26,2020-03-26,A Machine Learning alternative to placebo-cont...,,"Ezequiel Alvarez, Federico Lamagna, Manuel Szewc",,http://export.arxiv.org/pdf/2003.12454v1
302,2003.04524v1,2020-03-10,2020-03-10,"Old Drugs for Newly Emerging Viral Disease, CO...",,Mohammad Reza Dayer,,http://export.arxiv.org/pdf/2003.04524v1


Add full text with stripped markup as 'text' column to DataFrame.

In [None]:
article_results_arxiv_304['text'] = stripped_markup_articles
article_results_arxiv_304

Unnamed: 0,arxiv-id,published,revised,title,journal,authors,doi,pdf_url,text
0,2109.06377v4,2021-09-14,2022-12-22,ASGARD: A Single-cell Guided pipeline to Aid R...,,"Bing He, Yao Xiao, Haodong Liang, Qianhui Huan...",,http://export.arxiv.org/pdf/2109.06377v4,"\nHeterogeneity, or more specifically, the div..."
1,2212.09867v1,2022-12-19,2022-12-19,Detecting Contradictory COVID-19 Drug Efficacy...,,"Daniel N. Sosa, Malavika Suresh, Christopher P...",,http://export.arxiv.org/pdf/2212.09867v1,\nThe COVID-19 pandemic caused by the novel SA...
2,2212.09610v1,2022-12-19,2022-12-19,Drying of Bio-colloidal Sessile Droplets: Adva...,,"Anusuya Pal, Amalesh Gope, Anupam Sengupta",,http://export.arxiv.org/pdf/2212.09610v1,\nVirus Emulating Particles (VEP) White Blood ...
3,2212.00023v2,2022-11-30,2022-12-08,Random Copolymer inverse design system orienti...,,"Tianyu Wu, Yang Tang",,http://export.arxiv.org/pdf/2212.00023v2,
4,2212.03911v1,2022-12-07,2022-12-07,Analysis of Drug repurposing Knowledge graphs ...,,Ajay Kumar Gogineni,,http://export.arxiv.org/pdf/2212.03911v1,\nIn the initial stages of a viral outbreak su...
...,...,...,...,...,...,...,...,...,...
299,2003.13665v1,2020-03-30,2020-03-30,Genomics-guided molecular maps of coronavirus ...,,Gennadi Glinsky,,http://export.arxiv.org/pdf/2003.13665v1,\nCoronavirus pandemic 2020 caused by the newl...
300,2003.14258v1,2020-03-30,2020-03-30,Nanomechanical sonification of the 2019-nCoV c...,,Markus J. Buehler,,http://export.arxiv.org/pdf/2003.14258v1,\nProteins are the building blocks of virtuall...
301,2003.12454v1,2020-03-26,2020-03-26,A Machine Learning alternative to placebo-cont...,,"Ezequiel Alvarez, Federico Lamagna, Manuel Szewc",,http://export.arxiv.org/pdf/2003.12454v1,\nCurrent and last decades research in drug di...
302,2003.04524v1,2020-03-10,2020-03-10,"Old Drugs for Newly Emerging Viral Disease, CO...",,Mohammad Reza Dayer,,http://export.arxiv.org/pdf/2003.04524v1,\nThe outbreak of coronavirus in 2019-2020 is...


In [None]:
with open('2023-01-06_article_results_arxiv_304_full_text.pickle', 'wb') as f:
  pickle.dump(article_results_arxiv_304, f)

## 13. Check for missing text

Concise summary of DataFrame to see if there are any articles with missing text.

In [None]:
article_results_arxiv_304.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304 entries, 0 to 303
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   arxiv-id   304 non-null    object
 1   published  304 non-null    object
 2   revised    304 non-null    object
 3   title      304 non-null    object
 4   journal    39 non-null     object
 5   authors    304 non-null    object
 6   doi        63 non-null     object
 7   pdf_url    304 non-null    object
 8   text       300 non-null    object
dtypes: object(9)
memory usage: 21.5+ KB


As expected, the four with exceptions.

In [None]:
article_results_arxiv_304[article_results_arxiv_304['text'].isna()]

Unnamed: 0,arxiv-id,published,revised,title,journal,authors,doi,pdf_url,text
3,2212.00023v2,2022-11-30,2022-12-08,Random Copolymer inverse design system orienti...,,"Tianyu Wu, Yang Tang",,http://export.arxiv.org/pdf/2212.00023v2,
69,2109.00100v4,2021-08-31,2021-09-07,Proceedings of KDD 2021 Workshop on Data-drive...,,"Snehalkumar, S. Gaikwad, Shankar Iyer, Dalton ...",,http://export.arxiv.org/pdf/2109.00100v4,
70,2109.00435v3,2021-09-01,2021-09-07,Proceedings of KDD 2020 Workshop on Data-drive...,,"Snehalkumar, S. Gaikwad, Shankar Iyer, Dalton ...",,http://export.arxiv.org/pdf/2109.00435v3,
176,2007.09186v3,2020-07-17,2020-10-07,AWS CORD-19 Search: A Neural Search Engine for...,,"Parminder Bhatia, Lan Liu, Kristjan Arumae, Ni...",,http://export.arxiv.org/pdf/2007.09186v3,


We will drop the three that have no PDF and handle separately the valid PDF which GROBID failed to process.

In [None]:
article_results_arxiv_301 = article_results_arxiv_304.copy()

In [None]:
article_results_arxiv_301.drop([3,69,70], axis=0, inplace=True)
article_results_arxiv_301

Unnamed: 0,arxiv-id,published,revised,title,journal,authors,doi,pdf_url,text
0,2109.06377v4,2021-09-14,2022-12-22,ASGARD: A Single-cell Guided pipeline to Aid R...,,"Bing He, Yao Xiao, Haodong Liang, Qianhui Huan...",,http://export.arxiv.org/pdf/2109.06377v4,"\nHeterogeneity, or more specifically, the div..."
1,2212.09867v1,2022-12-19,2022-12-19,Detecting Contradictory COVID-19 Drug Efficacy...,,"Daniel N. Sosa, Malavika Suresh, Christopher P...",,http://export.arxiv.org/pdf/2212.09867v1,\nThe COVID-19 pandemic caused by the novel SA...
2,2212.09610v1,2022-12-19,2022-12-19,Drying of Bio-colloidal Sessile Droplets: Adva...,,"Anusuya Pal, Amalesh Gope, Anupam Sengupta",,http://export.arxiv.org/pdf/2212.09610v1,\nVirus Emulating Particles (VEP) White Blood ...
4,2212.03911v1,2022-12-07,2022-12-07,Analysis of Drug repurposing Knowledge graphs ...,,Ajay Kumar Gogineni,,http://export.arxiv.org/pdf/2212.03911v1,\nIn the initial stages of a viral outbreak su...
5,2212.01575v1,2022-12-03,2022-12-03,Multi-view deep learning based molecule design...,,"Chao Pang, Yu Wang, Yi Jiang, Ruheng Wang, Ran...",,http://export.arxiv.org/pdf/2212.01575v1,\nDe novo drug design is a time-consuming and ...
...,...,...,...,...,...,...,...,...,...
299,2003.13665v1,2020-03-30,2020-03-30,Genomics-guided molecular maps of coronavirus ...,,Gennadi Glinsky,,http://export.arxiv.org/pdf/2003.13665v1,\nCoronavirus pandemic 2020 caused by the newl...
300,2003.14258v1,2020-03-30,2020-03-30,Nanomechanical sonification of the 2019-nCoV c...,,Markus J. Buehler,,http://export.arxiv.org/pdf/2003.14258v1,\nProteins are the building blocks of virtuall...
301,2003.12454v1,2020-03-26,2020-03-26,A Machine Learning alternative to placebo-cont...,,"Ezequiel Alvarez, Federico Lamagna, Manuel Szewc",,http://export.arxiv.org/pdf/2003.12454v1,\nCurrent and last decades research in drug di...
302,2003.04524v1,2020-03-10,2020-03-10,"Old Drugs for Newly Emerging Viral Disease, CO...",,Mohammad Reza Dayer,,http://export.arxiv.org/pdf/2003.04524v1,\nThe outbreak of coronavirus in 2019-2020 is...


In [None]:
len(article_results_arxiv_301)

301

In [None]:
article_results_arxiv_301.reset_index(drop=True, inplace=True)

Check that the articles have been dropped and the reindexing has worked by seeing if the articles that were previously in rows 69 and 70 have gone.

In [None]:
article_results_arxiv_301.loc[68:71]

Unnamed: 0,arxiv-id,published,revised,title,journal,authors,doi,pdf_url,text
68,2010.16413v3,2020-10-09,2021-09-05,Artificial Intelligence (AI) in Action: Addres...,Annual Review of Biomedical Data Science 4 (2021),"Qingyu Chen, Robert Leaman, Alexis Allot, Ling...",10.1146/annurev-biodatasci-021821-061045,http://export.arxiv.org/pdf/2010.16413v3,\nSince the initial reports of an outbreak of ...
69,2103.02843v2,2021-03-04,2021-09-04,Pandemic Drugs at Pandemic Speed: Infrastructu...,Interface Focus. 2021. 11 (6): 20210018,"Agastya P. Bhati, Shunzhou Wan, Dario Alfè, Au...",10.1098/rsfs.2021.0018,http://export.arxiv.org/pdf/2103.02843v2,\ndiscovery lie at the interface between machi...
70,2108.13764v1,2021-08-31,2021-08-31,Virtual screening of Microalgal compounds as p...,,Ibrahim Mohammed,,http://export.arxiv.org/pdf/2108.13764v1,\nCorona virus disease-19 is caused by Severe...
71,2108.12150v1,2021-08-27,2021-08-27,A Nested Multi-Scale Model for COVID-19 Viral ...,,"Bishal Chhetri, D. K. K Vamsi, Carani Sanjeevi",,http://export.arxiv.org/pdf/2108.12150v1,\nCOVID -19 is a contagious respiratory and va...


In [None]:
with open('2023-01-06_article_results_arxiv_301_full_text.pickle', 'wb') as f:
  pickle.dump(article_results_arxiv_301, f)

### References

* GROBID https://github.com/kermitt2/grobid

* GROBID documentation https://grobid.readthedocs.io/


* Python PDF parser for scientific publications: content and figures https://github.com/titipata/scipdf_parser
