<h1>Mining scientific articles</h1>

<h2>with XML, Python, and allofplos</h2>
<br>
<br>
<center> 
<h4> Elizabeth Seiver<br>
<a href="http://plos.org">plos.org</a>, <a href="http://twitter.com/tweetotaler">@tweetotaler</a><br>
PLOS and The Hacker Within<br>
Wed November 29, 2017</h4>
</center>

# Why Mine Scientific Articles?
* Science articles represent scientific knowledge
* XML is version of articles for machines, while PDF is for humans
* Tool for meta-research and meta-science
* Quickly identify sets of articles of interest
* Identify research literature trends over time (study findings, jargon usage, citation networks)

# PLOS corpus of articles
* 220,000+ scientific articles from a wide array of research fields, focusing on the medical and life sciences
* Since 2003
* Open Access: free to read, free to re-use
* Creative Commons license (CC-BY, CC0)
* Many scientific articles are behind a paywall

# Tutorial plan
* Goal: to enable research questions about science articles
* Will use JupyterHub and a sample corpus of 10,000 randomly selected PLOS articles
* Won't be discussing research techniques (data analysis, natural language processing)
* Assumes basic Python knowledge (lists, dictionaries, loops, conditionals, datetime)

# Tutorial structure
1. How to parse XML using allofplos and [lxml](http://lxml.de/tutorial.html)
2. Basic structure of XML documents and [JATS standard](https://jats.nlm.nih.gov/)
3. Example projects with the PLOS test corpus
4. Hacking session: parse articles or contribute to the allofplos codebase  

Exercises based on tutorial: https://github.com/eseiver/xml_tutorial

# allofplos
* Python package for both downloading and parsing PLOS XML articles
* Turns PLOS XML articles into Python data structures
* Doesn't require knowledge of XML to use
* Focuses on article metadata (e.g., title, authors, date of publication)
* Work in progress, so aspects of it may change

# allofplos basics
* Initialize an article object w/DOI or XML filename
    * DOI (Digital Object Identifier) is a unique identifier for an online document/article
    * All PLOS DOIs start with `"10.1371/journal."`, like `"10.1371/journal.pone.0185809"`
* allofplos XML files are named with last part of DOI, e.g. `"journal.pone.0185809.xml"`

In [1]:
from allofplos import Article # if have run `pip install allofplos`
# from article_class import Article  # if inside cloned GitHub directory

# first instantiation of Article class by DOI
article = Article('10.1371/journal.pone.0178690')
article.title

'Physician assessments of drug seeking behavior: A mixed methods study'

In [2]:
# first instantiation of Article class by filename
article = Article.from_filename('allofplos_xml/journal.pone.0181748.xml')
article.title

'THPdb: Database of FDA-approved peptide and protein therapeutics'

In [3]:
# new article
article.doi = '10.1371/journal.pone.0183591'
article.title

'A checklist is associated with increased quality of reporting preclinical biomedical research: A systematic review'

# Notable properties
Try printing or returning some of these values

## Basic metadata

In [4]:
article.doi

'10.1371/journal.pone.0183591'

In [5]:
article.journal

'PLOS ONE'

In [6]:
article.pubdate

datetime.datetime(2017, 9, 13, 0, 0)

In [7]:
article.title

'A checklist is associated with increased quality of reporting preclinical biomedical research: A systematic review'

In [8]:
article.counts

{'fig-count': '3', 'page-count': '14', 'table-count': '2'}

In [9]:
article.word_count

4954

In [10]:
print(article.abstract[:310])

Irreproducibility of preclinical biomedical research has gained recent attention. It is suggested that requiring authors to complete a checklist at the time of manuscript submission would improve the quality and transparency of scientific reporting, and ultimately enhance reproducibility. Whether a checklist 


## People

In [11]:
contributor = article.contributors[0]
contributor.keys()

dict_keys(['contrib_initials', 'given_names', 'surname', 'group_name', 'ids', 'rid_dict', 'contrib_type', 'author_type', 'editor_type', 'email', 'affiliations', 'author_roles', 'footnotes'])

In [12]:
article.authors[0]

{'affiliations': ['Division of Pulmonary, Allergy, and Critical Care Medicine, Department of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America'],
 'author_roles': {'CASRAI CREDiT taxonomy': ['Conceptualization',
   'Data curation',
   'Formal analysis',
   'Funding acquisition',
   'Investigation',
   'Writing – original draft',
   'Writing – review & editing']},
 'author_type': 'corresponding',
 'contrib_initials': 'SH',
 'contrib_type': 'author',
 'editor_type': None,
 'email': ['shan.workmd@gmail.com'],
 'footnotes': ['Current address: Division of Pulmonary and Critical Care, Department of Medicine, Northwestern University, Chicago, Illinois, United States of America'],
 'given_names': 'SeungHye',
 'group_name': None,
 'ids': [{'authenticated': 'true',
   'id': 'http://orcid.org/0000-0001-5625-6337',
   'id_type': 'orcid'}],
 'rid_dict': {'aff': ['aff001'],
  'corresp': ['cor001'],
  'fn': ['currentaff001']},
 'surname': 'Han'}

In [13]:
article.corr_author

[{'affiliations': ['Division of Pulmonary, Allergy, and Critical Care Medicine, Department of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America'],
  'author_roles': {'CASRAI CREDiT taxonomy': ['Conceptualization',
    'Data curation',
    'Formal analysis',
    'Funding acquisition',
    'Investigation',
    'Writing – original draft',
    'Writing – review & editing']},
  'author_type': 'corresponding',
  'contrib_initials': 'SH',
  'contrib_type': 'author',
  'editor_type': None,
  'email': ['shan.workmd@gmail.com'],
  'footnotes': ['Current address: Division of Pulmonary and Critical Care, Department of Medicine, Northwestern University, Chicago, Illinois, United States of America'],
  'given_names': 'SeungHye',
  'group_name': None,
  'ids': [{'authenticated': 'true',
    'id': 'http://orcid.org/0000-0001-5625-6337',
    'id_type': 'orcid'}],
  'rid_dict': {'aff': ['aff001'],
   'corresp': ['cor001'],
   'fn': ['currentaff001']},
  'surname': 'H

In [14]:
article.editor[0]

{'affiliations': ['Fraunhofer Research Institution of Marine Biotechnology, GERMANY'],
 'author_roles': {None: ['Editor']},
 'author_type': None,
 'contrib_initials': 'JB',
 'contrib_type': 'editor',
 'editor_type': None,
 'email': None,
 'footnotes': [],
 'given_names': 'Johannes',
 'group_name': None,
 'ids': [],
 'rid_dict': {'aff': ['edit1']},
 'surname': 'Boltze'}

## Article type

In [15]:
article.type_  # JATS

'research-article'

In [16]:
article.plostype

'Research Article'

In [17]:
article.proof  # whether an uncorrected proof/early version or not

## Local article file (more on this later)

In [18]:
article.filename

'/Users/Elizabeth/PLOS_Corpus_Project/allofplos/allofplos/allofplos_xml/journal.pone.0183591.xml'

In [19]:
article.local

True

In [20]:
article.tree

<lxml.etree._ElementTree at 0x10efa9dc8>

In [21]:
article.root

<Element article at 0x1118cf348>

In [None]:
article.xml

# Other notable methods

In [22]:
article.get_dates()

{'accepted': datetime.datetime(2017, 8, 7, 0, 0),
 'collection': datetime.datetime(2017, 1, 1, 0, 0),
 'epub': datetime.datetime(2017, 9, 13, 0, 0),
 'received': datetime.datetime(2017, 3, 19, 0, 0)}

In [23]:
article.check_if_doi_resolves()

'works'

In [24]:
article

DOI: 10.1371/journal.pone.0183591
Title: A checklist is associated with increased quality of reporting preclinical biomedical research: A systematic review

In [None]:
print(article)