# The ContentMine Toolkit

## Introduction

These notebooks aim to provide an introduction to some of the tools in the [ContentMine](http://contentmine.org/) toolkit.

<p style="text-align: center; background-color: lightyellow;">
    This <a href="https://jupyter.org/">Juypter Notebook</a> provide an interactive environment where you can run Python scripts and shell commands and see the results in your browser.<br/>
    For shell commands, each line starts with a <tt>!</tt>, so <tt>!echo hello</tt> runs a shell script, but <tt>print('hello')</tt> runs as Python.<br/>
    For a more detailed introduction, try this <a href="https://programminghistorian.org/en/lessons/jupyter-notebooks">Introduction to Jupyter Notebooks</a>.
</p>
            
### The Big Picture

The vison that drives ContentMine is that text and data mining should be used to open up all research literature, allowing the factual content to be made available to all, and helping researchers to make better use of what has already been discovered.

### Overall Approach

The high-level workflow for meeting this vision works as follows.

Given a research question or area of interest:

- Collect all relevant publications:
    - This may be as PDF files or as XML or HTML.
- Normalise the publications into a better machine-readable form:
    - As well-structured HTML, referred to here as 'scholarly HTML' (but note that this does _not_ refer to [the proposed Scholarly HTML standard](https://w3c.github.io/scholarly-html/)). Any properly marked-up HTML should work fine, but ideally re-using the subset of [JATS](https://jats.nlm.nih.gov/) tags that make sense in a HTML context.
    - If HTML isn't an option, plain text can be used.
    - However, depending on the question at hand, other formats might be required.
- Extract facts:
    - Which papers mention which terms and concepts?
    - What research cited in each paper?
    - Which chemical compounds appear in the diagrams of each paper.
- Share the results:
    - Make the normalised publications available (if licensing permits).
    - Make the extracted facts openly available (factual assertions and [non-consumptive datasets](https://www.hathitrust.org/htrc_ncup) can usually be made available without restriction).
    - Create visualisations and documentation to help understand the results.
- Make new knowledge:
    - Use the results as the foundation for your own research.

Here, however, we'll focus on a specific use case and use that to explore how the approach works.

## The openVirus Project

In response to the COVID-19 epidemic, the [openVirus project](https://github.com/petermr/openVirus#openvirus) aims to aggregate scholarly publications and extracted knowledge on viruses and epidemics.

Here, we'll look at the specific case of finding which open access electronic theses mention terms relating to viruses and epidemics. Information about UK e-theses has been gathered by the [EThOS service](https://ethos.bl.uk/), and a suitable source dataset has been made available [here](https://data.bl.uk/ethos/).

The specific workflow here is:

- Use the EThOS dataset to find open-access theses.
- Get the PDFs and assemble them into the file folder layout conventions of the ContentMine toolkit.
- Generate HTML or text versions of the PDFs.
- Search through the HTML for the relevant terms, and extract the terms along with snippets of text as context for each time the terms appear.

The openVirus project has already created some [dictionaries](https://github.com/petermr/openVirus/tree/master/dictionaries) that can be used to link the terms that appear in the text to the relevant WikiData entities. _TBA come back to dictionary creation later on_

But to get started, we need some data sources. We'll start with a single theses, found by searching EThOS for relevant terms: [TraVerse : a method of natural respiratory virus transmission from symptomatic children to healthy young adults](https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.755301) (`id: uk.bl.ethos.755301`)

This command was used to create the initial folder structure:

    ami-makeproject -p ethos --rawfiletypes pdf

Resulting in a folder called `ethos` containing a JSON project file:

```
ethos/make_project.json
```

We then add the full text, as per the expected folder layout (called the [CProject](https://github.com/ContentMine/workshop-resources/blob/master/software-tutorials/cproject/README.md) naming conventions). Listing all files shows:

```
ethos/make_project.json
ethos/uk.bl.ethos.755301/
ethos/uk.bl.ethos.755301/fulltext.pdf
```

If you run the following code, and <i>if your browser can display PDF files</i>, the PDF should be displayed in the page below:

In [None]:
from IPython.display import IFrame,display

display(IFrame(src="ethos/uk.bl.ethos.755301/fulltext.pdf", width="60%", height="300px"))

If that doesn't work, you can <a target="_blank" href="ethos/uk.bl.ethos.755301/fulltext.pdf">download it instead</a>.

So we have the source document, but to perform the dictionary analysis we need either a plain text version in a file called:

```
ethos/uk.bl.ethos.755301/fulltext.pdf.txt
```

Or a HTML version in a file called:

```
ethos/uk.bl.ethos.755301/scholarly.html
```

The ContentMine toolkit provide wrappers for some sophisticated tools for performing PDF-to-HTML conversion, but for our purposes we can start with a simple HTML version generated by [Apache Tika](https://tika.apache.org/).

The following sequence of shell commands can be used to generate a suitable HTML version:

In [None]:
!echo "Converting to HTML using Apache Tika..."
!java -jar /opt/tika.jar ethos/uk.bl.ethos.755301/fulltext.pdf > ethos/uk.bl.ethos.755301/scholarly.html
!echo "DONE!"
!ls -l ethos/uk.bl.ethos.755301/scholarly.html

Once you've re-run the script above, the simple HTML this process generated can be viewed here:

In [1]:
from IPython.display import IFrame,display

display(IFrame(src="ethos/uk.bl.ethos.755301/scholarly.html", width="60%", height="300px"))

Now we have HTML, we can finally extract some facts! We use the `ami-search` tool from the ContentMine [AMI3](https://github.com/petermr/ami3) project, which takes dictionaries of keywords relating to specific concepts and records where they appear in a set of texts.

The arguments specify:

- `-p ethos` to specify which project to process,
- `--forcemake` to make sure the output is regenerated even if the output files already exist,
- followed by a list of dictionaries to use, in this case the `country`, `virus_topics` and `virus_systemic_diseases` dictionaries.


In [24]:
!ami-search -p ethos --forcemake --dictionary=country --dictionary=virus_topics --dictionary=viral_systemic_diseases


Generic values (AMISearchTool)
-v to see generic values
oldstyle            true

Specific values (AMISearchTool)
oldstyle             true
strip numbers        false
wordCountRange       (20,1000000)
wordLengthRange      (1,20)

dictionaryList       [country, virus_topics, viral_systemic_diseases]
dictionaryTop        null
dictionarySuffix     [xml]

0    [main] DEBUG org.contentmine.ami.tools.AbstractAMISearchTool  - old style search command); change
0 [main] DEBUG org.contentmine.ami.tools.AbstractAMISearchTool  - old style search command); change
cProject: ethos
legacy cmd> word(frequencies)xpath:@count>20~w.stopwords:pmcstop.txt_stopwords.txt
legacy cmd> search(country)
legacy cmd> search(virus_topics)
legacy cmd> search(viral_systemic_diseases)
uk.bl.ethos.755301 
large document (7031) for uk.bl.ethos.755301 truncated to 500 sections
...
large document (7031) for uk.bl.ethos.755301 truncated to 500 sections
...
large document (7031) for uk.bl.ethos.755301 truncated to 500 sectio

If we look in the `ethos` folder we can see all the new files created by the `ami-search` process:

In [43]:
!find ethos -type f

ethos/search.viral_systemic_diseases.count.xml
ethos/uk.bl.ethos.755301/search.viral_systemic_diseases.count.xml
ethos/uk.bl.ethos.755301/word.frequencies.count.xml
ethos/uk.bl.ethos.755301/tei/fulltext.tei.xml
ethos/uk.bl.ethos.755301/tei/fulltext.tei.html
ethos/uk.bl.ethos.755301/tei/fulltext_assets/image-5.png
ethos/uk.bl.ethos.755301/tei/fulltext_assets/image-12.png
ethos/uk.bl.ethos.755301/tei/fulltext_assets/image-2.png
ethos/uk.bl.ethos.755301/tei/fulltext_assets/image-3.png
ethos/uk.bl.ethos.755301/tei/fulltext_assets/image-10.png
ethos/uk.bl.ethos.755301/tei/fulltext_assets/image-16.png
ethos/uk.bl.ethos.755301/tei/fulltext_assets/image-6.png
ethos/uk.bl.ethos.755301/tei/fulltext_assets/image-9.png
ethos/uk.bl.ethos.755301/tei/fulltext_assets/image-14.png
ethos/uk.bl.ethos.755301/tei/fulltext_assets/image-11.png
ethos/uk.bl.ethos.755301/tei/fulltext_assets/image-7.png
ethos/uk.bl.ethos.755301/tei/fulltext_assets/image-1.png
ethos/uk.bl.ethos.755301/tei/fulltex

In [26]:
display(IFrame(src="ethos/full.dataTables.html", width="100%", height="300px"))

_TBA look at the results..._ 

In [27]:
display(IFrame(src="ethos/uk.bl.ethos.755301/results/search/virus_topics/results.xml", width="100%", height="300px"))

In [42]:
from IPython.display import SVG,display
#display(SVG(filename='ethos/__cooccurrence/viral_systemic_diseases-viral_systemic_diseases/cooccur.svg'))
#display(SVG(filename='ethos/__cooccurrence/allPlots.svg', height="100px"))
display(IFrame(src="ethos/__cooccurrence/country-virus_topics/cooccur.svg", width="50%", height="350px"))

We've only processed a single document, so the co-occurance matrix for the terms that appear in the document is `1` everywhere!  To get something more interesting, we'll need to analyse more documents.

# Conclusion

Hopefully, this notebook has introduced the major concepts and illustrated the potential uses of the ContentMine approach and tools. 

_TBA next steps_

* Process many theses.
* [other notebooks](/) on GROBID, OCR, etc.