SecMapper: Section Mapping of Scientific Documents (PDFs) using CRF and OCR'd by pdftoxml.
-
Download
pdf2xml
from here. -
Extract and make executable
chmod +x pdf2xmlexecutable
-
Optional: Move it to
/usr/local/bin/
sudo mv pdftoxmlexecutable /usr/local/bin/pdftoxml
Download CRF++ from here. Follow the instructions in the INSTALL file/ in the website. You might have to run ldconfig
at the end to update your libraries. Run $ crf_learn
to check if the installation is complete.
Secmapper takes input the XML file generated by pdftoxml
and produces the mapped sections in another XML file. Besides the final XML, one file is generated which contains the raw section features annotated by the CRF. This is done to expose the inner workings and improve the model.
Instructions:
- Run
pdftoxml
on the pdf:pdftoxml mypdf.pdf rawxml.xml
or/path/to/pdftoxmlexecutable mypdf.pdf rawxml.xml
. Make sure thatpdftoxml
is executable - Use the SecMapper.py script:
python SecMapper.py rawxml.xml
python SecMapper.py /path/to/rawxml.xml
- You can also give the path explicitly by
python SecMapper.py rawxml.xml /path/to/the/rawxml
. Useful for batch processing.
- Thats it! The output is generated in the current directory (where SecMapper.py is present, with the name
rawxml_secmap.xml
).
I have also included the necessary files which can be guidelines for improving the model. Templates for crf_learn, sample training data have been provided. More info on how to use CRF++ can be found at its homepage.
Contribute to this tool by:
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Submit a pull request
SecMapper is a promising tool to extract sections and headings from scientific documents. The PDF is first segmented into regions or chunks, by means of its distance from neighboring text and a measure of how different it is from the surrounding text, in terms of font-size and bold-text(makes more sense when you look into the code).
It runs on 2 seperate models, one for pdfs with Numerical indices for headings like most papers in ACM, IEEE, etc. And the other for pdfs without indices, like CHI.
Its key strength lies here that each model can thus be tuned to its own fineness, without affecting the other, maximizing the accuracy.
Unfortunately, we have only managed to manually annotate and train on around 20 pdfs, but tested it on more than 100 pdfs. The results were about 77% in F-Score (overall).
It is therefore expected that increase in training data will provide far better results.
Authors:
GNU GENERAL PUBLIC LICENSE v3.
See the LICENSE file for more details.