Artic: Metadata extractor from scientific documents

Artic is a tool that automatically identifies the metadata information of a scientific paper. These information are: title, authors, emails, affiliations, conference name, conference date, conference location, conference year, ISBN and publisher. This first version is strongly-dependent on OmniPage Version 18. That said, the expected input for Artic is the paper as an XML file that has been generated by OmniPage, and the expected output will be a JSON file with the corresponding metadata content.

Artic version 2.0 will target open-source tools in order to extract rich text information from PDFs (e.g. font size, font format, bold, italic, among others).

License: MIT

Dependencies:

##How to execute##

After all the dependencies are installed and configured, execute the following steps to get the JSON file with metadata content.

Download Artic jar file
Go to the folder where OmniPage XML files are located
Run the following command:

    java -jar artic-1.0.jar

At the same location, a .json file will be created with the same name as the XML file of the given paper, which looks like this:

    {
        "title": "Artic: Metadata Extractor tool",
        "authors": [
            {
                "name": "Alan Souza",
                "affiliation": "UFRGS",
                "email": "alansouzati@gmail.com"
            }
        ],
        "venues": [
            {
                "name": "DocEng",
                "publisher": "ACM",
                "date": "September 16-19",
                "year": "2014",
                "isbn": "978-1-4503-1994-2/13/04"
            }
        ]
    }

##Configuration##

Artic provides some configuration parameters in order to allow changes to be made on the generation process. The most important attribute that could be eventually changed is the boundary configuration for author and affiliations. The configuration is provided through a properties file artic.properties, which is defined bellow:

    # default page parser engine is the OmniPage Professional Version 18
    page.parser.instance=br.ufrgs.artic.parser.omnipage.OmniPageParser
    
    # boundaries configuration - changing these values can dramatically affect the result of the system
    author.page.boundary.horizontal=21
    author.page.boundary.vertical=40
    
    affiliation.page.boundary.horizontal=17
    affiliation.page.boundary.vertical=40
    
    authorAffiliation.page.boundary.horizontal=17
    authorAffiliation.page.boundary.vertical=70

For example, author.page.boundary.horizontal represents the horizontal boundary that is applied for authors. In this case, a given word is considered belonging to the same author if it is within 21% distance to the current author being analyzed. Increasing this number may join distant words together. Similarly, decreasing this number may split close words apart. We have experimentally set to 21% based on our dataset.

The page.parser.instance defines the engine to parse the input (e.g. xml) to the Artic Page instance. For this first version, OmniPageParser is the only one being provided.

In order to override some configuration you need to specify the location using one of the following ways:

Add a artic.properties where you jar is located
Specify the location of your properties while running the jar

    java -jar artic-1.0.jar path/to/xml/folder/or/xml/file config/artic.properties

While running the jar, the first argument specifies the path to your xml folder or xml file. The last argument is the path to your custom properties file.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

README.md

README.md

pom.xml

pom.xml

Repository files navigation

Artic: Metadata extractor from scientific documents

About

Releases 1

Packages

Languages

alansouzati/artic

Folders and files

Latest commit

History

Repository files navigation

Artic: Metadata extractor from scientific documents

About

Resources

Stars

Watchers

Forks

Languages