Skip to content

alansouzati/artic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Artic: Metadata extractor from scientific documents

Artic is a tool that automatically identifies the metadata information of a scientific paper. These information are: title, authors, emails, affiliations, conference name, conference date, conference location, conference year, ISBN and publisher. This first version is strongly-dependent on OmniPage Version 18. That said, the expected input for Artic is the paper as an XML file that has been generated by OmniPage, and the expected output will be a JSON file with the corresponding metadata content.

Artic version 2.0 will target open-source tools in order to extract rich text information from PDFs (e.g. font size, font format, bold, italic, among others).

License: MIT

Dependencies:

##How to execute##

After all the dependencies are installed and configured, execute the following steps to get the JSON file with metadata content.

  • Download Artic jar file
  • Go to the folder where OmniPage XML files are located
  • Run the following command:
    java -jar artic-1.0.jar
  • At the same location, a .json file will be created with the same name as the XML file of the given paper, which looks like this:
    {
        "title": "Artic: Metadata Extractor tool",
        "authors": [
            {
                "name": "Alan Souza",
                "affiliation": "UFRGS",
                "email": "alansouzati@gmail.com"
            }
        ],
        "venues": [
            {
                "name": "DocEng",
                "publisher": "ACM",
                "date": "September 16-19",
                "year": "2014",
                "isbn": "978-1-4503-1994-2/13/04"
            }
        ]
    }

##Configuration##

Artic provides some configuration parameters in order to allow changes to be made on the generation process. The most important attribute that could be eventually changed is the boundary configuration for author and affiliations. The configuration is provided through a properties file artic.properties, which is defined bellow:

    # default page parser engine is the OmniPage Professional Version 18
    page.parser.instance=br.ufrgs.artic.parser.omnipage.OmniPageParser
    
    # boundaries configuration - changing these values can dramatically affect the result of the system
    author.page.boundary.horizontal=21
    author.page.boundary.vertical=40
    
    affiliation.page.boundary.horizontal=17
    affiliation.page.boundary.vertical=40
    
    authorAffiliation.page.boundary.horizontal=17
    authorAffiliation.page.boundary.vertical=70

For example, author.page.boundary.horizontal represents the horizontal boundary that is applied for authors. In this case, a given word is considered belonging to the same author if it is within 21% distance to the current author being analyzed. Increasing this number may join distant words together. Similarly, decreasing this number may split close words apart. We have experimentally set to 21% based on our dataset.

The page.parser.instance defines the engine to parse the input (e.g. xml) to the Artic Page instance. For this first version, OmniPageParser is the only one being provided.

In order to override some configuration you need to specify the location using one of the following ways:

  • Add a artic.properties where you jar is located
  • Specify the location of your properties while running the jar
    java -jar artic-1.0.jar path/to/xml/folder/or/xml/file config/artic.properties

While running the jar, the first argument specifies the path to your xml folder or xml file. The last argument is the path to your custom properties file.