Skip to content

How to generate a NIF dataset

RicardoUsbeck edited this page Nov 17, 2014 · 18 revisions

Gerbil uses datasets which are annotated with NIF, the Natural Language Processing Interchange Format. NIF is an ontology that describes strings. There are some interesting resources where you can find complete wikis and documentation:

Texts and corpora can be converted into NIF following this guide:

Let's start with a 2 sentence example document:

The jury said it did find that many of Georgia's registration and election laws "are outmoded or inadequate and often ambiguous".

It recommended that Fulton legislators act "to have these laws studied and revised to the end of modernizing and improving them".

NIF is a way to address arbitrary strings with URIs. Strings to be described are typically identified via offsets starting with 0 (before the first character), counting the gaps between characters. Every document is expressed as nif:Context resource. So we can address our document like this:

    <http://example.org/document#char=0,260>
        a nif:String , nif:Context , nif:RFC5147String ;
        nif:isString """The jury said it did find that many of Georgia's registration and election laws "are outmoded or inadequate and often ambiguous". It recommended that Fulton legislators act "to have these laws studied and revised to the end of modernizing and improving them"."""^^xsd:string ;
        nif:beginIndex "0"^^xsd:nonNegativeInteger ;
        nif:endIndex "260"^^xsd:nonNegativeInteger .

The document's text is contained in the mandatory nif:isString property. nif:beginIndex and nif:endIndex are further needed to denote the String offsets.

Substrings, like the first sentence, can now be annotated by their string offsets in reference to the nif:Context string.

    <http://example.org/document#char=0,129>
        a nif:String , nif:Sentence , nif:RFC5147String ;
        nif:anchorOf """The jury said it did find that many of Georgia's registration and election laws "are outmoded or inadequate and often ambiguous"."""^^xsd:string ;
        nif:beginIndex "0"^^xsd:nonNegativeInteger ;
        nif:endIndex "129"^^xsd:nonNegativeInteger ;
        nif:referenceContext <http://example.org/document#char=0,260> .

The sentence's string is contained in the mandatory nif:anchorOf property. The nif:referenceContext property links the document resource. This resource shows the smallest possible NIF annotation of a string, as it contains only mandatory properties. Smaller strings, like the word Georgia in the sentence, can be annotated the same way. Further properties can be added for annotation:

    <http://example.org/document#char=39,46>
        a nif:String , nif:Word , nif:RFC5147String ;
        nif:anchorOf """Georgia"""^^xsd:string ;
        nif:beginIndex "4"^^xsd:nonNegativeInteger ;
        nif:endIndex "8"^^xsd:nonNegativeInteger ;
        nif:referenceContext <http://example.org/document#char=0,260> ;
        nif:sentence <http://example.org/document#char=0,129> ;
        itsrdf:taIdentRef <http://dbpedia.org/page/Georgia_(U.S._state)> ;
        itsrdf:taClassRef <http://nerd.eurecom.fr/ontology#Location> .

A new property links the string to its sentence. We also find itsrdf:taIdentRef, used to link it to an external entity resource, like a DBpedia resource. itsrdf:taClassRef serves to link it to a Named Entity Type.

A complete overview of the NIF ontology can be found here. Our example corpus is the brown corpus which can be found at http://brown.nlp2rdf.org