Skip to content

esrel/econll

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

eCoNLL: Extended CoNLL Utilities for Shallow Parsing

PyPI - Version PyPI - Status PyPI - Python Version PyPI - Downloads

Shallow Parsing

Sequence Labeling and Classification

Classification is the problem of identifying to which of a set of categories (subpopulations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.

Sequence Labeling is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values. It is a subclass of structured (output) learning, since we are predicting a sequence object rather than a discrete or real value predicted in classification problems.

Sequence Labeling and Shallow Parsing

Shallow Parsing is a kind of Sequence Labeling. The main difference from Sequence Labeling task, such as Part-of-Speech Tagging, where there is an output label (tag) per token; Shallow Parsing additionally performs chunking -- segmentation of input sequence into constituents. Chunking is required to identify categories (or types) of multi-word expressions.

In other words, we want to be able to capture information that expressions like "New York" that consist of 2 tokens, constitute a single unit.

What this means in practice is that Shallow Parsing performs jointly (or not) 2 tasks:

  • Segmentation of input into constituents (spans)
  • Classification (Categorization, Labeling) of these constituents into predefined set of labels (types)

CoNLL Corpus Format

Corpus in CoNLL format consists of series of sentences, separated by blank lines. Each sentence is encoded using a table (or "grid") of values, where each line corresponds to a single word, and each column corresponds to an annotation type (such as various token-level features & labels).

The set of columns used by CoNLL-style files can vary from corpus to corpus.

Since a line in a data can correspond to any token (word or not), it is referred to by a more general term token. Similarly, since a data can be composed of units more or less than a sentence, a new line separated unit is referred to as block.

Encoding Segmentation Information

The notation scheme is used to label multi-word spans in token-per-line format, e.g. that New York is a LOCATION concept that has 2 tokens.

As such a token-level tag consists of an affix that encodes segmentation information and label that encodes type information. Consequently, the corpus tagset consists of all possible affix and label combinations.

A segment encoded with affixes and assigned a label is referred to as chunk.

  • Both, prefix and suffix notations are commons:

    • prefix: B-LOC
    • suffix: LOC-B
  • Meaning of Affixes

    • I for Inside of span
    • O for Outside of span (no prefix or suffix, just O)
    • B for Beginning of span

Alternative Schemes

  • No affix (useful when there are no multi-word concepts)
  • IO: deficient without B
  • IOB: see above
  • IOBE: E for End of span (L in BILOU for Last)
  • IOBES: S for Singleton (U in BILOU for Unit)

Evaluation

There are several methods to evaluate performance of shallow parsing models. They can be evaluated at token-level and at chunk-level.

Token-Level Evaluation

The unit of evaluation in this case is a tag of a token, and what is evaluated is how accurately a model assigns tags to tokens. Consequently, the token (or tag) accuracy measures the amount of correctly predicted tags.

Since a tag consists of an affix-label pair, it is additionally possible to separately compute affix and label performances.

Chunk-Level Evaluation

The unit of evaluation in this case is a chunk, and the evaluation is "joint"; in the sense that it jointly evaluates segmentation and labeling. That is, a true unit is the one that has correct label and span.

Similar to token-level evaluation, it is possible to evaluate segmentation independently of labeling. This is achieved ignoring the chunk label, e.g. by converting all of them to a single label.

Why eCoNLL?

Token-level evaluation is readily available from a number of packages, and can be easily computed using scikit-learn's classification_report, for instance.

Chunk-level evaluation was originally provided by conlleval perl script within CoNLL Shared Tasks. However, the one limitation of conlleval is that it does not support IOBES or BILOU schemes.

The conlleval script was ported to python numerous times, and these ports have various functionalities. One notable port is seqeval, which is also included in Hugging Face's evaluate package.

Installation

To install econll run:

pip install econll

Usage

It is possible to run econll from command-line, as well as to import the methods.

Command-Line Usage

usage: PROG [-h] -d DATA [-r REFS] 
            [--separator SEPARATOR] [--boundary BOUNDARY] [--docstart DOCSTART] 
            [--kind {prefix,suffix}] [--glue GLUE] [--otag OTAG]
            [-f {conll,parse,mdown}] [-o OUTS]
            [{eval,conv}]

eCoNLL: Extended CoNLL Utilities

positional arguments:
  {eval,conv}        task to perform

options:
  -h, --help            show this help message and exit

I/O Arguments:
  -d DATA, --data DATA  path to data/hypothesis file
  -r REFS, --refs REFS  path to references file

Data Format Arguments:
  --separator SEPARATOR
                        field separator string
  --boundary BOUNDARY   block separator string
  --docstart DOCSTART   doc start string

Tag Format Arguments:
  --kind {prefix,suffix}
                        tag order
  --glue GLUE           tag separator
  --otag OTAG           outside tag

Data Conversion Arguments:
  -f {conll,parse,mdown}, --form {conll,parse,mdown}
                        output format (kind)
  -o OUTS, --outs OUTS  path to output file

Evaluation

python -m econll -d DATA
python -m econll eval -d DATA
python -m econll eval -d DATA -r REFS

Conversion

python -m econll conv -d DATA -f FORMAT -o PATH

Versioning

This project adheres to Semantic Versioning.