Classification is the problem of identifying to which of a set of categories (subpopulations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.
Sequence Labeling is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values. It is a subclass of structured (output) learning, since we are predicting a sequence object rather than a discrete or real value predicted in classification problems.
Shallow Parsing is a kind of Sequence Labeling. The main difference from Sequence Labeling task, such as Part-of-Speech Tagging, where there is an output label (tag) per token; Shallow Parsing additionally performs chunking -- segmentation of input sequence into constituents. Chunking is required to identify categories (or types) of multi-word expressions.
In other words, we want to be able to capture information that expressions like "New York"
that consist of 2 tokens,
constitute a single unit.
What this means in practice is that Shallow Parsing performs jointly (or not) 2 tasks:
- Segmentation of input into constituents (spans)
- Classification (Categorization, Labeling) of these constituents into predefined set of labels (types)
Corpus in CoNLL format consists of series of sentences, separated by blank lines. Each sentence is encoded using a table (or "grid") of values, where each line corresponds to a single word, and each column corresponds to an annotation type (such as various token-level features & labels).
The set of columns used by CoNLL-style files can vary from corpus to corpus.
Since a line in a data can correspond to any token (word or not), it is referred to by a more general term token
.
Similarly, since a data can be composed of units more or less than a sentence,
a new line separated unit is referred to as block
.
The notation scheme is used to label multi-word spans in token-per-line format, e.g. that New York is a LOCATION concept that has 2 tokens.
As such a token-level tag
consists of an affix
that encodes segmentation information
and label
that encodes type information.
Consequently, the corpus tagset
consists of all possible affix
and label
combinations.
A segment encoded with affix
es and assigned a label
is referred to as chunk
.
-
Both, prefix and suffix notations are commons:
- prefix: B-LOC
- suffix: LOC-B
-
Meaning of Affixes
- I for Inside of span
- O for Outside of span (no prefix or suffix, just
O
) - B for Beginning of span
- No affix (useful when there are no multi-word concepts)
IO
: deficient withoutB
IOB
: see aboveIOBE
:E
for End of span (L
inBILOU
for Last)IOBES
:S
for Singleton (U
inBILOU
for Unit)
There are several methods to evaluate performance of shallow parsing models.
They can be evaluated at token
-level and at chunk
-level.
The unit of evaluation in this case is a tag
of a token
,
and what is evaluated is how accurately a model assigns tags to tokens.
Consequently, the token
(or tag
) accuracy measures the amount of correctly predicted tag
s.
Since a tag
consists of an affix
-label
pair,
it is additionally possible to separately compute affix
and label
performances.
The unit of evaluation in this case is a chunk
, and the evaluation is "joint";
in the sense that it jointly evaluates segmentation and labeling.
That is, a true
unit is the one that has correct label
and span
.
Similar to token-level evaluation, it is possible to evaluate segmentation independently of labeling.
This is achieved ignoring the chunk
label, e.g. by converting all of them to a single label.
Token-level evaluation is readily available from a number of packages,
and can be easily computed using scikit-learn
's classification_report
, for instance.
Chunk-level evaluation was originally provided by
conlleval
perl script within CoNLL Shared Tasks.
However, the one limitation of conlleval
is that it does not support IOBES
or BILOU
schemes.
The conlleval
script was ported to python numerous times, and these ports have various functionalities.
One notable port is seqeval
,
which is also included in Hugging Face's evaluate
package.
To install econll
run:
pip install econll
It is possible to run econll
from command-line, as well as to import the methods.
usage: PROG [-h] -d DATA [-r REFS]
[--separator SEPARATOR] [--boundary BOUNDARY] [--docstart DOCSTART]
[--kind {prefix,suffix}] [--glue GLUE] [--otag OTAG]
[-f {conll,parse,mdown}] [-o OUTS]
[{eval,conv}]
eCoNLL: Extended CoNLL Utilities
positional arguments:
{eval,conv} task to perform
options:
-h, --help show this help message and exit
I/O Arguments:
-d DATA, --data DATA path to data/hypothesis file
-r REFS, --refs REFS path to references file
Data Format Arguments:
--separator SEPARATOR
field separator string
--boundary BOUNDARY block separator string
--docstart DOCSTART doc start string
Tag Format Arguments:
--kind {prefix,suffix}
tag order
--glue GLUE tag separator
--otag OTAG outside tag
Data Conversion Arguments:
-f {conll,parse,mdown}, --form {conll,parse,mdown}
output format (kind)
-o OUTS, --outs OUTS path to output file
python -m econll -d DATA
python -m econll eval -d DATA
python -m econll eval -d DATA -r REFS
python -m econll conv -d DATA -f FORMAT -o PATH
This project adheres to Semantic Versioning.