Skip to content

Latest commit

 

History

History
193 lines (131 loc) · 7.61 KB

README.md

File metadata and controls

193 lines (131 loc) · 7.61 KB

STIFF - Sense Tagged Instances For Finnish

pipeline status

This repository contains code to automatically create a tagged sense corpus from OpenSubtitles2018. It also contains a lot of corpora wrangling code, most notably code to convert (the CC-NC licensed) EuroSense into a format usable by finn-wsd-eval.

Set up

You will need HFST and OMorFi installed globally before beginning. The reason for this is neither are currently PyPI installable. You will also need poetry. You can then run

$ ./install.sh

(Only partially tested) Conversion pipelines and evaluation using the Makefile

There is a Makefile. Reading the source is a recommended next step after this README. It has variables for most file paths. These are over ridable with default. You can override them to help make it convenient when supplying intermediate steps/upstream corpora yourself, wanting outputs in a particular place, and when running it with Docker when you may want to use bind mounts to make the aforementioned appear on the host.

Make STIFF or EuroSense into data for finn-wsd-eval

You can make the data needed for finn-wsd-eval by running::

make wsd-eval

which will make the STIFF and EuroSense WSD evaluation corpora, including trying to fetch all dependencies. However,

  1. It will take a long time. The longest step is building STIFF from scratch which can take around two weeks. To speed things up you can supply a premade stiff.raw.xml.zst downloaded from here (TODO).
  2. It will not fetch one dependency with restrictions upon it: BABELWNMAP.

Obtaining BABELWNMAP

You will next need to set the environment variable BABELWNMAP as the path to a TSV mapping from BabelNet synsets to WordNet synsets. You can either:

  1. Obtain the BabelNet indices by following these instructions and dump out the TSV by following the instructions at https://github.com/frankier/babelnet-lookup
  2. If you are affiliated with a research institution, I have permission to send you the TSV file, but you must send me a direct communication from your institutional email address. (Please shortly state your position/affiliation and non-commercial research use in the email so there is a record.)
  3. Alternatively (subject to the same conditions) if you prefer, I can just send you eurosense.unified.sample.xml eurosense.unified.sample.key

Make STIFF and EuroSense P/R plot

Run::

make corpus-eval

(OLD) Example conversion pipelines and evaluation

Both the following pipelines first create a corpus tagged in the unified format, which consists of an xml and key file, and then create a directory consisting of the files needed by finn-wsd-eval.

STIFF Pipeline

Fetch OpenSubtitles2018

poetry run python scripts/fetch_opensubtitles2018.py cmn-fin

Make raw STIFF

poetry run python scripts/pipeline.py mk-stiff cmn-fin stiff.raw.xml.zst

Make recommended STIFF variant + convert ➡️ Unified

poetry run python scripts/variants.py proc bilingual-precision-4 stiff.raw.xml.zst stiff.bp4.xml.zst
./stiff2unified.sh stiff.bp4.xml.zst stiff.unified.bp4.xml stiff.unified.bp4.key

EuroSense Pipeline

EuroSense ➡️ Unified

You will first need to obtain EuroSense. Since there are some language tagging issues with the original, I currently recommend you use a version I have attempted to fix.

You will next need to set the environment variable BABEL2WN_MAP as the path to a TSV mapping from BabelNet synsets to WordNet synsets. You can either:

  1. Obtain the BabelNet indices by following these instructions and dump out the TSV by following the instructions at https://github.com/frankier/babelnet-lookup
  2. If you are affiliated with a research institution, I have permission to send you the TSV file, but you must send me a direct communication from your institutional email address. (Please shortly state your position/affiliation and non-commercial research use in the email so there is a record.)
  3. Alternatively (subject to the same conditions) if you prefer, I can just send you eurosense.unified.sample.xml eurosense.unified.sample.key

Then run:

poetry run python scripts/pipeline.py eurosense2unified \
  /path/to/eurosense.v1.0.high-precision.xml eurosense.unified.sample.xml \
  eurosense.unified.sample.key

Process finn-man-ann

First obtain finn-man-ann.

Then run:

poetry run python scripts/munge.py man-ann-select --source=europarl \
  ../finn-man-ann/ann.xml - \
  | poetry run python scripts/munge.py lemma-to-synset - man-ann-europarl.xml
poetry run python scripts/munge.py man-ann-select --source=OpenSubtitles2018 \
  ../finn-man-ann/ann.xml man-ann-opensubs18.xml

Make STIFF or EuroSense into data for finn-wsd-eval

This makes a directory usable by finn-wsd-eval.

Old

poetry run python scripts/pipeline.py unified-to-eval \
  /path/to/stiff-or-eurosense.unified.xml /path/to/stiff-or-eurosense.unified.key \
  stiff-or-eurosense.eval/

New

TODO: STIFF

poetry run python scripts/filter.py tok-span-dom man-ann-europarl.xml \
  man-ann-europarl.filtered.xml
poetry run python scripts/pipeline.py stiff2unified --eurosense \
  man-ann-europarl.filtered.xml man-ann-europarl.uni.xml man-ann-europarl.uni.key
poetry run python scripts/pipeline.py stiff2unified man-ann-opensubs18.xml \
  man-ann-opensubs18.uni.xml man-ann-opensubs18.key.xml
poetry run python scripts/pipeline.py unified-auto-man-to-evals \
  eurosense.unified.sample.xml man-ann-europarl.uni.xml \
  eurosense.unified.sample.key man-ann-europarl.uni.key eurosense.eval

Make STIFF and EuroSense P/R plot

First process finn-man-ann.

Gather STIFF eval data

poetry run python scripts/variants.py eval /path/to/stiff.raw.zst stiff-eval-out
poetry run python scripts/eval.py pr-eval --score=tok <(poetry run python scripts/munge.py man-ann-select --source=OpenSubtitles2018 /path/to/finn-man-ann/ann.xml -) stiff-eval-out stiff-eval.csv

Gather EuroSense eval data

poetry run python scripts/munge.py man-ann-select --source=europarl /path/to/finn-man-ann/ann.xml - | poetry run python scripts/munge.py lemma-to-synset - man-ann-europarl.xml
mkdir eurosense-pr
mv /path/to/eurosense/high-precision.xml eurosense-pr/EP.xml
mv /path/to/eurosense/high-coverage.xml eurosense-pr/EC.xml
poetry run python scripts/eval.py pr-eval --score=tok man-ann-europarl.xml eurosense-pr europarl.csv

Plot on common axis

Warning, plot may be misleading...

poetry run python scripts/eval.py pr-plot stiff-eval.csv europarl.csv

Organisation & usage

For help using the tools, try running with --help. The main entry points are in scripts.

Innards

  • scripts/tag.py: Produce an unfiltered STIFF
  • scripts/filter.py: Filter STIFF according to various criteria
  • scripts/munge.py: Convert between different corpus/stream formats

Wrappers

  • scripts/stiff2unified.sh: Convert from STIFF format to the unified format
  • scripts/pipeline.py: Various pipelines composing multiple layers of filtering/conversion

Top level

The Makefile and Makefile.manann.