Skip to content
Experimental Finnish language model for SpaCy
Python Shell
Branch: master
Clone or download
Cannot retrieve the latest commit at this time.
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
fi
python_packaging
tests
tools
.gitignore
Changelog
LICENSE
LICENSE.MIT
README.md
requirements.txt
spacy_fi.py

README.md

Experimental Finnish language model for SpaCy

Finnish language model for SpaCy. The model contains POS tagger, dependency parser, word vectors, token frequencies and a lemmatizer (libvoikko). See below for notes about NER.

Install the Finnish language model

First, install the libvoikko native library with Finnish morphology data files.

Next, install the model by running:

pip install https://github.com/aajanki/spacy-fi/releases/download/v0.2.0/fi_experimental_web_md-0.2.0-py3-none-any.whl

Usage

import spacy

nlp = spacy.load('fi_experimental_web_md')

doc = nlp('Hän ajoi punaisella autolla.')
for t in doc:
    print(f'{t.lemma_}\t{t.pos_}')

Updating the model

Setup a development environment

Install the libvoikko native library with Finnish morphology data files.

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

tools/download_data.sh

Train the model

tools/train.sh

Tests

tools/tests.sh

Build a Python package

Package just the POS tagger and dependency parser (this is the model published on GitHub):

tools/package_model.sh models/taggerparser/model-best

Alternatively, to build a model with combined tagger, parser and NER capabilities, run the following:

tools/package_model.sh models/merged

Notes about the NER model:

  • The model is trained on a very specific domain (technology news) and its out-of-domain generalization is quite poor.
  • Distributing the NER model might not be possible because the training data license (CC BY-ND-NC) is incompatible with the lemmatizer license (GPL).

License

All the content in this repository is available under the GNU General Public License, version 3 or any later version.

Source code and other files under fi and tools directories are additionally available under the MIT license.

License for the trained models (Python packages)

The trained models in https://github.com/aajanki/spacy-fi/releases are distributed under GPL v3.

License for the training data

The data sets downloaded by the tools/download_data.sh script are licensed as follows:

  • UD_Finnish-TDT: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
  • Word vectors and frequencies at Finnish Internet Parsebank: CC BY-SA
  • finer-data: The Digitoday material is licensed under CC Attribution-NoDerivs-NonCommercial (CC BY-ND-NC 1.0) and the Wikipedia material is licensed under Attribution-ShareAlike (CC BY-SA 3.0)
You can’t perform that action at this time.