A tokenizer for French
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
doc
kea
.gitignore
LICENCE.md
README.md
example.py

README.md

kea

kea is a simple rule-based tokenizer for French. The tokenization process is decomposed in two steps:

  1. A rule-based tokenization approach is employed using the punctuation as an indication of token boundaries.

  2. A large-coverage lexicon is used to merge over-tokenized units (e.g. fixed contractions such as aujourd'hui are considered as one token)

A typical usage of this module is:

import kea
sentence = "Le Kea est le seul perroquet alpin au monde."
keatokenizer = kea.tokenizer()
tokens = keatokenizer.tokenize(sentence)

['Le', 'Kea', 'est', 'le', 'seul', 'perroquet', 'alpin', 'au', 'monde', '.']