Training sets and tokenizer for the Classical Greek language, for use with CLTK
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
LICENSE
README.md
greek.pickle
train_sentence_tokenizer.py
training_sentences.txt

README.md

CLTK Greek sentence tokenizer

About

This repository contains a training set and rule set for tokenizing sentences for Classical Greek, for use with the Classical Language Toolkit. Unless you want to create a new training set for Greek sentences, there is nothing you need from this repository.

To tokenize Greek sentences with the CLTK, first import it and use according to the docs here and then see instructions on tokenizing Greek sentences.

training_sentences.txt is comprised of the entirety of the Xenophon's Anabasis and is 57,173 words long.

Use

To create a new training set, manually add tokenized sentences (with each sentence starting a new line) to training_sentences.txt and run train_sentence_tokenizer.py. The script outputs greek.pickle. To use this new file, copy it to your local CLTK data directory at ~/cltk_data/compiled/sentence_tokens_greek/.

$ python train_sentence_tokenizer.py 
  Abbreviation: [0.3233] ἐᾶν
  Abbreviation: [0.3233] ἔζη
  Abbreviation: [0.8787] ὄν
  Sent Starter: [97.8234] 'ἐπειδὴ'
  Sent Starter: [113.3762] 'οἱ'
  Sent Starter: [65.2843] 'εἰ'
  Sent Starter: [32.1611] 'τοιγαροῦν'
  Sent Starter: [36.0471] 'ἀλλὰ'
  Sent Starter: [186.0545] 'μετὰ'
  Sent Starter: [45.6612] 'ταύτην'
  Sent Starter: [335.0765] 'ἐνταῦθα'
  Sent Starter: [220.8901] 'καὶ'
  Sent Starter: [360.4958] ''
  Sent Starter: [646.4387] 'ἐπεὶ'
  Sent Starter: [58.9281] 'ἀκούσας'
  Sent Starter: [53.6916] 'οὐκοῦν'
  Sent Starter: [58.7917] 'ταῦτα'
  Sent Starter: [124.8905] 'ἐκ'
  Sent Starter: [102.6241] 'ἔνθα'
  Sent Starter: [32.1611] 'καίτοι'
  Sent Starter: [47.4084] 'ἀκούσαντες'
  Sent Starter: [429.5321] 'ἐντεῦθεν'

If you think your training set and tokenizer is an improvement over the CLTK's current, please submit a pull request.

LICENSE

This software is, like the rest of the CLTK, licensed under the MIT license (see LICENSE). The texts for the training sentences comes from Perseus and are licensed under the Creative Commons Attribution-ShareAlike 3.0 United States License.