Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 

CLTK Greek sentence tokenizer

About

This repository contains a training set and rule set for tokenizing sentences for Classical Greek, for use with the Classical Language Toolkit. Unless you want to create a new training set for Greek sentences, there is nothing you need from this repository.

To tokenize Greek sentences with the CLTK, first import it and use according to the docs here and then see instructions on tokenizing Greek sentences.

training_sentences.txt is comprised of the entirety of the Xenophon's Anabasis and is 57,173 words long.

Use

To create a new training set, manually add tokenized sentences (with each sentence starting a new line) to training_sentences.txt and run train_sentence_tokenizer.py. The script outputs greek.pickle. To use this new file, copy it to your local CLTK data directory at ~/cltk_data/compiled/sentence_tokens_greek/.

$ python train_sentence_tokenizer.py 
  Abbreviation: [0.3233] ἐᾶν
  Abbreviation: [0.3233] ἔζη
  Abbreviation: [0.8787] ὄν
  Sent Starter: [97.8234] 'ἐπειδὴ'
  Sent Starter: [113.3762] 'οἱ'
  Sent Starter: [65.2843] 'εἰ'
  Sent Starter: [32.1611] 'τοιγαροῦν'
  Sent Starter: [36.0471] 'ἀλλὰ'
  Sent Starter: [186.0545] 'μετὰ'
  Sent Starter: [45.6612] 'ταύτην'
  Sent Starter: [335.0765] 'ἐνταῦθα'
  Sent Starter: [220.8901] 'καὶ'
  Sent Starter: [360.4958] ''
  Sent Starter: [646.4387] 'ἐπεὶ'
  Sent Starter: [58.9281] 'ἀκούσας'
  Sent Starter: [53.6916] 'οὐκοῦν'
  Sent Starter: [58.7917] 'ταῦτα'
  Sent Starter: [124.8905] 'ἐκ'
  Sent Starter: [102.6241] 'ἔνθα'
  Sent Starter: [32.1611] 'καίτοι'
  Sent Starter: [47.4084] 'ἀκούσαντες'
  Sent Starter: [429.5321] 'ἐντεῦθεν'

If you think your training set and tokenizer is an improvement over the CLTK's current, please submit a pull request.

LICENSE

This software is, like the rest of the CLTK, licensed under the MIT license (see LICENSE). The texts for the training sentences comes from Perseus and are licensed under the Creative Commons Attribution-ShareAlike 3.0 United States License.

About

Training sets and tokenizer for the Classical Greek language, for use with CLTK

Resources

License

Releases

No releases published

Packages

No packages published

Languages