Skip to content

cltk/latin_proper_names_cltk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

The file proper_names.txt contains a newline-delimited file which contains all of the words in the PHI5 which are likely proper names (persons, places, etc.). The value of this list is that, since everything is a noun, it may be used as a default POS tagger for these unusual words.

build_proper_names_list.py shows how this file was made. proper_names.txt contains 40,683 unique, alphabetized words.

Important notes:

  • This list contains some words that are not proper nouns, and is currently being hand-checked to remove these. It is currently hand-checked to EOF.
  • Some processing artifacts remain in the text, esp forms w/ a trailing _ (underscore) character. These will be removed later via automatic processing.
  • Similarly, there are a number of doublets as a result of a lexeme + underscore + additional lexeme; e.g., 'Alexandro' vs 'Alexandro_erat'.
  • A certain number of forms with attached clitics (e.g., -que, -ve) are present in the corpus; the host lexemes of these clitics are often doublets of non-cliticized lexemes.
  • A number of apparent abbreviations have been left intact; e.g., 'Achil'.
  • There is a certain amount of orthographic doubling as the result of u/v or i/j spellings; e.g., 'Achivis' vs. 'Achiuis', or '-que' vs '-qve'. Similarly, in Greek words there are a number of doublets from variant y/u spellings; e.g., 'Amphitruone' vs. 'Amphitryone'.
  • Roman numeral notation has also been removed.

License

Copyright (c) 2014 Kyle P. Johnson, under the MIT License. See 'LICENSE' for details.

About

A list of ~40K Classical Latin proper names

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages