Permalink
Fetching contributors…
Cannot retrieve contributors at this time
133 lines (81 sloc) 4.43 KB

Middle English

Middle English is collectively the varieties of the English language spoken after the Norman Conquest (1066) until the late 15th century; scholarly opinion varies but the Oxford English Dictionary specifies the period of 1150 to 1500. (Source: Wikipedia)

Text Normalization

CLTK's normalizer attempts to clean the given text, converting it into a canonical form.

Lowercase Conversion

The to_lower parameter converts the string into lowercase.

In [1]: from cltk.corpus.middle_english.alphabet import normalize_middle_english

In [2]: normalize_middle_english("Whan Phebus in the Crabbe had nere hys cours ronne And toward the leon his journé gan take", to_lower=True)
Out [2]: 'whan phebus in the crabbe had nere hys cours ronne and toward the leon his journé gan take'

Punctuation Removal

punct is responsible for punctuation removal

In [3]: normalize_middle_english("Thus he hath me dryven agen myn entent, And contrary to my course naturall.", punct=True)
Out [3]: 'thus he hath me dryven agen myn entent and contrary to my course naturall'

Canonical Form

The alpha_conv follows the established spelling conventions developed thorughout the last last century. þ and ð are both converted to th while 3 is converted to y at the start of the word and to gh otherwise.

In [4]: normalize_middle_english("as 3e lykeþ best", alpha_conv=True)
Out [4]: 'as ye liketh best'

Stemming

CLTK supports a rule-based affix stemmer for ME.

Keep in mind, that while Middle English is considered a weakly inflected language with a grammatical structure resembling that of Modern English, its lack of orthographical conventions presents a difficulty when accounting for various affixes.

In [1]: from cltk.stem.middle_english import affix_stemmer

In [2]: from cltk.corpus.middle_english.alphabet import normalize_middle_english

In [3]: text = normalize_middle_english('The speke the henmest kyng, in the hillis he beholdis.').split(" ")

In [4]: affix_stemmer(text)
Out [4]: 'the spek the henm kyng in the hill he behold'

The stemmer can also take an additional parameter of a hard-coded exception dictionary. An example follows utilizing the compiled stopwords list.

In[7]: from cltk.stop.middle_english.stops import STOPS_LIST

In[8]: exceptions = dict(zip(STOPS_LIST, STOPS_LIST))

In[9]: affix_stemer('byfore him'.split(" "), exception_list = exceptions)
Out[9]: 'byfore him'

Stopword Filtering

To use the CLTK's built-in stopwords list, We use an example from Chaucer's "The Summoner's Tale":

In [1]: from nltk.tokenize.punkt import PunktLanguageVars

In [2]: from cltk.stop.middle_english.stops import STOPS_LIST

In [3]: sentence = 'This frere bosteth that he knoweth helle'

In [4]: p = PunktLanguageVars()

In [5]: tokens = p.word_tokenize(sentence.lower())

In [6]: [w for w in tokens if not w in STOPS_LIST]
Out[6]:
['frere',
 'bosteth',
 'knoweth',
 'helle',
 '.']

Stresser

The historical events of early 11th century Britain were intertwined with its phonological development. The Norman Conquest in 1066 is mainly responsible for the influx of both Francien and Latin words and by extension for the highly variable spelling and phonology of ME.

While the Stresser provided by CLTK is unable to recognize the stressing of a given word, it does accept some of the most common stressing rules as parameters (Latin/Germanic/French)

In [1]: from cltk.phonology.middle_english.transcription import Word

In [2]: ".".join(Word('beren').stresser(stress_rule = "FSR"))
Out[2]: "ber.'en"

In [3]: ".".join(Word('yisterday').stresser(stress_rule = "GSR"))
Out [3]: "yi.ster.'day"

In [4]: ".".join(Word('verbum').stresser(stress_rule = "LSR"))
Out [4]: "ver.'bum"

Syllabify

The Word class provides a syllabification module for ME words.

In [1]: from cltk.phonology.middle_english.transcription import Word

In [2]: w = Word("hymsylf")

In [3]: w.syllabify()
Out [3]: ['hym', 'sylf']

In [4]: w.syllabified_str()
Out[4]: 'hym.sylf'