# The Macronizer Class

A new macronizer object takes a range of initialization variables, all of them optional. Only the first two are intended to be changed by the user:


- `macronize_everything=True`, determines whether to mark macrons whose length is inferable from accent rules (should be False for a student audience)
- `unicode=False`, determines whether output is human-friendly unicode combining diacritics or machine-friendly non-combining carets and underscores. Evaluation methods are only available for the latter.

When the aim is to create a training corpus, no defaults will ever have to be changed. Here's the simplest case possible:

In [1]:
from class_macronizer import Macronizer

macronizer = Macronizer()

input = '''ἀάατος, ἀγαθὸς, καλὸς, ἀνήρ, νεανίας'''
output = macronizer.macronize(input)

print(f'Results: {output}')

Macronizing tokens ☕️: 100%|██████████| 5/5 [00:00<00:00, 666.29it/s]
                                                            

###### STATS ######
Dichrona in open syllables before: 	10
Unmacronized dichrona in open syllables left: 	0

[32m10[0m dichrona macronized.

Macronization ratio: [32m100.00%[0m
Results: ἀ^ά_α^τος, ἀ^γα^θὸς, κα^λὸς, ἀ^νήρ, νεα_νί^α_ς




Now let's try a longer input. Below I have loaded in all of Xenophon's *Anabasis* as one Python string of 359 857 characters. I also show a useful function to colour print the output. However, since the jupyter window would crash if forced to print and render the entirety of this output, I only print the first ten lines.

Note that this takes close to two minutes on my MacBook, with most of the computation dedicated to reintegrating the list of macronized words into the original text in a careful way. 

In [3]:
from class_macronizer import Macronizer
from tests.anabasis import anabasis
from grc_utils import colour_dichrona_in_open_syllables

macronizer = Macronizer()

input = anabasis
output = macronizer.macronize(input)

for line in output.split('. ')[:10]:
    print(colour_dichrona_in_open_syllables(line))


Macronizing tokens ☕️: 100%|██████████| 21864/21864 [00:16<00:00, 1317.02it/s]
                                                                               

###### STATS ######
Dichrona in open syllables before: 	29729
Unmacronized dichrona in open syllables left: 	8894

[32m20835[0m dichrona macronized.

Macronization ratio: [32m70.08%[0m
Δ[32mα[0m_ρείου καὶ Π[32mα[0m^ρ[32mυ[0m^σ[32mά[0m_τ[32mι[0m^δος γίγνονται παῖδες δ[32mύ[0m^ο, πρεσβ[32mύ[0m^τερος μὲν Ἀρταξέρξης, νεώτερος δὲ Κῦρος· ἐπεὶ δὲ ἠσθένει Δ[32mα[0m_ρεῖος καὶ [32mὑ[0m^πώπτευε τελευτὴν τοῦ β[32mί[0m^ου, ἐβούλετο τὼ παῖδε ἀμφοτέρω π[32mα[0m^ρεῖναι
ὁ μὲν οὖν πρεσβ[32mύ[0m^τερος π[32mα[0m^ρὼν ἐτύγχ[32mα[0m^νε· Κῦρον δὲ μετ[32mα[0m^πέμπεται [32mἀ[0m^πὸ τῆς ἀρχῆς ἧς αὐτὸν σατρ[31mά[0mπην ἐποίησε, καὶ στρ[32mα[0m^τηγὸν δὲ αὐτὸν [31mἀ[0mπέδειξε πάντων ὅσοι ἐς Καστωλοῦ πεδ[32mί[0m^ον ἁθροίζονται
[32mἀ[0m^ν[32mα[0m^βαίνει οὖν ὁ Κῦρος λ[32mα[0m^βὼν Τισσ[32mα[0m^φέρνην ὡς φ[32mί[0m^λον, καὶ τῶν Ἑλλήνων ἔχων ὁπλ[32mί[0m_τ[32mα[0m_ς [32mἀ[0m^νέβη τρ[32mι[0m^[32mα[0m_κοσ[32mί[0m^ους, [32mἄ[0m^ρχοντ[32mα[0m^ δὲ αὐτῶν Ξεν[32

## Sidenotes

- As you saw, the stats method returns a macronization ratio. Since proper names are especially prone to have ambivalent dichrona, there is the option to exclude them from the statistic by using `count_proper_names=False`. In a prose context, names may contribute to giving an unnecessarily bad impression of the degree of macronization.