# The Macronizer Class

A new macronizer object takes a range of initialization variables, all of them optional. Only the first two are intended to be changed by the user:


- `macronize_everything=True`, determines whether to mark macrons whose length is inferable from accent rules (should be False for a student audience)
- `unicode=False`, determines whether output is human-friendly unicode combining diacritics or machine-friendly non-combining carets and underscores. Evaluation methods are only available for the latter.

When the aim is to create a training corpus, no defaults will ever have to be changed. Here's the simplest case possible:

In [1]:
%pip list

Package                   Version        Editable project location
------------------------- -------------- ------------------------------
annotated-types           0.7.0
anyio                     4.8.0
appnope                   0.1.4
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 3.0.0
async-lru                 2.0.4
attrs                     25.1.0
babel                     2.17.0
beautifulsoup4            4.13.3
bleach                    6.2.0
blis                      0.7.11
catalogue                 2.0.10
certifi                   2025.1.31
cffi                      1.17.1
charset-normalizer        3.4.1
click                     8.1.8
cloudpathlib              0.21.0
colorama                  0.4.6
comm                      0.2.2
confection                0.1.5
cymem                     2.0.11
debugpy                   1.8.13
decorator                 5.2.1
defusedxml                0.7.1
distro        

In [3]:
from grc_macronizer.class_macronizer import Macronizer

macronizer = Macronizer()

input = '''ἀάατος, ἀγαθὸς, καλὸς, ἀνήρ, νεανίας, Αἰγύπτου'''
output = macronizer.macronize(input)

print(f'Results: {output}')

ModuleNotFoundError: No module named 'grc_macronizer'

Now let's try a longer input. Below I have loaded in all of Xenophon's *Anabasis* as one Python string of 359 857 characters. I also show a useful function to colour print the output. However, since the jupyter window would crash if forced to print and render the entirety of this output, I only print the first ten lines.

Note that this takes close to two minutes on my MacBook, with most of the computation dedicated to reintegrating the list of macronized words into the original text in a careful way. 

In [1]:
from class_macronizer import Macronizer
from tests.anabasis import anabasis
from grc_utils import colour_dichrona_in_open_syllables

macronizer = Macronizer()

input = anabasis
output = macronizer.macronize(input)

for line in output.split('. ')[:10]:
    print(colour_dichrona_in_open_syllables(line))


Extracting words to macronize from the odyCy docs: 100%|██████████| 2352/2352 [00:14<00:00, 166.37it/s]
Macronizing tokens ☕️: 100%|██████████| 21956/21956 [00:25<00:00, 847.74it/s] 
                                                                               

###### STATS ######
Dichrona in open syllables before:            29712
Unmacronized dichrona in open syllables left: 6504

[32m23208[0m dichrona macronized.

Macronization ratio: [32m78.11%[0m
Δ[32mα[0m_ρείου καὶ Π[32mα[0m^ρ[32mυ[0m^σ[32mά[0m_τ[32mι[0m^δος γίγνονται παῖδες δ[32mύ[0m^ο, πρεσβ[32mύ[0m^τερος μὲν Ἀρταξέρξης, νεώτερος δὲ Κῦρος· ἐπεὶ δὲ ἠσθένει Δ[32mα[0m_ρεῖος καὶ [32mὑ[0m^πώπτευε τελευτὴν τοῦ β[32mί[0m^ου, ἐβούλετο τὼ παῖδε ἀμφοτέρω π[32mα[0m^ρεῖναι
ὁ μὲν οὖν πρεσβ[32mύ[0m^τερος π[32mα[0m^ρὼν ἐτύγχ[32mα[0m^νε· Κῦρον δὲ μετ[32mα[0m^πέμπεται [32mἀ[0m^πὸ τῆς ἀρχῆς ἧς αὐτὸν σατρ[31mά[0mπην ἐποίησε, καὶ στρ[32mα[0m^τηγὸν δὲ αὐτὸν [31mἀ[0mπέδειξε πάντων ὅσοι ἐς Καστωλοῦ πεδ[32mί[0m^ον ἁθροίζονται
[32mἀ[0m^ν[32mα[0m^βαίνει οὖν ὁ Κῦρος λ[32mα[0m^βὼν Τισσ[32mα[0m^φέρνην ὡς φ[32mί[0m^λον, καὶ τῶν Ἑλλήνων ἔχων ὁπλ[32mί[0m_τ[32mα[0m_ς [32mἀ[0m^νέβη τρ[32mι[0m^[32mα[0m_κοσ[32mί[0m^ους, [32mἄ[0m^ρχοντ[32mα[0m^ δὲ αὐτῶ

Each run collects all the word forms not fully macronized as a Python list, saved as `diagnostics/still_ambiguous_{first word of input text}`.

In [None]:
from class_macronizer import Macronizer
from tests.hiketides import hiketides
from grc_utils import colour_dichrona_in_open_syllables

macronizer = Macronizer()

input = hiketides
output = macronizer.macronize(input)

for line in output.split('.')[:10]:
    print(colour_dichrona_in_open_syllables(line))


Extracting words to macronize from the odyCy docs: 100%|██████████| 340/340 [00:00<00:00, 396.47it/s]
Macronizing tokens ☕️: 100%|██████████| 2088/2088 [00:02<00:00, 706.62it/s]
                                                                           

###### STATS ######
Dichrona in open syllables before:            2737
Unmacronized dichrona in open syllables left: 939

[32m1798[0m dichrona macronized.

Macronization ratio: [32m65.69%[0m
Ζεὺς μὲν [32mἀ[0m^φίκτωρ ἐπ[32mί[0m^δοι προφρόνως
στόλον ἡμέτερον νάιον ἀρθέντ'
[32mἀ[0m^πὸ προστομ[31mί[0mων λεπτοψ[31mα[0mμ[32mά[0m^θων
Νείλου
Δ[32mί[0m_[32mα[0m_ν δὲ λ[32mι[0m^ποῦσαι
χθόν[31mα[0m σύγχορτον Σ[31mυ[0mρ[31mί[0mᾳ φεύγομεν,
οὔτ[31mι[0mν' ἐφ' αἵμ[32mα[0m^τ[32mι[0m^ δημηλ[31mα[0mσ[31mί[0m[31mα[0mν
ψήφῳ πόλεως γνωσθεῖσαι,
ἀλλ' αὐτογενεῖ φυξ[31mα[0mνορ[31mί[0mᾳ,
γ[32mά[0m^μον [31mΑ[0m[31mἰ[0mγ[32mύ[0m^πτου παίδων [31mἀ[0mσεβῆ τ'
ὀνοταζόμεναι δ[32mι[0m^[31mά[0mνοι[32mα[0m^ν.
Δ[31mα[0mν[31mα[0mὸς δὲ π[32mα[0m^τὴρ καὶ βούλαρχος
καὶ στ[32mα[0m^σ[31mί[0mαρχος τ[32mά[0m^δε πεσσονομῶν
κ[31mύ[0mδιστ' [32mἀ[0m^χέων ἐπέκρ[31mι[0mνεν
φεύγειν [31mἀ[0mνέδην δ[32mι[0m^[32mὰ[0m^ κῦμ' [32mἅ[0m_λ[32mι[0m^ον,
κέλσαι δ'

## Sidenotes

- As you saw, the stats method returns a macronization ratio. Since proper names are especially prone to have ambivalent dichrona, there is the option to exclude them from the statistic by using `count_proper_names=False`. In a prose context, names may contribute to giving an unnecessarily bad impression of the degree of macronization.