# Python Classics Cookbook
by Patrick J. Burns

### 1. Replace macrons

**Problem**: You want to remove all of the macrons from a string, like the following sentence from Caesar's *Bellum Gallicum*.

In [1]:
text_with_macrons = """Gallia est omnis dīvīsa in partēs trēs, quārum ūnam incolunt Belgae, aliam Aquītānī, tertiam quī ipsōrum linguā Celtae, nostrā Gallī appellantur."""

Here are three methods for removing macrons: 1. with ```replace```; 2. with ```re.sub```; and 3. with ```translate```. Click [here](#remove_macrons_best) for the TLDR best solution.

#### string ```replace```

In [2]:
# simple replacement

word = 'dīvīsa'
word_without_macrons = word.replace('ī', 'i')
print(f"{word} > {word_without_macrons}")

word = 'Aquītānī'
word_without_macrons = word.replace('ā', 'a').replace('ī', 'i')
print(f"{word} > {word_without_macrons}")

dīvīsa > divisa
Aquītānī > Aquitani


It would be tedious to chain together enough ```replace``` methods to solve this problem. So, we could create a dictionary of replacement patterns and loop over them, replacing the text with each pass.

In [3]:
# create dictionary of macrons

macron_map = {
    'ā': 'a', 
    'ē': 'e', 
    'ī': 'i', 
    'ō': 'o', 
    'ū': 'u',
    'ȳ': 'y',
    'Ā': 'A',
    'Ē': 'E', 
    'Ī': 'I', 
    'Ō': 'O', 
    'Ū': 'U',
    'Ȳ': 'Y'
}

# compact method with dictionary comprehension

vowels = 'aeiouyAEIOUY'
vowels_with_macrons = 'āēīōūȳĀĒĪŌŪȲ'
macron_map = {k: v for k, v in zip(vowels_with_macrons, vowels)}    

print(macron_map)

{'ā': 'a', 'ē': 'e', 'ī': 'i', 'ō': 'o', 'ū': 'u', 'ȳ': 'y', 'Ā': 'A', 'Ē': 'E', 'Ī': 'I', 'Ō': 'O', 'Ū': 'U', 'Ȳ': 'Y'}


In [4]:
# replace by iterating over dictionary

text_without_macrons = text_with_macrons

for k, v in macron_map.items():
    text_without_macrons = text_without_macrons.replace(k, v)
    
print(text_without_macrons)

Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.


In [5]:
# function for replace by iterating over dictionary

def replace_macrons_1(text_with_macrons, replacement_dictionary):
    text_without_macrons = text_with_macrons
    for k, v in replacement_dictionary.items():
        text_without_macrons = text_without_macrons.replace(k, v)    
    return text_without_macrons

In [9]:
%time
replace_macrons_1(text_with_macrons, macron_map)

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 8.82 µs


'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'

#### replacement with regular expressions

Another option would be to do the same thing with regular expressions instead of ```replace``...

In [7]:
# function for re.sub by iterating over dictionary

import re

def remove_macrons_2(text_with_macrons, replacement_dictionary):
    text_without_macrons = text_with_macrons
    for k, v in replacement_dictionary.items():
        text_without_macrons = re.sub(rf'{k}', v, text_without_macrons, flags=re.MULTILINE)
    return text_without_macrons

In [10]:
%time
remove_macrons_2(text_with_macrons, macron_map)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 9.3 µs


'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'

For a single sentence, this turns out to take about the same amount of time to run (not so with larger texts, as we see below).

#### replacement with ```translate```

Another option is the ```translate``` method. This allows us to make all changes using a translation table without having to loop repeated over the original string.

In [22]:
# compact method with dictionary comprehension
# note that translate uses ```ord```, i.e. the Unicode code point for each mapped character

vowels = 'aeiouyAEIOUY'
vowels_with_macrons = 'āēīōūȳĀĒĪŌŪȲ'
macron_table = {ord(k): v for k, v in zip(vowels_with_macrons, vowels)}    

print(macron_table)

{257: 'a', 275: 'e', 299: 'i', 333: 'o', 363: 'u', 563: 'y', 256: 'A', 274: 'E', 298: 'I', 332: 'O', 362: 'U', 562: 'Y'}


In [23]:
# function for replacing macrons with translate

def remove_macrons_3(text_with_macrons, macron_table):
    text_without_macrons = text_with_macrons.translate(macron_table)
    return text_without_macrons

In [24]:
%time
remove_macrons_3(text_with_macrons, macron_ord_map)

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 9.06 µs


'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'

#### Testing recipes on larger texts

All three methods run at about the same speed on a single sentence. But minor differences can add up as the amount of text to be processed increased. How do these recipes perform on larger texts?

In [25]:
# Get sample text with macrons
# Here we'll use the Dickinson College Commentaries text of Caesar's *Bellum Gallicum* (which has macrons!) as found in conventus-lex's github repo for Maccer.

from requests_html import HTMLSession
session = HTMLSession()
url = 'https://raw.githubusercontent.com/conventus-lex/maccer/master/sources/DCC/Caesar%20-%20Selections%20from%20the%20Gallic%20War.txt'
r = session.get(url)
test = r.text
test = test[test.find('1.1'):] # remove 'metadata'
test = re.sub(r'\d\.\d+', '', test) # remove chapter headings, e.g. 1.1
print(test[2:147]) # print sample

Gallia est omnis dīvīsa in partēs trēs, quārum ūnam incolunt Belgae, aliam Aquītānī, tertiam quī ipsōrum linguā Celtae, nostrā Gallī appellantur.


In [28]:
print(f'This text has {len(test.split())} words.')

This text has 6399 words.


Here are the results of timeit on my iMac 2.7 GHz Intel Core i5...

In [13]:
%timeit -n 1000 replace_macrons_1(test, macron_map)

1000 loops, best of 3: 323 µs per loop


In [14]:
%timeit -n 1000 remove_macrons_2(test, macron_map)

1000 loops, best of 3: 2.13 ms per loop


In [29]:
%timeit -n 1000 remove_macrons_3(test, macron_ord_map)

1000 loops, best of 3: 4.92 ms per loop


The string method ```replace``` even with the multiple passes over the string is much faster than the other two methods.

#### Warning about combining characters

Before wrapping up a discussion about string replacement and unicode characters with diacriticals, it seems like a good time to mention decomposed and precomposed unicode characters. Note the following behavior.

In [31]:
word1 = 'dīvīsa'
print(len(word1))

6


In [32]:
word2 = 'dīvīsa'
print(len(word2))

8


In [34]:
print(word1 == word2)

False


These strings are not the same—word2 contains two decomposed lower-case-i-with-macrons.

In [38]:
print(word1.encode('unicode-escape'))
print(word2.encode('unicode-escape'))

b'd\\u012bv\\u012bsa'
b'di\\u0304vi\\u0304sa'


It seems like a good idea to handle these differences before attempting to replace characters. We can use unicodedata.normalize to convert all strings for replacement to Normalization Form C (NFC) before processing.

In [39]:
import unicodedata
word2 = unicodedata.normalize('NFC', word2)
print(len(word2))

6


In [40]:
# function with NFC preprocessing

def replace_macrons_1b(text_with_macrons, replacement_dictionary):
    text_without_macrons = unicodedata.normalize('NFC', text_with_macrons)
    for k, v in replacement_dictionary.items():
        text_without_macrons = text_without_macrons.replace(k, v)    
    return text_without_macrons

In [41]:
%time
replace_macrons_1b(text_with_macrons, macron_map)

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 10 µs


'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'

#### <a id='remove_macrons_best'>Replace Macrons: Best solution</a>

Putting it all together we have the following function that we can use for macron replacement.

In [43]:
import unicodedata

def replace_macrons(text_with_macrons):
    '''Replace macrons in Latin text'''
    vowels = 'aeiouyAEIOUY'
    vowels_with_macrons = 'āēīōūȳĀĒĪŌŪȲ'
    replacement_dictionary = {k: v for k, v in zip(vowels_with_macrons, vowels)}    
    
    temp = unicodedata.normalize('NFC', text_with_macrons)

    for k, v in replacement_dictionary.items():
        temp = temp.replace(k, v)
    else:
        temp = text_without_macrons

    return text_without_macrons

In [44]:
%timeit -n 1000 replace_macrons(test)

1000 loops, best of 3: 415 µs per loop


In [28]:
print(replace_macrons(test)[:147])

Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.


So, slightly slower with normalization, but still faster than other methods.

Please open issue for any problems you see with the code. You can also use issues, if you would like to suggest another Python solution for any of the recipes you see in this notebook.