# Read XML file with betacode text

Install necessary modules. You also have to `pip install pygtrie` if you want to use the `betacode` module.

In [1]:
from bs4 import BeautifulSoup
from collections import Counter
import betacode.conv

Read in xml file and create soup object

In [2]:
with open('aristot.met_gk.xml','r') as fin:
    aristotle = fin.read()
soup = BeautifulSoup(aristotle, 'lxml')

Looking at the xml file, we need to find the `text` tags in the file. 

In [3]:
texts = soup.find_all('text')
len(texts)

1

Turns out there is only one `text` tag.

In [4]:
text = texts[0]

Find the chapters within the `text` tag by looking for the `div1` tag.

In [5]:
chapters = text.find_all('div1', type='Book')

Grab all text within the tag by using `text` attribute and append to the list of chapters. I could have also just done `text.text` to do the same thing but without chapter separation.

In [6]:
CH = []
for chapter in chapters:
    CH.append(chapter.text)
book = '\n'.join(CH)

Convert from betacode to unicode

In [7]:
greek = betacode.conv.beta_to_uni(book)

Count characters in text

In [8]:
letters = Counter(greek)
letters

Counter({'\t': 896,
         '\n': 280,
         ' ': 77088,
         '"': 28,
         '&': 1076,
         '(': 1,
         ')': 1,
         ',': 5524,
         '.': 2087,
         '0': 5,
         '1': 9,
         '2': 3,
         '3': 3,
         '6': 4,
         '7': 1,
         '8': 2,
         '9': 2,
         ';': 1467,
         '<': 60,
         '>': 60,
         '·': 1915,
         'Α': 29,
         'Β': 21,
         'Γ': 3,
         'Δ': 15,
         'Ε': 8,
         'Ζ': 5,
         'Η': 1,
         'Θ': 4,
         'Κ': 45,
         'Λ': 5,
         'Μ': 7,
         'Ν': 4,
         'Ξ': 4,
         'Π': 64,
         'Ρ': 1,
         'Σ': 51,
         'Τ': 4,
         'Φ': 4,
         'Ψ': 1,
         'Ω': 1,
         'α': 24693,
         'β': 1358,
         'γ': 6311,
         'δ': 8631,
         'ε': 17577,
         'ζ': 526,
         'η': 4415,
         'θ': 4545,
         'ι': 18271,
         'κ': 11035,
         'λ': 9849,
         'μ': 9294,
         'ν': 31104,
     

Do the same thing with words with some cleaning of linefeeds and tabs.

In [9]:
greekwords = greek.split(' ')
greekwords = [word.strip('\n\t') for word in greekwords]

In [10]:
words = Counter(greekwords)
words.most_common(20)

[('καὶ', 4275),
 ('τὸ', 2805),
 ('δὲ', 1855),
 ('γὰρ', 1460),
 ('ἢ', 1195),
 ('τῶν', 1178),
 ('μὲν', 1134),
 ('τὰ', 1072),
 ('ἡ', 924),
 ('δ’', 878),
 ('εἶναι', 812),
 ('τοῦ', 776),
 ('μὴ', 649),
 ('ἐν', 601),
 ('εἰ', 601),
 ('ὅτι', 542),
 ('τὴν', 514),
 ('τῷ', 505),
 ('ὁ', 475),
 ('ὡς', 472)]

Output unicode string to file with **utf-8** encoding

In [11]:
with open('greektext.txt','w',encoding='utf-8') as fout:
    fout.write(greek)