# Cleaning Te Ara

The first text source we will be handling is Te Ara, the Encyclopedia of New Zealand.

From Wikipedia:
> Te Ara: The Encyclopedia of New Zealand is an online encyclopedia created by the Ministry for Culture and Heritage of the New Zealand Government.

In [67]:
import os, re
import taumahi
import warnings
import itertools
import collections
from multiprocessing import Pool, cpu_count
from unicodedata import category
from nltk.tokenize import sent_tokenize

In [2]:
def multicore_apply(iterable, func):
    '''
    Implements pool.map safely, closing processes properly afterwards
    '''
    try:
        pool = Pool(cpu_count() - 1)
        result = pool.map(func, iterable)
    finally:
        pool.close()
        pool.join()
    return result

In [3]:
def remove_punctuation(kupu_tōkau):
    return ''.join(ch for ch in kupu_tōkau if category(ch)[0] != 'P')
    
def normalize_text(kupu_tōkau):
    kupu_tōkau = re.sub("\s{2,}", " ", kupu_tōkau)
    return remove_punctuation(kupu_tōkau.lower())

In [52]:
def digits_to_text(num):

    if abs(num) >= 10000:
        warnings.warn("Only numbers below 10,000 can be translated")
        return str(num)

    digits = [int(i) for i in str(num)]

    ones = ['kore', 'tahi', 'rua', 'toru', 'whā',
            'rima', 'ono', 'whitu', 'waru', 'iwa']
    places = ['mano', 'rau', 'tekau', '']

    ones_dict   = dict(zip([i for i in range(10)], ones))
    places_dict = dict(zip([3, 2, 1, 0], places))

    digit_words = []
    for place, digit in enumerate(digits[::-1]):
        ones_digit = ones_dict[digit]

        place_digit = places_dict[place]

        if place == 1:
            place_digit = place_digit + " mā"

        if place > 1 and ones_digit == 'tahi':
            ones_digit = "kotahi"
        
        place_words = str.strip(ones_digit + " " + place_digit)
        
        digit_words.append(place_words)
    
    digit_text = ' '.join(digit_words[::-1])

    digit_text = str.strip(digit_text
        .replace(" mā kore", "")
        .replace(" kore rau", "")
        .replace("kore tekau ", "")
        .replace("tahi tekau ", "tekau "))

    return digit_text

In [4]:
te_ara_path = "../sources/teara-mi-content.txt"
with open(te_ara_path, "r") as f:
    te_ara = f.read()

In [5]:
"Te Ara has {} characters".format(len(te_ara))

'Te Ara has 8873533 characters'

We start off by reading the first 1000 characters of the text:

In [6]:
print(te_ara[:1000])

### New article
https://teara.govt.nz/mi/te-mahi-kai
Ko te kāinga te pokapū o ngā mahi kai a te Māori. Ko te maramataka ka tohu i te wā ki tēnā mahi, ki tēnā mahi. Ka tauhokohoko ngā iwi i ngā kai mai i ngā māra, te hī ika, te mahi tuna, te tāwhiti manu, te kohikohi kai hoki.
### New article
https://teara.govt.nz/mi/te-mahi-kai/page-1
Ngā kaihōpara me te hunga tauhokohoko
        Nō te takiwā o ngā tau 1250 – 1300 AD ka tae ngā tīpuna o te Māori ki Aotearoa. Ko te iwi Māori te whakamutunga o ngā iwi hōpara i Te Moananui-a-Kiwa. Ka tauhokohoko ngā tīpuna o te Māori ki tēnā iwi ki tēnā iwi i ngā moutere o Te Moananui-a-Kiwa. Ko Aotearoa te whenua rahi rawa i nōhia e ngā tāngata o Te Moananui-a-Kiwa. Hāunga te pāmamao o te whenua hou, taea noatia ai e te waka haere moana.
        Ngā moutere tango kai
        Noho ai ngā iwi o Te Moananui-a-Kiwa ki ngā moutere tūtata, ka hūpeke i tēnā moutere, i tēnā moutere ki te mahi kai. Ko te whakapae, i pērā te noho a te Māori ki Aotearoa i te taenga

A few comments:
- Te Ara contains multiple kinds of text (urls, Te Reo and also English)
- It'll be worthwhile to run through `te_ara` and clean up any non-māori text etc.

Fortunately, the `taumahi` library has the tools we need to do this.

## Cleaning the text

First we split up `te_ara` into sentences, using `nltk.sent_tokenize`:

In [7]:
te_ara_sents = [s.strip() for t in te_ara.split("\n") for s in sent_tokenize(t)]

In [8]:
# Print the number of sentences in te_ara
print("There are {} sentences in te_ara".format(len(te_ara_sents)))

There are 109617 sentences in te_ara


Here are the first 5 sentences:

In [9]:
te_ara_sents[:5]

['### New article',
 'https://teara.govt.nz/mi/te-mahi-kai',
 'Ko te kāinga te pokapū o ngā mahi kai a te Māori.',
 'Ko te maramataka ka tohu i te wā ki tēnā mahi, ki tēnā mahi.',
 'Ka tauhokohoko ngā iwi i ngā kai mai i ngā māra, te hī ika, te mahi tuna, te tāwhiti manu, te kohikohi kai hoki.']

Now we inspect the comments:

In [10]:
collections.Counter(sent for sent in te_ara_sents if sent.startswith("#"))

Counter({'### New article': 5407})

It looks like the only commented line in `te_ara` is `'### New article'`, which occurs 5407 times in the corpus. That means they're easy to remove.

In [11]:
te_ara_sents = [sent for sent in te_ara_sents if not sent == "### New article"]

In [12]:
# Print the number of sentences in te_ara
print("There are {} sentences in te_ara".format(len(te_ara_sents)))

There are 104210 sentences in te_ara


Next, we can remove the urls from the text:

In [13]:
url_regex = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')

In [14]:
te_ara_urls = {url_regex.match(sent).group(0) for sent in te_ara_sents if url_regex.match(sent)}

In [15]:
# Print the number of sentences in te_ara
print("There are {} urls in te_ara".format(len(te_ara_urls)))

There are 5411 urls in te_ara


In [16]:
list(te_ara_urls)[:5]

['https://teara.govt.nz/mi/artwork/11454/ko-hinepukohurangi-raua-ko-te-maunga',
 'https://teara.govt.nz/mi/video/8884/te-hanga-toki',
 'https://teara.govt.nz/mi/artwork/14158/rakaihautu',
 'https://teara.govt.nz/mi/photograph/2190/nga-kupenga-a-nga-iwi-maori-o-whanganui',
 'https://teara.govt.nz/mi/photograph/10900/te-tohu-mokai-a-te-tai-tokerau']

Likewise, there are 5411 urls in `te_ara` as well, so we remove these too.

In [17]:
%time
te_ara_sents = [sent for sent in te_ara_sents if not any(url in sent for url in te_ara_urls)]

In [18]:
te_ara_sents[:5]

['Ko te kāinga te pokapū o ngā mahi kai a te Māori.',
 'Ko te maramataka ka tohu i te wā ki tēnā mahi, ki tēnā mahi.',
 'Ka tauhokohoko ngā iwi i ngā kai mai i ngā māra, te hī ika, te mahi tuna, te tāwhiti manu, te kohikohi kai hoki.',
 'Ngā kaihōpara me te hunga tauhokohoko',
 'Nō te takiwā o ngā tau 1250 – 1300 AD ka tae ngā tīpuna o te Māori ki Aotearoa.']

The next thing we want to do is detect any kupu pākehā (English words) in `te_ara`. In the `taumahi` library, there's the following useful `kupu_pākehā` function.

In [19]:
kupu_pākehā = collections.Counter(
    kupu for sent in te_ara_sents for kupu in taumahi.kupu_pākehā(normalize_text(sent), tohutō=False)
)

So we can see that in some cases, the kupu pākehā in the corpus are names of people (e.g. 'James Belich'), or sometimes organisations 'Peoples of the Pacific' and sometimes they are out of vocab terms like 'AD'.

In [20]:
print("There are {} unique kupu pākehā, and {} kupu pākeha in total in te_ara".format(len(kupu_pākehā), sum(kupu_pākehā.values())))

There are 7348 unique kupu pākehā, and 38806 kupu pākeha in total in te_ara


### Removing the kupu pākehā

Now we want to remove the `kupu_pākehā` from the `te_ara` text.

In [21]:
pākehā_regex = re.compile("|".join(sorted(['\s{0}\s|^{0}\s|\s{0}$'.format(s) for s in kupu_pākehā.keys()])))

In [46]:
def clean_pākehā(kupu_tōkau, repl = ' ', normalize = True):
    if normalize:
        kupu_tōkau = normalize_text(kupu_tōkau)
    
    kupu_tōkau = pākehā_regex.sub(repl, kupu_tōkau)
    kupu_tōkau = pākehā_regex.sub(repl, kupu_tōkau)
    kupu_tōkau = re.sub("\s{2,}", repl, kupu_tōkau)
    
    return kupu_tōkau.strip()

In [49]:
%%time
te_ara_māori_sents = multicore_apply(te_ara_sents, clean_pākehā)

CPU times: user 2.73 s, sys: 1 s, total: 3.73 s
Wall time: 32min 26s


In [55]:
digits_to_text(1234)

'kotahi mano rua rau toru tekau mā whā'

In [57]:
kupu_tōkau = 'nō te takiwā o ngā tau 1250 1300 ka tae ngā tīpuna o te māori ki aotearoa'

In [77]:
def replace_nums(tauriterite):
    tau = tauriterite.group(0)
    if "$" in tau:
        tau = tau[1:]
        digits = digits_to_text(int(tau))
        digits += " tāra"
    else:
        digits = digits_to_text(int(tau))
    return digits

In [78]:
te_ara_māori_sents_no_nums = [re.sub("\$?\d+", replace_nums, sent) for sent in te_ara_māori_sents]

  after removing the cwd from sys.path.


In [81]:
te_ara_māori_sents_no_nums = [re.sub("\d+", "", sent) for sent in te_ara_māori_sents]

In [84]:
te_ara_māori_sents_no_nums = [re.sub("\s{2,}", " ", sent) for sent in te_ara_māori_sents_no_nums]

In [85]:
with open("../data/te_ara_māori_sents.txt", "w") as f:
    for sent in te_ara_māori_sents_no_nums:
        if len(sent) > 0:
            f.write(sent + "\n")