# Hungarian tools and corpora

If you'd like to work on a different language, we'll help you on your way. Here's a notebook for Hungarian. It's separated in two sections: tools and corpora. The first part shows you how to get a Hungarian corpus, while the second part shows you how to tag sentences.

This is just a small sample of resources out there for Hungarian, but these suffice for now. I will extend this notebook as needed (send me an email).

## Corpora

There are several Hungarian corpora, but the [Webcorpus](http://mokk.bme.hu/en/resources/webcorpus/) seems the most accessible, and it's big enough to find interesting stuff. There are several versions of the webcorpus, but the only one made available online is the "4% version". This means that it contains only 4% of the entire corpus. Still, we're talking about over a million web pages, containing 589 million tokens and over 7 million types. That's a lot of words!

These are the files you can download. Start with the first file, and only download the others if you need more data:

```
web2-4p-0.tar.gz        09-Jun-2004 22:15  365M
web2-4p-1.tar.gz        09-Jun-2004 22:22  377M
web2-4p-2.tar.gz        09-Jun-2004 22:30  371M
web2-4p-3.tar.gz        09-Jun-2004 22:38  373M
web2-4p-4.tar.gz        09-Jun-2004 22:46  366M
web2-4p-5.tar.gz        09-Jun-2004 22:54  370M
web2-4p-6.tar.gz        09-Jun-2004 23:01  372M
web2-4p-7.tar.gz        09-Jun-2004 23:09  371M
web2-4p-8.tar.gz        09-Jun-2004 23:17  370M
web2-4p-9.tar.gz        09-Jun-2004 23:25  375M
```

I've downloaded `web2-4p-0.tar.gz` and unpacked it, ending up with a folder called `content` that has loads of files in it. Let's print one of the files and see what it looks like:

In [5]:
# Note the encoding for this file! There wasn't any note on the website, but it's not utf-8!
# I found the right encoding by looking for 'Hungarian encoding' on Google.
# Here's one of the pages I found: https://en.wikipedia.org/wiki/ISO/IEC_8859

with open('/Users/Emiel/Downloads/content/0000000042', encoding='ISO-8859-2') as f:
    # Print the first 500 characters.
    print(f.read()[:500])

<s>Megöntözöd onnan fentről a hegyeket, alkotásaid gyümölcsével jól tartod a földet.
<w>Megöntözöd
</w>
<w>onnan
</w>
<w>fentről
</w>
<w>a
</w>
<w>hegyeket
</w>
<c>,</c>
<w>alkotásaid
</w>
<w>gyümölcsével
</w>
<w>jól
</w>
<w>tartod
</w>
<w>a
</w>
<w>földet
</w>
<c>.</c>
</s>
<s>Füvet sarjasztasz az állatoknak, növényeket a földművelő embernek, hogy kenyeret termeljen a földből.
<w>Füvet
</w>
<w>sarjasztasz
</w>
<w>az
</w>
<w>állatoknak
</w>
<c>,</c>
<w>növényeket
</w>
<w>a
</w>
<w>földművelő
</w


This format is a kind of minimal, raw XML. Some observations:

* The file has no root covering all the sentences.
* Each sentence is within `<s>...</s>`-tags.
* The sentences are tokenized, with all words in `<w>...</w>`-tags.
* There's a separate tag for punctuation: `<c>...</c>`.

We cannot parse this kind of XML with a regular XML parser, because there is no root. But this does work:

In [19]:
from lxml import etree

with open('/Users/Emiel/Downloads/content/0000000042', encoding='ISO-8859-1') as f:
    corpus_data = f.read()

root = etree.HTML(corpus_data)

# Somehow etree.HTML introduces P-elements in the corpus. I don't know why.
sentences = root.xpath('.//s')
print(sentences)

[<Element s at 0x104d3a4c8>, <Element s at 0x104d54e48>, <Element s at 0x104d54cc8>, <Element s at 0x104d54588>, <Element s at 0x104d54448>, <Element s at 0x104d54fc8>, <Element s at 0x104d543c8>, <Element s at 0x104d54808>, <Element s at 0x104d50e08>, <Element s at 0x104d505c8>, <Element s at 0x104d50f08>, <Element s at 0x104d50f48>, <Element s at 0x104d50ec8>, <Element s at 0x104d3e3c8>, <Element s at 0x104d3e288>, <Element s at 0x104d3e2c8>, <Element s at 0x104d3e308>, <Element s at 0x104d3e748>, <Element s at 0x104d3e788>, <Element s at 0x104d3e7c8>, <Element s at 0x104d3e808>, <Element s at 0x104d3e848>, <Element s at 0x104d3e888>, <Element s at 0x104d3e8c8>, <Element s at 0x104d3e908>, <Element s at 0x104d3e948>, <Element s at 0x104d3e988>, <Element s at 0x104d3e9c8>, <Element s at 0x104d3ea08>]


Now we have a list of elements, but what can we do with those?

In [36]:
# Let's get the first sentence to experiment with.
first_sentence = sentences[0]

# Get the tag of the first sentence.
print(first_sentence.tag)

s


In [None]:
# Here's how to get the tokens for the first sentence.
tokens = []
for token in first_sentence:
    # All text is surrounded by newline characters. Strip() removes that.
    text = token.text.strip()
    tokens.append(text)
print(tokens)

Now you have a list of tokens. This is excellent for the second part of the notebook!

Some final hints for working with this corpus:

* Write a function to open a file, load the data, and return the sentences (where each sentence is a list of tokens).
* Use the `glob`-module to create a list of all corpus files.
* Advanced students may wish to write a generator function that yields sentences for the entire corpus.

## Tools

The second part of this notebook shows you how to use the `hunpos` part-of-speech tagger. This is a multiplatform tagger that you can download from [here](https://code.google.com/archive/p/hunpos/downloads). Please download both the tagger (`hunpos-1.0-...`) and the Hungarian model (`hu_szeged_kr.model.gz`). Unpack both files on your computer.

The tagger is pretty easy to use. Here's how.

In [37]:
from nltk.tag import hunpos

# Change these paths for your computer.
# On Windows, I suspect you have to use hunpos-tag.exe.
path_to_model = '/Users/Emiel/Downloads/hu_szeged_kr.model'
path_to_tagger = '/Users/Emiel/Downloads/hunpos-1.0-macosx/hunpos-tag'

tagger = hunpos.HunposTagger(path_to_model, path_to_tagger)

In [39]:
tagged_sentence = tagger.tag(tokens)
print(tagged_sentence)

[('Megöntözöd', b'NOUN'), ('onnan', b'ADV'), ('fentrõl', b'ADV'), ('a', b'ART'), ('hegyeket', b'NOUN<PLUR><CAS<ACC>>'), (',', b'PUNCT'), ('alkotásaid', b'NUM'), ('gyümölcsével', b'NOUN<CAS<INS>>'), ('jól', b'ADV'), ('tartod', b'VERB<PERS<2>><DEF>'), ('a', b'ART'), ('földet', b'NOUN<CAS<ACC>>'), ('.', b'PUNCT')]


The only downside right now is that the tags are in bytes. You should get rid of those as soon as possible, to prevent any errors in your program. Here's how to convert to unicode:

In [42]:
bytes_pos = b'NOUN'
uni_pos = bytes_pos.decode('utf-8')
print(uni_pos)

NOUN


It's probably a good idea to write a *wrapper function* if you plan to use the tagger often. Here's a template:

In [43]:
def tag(tokens, tagger=tagger):
    """
    Wrapper around an instance of HunposTagger.
    This function ensures that the tags are in unicode.
    """
    result = tagger.tag(tokens)
    # Create a new list to hold the unicode tokens.
    # Loop over the result and convert the tags to utf-8.
    # HINT: For your loop, remember multiple assignment!
    # Return the new result.