# HW1: Import external MULTEXT-East 1984 corpus

David J. Birnbaum, djbpitt@pitt.edu, 2019-01-09

## Metadata

The MULTEXT-EAST (MTE) 1984 corpus contains the text of George Orwell’s _1984_, with linguistic annotation, in 12 languages. It is available as item #104 at http://www.nltk.org/nltk_data/. Documentation is at http://nl.ijs.si/ME/V4/ and, in more detail, https://www.clarin.si/repository/xmlui/handle/11356/1043. The principal developer/editor is Tomaž Erjavec, and the license is CC BY-NC-SA 4.0.

## Self-assessment

* **What it does:** Load an external corpus with MTE markup and examine different types of linguistic annotation.
* **What I would like to do but don’t know how:** Understand why Bulgarian and Macedonian are broken, and what it would take to fix them.

## Explore the corpus

NLTK incorporates an MTECorpusReader class that can parse MTE (TEI P5) markup. API documentation is at https://www.nltk.org/api/nltk.corpus.reader.html#module-nltk.corpus.reader.mte. We”ll use pandas and numpy, so let’s import them up front, along with NLTK.

In [1]:
import nltk
import numpy as np
import pandas as pd

The first argument is the path to the corpus directory, the second is a file mask.

In [2]:
reader = nltk.corpus.reader.MTECorpusReader(r'data/mte_teip5', r'.*\.xml')

What methods does the corpus reader expose?

In [3]:
dir(reader)

['_MTECorpusReader__fileids',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_encoding',
 '_fileids',
 '_get_root',
 '_para_block_reader',
 '_root',
 '_sent_tokenizer',
 '_sep',
 '_tagset',
 '_word_tokenizer',
 'abspath',
 'abspaths',
 'citation',
 'encoding',
 'ensure_loaded',
 'fileids',
 'lemma_paras',
 'lemma_sents',
 'lemma_words',
 'license',
 'open',
 'paras',
 'raw',
 'readme',
 'root',
 'sents',
 'tagged_paras',
 'tagged_sents',
 'tagged_words',
 'unicode_repr',
 'words']

What files are available in the corpus?

In [4]:
ids = reader.fileids()

How many files are there?

In [5]:
len(ids)

52

How about just the monolingual corpus files?

In [6]:
ids = filter(lambda x: x.startswith("oana") ,ids)
# Bulgarian and Macedonian are broken (see below), so we’ll exclude them
ids = filter(lambda x: x not in ["oana-bg.xml", "oana-mk.xml"], ids)
ids = list(ids) # we have to listify this, because we’re going to reuse it, and a filter can’t be reused
ids

['oana-cs.xml',
 'oana-en.xml',
 'oana-et.xml',
 'oana-fa.xml',
 'oana-hu.xml',
 'oana-pl.xml',
 'oana-ro.xml',
 'oana-sk.xml',
 'oana-sl.xml',
 'oana-sr.xml']

How may words in each version? Let’s use a pandas dataframe instead of a numpy array, since the datatypes (fileid, word count) are heterogeneous.

In [7]:
wc = map(lambda x: len(reader.words(x)), ids)
df = pd.DataFrame(list(zip(ids, wc)), columns = ["Fileid", "Word count"])
df

Unnamed: 0,Fileid,Word count
0,oana-cs.xml,100366
1,oana-en.xml,118424
2,oana-et.xml,94898
3,oana-fa.xml,108427
4,oana-hu.xml,98426
5,oana-pl.xml,97413
6,oana-ro.xml,118325
7,oana-sk.xml,102074
8,oana-sl.xml,112278
9,oana-sr.xml,104290


Set the Fileid as the index and the word count as the value:

In [8]:
df.set_index("Fileid", inplace = True)
df

Unnamed: 0_level_0,Word count
Fileid,Unnamed: 1_level_1
oana-cs.xml,100366
oana-en.xml,118424
oana-et.xml,94898
oana-fa.xml,108427
oana-hu.xml,98426
oana-pl.xml,97413
oana-ro.xml,118325
oana-sk.xml,102074
oana-sl.xml,112278
oana-sr.xml,104290


Which is the longest in word count? The shortest? What’s the mean word count? Pandas can provide max, min, and mean, along with other descriptive statistics:

In [9]:
df.describe()

Unnamed: 0,Word count
count,10.0
mean,105492.1
std,8520.702703
min,94898.0
25%,98911.0
50%,103182.0
75%,111315.25
max,118424.0


It would be more useful to report the fileid along with the value, though. Let’s try:

In [10]:
df.idxmax()

Word count    oana-en.xml
dtype: object

Oops! All we want is the value, so use a numerical index to get rid of the rest of the output:

In [11]:
longest_id = df.idxmax()[0]
longest = df.max()[0]
shortest_id = df.idxmin()[0]
shortest = df.min()[0]
average = df.mean()[0]
print("The longest file is", longest_id, "of length", longest, "; the shortest is",
     shortest_id, "of length", shortest, "; and the average length is", average)

The longest file is oana-en.xml of length 118424 ; the shortest is oana-et.xml of length 94898 ; and the average length is 105492.1


Can we use `numpy` methods, just because? A pandas series (in this case, a column of values in a one-column dataframe) is of type numpy.ndarray:

In [12]:
type(df.values)

numpy.ndarray

So we can use numpy methods:

In [13]:
s = df.values
longest = np.max(s)
shortest = np.min(s)
average = np.mean(s)
print("The longest text is of length", longest, ", the shortest is of length", shortest, 
      "and the average if of length", average)

The longest text is of length 118424 , the shortest is of length 94898 and the average if of length 105492.1


## Looking closely at one version

Let’s look at the shape of the Slovak version ...

In [14]:
sents_count = len(reader.sents('oana-sk.xml'))
paras_count = len(reader.paras('oana-sk.xml'))
words_count = len(reader.words('oana-sk.xml'))
print('There are', sents_count, 'sentences,', paras_count, 
      'paragraphs, and', words_count,'words in the Slovak version.')

There are 6354 sentences, 1359 paragraphs, and 102074 words in the Slovak version.


We can get a list of word-tokenized sentences ...

In [15]:
sk_sents = reader.sents(['oana-sk.xml'])
sk_sents[:3]

[['Bol',
  'jasný',
  ',',
  'ale',
  'chladný',
  'aprílový',
  'deň',
  'a',
  'hodiny',
  'odbíjali',
  'trinástu',
  '.'],
 ['S',
  'bradou',
  'pritlačenou',
  'na',
  'prsia',
  ',',
  'aby',
  'sa',
  'chránil',
  'pred',
  'dotieravým',
  'vetrom',
  ',',
  'Winston',
  'Smith',
  'prekĺzol',
  'rýchlo',
  'cez',
  'sklené',
  'dvere',
  'na',
  'sídlisku',
  'Víťazstvo',
  ',',
  'nie',
  'však',
  'tak',
  'rýchlo',
  ',',
  'aby',
  'zabránil',
  'zvírenému',
  'piesku',
  'a',
  'prachu',
  'vniknúť',
  'dnu',
  's',
  'ním',
  '.'],
 ['V',
  'chodbe',
  'páchla',
  'varená',
  'kapusta',
  'a',
  'staré',
  'handrové',
  'rohožky',
  '.']]

Lemmatized, too ...

In [16]:
sk_lemma_sents = reader.lemma_sents(['oana-sk.xml'])
sk_lemma_sents[:3]

[[('Bol', 'byť'),
  ('jasný', 'jasný'),
  (',', ''),
  ('ale', 'ale'),
  ('chladný', 'chladný'),
  ('aprílový', 'aprílový'),
  ('deň', 'deň'),
  ('a', 'a'),
  ('hodiny', 'hodiny'),
  ('odbíjali', 'odbíjať'),
  ('trinástu', 'trinásty'),
  ('.', '')],
 [('S', 's'),
  ('bradou', 'brada'),
  ('pritlačenou', 'pritlačený'),
  ('na', 'na'),
  ('prsia', 'prsia'),
  (',', ''),
  ('aby', 'aby'),
  ('sa', 'sa'),
  ('chránil', 'chrániť'),
  ('pred', 'pred'),
  ('dotieravým', 'dotieravý'),
  ('vetrom', 'vietor'),
  (',', ''),
  ('Winston', 'winston'),
  ('Smith', 'smith'),
  ('prekĺzol', 'prekĺznuť'),
  ('rýchlo', 'rýchlo'),
  ('cez', 'cez'),
  ('sklené', 'sklený'),
  ('dvere', 'dvere'),
  ('na', 'na'),
  ('sídlisku', 'sídlisko'),
  ('Víťazstvo', 'víťazstvo'),
  (',', ''),
  ('nie', 'nie'),
  ('však', 'však'),
  ('tak', 'tak'),
  ('rýchlo', 'rýchlo'),
  (',', ''),
  ('aby', 'aby'),
  ('zabránil', 'zabrániť'),
  ('zvírenému', 'zvírený'),
  ('piesku', 'piesok'),
  ('a', 'a'),
  ('prachu', 'prach'),
 

Sentences with morphological tagging are also available ...

In [17]:
sk_tagged_sents = reader.tagged_sents(['oana-sk.xml'])
sk_tagged_sents[:3]

[[('Bol', '#Vcps-sm-n-----p'),
  ('jasný', '#Afpmsn'),
  (',', ''),
  ('ale', '#Cs'),
  ('chladný', '#Afpmsn'),
  ('aprílový', '#Afpmsn'),
  ('deň', '#Ncmsn'),
  ('a', '#Cc'),
  ('hodiny', '#Ncfpn'),
  ('odbíjali', '#Vmps-pf-n-----p'),
  ('trinástu', '#Mofsal--f'),
  ('.', '')],
 [('S', '#Spsi'),
  ('bradou', '#Ncfsi'),
  ('pritlačenou', '#Afpfsi'),
  ('na', '#Spsa'),
  ('prsia', '#Ncnpa'),
  (',', ''),
  ('aby', '#Cs'),
  ('sa', '#Px---a--ypn'),
  ('chránil', '#Vmps-sm-n-----p'),
  ('pred', '#Spsi'),
  ('dotieravým', '#Afpmsi'),
  ('vetrom', '#Ncmsi'),
  (',', ''),
  ('Winston', '#Npmsn--y'),
  ('Smith', '#Npmsn--y'),
  ('prekĺzol', '#Vmps-sm-n-----e'),
  ('rýchlo', '#R-p'),
  ('cez', '#Spsa'),
  ('sklené', '#Afpfpa'),
  ('dvere', '#Ncfpa'),
  ('na', '#Spsl'),
  ('sídlisku', '#Ncnsl'),
  ('Víťazstvo', '#Ncnsn'),
  (',', ''),
  ('nie', '#Q'),
  ('však', '#Q'),
  ('tak', '#R-p'),
  ('rýchlo', '#R-p'),
  (',', ''),
  ('aby', '#Cs'),
  ('zabránil', '#Vmps-sm-n-----e'),
  ('zvírenému', '#A

## Discovery 

Bulgarian seems to be broken ... eek!

In [18]:
bg_tagged_sents = reader.tagged_sents(['oana-bg.xml'])
bg_tagged_sents[:3]

ValueError: concat() expects at least one object!

Apparently this is because Bulgarian and Macedonian are said not to be TEI-P5 conformant (although they validate against the Relax NG schema that accompanies the distribution). The documentation at http://www.nltk.org/_modules/nltk/corpus/reader/mte.html includes a hint at:

```python
# filter multext-east sourcefiles that are not compatible to the teip5 specification
        fileids = filter(lambda x: x not in ["oana-bg.xml", "oana-mk.xml"], fileids)
```

This seems to say that the Bulgarian and Macedonian filenames should not be returned by the `.fileids()` method, which is peculiar, since when we ran `reader.fileids()` above, they were. The XML markup is not uniform within the corpus (for example, only Bulgarian and English contain `@function` attributes), so perhaps the answer lies there, but that’s a topic best explored in an XML environment.