# Übungsaufgaben 7


## Aufgabe 1 (Verarbeitung BNC-XML-Korpus)

Schreiben Sie ein Python-Programm, das 

- ein Sample des British National Corpus mit Hilfe des NLTK-XML-Korpusreaders (https://www.nltk.org/_modules/nltk/corpus/reader/bnc.html) einliest, 

- die getaggten Wörter und Sätze (mit `tagged_words()` bzw. `tagged_sents()`) ausgibt,

- Bigramme berechnet,

- und aus dem XML-Dokumenteninhalt (`<teiHeader>`) die Metadaten mit `xml.etree` extrahiert.

Sie können das BNC Sample entweder über NLTK herunterladen: 
```python
nltk.download('bnc')
```
oder hier manuell herunterladen: https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2553

### Einlesen und NLTK-Korpusmethoden:

In [1]:
import nltk
from nltk.corpus.reader.bnc import BNCCorpusReader

###nltk.download('bnc')

bnc_reader = BNCCorpusReader(root="BNC-sample/Texts", fileids=r'(\w*)/\w*\.xml')

In [2]:
%%bash
ls -R BNC-sample

Texts
bncHdr.xml

BNC-sample/Texts:
news

BNC-sample/Texts/news:
A1K.xml
A3E.xml


In [3]:
len(bnc_reader.words())

4202

In [4]:
print(*bnc_reader.words()[:100]) # * = unpacking words-list 

PUB ROCK / Pint-sized rock ‘ n ’ rollers : Jim White steps back in time to sample the fare at his local By JIM WHITE According to those on the inside track , Raggahouse is the latest fad in clubland . This hybrid mix of reggae and hip hop follows acid jazz , Belgian New Beat and acid swing — the wholly forgettable contribution of Jive Bunny — as the sound to set disco feet tapping . But no matter how often the archives are ransacked for old genres to bastardise , you do n't have to be at


In [5]:
print(bnc_reader.tagged_words()[1:20])

[('ROCK', 'SUBST'), ('/', 'PUN'), ('Pint-sized', 'ADJ'), ('rock', 'SUBST'), ('‘', 'PUQ'), ('n', 'SUBST'), ('’', 'PUQ'), ('rollers', 'SUBST'), (':', 'PUN'), ('Jim', 'SUBST'), ('White', 'SUBST'), ('steps', 'VERB'), ('back', 'ADV'), ('in', 'PREP'), ('time', 'SUBST'), ('to', 'PREP'), ('sample', 'VERB'), ('the', 'ART'), ('fare', 'SUBST')]


In [15]:
print(bnc_reader.tagged_sents()[2])

[('According', 'VERB'), ('to', 'PREP'), ('those', 'ADJ'), ('on', 'PREP'), ('the', 'ART'), ('inside', 'ADJ'), ('track', 'SUBST'), (',', 'PUN'), ('Raggahouse', 'SUBST'), ('is', 'VERB'), ('the', 'ART'), ('latest', 'ADJ'), ('fad', 'SUBST'), ('in', 'PREP'), ('clubland', 'SUBST'), ('.', 'PUN')]


#### Berechnung von Bigrammen:

In [7]:
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(bnc_reader.words())
scored = finder.score_ngrams(bigram_measures.raw_freq)

print(scored[0:10])

[(('of', 'the'), 0.006663493574488339), (('.', 'The'), 0.004997620180866254), ((',', 'but'), 0.004283674440742504), (('for', 'the'), 0.003093764873869586), (('to', 'be'), 0.003093764873869586), ((',', 'the'), 0.0028557829604950024), (('in', 'the'), 0.0028557829604950024), (('pub', 'rock'), 0.0028557829604950024), (('to', 'the'), 0.0028557829604950024), ((')', '.'), 0.0019038553069966682)]


### Extraktion von XML-Metadaten:

In [None]:
# For access to the complete XML data structure, use the ``xml()`` method.
# For access to simple word lists and tagged word lists, use ``words()``, ``sents()``, ``tagged_words()``, and ``tagged_sents()``.
# https://www.nltk.org/_modules/nltk/corpus/reader/bnc.html

In [9]:
import xml.etree.ElementTree as ET

file = bnc_reader.open('news/A1K.xml')
tree = ET.parse(file)

#create etree root-Object
root = tree.getroot()
type(root)

xml.etree.ElementTree.Element

In [12]:
#alternativ mit NLTK.xml()-Methode:
root = bnc_reader.xml('news/A1K.xml')

In [13]:
root.text #kein Text direkt unter <bncDoc> Wurzelknoten!

In [14]:
for el in list(root):
    print(el)

<Element 'teiHeader' at 0x15a518590>
<Element 'wtext' at 0x15a684680>


### itertext() iteriert über ein XML-Knoten und alle Kinderknoten und extrahiert den Text:

In [123]:
for text in list(root[0][0].itertext()): #Inhalt von <fileDesc>
    print(text)

  Independent, electronic edition of 1989-10-02: Listings section. Sample containing about 1842 words from a periodical (domain: arts) 
 Data capture and transcription 
 Oxford University Press 
 
BNC XML Edition, December 2006
 1842 tokens; 1863 w-units; 94 s-units 
Distributed under licence by Oxford University Computing Services on behalf of the BNC Consortium.
 This material is protected by international copyright laws and may not be copied or redistributed in any way. Consult the BNC Web Site at http://www.natcorp.ox.ac.uk for full licencing and distribution conditions.
A1K
 IaList 
Independent, electronic edition of 1989-10-02: Listings section.
 
Newspaper Publishing plc
 
London
 
1989
 
 


### Iterative Extraktion aus mehreren Korpusfiles mit NLTK.xml():

In [15]:
fileids = ['news/A1K.xml', 'news/A3E.xml']

for fileid in fileids:
    root = bnc_reader.xml(fileid)
    for text in list(root[0][0].itertext()):
        print(text)

  Independent, electronic edition of 1989-10-02: Listings section. Sample containing about 1842 words from a periodical (domain: arts) 
 Data capture and transcription 
 Oxford University Press 
 
BNC XML Edition, December 2006
 1842 tokens; 1863 w-units; 94 s-units 
Distributed under licence by Oxford University Computing Services on behalf of the BNC Consortium.
 This material is protected by international copyright laws and may not be copied or redistributed in any way. Consult the BNC Web Site at http://www.natcorp.ox.ac.uk for full licencing and distribution conditions.
A1K
 IaList 
Independent, electronic edition of 1989-10-02: Listings section.
 
Newspaper Publishing plc
 
London
 
1989
 
 
  Independent, electronic edition of 1989-10-07: Gardening pages. Sample containing about 1831 words from a periodical (domain: leisure) 
 Data capture and transcription 
 Oxford University Press 
 
BNC XML Edition, December 2006
 1831 tokens; 1846 w-units; 92 s-units 
Distributed under licen