***Vorlesung 'Syntax natürlicher Sprachen'***

--- 
# Intro Vorlesung 3: Syntaktische Kategorien


In [1]:
import nltk

---

## POS-Tagsets
- http://www.nltk.org/book/ch05.html#tab-universal-tagset
- https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
- https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/mitarbeiter-innen/hagen/STTS_Tagset_Tiger
- https://universaldependencies.org/u/pos/

In [2]:
#nltk: Penn Treebank POS Tagset:
nltk.help.upenn_tagset('VBG')

VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...


## Syntaktische Tagsets

### Phrasenstruktur
- http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html
- https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/mitarbeiter-innen/hagen/Tiger_Knotenlabels

### Dependenzrelationen
- https://universaldependencies.org/u/dep/

---

## Wortarten (Präterminale)

### Generierung von POS-Mustern mit rekursiven Phrasenstrukturregeln

In [3]:
# http://www.nltk.org/howto/generate.html
grammar = nltk.CFG.fromstring("""
    S -> NP VP
    PP -> P NP
    NP -> Det N | Det N PP | N
    VP -> V NP | VP PP
    Det -> 'Det'
    N -> 'N'
    V -> 'V'
    P -> 'P'
""")

#Generierung:
from nltk.parse.generate import generate
print('Anzahl an POS-Mustern: ', 
    '\n\tbei 6 Regelanwendungen:', len(list(generate(grammar, depth=6))), 
    '\n\tbei 7 Regelanwendungen:', len(list(generate(grammar, depth=7))),
    '\n\tbei 8 Regelanwendungen:', len(list(generate(grammar, depth=8))))

Anzahl an POS-Mustern:  
	bei 6 Regelanwendungen: 24 
	bei 7 Regelanwendungen: 64 
	bei 8 Regelanwendungen: 408


In [4]:
from nltk.parse.generate import generate
for sentence in generate(grammar, depth=7):
    print(' '.join(sentence))

Det N V Det N
Det N V Det N P Det N
Det N V Det N P N
Det N V N
Det N V Det N P Det N
Det N V Det N P N
Det N V N P Det N
Det N V N P N
Det N V Det N P Det N P Det N
Det N V Det N P Det N P N
Det N V Det N P N P Det N
Det N V Det N P N P N
Det N V N P Det N P Det N
Det N V N P Det N P N
Det N V N P N P Det N
Det N V N P N P N
Det N P Det N V Det N
Det N P Det N V Det N P Det N
Det N P Det N V Det N P N
Det N P Det N V N
Det N P Det N V Det N P Det N
Det N P Det N V Det N P N
Det N P Det N V N P Det N
Det N P Det N V N P N
Det N P Det N V Det N P Det N P Det N
Det N P Det N V Det N P Det N P N
Det N P Det N V Det N P N P Det N
Det N P Det N V Det N P N P N
Det N P Det N V N P Det N P Det N
Det N P Det N V N P Det N P N
Det N P Det N V N P N P Det N
Det N P Det N V N P N P N
Det N P N V Det N
Det N P N V Det N P Det N
Det N P N V Det N P N
Det N P N V N
Det N P N V Det N P Det N
Det N P N V Det N P N
Det N P N V N P Det N
Det N P N V N P N
Det N P N V Det N P Det N P Det N
Det N P N V De

### Suche kontextäquivalenter Wörter im Korpus mit NLTK (paradigmatische Dimension)

http://www.nltk.org/book/ch05.html#using-a-tagger:

> Lexical categories like "noun" and part-of-speech tags like NN seem to have their uses, but the details will be obscure to many readers. You might wonder what justification there is for introducing this extra level of information. Many of these categories arise from superficial analysis the distribution of words in text. Consider the following analysis involving *woman* (a noun), *bought* (a verb), *over* (a preposition), and *the* (a determiner). The `text.similar()` method takes a word *w*, finds all contexts *w1 w w2*, then finds all words *w'* that appear in the same context, i.e. *w1 w' w2*.



In [5]:
#http://www.nltk.org/book/ch05.html#using-a-tagger

from nltk.corpus import brown
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())

> Observe that searching for *woman* finds nouns; searching for *bought* mostly finds verbs; searching for *over* generally finds prepositions; searching for *the* finds several determiners. A tagger can correctly identify the tags on these words in the context of a sentence, e.g. *The woman bought over $150,000 worth of clothes.* (http://www.nltk.org/book/ch05.html#using-a-tagger)



In [6]:
text.similar('woman')

man time day year car moment world house family child country boy
state job place way war girl work word


In [7]:
text.similar('bought')

made said done put had seen found given left heard was been brought
set got that took in told felt


In [8]:
text.similar('over')

in on to of and for with from at by that into as up out down through
is all about


In [9]:
text.similar('the')

a his this their its her an that our any all one these my in your no
some other and


--- 
### Suche nach nominalen Mustern im Korpus  (syntagmatische Dimension)

http://www.nltk.org/book/ch05.html#nouns:

> Let's inspect some tagged text to see what parts of speech occur before a noun, with the most frequent ones first. To begin with, we construct a list of bigrams whose members are themselves word-tag pairs such as `(('The', 'DET'), ('Fulton', 'NP'))` and `(('Fulton', 'NP'), ('County', 'N'))`. Then we construct a `FreqDist` from the tag parts of the bigrams.


In [10]:
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
word_tag_pairs = nltk.bigrams(brown_news_tagged)
noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN']
fdist = nltk.FreqDist(noun_preceders)
[(tag, fq) for (tag, fq) in fdist.most_common()]

[('NOUN', 7959),
 ('DET', 7373),
 ('ADJ', 4761),
 ('ADP', 3781),
 ('.', 2796),
 ('VERB', 1842),
 ('CONJ', 938),
 ('NUM', 894),
 ('ADV', 186),
 ('PRT', 94),
 ('PRON', 19),
 ('X', 11)]

> This confirms our assertion that nouns occur after determiners and adjectives, including numeral adjectives (tagged as `NUM``). (http://www.nltk.org/book/ch05.html#nouns)


--- 

### Adjektive als Klasse distributionsäquivalenter Wörter:

#### Suche nach distributionsäquivalenten Wörtern (Auftreten in gleichen Kontexten):

In [11]:
from nltk.corpus import brown
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('big')

little new first good small large great the old other strong young
major white second short beautiful a best long


#### Wortarten-Kontexte von distributionsäquivalenter Wörtern (als Vertreter einer Distributionsklasse):

In [12]:
#Rechter und linker Kontext für eine Menge distributionsäquivalenter Wörter:
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
word_tag_trigrams = nltk.trigrams(brown_news_tagged)
adj_contexts = [(a[1], c[1]) for (a, b, c) in word_tag_trigrams if b[0] in ('big', 'little', 'new', 'first', 'good', 'small', 'large', 'great')]

fdist = nltk.FreqDist(adj_contexts)
[(tag, fq) for (tag, fq) in fdist.most_common()]

[(('DET', 'NOUN'), 193),
 (('ADP', 'NOUN'), 42),
 (('VERB', 'NOUN'), 36),
 (('DET', 'ADJ'), 30),
 (('DET', 'NUM'), 22),
 (('NOUN', 'NOUN'), 13),
 (('DET', 'ADP'), 13),
 (('ADJ', 'NOUN'), 12),
 (('CONJ', 'NOUN'), 10),
 (('NUM', 'NOUN'), 10),
 (('NOUN', 'ADJ'), 8),
 (('ADV', 'NOUN'), 8),
 (('VERB', '.'), 7),
 (('.', 'NOUN'), 6),
 (('DET', '.'), 5),
 (('ADV', 'ADP'), 5),
 (('ADV', '.'), 5),
 (('NOUN', '.'), 4),
 (('VERB', 'VERB'), 4),
 (('DET', 'VERB'), 4),
 (('NOUN', 'VERB'), 4),
 (('ADP', 'ADJ'), 3),
 (('NUM', 'ADJ'), 3),
 (('ADP', '.'), 3),
 (('PRON', 'VERB'), 3),
 (('PRT', 'NOUN'), 3),
 (('ADV', 'DET'), 3),
 (('ADV', 'VERB'), 3),
 (('ADJ', 'ADJ'), 2),
 (('VERB', 'ADJ'), 2),
 (('VERB', 'ADV'), 2),
 (('DET', 'PRT'), 2),
 (('ADP', 'VERB'), 2),
 (('ADP', 'CONJ'), 2),
 (('DET', 'ADV'), 2),
 (('PRT', 'VERB'), 1),
 (('VERB', 'DET'), 1),
 (('DET', 'X'), 1),
 (('DET', 'CONJ'), 1),
 (('NOUN', 'NUM'), 1),
 (('ADV', 'CONJ'), 1),
 (('ADP', 'ADP'), 1),
 (('ADP', 'NUM'), 1),
 (('VERB', 'CONJ'), 1),


In [13]:
#Linker und rechter Kontext für ADJ:
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
word_tag_trigrams = nltk.trigrams(brown_news_tagged)
adj_contexts = [(a[1], c[1]) for (a, b, c) in word_tag_trigrams if b[1] == 'ADJ']

fdist = nltk.FreqDist(adj_contexts)
[(tag, fq) for (tag, fq) in fdist.most_common()]

[(('DET', 'NOUN'), 2081),
 (('ADP', 'NOUN'), 745),
 (('NOUN', 'NOUN'), 368),
 (('ADJ', 'NOUN'), 363),
 (('VERB', 'NOUN'), 351),
 (('.', 'NOUN'), 332),
 (('DET', 'ADJ'), 218),
 (('CONJ', 'NOUN'), 214),
 (('ADV', 'NOUN'), 154),
 (('VERB', 'ADP'), 145),
 (('NUM', 'NOUN'), 129),
 (('DET', '.'), 104),
 (('ADV', 'ADP'), 100),
 (('ADV', '.'), 83),
 (('DET', 'NUM'), 83),
 (('VERB', '.'), 77),
 (('VERB', 'PRT'), 61),
 (('.', 'ADP'), 57),
 (('DET', 'CONJ'), 56),
 (('NOUN', 'ADP'), 56),
 (('ADP', 'ADJ'), 56),
 (('DET', 'ADP'), 56),
 (('.', '.'), 53),
 (('DET', 'VERB'), 44),
 (('ADP', '.'), 42),
 (('ADP', 'ADP'), 38),
 (('NOUN', '.'), 33),
 (('VERB', 'CONJ'), 32),
 (('ADP', 'CONJ'), 30),
 (('.', 'CONJ'), 27),
 (('.', 'ADJ'), 26),
 (('CONJ', '.'), 25),
 (('VERB', 'ADJ'), 25),
 (('NOUN', 'ADJ'), 24),
 (('ADJ', 'ADJ'), 21),
 (('PRT', 'NOUN'), 19),
 (('CONJ', 'ADP'), 18),
 (('CONJ', 'ADJ'), 17),
 (('ADP', 'NUM'), 16),
 (('ADV', 'PRT'), 13),
 (('NOUN', 'VERB'), 13),
 (('VERB', 'ADV'), 13),
 (('ADV', 'V

In [14]:
#Beispiele aus Korpus für prädikative ('is good') bzw. adverbiale Verwendung ('left fast')
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
word_tag_pairs = nltk.bigrams(brown_news_tagged)
noun_preceders = [a[0] for (a, b) in word_tag_pairs if b[1] == 'ADJ' and a[1] == 'VERB']
fdist = nltk.FreqDist(noun_preceders)
[(tag, fq) for (tag, fq) in fdist.most_common(20)]

[('is', 77),
 ('be', 67),
 ('was', 44),
 ('are', 37),
 ('were', 26),
 ('been', 13),
 ('had', 11),
 ('made', 10),
 ('get', 9),
 ('become', 8),
 ('provide', 7),
 ('make', 7),
 ('has', 7),
 ('have', 6),
 ('getting', 4),
 ('give', 4),
 ('said', 4),
 ("isn't", 4),
 ('brought', 3),
 ('left', 3)]