In [1]:
%matplotlib notebook

import nltk

# The NLP (Natural Language Processing) pipeline

![NLP pipeline](http://www.nltk.org/images/dialogue.png)

(Courtesy of the NLTK Book)

As you can see, engineers like to separate language perception and production into neat, self-contained tasks, because it's more manageable from an engineering point of view if you're dealing with one well-defined task. This is obviously not how it works in humans -- the various stages are interconnected, top-down and bottom-up neural processes interact. 

Early work in generative grammar was in favor of such a modular concept, but more contemporary trends such as construction grammar advocate a more holistic approach towards representing linguistic knowledge, where constraints from these various linguistic levels coexist.

In practice though, it's very hard even for engineers to completely separate these tasks -- quite simply because it turns out that you just need information from multiple levels in order to do a state-of-the-art job at any of them. For instance, if you're doing speech recognition, you'd better have an idea of the lexicon, morphology and syntax of the language you're trying to recognize in order to constrain the guesses you're making about the acoustic signal you're trying to analyze.

There are also very exciting attempts to do computer modeling of speech perception and production that somehow mimicks the complexity, interconnectedness and extensive feedback mechanisms between the various layers that humans seem to be relying on. For an overview, check out the open-access book [The Talking Heads experiment: Origins of words and meanings](langsci-press.org//catalog/book/49) by Luc Steels.

# Stemming, lemmatization and tagging

In [2]:
from nltk.corpus import webtext

sent = webtext.words("grail.txt")[1818:1861]
sent

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'women',
 'lying',
 'in',
 'ponds',
 'distributing',
 'swords',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 '.',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'masses',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.',
 'ARTHUR',
 ':',
 'Be',
 'quiet',
 '!']

**Stemmers** are often used in lightly inflected languages for grouping related words within a text. They are mostly rule-based systems designed to strip endings from words (leaving just stems), which makes it possible to collapse the different inflected forms into one when doing e.g. frequency counts.

In [3]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()

In [4]:
[porter.stem(t) for t in sent]

['DENNI',
 ':',
 'Listen',
 ',',
 'strang',
 'women',
 'lie',
 'in',
 'pond',
 'distribut',
 'sword',
 'is',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'Suprem',
 'execut',
 'power',
 'deriv',
 'from',
 'a',
 'mandat',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcic',
 'aquat',
 'ceremoni',
 '.',
 'ARTHUR',
 ':',
 'Be',
 'quiet',
 '!']

**Lemmatizers** serve the same purpose as stemmers, but instead of using rules, they're generally backed by a database of words and their forms within a particular language, and possibly a system for disambiguating between rival interpretations based on context (either rule-based or stochastic, i.e. statistically trained). Whereas stemmers often yield truncated strings that are not really words in themselves (which may be useful nonetheless, if each inflection of a word yields the same truncated form and therefore performs the correct grouping), lemmatizers yield dictionary headwords (lemmas).

In [5]:
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(w) for w in sent]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'woman',
 'lying',
 'in',
 'pond',
 'distributing',
 'sword',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 '.',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.',
 'ARTHUR',
 ':',
 'Be',
 'quiet',
 '!']

In [6]:
# NLTK can also do part-of-speech (POS) tagging; if you're interested in
# how this works and trying to build and evaluate your own POS tagger, I
# recommend checking out chapters 5 and 6 of the NLTK Book
nltk.pos_tag(sent)

[('DENNIS', 'NN'),
 (':', ':'),
 ('Listen', 'NNP'),
 (',', ','),
 ('strange', 'JJ'),
 ('women', 'NNS'),
 ('lying', 'VBG'),
 ('in', 'IN'),
 ('ponds', 'NNS'),
 ('distributing', 'VBG'),
 ('swords', 'NNS'),
 ('is', 'VBZ'),
 ('no', 'DT'),
 ('basis', 'NN'),
 ('for', 'IN'),
 ('a', 'DT'),
 ('system', 'NN'),
 ('of', 'IN'),
 ('government', 'NN'),
 ('.', '.'),
 ('Supreme', 'NNP'),
 ('executive', 'NN'),
 ('power', 'NN'),
 ('derives', 'VBZ'),
 ('from', 'IN'),
 ('a', 'DT'),
 ('mandate', 'NN'),
 ('from', 'IN'),
 ('the', 'DT'),
 ('masses', 'NNS'),
 (',', ','),
 ('not', 'RB'),
 ('from', 'IN'),
 ('some', 'DT'),
 ('farcical', 'JJ'),
 ('aquatic', 'JJ'),
 ('ceremony', 'NN'),
 ('.', '.'),
 ('ARTHUR', 'NNP'),
 (':', ':'),
 ('Be', 'NNP'),
 ('quiet', 'JJ'),
 ('!', '.')]

In [7]:
nltk.help.brown_tagset()

(: opening parenthesis
    (
): closing parenthesis
    )
*: negator
    not n't
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ? ; ! :
:: colon
    :
ABL: determiner/pronoun, pre-qualifier
    quite such rather
ABN: determiner/pronoun, pre-quantifier
    all half many nary
ABX: determiner/pronoun, double conjunction or pre-quantifier
    both
AP: determiner/pronoun, post-determiner
    many other next more last former little several enough most least only
    very few fewer past same Last latter less single plenty 'nough lesser
    certain various manye next-to-last particular final previous present
    nuf
AP$: determiner/pronoun, post-determiner, genitive
    other's
AP+AP: determiner/pronoun, post-determiner, hyphenated pair
    many-much
AT: article
    the an no a every th' ever' ye
BE: verb 'to be', infinitive or imperative
    be
BED: verb 'to be', past tense, 2nd person singular or all persons plural
    were
BED*: verb 'to be', past tense, 2nd person singular or 

For highly inflected languages like Czech, making a stemmer based on a set of rules with exceptions would be so much work that you might as well go directly with full blown lemmatization and morphological tagging, considering the results will be much more useful. Unfortunately, there's no module in NLTK to help you with that -- but other third-party open-source tools do exist! One of them is the MorphoDiTa tagger by [ÚFAL MFF UK](http://ufal.mff.cuni.cz/), which you can [try out on-line](http://lindat.mff.cuni.cz/services/morphodita/demo.php), [download](http://ufal.mff.cuni.cz/morphodita/install), or use its [web API](http://lindat.mff.cuni.cz/services/morphodita/api-reference.php). We'll go with the third option, because it's both very easy to use (no installation required) and integrates very well with Python (or indeed any programming environment).

A web API (or Application Programming Interface) is a service that allows you to post requests to a server on the web, have it process your request and receive some structured data back. It's very similar to just browsing the web using your browser, but instead of getting back HTML-formatted pages, which contain a lot of information focused on the visual presentation of the data you've requested, you get back only "the meat", i.e. the actual data, in an easily machine parsable format (either JSON or XML). We'll look at XML some other time, but JSON basically maps very closely to a data structure you could build by nesting Python dictionaries and lists inside each other, so it should feel very natural.

So let's dive in and create some functions to lemmatize Czech text using the MorphoDiTa web API. All we need to know about the API is contained in [MorphoDiTa's documentation](http://lindat.mff.cuni.cz/services/morphodita/api-reference.php). By the way, you could also simply use the API in your browser -- just visit [http://lindat.mff.cuni.cz/services/morphodita/api/tag?data=Děti pojedou k babičce. Už se těší.&output=json](http://lindat.mff.cuni.cz/services/morphodita/api/tag?data=Děti pojedou k babičce. Už se těší.&output=json&convert_tagset=strip_lemma_id) to see what it returns. Our task will be to make a function which visits this kind of URL in Python instead of in your browser, receives back the same output and interprets it.

In [8]:
import requests

def morphoditag(text):
    # initiate a POST HTTP request to the MorphoDiTa API
    resp = requests.post(
        # this is the URL under which the API resides
        "http://lindat.mff.cuni.cz/services/morphodita/api/tag",
        # this is the data sent inside the POST request
        data=dict(data=text, output="json"))
    # if everything goes well, we get back response data formatted in JavaScript
    # Object Notation (JSON); the response object conveniently has a .json() method
    # which helpfully converts the JSON-formatted string to a Python data structure
    # made of nested Python dictionaries and lists
    json = resp.json()
    # the json contains some metadata in addition to the result; we can either
    # return all of it and let the caller (user) deal with this...
    # return json
    # ... or we can return just the tagged sentences...
    # return json["result"]
    # ... or we can even decide that we want to modify the output format a little
    # so that it's closer to what NLTK does when it tags text (i.e. each position
    # should be a tuple, consisting in our case of (word, lemma, tag))
    
    # json["result"] is a list of lists, where each nested list represents a
    # sentence
    for sent in json["result"]:
        # enumerate is a function that takes a collection and successively yields
        # all of its elements and their indices; in each iteration, we assign the
        # name "position" to the current element, and the name "i" to its
        # corresponding index
        for i, position in enumerate(sent):
            # position is a Python dictionary representing the current position,
            # as returned by the MorphoDiTa API (if you don't believe me, return
            # it at this point or print it out and see for yourself!); we turn it
            # into a tuple representation, and re-assign the name "position" to
            # this new representation 
            position = (position["token"], position["lemma"], position["tag"])
            # since lists are mutable, we can replace the old representation of
            # the position at this index in the current sentence with the new
            # one (i.e. the i'th element of sent was originally a dictionary as
            # returned by the MorphoDiTa API, but we replace it with our newly
            # created tuple)
            sent[i] = position
            # the two previous lines can also be condensed into one:
            # sent[i] = (position["token"], position["lemma"], position["tag"])
    # having hot-swapped (mutated in place; see sent[i] = position) the values
    # in the nested lists under json["result"], just return it
    return json["result"]

In [9]:
morphoditag("Děti pojedou k babičce. Už se těší.")

[[('Děti', 'dítě', 'NNFP1-----A----'),
  ('pojedou', 'jet-1_^(pohybovat_se,_ne_však_chůzí)', 'VB-P---3F-AA---'),
  ('k', 'k-1', 'RR--3----------'),
  ('babičce', 'babička', 'NNFS3-----A----'),
  ('.', '.', 'Z:-------------')],
 [('Už', 'už-1', 'Db-------------'),
  ('se', 'se_^(zvr._zájmeno/částice)', 'P7-X4----------'),
  ('těší', 'těšit_:T', 'VB-S---3P-AA---'),
  ('.', '.', 'Z:-------------')]]

Now it's your turn -- examine the [MorphoDiTa API documentation]() and add parameters to the function that will extend its functionality, e.g. by allowing the user to switch between the tagging models for Czech and English.

# Syntactic parsing

NLTK can also be used for experimenting with syntactic parsing. It provides functions for defining context-free grammars (CFGs), probabilistic context-free grammars (PCFGs), feature-based grammars, and various algorithms that make use of these grammars to perform parsing. If you're interested in this topic, take a look at chapters 8 and 9 of the NLTK Book.

Here's a small taste of what's available:

In [10]:
# note that the names of the components can be whatever you like, they're just
# symbols whose function is entirely defined by their relationships within the
# CFG (as you define them), the names are arbitrary (so S could as well be X or
# Y or FOO, but it would make your grammar hard to understand for a human
# linguist)
grammar = nltk.CFG.fromstring("""
S -> NP VP "."
NP -> Pron | Det NX
NX -> N | PreMod N | N PostMod | PreMod N PostMod
VP -> V | V NP | V NP PP
PP -> P NP
PreMod -> Adj
PostMod -> PP
N -> "man" | "ball"
Adj -> "strong" | "little"
P -> "with"
V -> "hit"
Det -> "the" | "a"
Pron -> "I"
""")
sent = "I hit the strong man with a little ball .".split()
# sent = "I hit the strong man .".split()
parser = nltk.ChartParser(grammar)

# the parser returns a sequence of parse trees, because sentences can be
# ambiguous with respect to the grammar, so there can be more than one result;
# in this particular case, what are the two interpretations that the CFG
# yields?
for tree in parser.parse(sent):
    tree.pretty_print()

                   S                                        
  _________________|_________                                
 |   |                       VP                             
 |   |     __________________|____________                   
 |   |    |        |                      PP                
 |   |    |        |              ________|____              
 |   |    |        NP            |             NP           
 |   |    |    ____|_____        |     ________|_____        
 |   |    |   |          NX      |    |              NX     
 |   |    |   |     _____|___    |    |         _____|___    
 |   NP   |   |  PreMod      |   |    |      PreMod      |  
 |   |    |   |    |         |   |    |        |         |   
 |  Pron  V  Det  Adj        N   P   Det      Adj        N  
 |   |    |   |    |         |   |    |        |         |   
 .   I   hit the strong     man with  a      little     ball

     S                                                      
  ___|________  

# In(stead of a ) conclusion

- thanks for attending the workshop, it was a pleasure to be able to spend a few hours in your company!
- **Google** and **StackOverflow** (and sister sites) are your friends
- drop me a line at <david.lukes@ff.cuni.cz> if you find yourself stuck while going over some of the material we've covered
- if you're Prague-based and a speaker of Czech, come say hello at the Python course I'll be teaching starting this autumn at the Faculty of Arts, Charles University, and tell your friends!