# FLIP(01):  Advanced Data Science
**(Module E: Natural Language Processing)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, but NOT allowed to change or distribute this package.
- If you found any issue/bug for this document, please submit an issue at [tulip lab/flip01](https://github.com/tulip-lab/flip01/issues)


Prepared by and for 
**Student Members** |
2006-2022 [TULIP Lab](http://www.tulip.org.au)

---


# Session 09 - Analyzing the Meaning of Sentences

### Natural Language Understanding

#### Querying a Database

In this section,we will show that solving the task in a restricted domain is pretty straightforward. But we will also see that to address the problem in a more general way, we have to open up a whole new box of ideas and techniques, involving the representation of meaning.

In [None]:
import nltk

In [None]:
nltk.data.show_cfg('grammars/book_grammars/sql0.fcfg')

In [None]:
from nltk import load_parser

In [None]:
cp = load_parser('grammars/book_grammars/sql0.fcfg')

In [None]:
query = 'What cities are located in China'

In [None]:
trees = list(cp.parse(query.split()))

In [None]:
answer = trees[0].label()['SEM']

In [None]:
answer = [s for s in answer if s]

In [None]:
q = ' '.join(answer)

In [None]:
print(q)

Finally, we execute the query over the database city.db and retrieve some results.



In [None]:
from nltk.sem import chat80

In [None]:
rows = chat80.sql_query('corpora/city_database/city.db', q)

In [None]:
for r in rows: print(r[0], end=" ")

# Propositional Logic

A logical language is designed to make reasoning formally explicit. As a result, it can capture aspects of natural language which determine whether a set of sentences is consistent. As part of this approach, we need to develop logical representations of a sentence φ that formally capture the truth-conditions of φ.

In [None]:
nltk.boolean_ops()

NLTKs Expression object can process logical expressions into various subclasses of Expression:



In [None]:
read_expr = nltk.sem.Expression.fromstring

In [None]:
read_expr('-(P & Q)')

In [None]:
read_expr('P & Q')

In [None]:
read_expr('P | (R -> Q)')

In [None]:
read_expr('P <-> -- P')

Arguments can be tested for "syntactic validity" by using a proof system. We will say a little bit more about this later. Logical proofs can be carried out with NLTK's inference module, for example via an interface to the third-party theorem prover Prover9. The inputs to the inference mechanism first have to be converted into logical expressions.

In [None]:
lp = nltk.sem.Expression.fromstring

In [None]:
SnF = read_expr('SnF')

In [None]:
NotFnS = read_expr('-FnS')

In [None]:
R = read_expr('SnF -> -FnS')

In [None]:
prover = nltk.Prover9()

In [None]:
prover.prove(NotFnS, [SnF, R])

Recall that we interpret sentences of a logical language relative to a model, which is a very simplified version of the world. A model for propositional logic needs to assign the values True or False to every possible formula. We do this inductively: first, every propositional symbol is assigned a value, and then we compute the value of complex formulas by consulting the meanings of the boolean operators and applying them to the values of the formula's components. A Valuation is a mapping from basic expressions of the logic to their values. Here's an example:

In [None]:
val = nltk.Valuation([('P', True), ('Q', True), ('R', False)])

In [None]:
val['P']

In [None]:
dom = set()

In [None]:
g = nltk.Assignment(dom)

In [None]:
m = nltk.Model(dom, val)

In [None]:
print(m.evaluate('(P & Q)', g))

In [None]:
print(m.evaluate('-(P & Q)', g))

In [None]:
print(m.evaluate('(P & R)', g))

In [None]:
print(m.evaluate('(P | R)', g))

# First-Order Logic

In the remainder of this chapter, we will represent the meaning of natural language expressions by translating them into first-order logic. Not all of natural language semantics can be expressed in first-order logic. But it is a good choice for computational semantics because it is expressive enough to represent many aspects of semantics, and on the other hand, there are excellent systems available off the shelf for carrying out automated inference in first-order logic.

## Syntax

First-order logic keeps all the Boolean operators of propositional logic, but it adds some important new mechanisms.

It is often helpful to inspect the syntactic structure of expressions of first-order logic, and the usual way of doing this is to assign types to expressions. Following the tradition of Montague grammar, we will use two basic types: e is the type of entities, while t is the type of formulas, i.e., expressions which have truth values. Given these two basic types, we can form complex types for function expressions. That is, given any types σ and τ, 〈σ, τ〉 is a complex type corresponding to functions from 'σ things' to 'τ things'. For example, 〈e, t〉 is the type of expressions from entities to truth values, namely unary predicates. The logical expression can be processed with type checking.

In [None]:
read_expr = nltk.sem.Expression.fromstring

In [None]:
expr = read_expr('walk(angus)', type_check=True)

In [None]:
expr.argument

In [None]:
expr.argument.type

In [None]:
expr.function

In [None]:
expr.function.type

Why do we see <e,?> at the end of this example? Although the type-checker will try to infer as many types as possible, in this case it has not managed to fully specify the type of walk, since its result type is unknown. Although we are intending walk to receive type <e, t>, as far as the type-checker knows, in this context it could be of some other type such as <e, e> or <e, <e, t>. To help the type-checker, we need to specify a signature, implemented as a dictionary that explicitly associates types with non-logical constants:

In [None]:
sig = {'walk': '<e, t>'}

In [None]:
expr = read_expr('walk(angus)', signature=sig)

In [None]:
expr.function.type

In [None]:
read_expr = nltk.sem.Expression.fromstring

In [None]:
read_expr('dog(cyril)').free()

In [None]:
read_expr('dog(x)').free()

In [None]:
read_expr('own(angus, cyril)').free()

In [None]:
read_expr('exists x.dog(x)').free()

In [None]:
read_expr('((some x. walk(x)) -> sing(x))').free()

In [None]:
read_expr('exists x.own(y, x)').free()

## First-Order Theorem Proving

The general case in theorem proving is to determine whether a formula that we want to prove (a proof goal) can be derived by a finite sequence of inference steps from a list of assumed formulas.

In [None]:
NotFnS = read_expr('-north_of(f, s)')

In [None]:
SnF = read_expr('north_of(s, f)')

In [None]:
R = read_expr('all x. all y. (north_of(x, y) -> -north_of(y, x))')

In [None]:
prover = nltk.Prover9()

In [None]:
prover.prove(NotFnS, [SnF, R])

In [None]:
FnS = read_expr('north_of(f, s)')

In [None]:
prover.prove(FnS, [SnF, R])

## Truth in Model

Relations are represented semantically in NLTK in the standard set-theoretic way: as sets of tuples. For example, let’s suppose we have a domain of discourse consisting of the individuals Bertie, Olive, and Cyril, where Bertie is a boy, Olive is a girl, and Cyril is a dog.

In [None]:
dom = {'b', 'o', 'c'}

In [None]:
v = """
    bertie => b
    olive => o
    cyril => c
    boy => {b}
    girl => {o}
    dog => {c}
    walk => {o, c}
    see => {(b, o), (c, b), (o, c)}
    """

In [None]:
val = nltk.Valuation.fromstring(v)

In [None]:
print(val)

You may have noticed that our unary predicates (i.e, boy, girl, dog) also come out as sets of singleton tuples, rather than just sets of individuals. This is a convenience which allows us to have a uniform treatment of relations of any arity. A predication of the form P(τ1, ... τn), where P is of arity n, comes out true just in case the tuple of values corresponding to (τ1, ... τn) belongs to the set of tuples in the value of P.

In [None]:
('o', 'c') in val['see']

In [None]:
('b',) in val['boy']

## Individual Variables and Assignments

In our models, the counterpart of a context of use is a variable assignment. This is a mapping from individual variables to entities in the domain. Assignments are created using the Assignment constructor, which also takes the model’s domain of discourse as a parameter.

In [None]:
g = nltk.Assignment(dom, [('x', 'o'), ('y', 'c')])

In [None]:
print(g)

In [None]:
m = nltk.Model(dom, val)

In [None]:
m.evaluate('see(olive, y)', g)

In [None]:
g['y']

In [None]:
m.evaluate('see(y, x)', g)

In [None]:
g.purge()

In [None]:
m.evaluate('see(olive, y)', g)

In [None]:
m.evaluate('see(bertie, olive) & boy(bertie) & -walk(bertie)', g)

## Quantification

One of the crucial insights of modern logic is that the notion of variable satisfaction can be used to provide an interpretation for quantified formulas.

In [None]:
m.evaluate('exists x.(girl(x) & walk(x))', g)

In [None]:
m.evaluate('girl(x) & walk(x)', g.add('x', 'o'))

In [None]:
fmla1 = read_expr('girl(x) | boy(x)')

In [None]:
m.satisfiers(fmla1, 'x', g)

In [None]:
fmla2 = read_expr('girl(x) -> walk(x)')

In [None]:
m.satisfiers(fmla2, 'x', g)

In [None]:
fmla3 = read_expr('walk(x) -> girl(x)')

In [None]:
m.satisfiers(fmla3, 'x', g)

In [None]:
m.evaluate('all x.(girl(x) -> walk(x))', g)

## Quantifier Scope Ambiguity

In [None]:
v2 = """
    bruce => b
    cyril => c
    elspeth => e
    julia => j
    matthew => m
    person => {b, e, j, m}
    admire => {(j, b), (b, b), (m, e), (e, m), (c, a)}
    """

In [None]:
val2 = nltk.Valuation.fromstring(v2)

In [None]:
dom2 = val2.domain

In [None]:
m2 = nltk.Model(dom2, val2)

In [None]:
g2 = nltk.Assignment(dom2)

In [None]:
fmla4 = read_expr('(person(x) -> exists y.(person(y) & admire(x, y)))')

In [None]:
m2.satisfiers(fmla4, 'x', g2)

This shows that fmla4 holds of every individual in the domain. By contrast, consider the formula fmla5 below; this has no satisfiers for the variable y.

In [None]:
fmla5 = read_expr('(person(y) & all x.(person(x) -> admire(x, y)))')

In [None]:
m2.satisfiers(fmla5, 'y', g2)

In [None]:
fmla6 = read_expr('(person(y) & all x.((x = bruce | x = julia) -> admire(x, y)))')

In [None]:
m2.satisfiers(fmla6, 'y', g2)

## Model Building

We have been assuming that we already had a model, and wanted to check the truth of a sentence in the model. By contrast, model building tries to create a new model,given some set of sentences. If it succeeds, then we know that the set is consistent, since we have an existence proof of the model.

In [None]:
a3 = read_expr('exists x.(man(x) & walks(x))')

In [None]:
c1 = read_expr('mortal(socrates)')

In [None]:
c2 = read_expr('-mortal(socrates)')

In [None]:
mb = nltk.Mace(5)

In [None]:
print(mb.build_model(None, [a3, c1]))

In [None]:
print(mb.build_model(None, [a3, c2]))

In [None]:
print(mb.build_model(None, [c1, c2]))

In [None]:
a4 = read_expr('exists y. (woman(y) & all x. (man(x) -> love(x,y)))')

In [None]:
a5 = read_expr('man(adam)')

In [None]:
a6 = read_expr('woman(eve)')

In [None]:
g = read_expr('love(adam,eve)')

In [None]:
mc = nltk.MaceCommand(g, assumptions=[a4, a5, a6])

In [None]:
mc.build_model()

In [None]:
print(mc.valuation)

In [None]:
a7 = read_expr('all x. (man(x) -> -woman(x))')

In [None]:
g = read_expr('love(adam,eve)')

In [None]:
mc = nltk.MaceCommand(g, assumptions=[a4, a5, a6, a7])

In [None]:
mc.build_model()

In [None]:
print(mc.valuation)

# The Semantics of English Sentences

## The λ-Calculus

Remember that \ is a special character in Python strings. We must either escape it (with another \), or else use “raw strings”  as shown here:

In [None]:
read_expr = nltk.sem.Expression.fromstring

In [None]:
expr = read_expr(r'\x.(walk(x) & chew_gum(x))')

In [None]:
expr

In [None]:
expr

In [None]:
print(read_expr(r'\x.(walk(x) & chew_gum(y))'))

In [None]:
expr = read_expr(r'\x.(walk(x) & chew_gum(x))(gerald)')

In [None]:
print(expr)

In [None]:
print(expr.simplify())

In [None]:
print(read_expr(r'\x.\y.(dog(x) & own(y, x))(cyril)').simplify())

In [None]:
print(read_expr(r'\x y.(dog(x) & own(y, x))(cyril, angus)').simplify())

In [None]:
expr1 = read_expr('exists x.P(x)')

In [None]:
print(expr1)

In [None]:
expr2 = expr1.alpha_convert(nltk.sem.Variable('z'))

In [None]:
print(expr2)

In [None]:
expr1 == expr2

In [None]:
expr3 = read_expr('\P.(exists x.P(x))(\y.see(y, x))')

In [None]:
print(expr3)

In [None]:
print(expr3.simplify())

## Quantified NPs

At the start of this section, we briefly described how to build a semantic representation for Cyril barks. You would be forgiven for thinking this was all too easy—surely there is a bit more to building compositional semantics.

In [None]:
# lp = nltk.LogicParser()

In [None]:
# tvp = lp.parse(r'\X x.X(\y.chase(x,y))')

In [None]:
# np = lp.parse(r'(\P.exists x.(dog(x) & P(x)))')

In [None]:
# vp = nltk.ApplicationExpression(tvp, np)

In [None]:
# print vp

In [None]:
# print vp.simplify()

In [None]:
# from nltk import load_parser

In [None]:
# parser = load_parser('grammars/book_grammars/simple-sem.fcfg', trace=0)

In [None]:
# sentence = 'Angus gives a bone to every dog'

In [None]:
# tokens = sentence.split()

In [None]:
# trees = parser.nbest_parse(tokens)

In [None]:
# for tree in trees:
    # print tree.node['SEM']

In [None]:
# v = """
#     bertie => b
#     olive => o
#     cyril => c
#     boy => {b}
#     girl => {o}
#     dog => {c}
#     walk => {o, c}
#     see => {(b, o), (c, b), (o, c)}
#     """

In [None]:
# val = nltk.parse_valuation(v)

In [None]:
# g = nltk.Assignment(val.domain)

In [None]:
# m = nltk.Model(val.domain, val)

In [None]:
# sent = 'Cyril sees every boy'

In [None]:
# grammar_file = 'grammars/book_grammars/simple-sem.fcfg'

In [None]:
# results = nltk.batch_evaluate([sent], grammar_file, m, g)[0]

In [None]:
# for (syntree, semrel, value) in results:
    # print semrep
    # print value

## Transitive Verbs

Our next challenge is to deal with sentences containing transitive verbs

In [None]:
read_expr = nltk.sem.Expression.fromstring

In [None]:
tvp = read_expr(r'\X x.X(\y.chase(x,y))')

In [None]:
np = read_expr(r'(\P.exists x.(dog(x) & P(x)))')

In [None]:
vp = nltk.sem.ApplicationExpression(tvp, np)

In [None]:
print(vp)

In [None]:
print(vp.simplify())

The grammar simple-sem.fcfg contains a small set of rules for parsing and translating simple examples of the kind that we have been looking at. Here's a slightly more complicated example.

In [None]:
from nltk import load_parser

In [None]:
parser = load_parser('grammars/book_grammars/simple-sem.fcfg', trace=0)

In [None]:
sentence = 'Angus gives a bone to every dog'

In [None]:
tokens = sentence.split()

In [None]:
for tree in parser.parse(tokens):
    print(tree.label()['SEM'])

NLTK provides some utilities to make it easier to derive and inspect semantic interpretations. The function interpret_sents() is intended for interpretation of a list of input sentences. It builds a dictionary d where for each sentence sent in the input, d[sent] is a list of pairs (synrep, semrep) consisting of trees and semantic representations for sent. The value is a list since sent may be syntactically ambiguous; in the following example, however, there is only one parse tree per sentence in the list.

In [None]:
sents = ['Irene walks', 'Cyril bites an ankle']

In [None]:
grammar_file = 'grammars/book_grammars/simple-sem.fcfg'

In [None]:
for results in nltk.interpret_sents(sents, grammar_file):
    for (synrep, semrep) in results:
        print(synrep)

We have seen now how to convert English sentences into logical forms, and earlier we saw how logical forms could be checked as true or false in a model. Putting these two mappings together, we can check the truth value of English sentences in a given model. Let's take model m as defined above. The utility evaluate_sents() resembles interpret_sents() except that we need to pass a model and a variable assignment as parameters. The output is a triple (synrep, semrep, value) where synrep, semrep are as before, and value is a truth value. For simplicity, the following example only processes a single sentence.

In [None]:
v = """
... bertie => b
... olive => o
... cyril => c
... boy => {b}
... girl => {o}
... dog => {c}
... walk => {o, c}
... see => {(b, o), (c, b), (o, c)}
... """

In [None]:
val = nltk.Valuation.fromstring(v)

In [None]:
g = nltk.Assignment(val.domain)

In [None]:
m = nltk.Model(val.domain, val)

In [None]:
sent = 'Cyril sees every boy'

In [None]:
grammar_file = 'grammars/book_grammars/simple-sem.fcfg'

In [None]:
results = nltk.evaluate_sents([sent], grammar_file, m, g)[0]

In [None]:
for (syntree, semrep, value) in results:
    print(semrep)
    print(value)

## Quantifier Ambiguity Revisited

One important limitation of the methods described earlier is that they do not deal with scope ambiguity.

In [None]:
from nltk.sem import cooper_storage as cs

In [None]:
sentence = 'every girl chases a dog'

In [None]:
trees = cs.parse_with_bindops(sentence, grammar='grammars/book_grammars/storage.fcfg')

In [None]:
semrep = trees[0].label()['SEM']

In [None]:
cs_semrep = cs.CooperStore(semrep)

In [None]:
print(cs_semrep.core)

In [None]:
for bo in cs_semrep.store:
    print(bo)

In [None]:
cs_semrep.s_retrieve(trace=True)

In [None]:
for reading in cs_semrep.readings:
    print(reading)

# Discourse Semantics

A discourse is a sequence of sentences. Very often, the interpretation of a sentence in a discourse depends on what preceded it. A clear example of this comes from anaphoric pronouns, such as he, she, and it. Given a discourse such as Angus used to have a dog.
But he recently disappeared., you will probably interpret he as referring to Angus’s dog. However, in Angus used to have a dog. He took him for walks in New Town., you are more likely to interpret he as referring to Angus himself.

## Discourse Representation Theory

In [None]:
read_dexpr = nltk.sem.DrtExpression.fromstring

In [None]:
drs1 = read_dexpr('([x, y], [angus(x), dog(y), own(x, y)])')

In [None]:
drs1.draw()

In [None]:
print(drs1.fol())

In [None]:
drs2 = read_dexpr('([x], [walk(x)]) + ([y], [run(y)])')

In [None]:
print(drs2)

In [None]:
print(drs2.simplify())

In [None]:
drs3 = read_dexpr('([], [(([x], [dog(x)]) -> ([y],[ankle(y), bite(x, y)]))])')

In [None]:
print(drs3.fol())

In [None]:
drs4 = read_dexpr('([x, y], [angus(x), dog(y), own(x, y)])')

In [None]:
drs5 = read_dexpr('([u, z], [PRO(u), irene(z), bite(u, z)])')

In [None]:
drs6 = drs4 + drs5

In [None]:
print(drs6.simplify())

In [None]:
print(drs6.simplify().resolve_anaphora())

In [None]:
from nltk import load_parser
parser = load_parser('grammars/book_grammars/drt.fcfg', logic_parser=nltk.sem.drt.DrtParser())

In [None]:
trees = list(parser.parse('Angus owns a dog'.split()))

In [None]:
print(trees[0].label()['SEM'].simplify())

## Discourse Processing

When we interpret a sentence, we use a rich context for interpretation, determined in part by the preceding context and in part by our background assumptions.

In [None]:
dt = nltk.DiscourseTester(['A student dances', 'Every student is a person'])

In [None]:
dt.readings()

When a new sentence is added to the current discourse, setting the parameter consistchk=True causes consistency to be checked by invoking the model checker for each thread, i.e., sequence of admissible readings. In this case, the user has the option of retracting the sentence in question.

In [None]:
dt.add_sentence('No person dances', consistchk=True)

In [None]:
dt.retract_sentence('No person dances', verbose=True)

In [None]:
dt.add_sentence('A person dances', informchk=True)

In [None]:
from nltk.tag import RegexpTagger
tagger = RegexpTagger(
                    [('^(chases|runs)$', 'VB'),
                    ('^(a)$', 'ex_quant'),
                    ('^(every)$', 'univ_quant'),
                    ('^(dog|boy)$', 'NN'),
                    ('^(He)$', 'PRP')
                    ])

In [None]:
rc = nltk.DrtGlueReadingCommand(depparser=nltk.MaltParser(tagger=tagger))

In [None]:
dt = nltk.DiscourseTester(['Every dog chases a boy', 'He runs'], rc)

In [None]:
dt.readings()

In [None]:
dt.readings(show_thread_readings=True)

In [None]:
dt.readings(show_thread_readings=True, filter=True)