## Caveat

The discussions below are primarily-based on English, as that is my first language. 

I know that some assumptions may not apply to all languages. However, I am concentrating first on text generation in English.

## Why do we want to convert to a phrase-based grammar?

Because this allows us to model the different options available for different chunks of speech. For example, we can have the following rules:

* VP > VERB
* VP > VERB ADV
* VP > ADV VERB

Or:
* S > NP VP
* S > NP VP NP
* S > NP VP NP VP
* S > NP VP CC S

The dependency graph has single chains of words. The difficulty is knowing how to partition and group these chains. For example, we know that NOUN > DET is regularly repeated, but how do we know it should be treated as a modular unit NP?

## Sentences and Things

Thinking about language use, functional grammars seem useful to seed our sentences. For example, lanaguage is about "things" and "happenings", which at a simple level are mapped onto nouns and verbs. The reason the dependency graph takes a verb as its root is that a verb is seen to be a prequisite for a sentence.

Looking at how langauge developed in my children, it is my belief that "things" have a slight primacy over "happening". A child's first words tend to be nouns: "bus", "dog", "mama". It's may be 6-12 months later than verbs come along: "get drink", "go there".

When looking at text generation, we need to *condition* the generation on *something*. A sentence needs to be about something. In formal sentence, the subject of a sentence appears to have primacy over the object, as subject-verb make grammatical sense ("The cat sat") and the object appears to provide further information. Object-verb pairs tend to occur as commands or imperative phrases ("Sit here"). While common in speech, they are less common in formal writing. (We could test this by looking at distributions across our patent data.)

Indeed, it appears that in many cases, the *things* in our sentence determine the verbs that are used. For example, if we have "cat" and "mat" this limits the relations between them (i.e. "swam" seems an unusual choice). The sequence of generation thus seems to be something along the lines of:

* S > SUBJ
* SUBJ > OBJ
* SUBJ > VERB
* SUBJ, OBJ > VERB
* SUBJ, VERB > OBJ

Thinging of *things* fits nicely with ideas of *coverage* and *attention*. In a piece of writing, e.g. a paragraph or document, we expect to cover a set of sentences about certain things. Each sentence may be about a different set of things, which is where attention comes in. However, at a top level, the set of things is reasonably limited. What we do though is we *zoom in* on things to add detail, i.e. we select one thing to be the centre of attention together with a context, and we generate another sentence. This is performed within the constraints of *coverage*:  we know that repetition is bad and should not occur. 

In phrase-based grammars subjects and objects are represented as noun phrases. Again, this needs investigation: we need to see which phrase-based groups subjects and objects belong to.

## Context

If we know that our sentence is about *things*, where do we get our inspiration from? Or more precisely, how do we know what *things* to write about?

Let's have a look at some current applications of natural language generation:

* machine translation: the output sentence is about the same things as the input sentence;
* summarization: the output sentence is about the things in a larger input document or body of text; and
* image captioning: the output sentence is about the things in the image.

In each case, we have some form of input which provides the source of *things* that are present in the output sentence.

In each case, there is some form of encoding of the input into a continuous multidimensional representation (generally, a vector of 100-300 elements). 

When training, the model learns a mapping from the input data to the continuous multidimensional representation and then a mapping from the continuous multidimensional representation to an output sentence.

Even when attention is applied, this seems to modify the values in the continuous multidimensional representation but does not generally change the form of the representation.

Hence, when thinking about lanaguage generation we can build a framework based on a general continuous multidimensional representation, a "context tensor".

## How will we use the result in generative models?

One framework for language generation appears to be the following process:

* Receive a context tensor;
* Sample a subject representation based on the context tensor;
* Given the context tensor and the subject representation, iteratively sample verb and object representations;
* Somehow map the SVO sequence to a top-level phrase-based sequence;
* Apply learnt rules hierarchically and modularly to provide language portions for each phrase.

We also have the following ideas:

* At least one level of attention is constant for each sentence, but varies between sentences.
* Each sentence should say something different, i.e. we can add a constraint that forces the generation away from a sentence encoding that is close to a previous sentence encoding.
* We should be able to take a phrase in the sentence and sample at progressive levels of detail.

## How to learn rules in a deep learning framework?

Grammar rules take an input and produce an output. The input is fixed but the output may vary in length (but can be constrained to generate a binary tree - i.e. input > O1, O2 (two outputs)).

If a binary tree is assumed the rules may be modelled with a feed-forward numeral network, where each pair of outputs is a different output symbol.

If we assume varying length outputs, then the rules may be modelled with a recurrent neural network. In this case, we supply a single input and multiple outputs are produced until we get a stop token. There are different ways we can apply such a recurrent neural network:

* Supply the same input at each time step (this is a one-to-many implemenation); or
* Supply the previous output as the input for the next time step (a many-to-many implemenation).

We can use the same network to model the rules at the different levels. 

## What are our tokens in our deep learning model?

If we start with "terminal tokens", we have parts of speech. 

Spacy has [the following parts of speech tags](https://spacy.io/api/annotation#pos-tagging):

* ADJ	adjective	big, old, green, incomprehensible, first
* ADP	adposition	in, to, during
* ADV	adverb	very, tomorrow, down, where, there
* AUX	auxiliary	is, has (done), will (do), should (do)
* CONJ	conjunction	and, or, but
* CCONJ	coordinating conjunction	and, or, but
* DET	determiner	a, an, the
* INTJ	interjection	psst, ouch, bravo, hello
* NOUN	noun	girl, cat, tree, air, beauty
* NUM	numeral	1, 2017, one, seventy-seven, IV, MMXIV
* PART	particle	's, not,
* PRON	pronoun	I, you, he, she, myself, themselves, somebody
* PROPN	proper noun	Mary, John, Londin, NATO, HBO
* PUNCT	punctuation	., (, ), ?
* SCONJ	subordinating conjunction	if, while, that
* SYM	symbol	```$, %, §, ©, +, −, ×, ÷, =, :), 😝```
* VERB	verb	run, runs, running, eat, ate, eating
* X	other	sfpksdpsxmsa
* SPACE	space

There are thus 19 in total.

If we work with patent data, then we will generally have fewer proper nouns and interjections.

One of our first tasks is thus to work out the scale of our non-terminal rules.