# Resources for Morphology

### CELEX
This database comes from the **Ce**nter for **Lex**ical Information, developed at the Max Planck Institute for Psycholinguistics (Netherlands). It attempts to provide a mapping between words in three Germanic languages: English, Dutch, and German. It provides information for the orthography, phonology, morphology, syntax, and frequency of either (1) different lemmas or root forms or (2) raw word forms. It is *not* open source and you need a license to use this dataset, so we will not use this. However, you may see it referenced if you read further into papers on English morphology, especially older ones. 

### English Lexicon Project (ELP)

In psycholinguistics, researchers have studied how morphologically complex words are processed in the brain. Because of this, they have created rich datasets that annotate for morphological information that psycholinguists might be interested in. The English Lexicon Project ([https://elexicon.wustl.edu/](https://elexicon.wustl.edu/)) is one such resource. It has a complex data schema that allows researchers to look for words with certain properties. Remember that English morphology can generally be broken down into the following categories:

* Derivational morphology (changes word categories; surprise --> surprisal)
* Inflectional morphology (retains word categories; cat --> cats)
* Prefixes (before the root)
* Suffixes (after the root)

That data schema looks like this:


In [None]:
# downloading some of the resources from open source

!pip install citylex # morphology dataset interface
!citylex --elp # download the english lexicon project
!ls # show its location in drive

import pandas as pd

citylex = pd.read_csv("./citylex.tsv", sep="\t")
citylex.tail()

INFO: Requesting URL: https://raw.githubusercontent.com/kylebgorman/ELP-annotations/master/ELP.csv
INFO: Collected 68,623 ELP analyses
INFO: Writing out textproto...
INFO: Writing out TSV...
INFO: Success!
citylex.textproto  citylex.tsv	sample_data


Unnamed: 0,wordform,elp_morph_sp,elp_nmorph
79667,zur,,
79668,zurcher,,
79669,zurich,{zurich},1.0
79670,zwei,,
79671,zworykin,,


### Many languages with a "universal" grammatical schema: Unimorph

Attempts to unify some of the challenges of using morphological resources like CELEX and the ELP above by _specifically labeling_ grammatical properties of the morphemes. 

Unimorph (Nicolai et al., 2020) has a huge repository of languages available with morphological annotations: https://github.com/orgs/unimorph/repositories



In [None]:
!citylex --unimorph --elp
!ls

import pandas as pd

citylex = pd.read_csv("./citylex.tsv", sep="\t")
citylex.tail()

INFO: Requesting URL: https://raw.githubusercontent.com/kylebgorman/ELP-annotations/master/ELP.csv
INFO: Collected 68,623 ELP analyses
INFO: Requesting URL: https://raw.githubusercontent.com/unimorph/eng/master/eng
INFO: Collected 115,523 UniMorph analyses
INFO: Writing out textproto...
INFO: Writing out TSV...
INFO: Success!
citylex.textproto  citylex.tsv	sample_data


Unnamed: 0,wordform,elp_morph_sp,elp_nmorph,unimorph_morph
142852,œconomizing,,,œconomize_V;V.PTCP;PRS
142853,œstruate,,,œstruate_V;NFIN
142854,œstruated,,,œstruate_V;PST^œstruate_V;V.PTCP;PST
142855,œstruates,,,œstruate_V;3;SG;PRS
142856,œstruating,,,œstruate_V;V.PTCP;PRS


### The Unimorph Schema

Let's break down some of the data in the `unimorph_morph` column above.

> "Œconomize": Obsolete spelling of economize.

Unimorph uses some specific notation to express complex linguistic concepts. For example, the term "œconomizing" comes from the verb "œconomize". Unimorph therefore marks "œconomize" as a verb with a special underscore (`_V`) broken into (separated by `;`) that correspond to other grammatical properties of the word.

* `V.PTCP` is "Verb: Participle"
  *  Because "œconomizing" ends in `-ing`, in English this makes it a "participle", which is just another way of saying a verb that also acts like an adverb (e.g., words that end in -ly in English)
  * For the curious: Other participles in English include "-ed" forms of verbs, such as, "I have brush**ed** my teeth"). This is different from the regular -ed (e.g., "I brush**ed** my teeth" would get a different tag in Unimorph)
* `PRS` is "Present"
  * "Present" in English is a way of encoding grammatical **tense**; i.e., of expressing at what time an event takes place. When something takes place in the present tense, it often means the event is currently taking place (compare "I eat sandwiches for lunch" and "I ate sandwiches for lunch").

Many of the tags are linguistically informed and try to capture a broad range of linguistic **form-to-meaning mappings**.

If we know something about the grammatical purpose of a sequence of  (e.g., `PRS` can map onto "-s" in English), we might then ask an NLP system (e.g., a finite state automaton/finite state transducer) to produce a new form of a verb that it has never seen before given a base form of a verb ("hiss") and the `PRS` feature.

Recall our Turkish example from Monday?

* `-dE` can be realized in two forms: `-de` and `-da` depending on the sounds in the prior morphemes.
* Unimorph should treat both of these the same using some kind of consistent "tag" or label that defines what the meaning of "-dE" is.

And for ambiguous morphemes, like -s in English, there are two relations:
* -s for verbs
* -s for nouns

In [None]:
#@title Different morphological representations 

citylex[citylex['wordform'].isin({"dances", "cats", "mice", "bosses", "hours", "hour", "time"})]

Unnamed: 0,wordform,elp_morph_sp,elp_nmorph,unimorph_morph
14800,bosses,{boss}>s>,2.0,boss_V;3;SG;PRS
18874,cats,{cat}>s>,2.0,cat_V;3;SG;PRS
29191,dances,{dance}>s>,2.0,dance_V;3;SG;PRS
60063,hour,{hour},1.0,
60070,hours,{hour}>s>,2.0,
75895,mice,{mice},1.0,
127455,time,{time},1.0,time_V;NFIN


As we can see, Unimorph is not complete. For example, many common words like "mice" do not have an annotation, and it thinks many plural nouns are verbs. Even more problematically, much of the data was generated by rules, which leads to errors that are probably overly general (such as "cats" being a verb. Without better rules, Unimorph will continue to be insufficient for many applications. For this reason, the Unimorph resource is mostly used to generate possible valid pairs from known form-to-meaning mappings. Even then, the **limits of the size of the data** will pose a problem for any useful analysis of real-world data.

So, in summary: Many resources are incomplete for different reasons. The English Lexicon project has decent segmentations for morphemes within English words, and Unimorph has clear tags associated with the role those morphemes play in the word without information as to where they occur. In principle, both resources could complement each other.

In general, making "complete" morphological resources that are flexible and error-free is really hard. Encoding both information about **form** and **function** is also a major challenge. This is why a lot of approaches to morphology have moved to **unsupervised methods** that focus primarily on form.

# Supervised vs. Unsupervised Learning

Supervised learning is when you use data that has already been labeled or measured. For example, if we have a dataset of documents, and we want to predict whether they are spam or not, we might have a pre-existing dataset of emails that have been labeled as either being spam or not ($S$). We can then try to predict the _label_ associated with each email from the words ($W$) in our vocabulary ($V$) it contains. This model is a **supervised model** because we *know* the labels ahead of time. 

For example, if we see that ALL (100%) of the emails that are labeled "spam" have the word "money" in them, and NONE (0%) of the emails that are not spam have that word, then we might want to say that:

```python
my_tokenized_email = ["Please", "send", "us", "all", "your", "money", "!", "!"]
if("money" in my_tokenized_email):
  email_is_spam = True
else:
  email_is_spam = False
```

When building natural language processing applications, we often do not want to use very strict, hard-coded rules because we would probably miss out on real emails. A supervised learning model will learn sophisticated **weights** on all of the words ($w$) in our vocabulary ($V$), but it is very similar in kind to the rule above. Importantly, we made our decision on the basis of _labels_ here. 

In the case of spam detection, it makes sense to require labels. But, not all cases are well-suited to labelled data. In particular, researchers often want to gain as much out of datasets that have need for **linguistic expertise** as possible, without hiring linguists or getting data annotations. 


## Processing sequences to predict morphology

One way to think about how we can identify the **morphological structure** of a word is to think of it as a sequence labeling problem. For example, we might want to take a word like "dances" and produce a series of labels like "dance_V;3;SG;PRS". In this case, we could frame the problem as something like learning the mapping between word forms and morpheme categories, or the syntactic or semantic information they carry.

Wordform | Morpheme category
--- | ---
d | ROOT
a | ROOT
n | ROOT
c | ROOT
e | ROOT
s | 3;SG;PRS

Where we **predict** that the -s ending is "third person singular in the present tense." We could also predict something simpler -- such as whether it is third person or not, or whether it is singular or not, or whether it is the present tense or not. In English, this is easy because the -s mostly means all of these things at the same time.

But, for many languages this is not so easy. In some languages, such as Finnish, we simply cannot observe all possible forms of a word (up to thousands of variants in some cases) and many morphemes are concatenated together, each with distinct meanings. This is analogous to the idea of not being able to process all sentences -- there will always be unseen data. So, another reason we consider unsupervised learning approaches is the need for tools to **combat sparsity** in our data. 

## Unsupervised learning to reduce data sparsity

We can consider how we can learn the best morphological "vocabulary" from data. For this, we can devise simple **unsupervised learning** algorithms to characterize regularities in our data that we can take advantage of. Without labels, we will try to find as many morphemes as possible so we can segment words into their component parts. (Figuring out what to call these morphemes, or identify what grammatical role they play, is a much harder question).

## Algorithms for morphology

There are many unsupervised learning algorithms for morphology. Nearly all of them make use of probability theory, such as **transition probabilities** and mutual information like we have computed before. One baseline algorithm for learning morphology-like sequences is the Morfessor algorithm, which we discuss below. We will also cover this in more detail on Friday, alongside the Byte-Pair Encoding (BPE) framework. Both of these are **bottom-up**, data-driven approaches to learning informative substrings that characterize word forms.

Note: There are elements of the Morfessor algorithm that are challenging until we get to later in the class. If you read the paper, be aware that some of the algorithms will be covered in class (EM or Expectation Maximization), and others will not (Viterbi).

## Morfessor algorithm (Creutz & Lagus, 2002) -- Assigned reading

Two major components:
* Lexicon (words or word-like units -- "constructions" -- and their properties)
* Grammar (a system that controls how these units combine into "compounds")

Two general assumptions:
* "Compounds" are formed by one or more "constructions"
  * Maximally, a word can have as many constructions as "atoms" (e.g., characters or letters)
  * Corollary: We cannot have "invisible" constructions (e.g., **zero derivation**)
* Each construction within a compound can occur **independently** -- words are effectively constructed at random

# Friday

1. We will discuss how Morfessor decides how to segment words into their component morphs. How does Morfessor iteratively learn -- without labels -- what morphs should be in its codebook?

2. We will discuss how Byte Pair Encoding (BPE) works.

3. We will compare BPE segmentations with morphologically "gold" segmentations to see how much they align with each other.

## HW2 is due Friday by midnight (11:59pm Eastern).
  ### Remember to upload it as a .ipynb file
  ### Please reach out if you have any questions!