## Clarification, HW2:

* Question 8 (mutual information question) says you need to pick an abstract from Question 5. Please pick an abstract from Question 6 (where you compare tokenizations).
* Compute bigram statistics over the entire corpus (otherwise the numbers will be unreliable) -- pointwise mutual information values can be computed on the fly as long as you know $p(AB)$, $p(A)$ and $p(B)$.

<hr />

# What is a word?

We have been working with the `nltk.word_tokenize` function for the last two weeks. Here is what we know:

* It breaks up words like `"don't"` into two words: `["do", "n't"]`
* It does not break up complex words like `"reading"` into two pieces (i.e., it just says the word is `["reading"]`).
* With this tokenizer, we have no way to link similarly-spelled words to each other
* Not all languages are like English

## Morphology

**Morphemes** are normally defined as the smallest "meaningful" parts of language. When they are combined, they form more complex meanings, from single words to whole texts. The system of morphemes of a whole language is called its **morphology**.

There are many different kinds of morphemes and they can sometimes have different behavior. Every language uses morphology slightly differently. Here are just a few:

* English
  * **Inflectional morphology**: Keeps the base mostly the same
    * Noun morphology
      * Singular (_cat_) to plural (_cat**s**_)
    * Verb morphology
      * Progressive (dance to danc**ing**)
      * Past (danced to danc**ed**) 
  * **Derivational morphology**: Changes the word to a completely different meaning
    * American
    * --> American**ize**
    * ---> Americaniz**ation**
    * ---> Americaniz**ability**
    * ---> Americanization**ing**
  * **Compounds**: Puts two morphemes together to create a new whole, usually with an _implicit_ grammatical relation
    * Dollhouse (a house for dolls)
    * Armchair (a chair with arms)
    * Boxcar (a train car that is box-shaped)

Many languages have much more complex systems than English. 

* For example, French has a rich system for conjugating verbs:
  * je danse (I dance)
  * je dans**ais** (I used to dance)
  * je dans**erais** (I would dance)
  * j'ai dans**é** (I danced)
* And Estonian has lots of ways of changing nouns depending on the meaning
  * kohvik (café)
  * kohvik**u** (café's)
  * kohviku**sse** (into the café)
  * kohviku**st** (from the café)

Among many other alternatives. In many indigenous languages throughout the world, more rich morphology systems exist. Other languages can even have "words" that express something like a whole sentence. For example, in Inuit languages, we can express the idea, "Even though it's not snowing a great deal" with a single word: ["Qanniluanngikkaluaqtuq"](https://en.wikipedia.org/wiki/Inuit_grammar#Verb_modifiers) that contains five morphemes (qanniq-, -luaq-, -nngit-, -galuaq-, -tuq). 

In some languages like Turkish, there is a system where the sounds within the word have to "agree" with the sounds of the previous morpheme. So, for example, the morpheme "-dE" has two possible forms: "-de" and "-da" depending on what precedes in. The phrase "in Germany" would be translated as "Almanya**da**" but "in Turkey" would be translated as "Türkiyede" because the preceding vowel (/a/ and /e/, respectively) has control over how the morpheme sounds. The multiple ways that the "same" morpheme can be realized in pronunciation is known as an **allomorph**. Not all allomorphs are written differently. For example, compare these two words:

* cat --> cats
* dog --> dogs

Do you hear a difference in how the "-s" sounds at the end of the word?

<details>
<summary>The difference is...</summary>
that the -s in "cats" is said like the s in "hiss" ([s]) and the -s in "dogs" is said like the /z/ sound at the end of the word "fries."
</details>


## Types of morphemes

Different morphemes will play different roles in a word. For example, the "sleep" in "sleeping" is clearly more important to the meaning than the "-ing" at the end. This is because, in a language like English, "sleep" can more or less stand on its own and be understood.

The morpheme "sleep" in "sleeping" is usually called the **root** of the word. The **root morpheme** of any word is usually the most important morpheme.

Other morphemes that are less central to the meaning are **prefixes**, which come *before* other morphemes (the **pre-** in "prefixes"), **suffixes** (the **-es** in "suffixes"), which come *after* other morphemes, **infixes** (rare in English but things like Homer Simpson's saxo-**ma**-phone.)

Morphemes that cannot occur on their own, without being attached to anything else, are known as **bound** morphemes. For example, we cannot normally say "pre-" without some additional context (e.g., "Did you pre-date or post-date this file?" "Pre-."). \[Probably we should use another metaphor for this soon.\]

Morphemes that can stand on their own are known as **free** morphemes because they can occur in many contexts and positions. Some good examples of free morphemes in English are words that occur in compounds -- armchair, bookcase, etc. All the words "arm", "chair", "book", and "case" can occur on their own and be added to.

## Traditional approaches to morphology: Finite State Machines

Traditionally, we handled morphology with **finite state machines** and generated novel wordforms using **finite state transducers**. Finite state methods are foundational in our understanding of natural language processing approaches, but they are present in many of the systems we see every day, such as a classic gumball machine:

<center><img src="http://www.openfst.org/twiki/pub/GRM/PyniniDocs/gumball.png" alt="Finite state tranducer representing a gumball machine" /></center>

We expect that if we turn the nob and give the machine a coin, that we will get a gumball out. Getting a gumball will reset the knob.

**Weighted finite state transducers** are like regular finite state machines, except they assign a weight -- which can be converted to a probability -- to each transition. For the above, this could be a more complex gumball machine that 30% of the time gave you a gumball, and 70% of the time gave you one of those plastic slappy hands.

<center><img src="https://m.media-amazon.com/images/I/71agbbZgQ7L._AC_SL1452_.jpg" width=250 alt="Six multicolored ticky hands toys"/></center>

## How is this like morphology?

Morphology can also be thought of as **transitions** between sounds or morphemes. However, instead of knobs and coins and gumballs, we have sequences of morphemes, e.g., the Estonian \<kohvik-u-sse\> that we saw earlier.

In general, finite state approaches at the highest level are appropriate characterizations of many NLP problems.



In [1]:
from google.colab import drive
import os
import spacy

drive.mount('/content/drive/', force_remount=True)

abstracts = open("/content/drive/MyDrive/Teaching/Fall2021/"
                 "Computational Linguistics/Lectures/"
                 "supplementary_files/abstracts.tsv").readlines()

# load in spacy model so we can lemmatize
nlp = spacy.load("en_core_web_sm")

Mounted at /content/drive/


## Stemming

One cheap way to link words with related wordforms -- and probably meanings -- together is to use the [Porter Stemmer ("Suffix stripping") algorithm](https://tartarus.org/martin/PorterStemmer/python.txt). It is rarely used today, because it does not generally do a sophisticated enough job at breaking words into their constituent morphemes. But, it is a fast and efficient algorithm that simplies word forms. For example, it will take a word like "stopping" and produce "stop" because it simplifies doubled consonants ("pp") and strips off specific word endings ("-ing").

In [2]:
# using the porter stemmer
import nltk
from nltk.stem import porter
from pprint import pprint

stemmed_abstract = []
stemmer = porter.PorterStemmer()
nlped_abstract = nlp(abstracts[0])
for w in nlped_abstract:
  stemmed_word = stemmer.stem(w.text)
  stemmed_abstract.append(stemmed_word)

pprint(' '.join(stemmed_abstract))

('offens languag detect ( old ) ha receiv increas attent due to it societ '
 'impact . recent work show that bidirect transform base method obtain impress '
 'perform on old . howev , such method usual reli on larg - scale well - label '
 'old dataset for model train . To address the issu of data / label scarciti '
 'in old , in thi paper , we propos a simpl yet effect domain adapt approach '
 'to train bidirect transform . our approach introduc domain adapt ( DA ) '
 'train procedur to albert , such that it can effect exploit auxiliari data '
 'from sourc domain to improv the old perform in a target domain . experiment '
 'result on benchmark dataset show that our approach , albert ( DA ) , obtain '
 'the state - of - the - art perform in most case . particularli , our '
 'approach significantli benefit underrepres and under - perform class , with '
 'a signific improv over albert . \n')


## Some issues with stemming

One of the main reasons stemming is dispreferred these days is that it increases ambiguity. By throwing away (stripping) final information, you lose clues to:
* What kind of word it was (noun or verb)
* What grammatical role that word played -- e.g., past versus present tense

But there are other disadvantages, too:
* Generally not nice for people to read
* Not sensitive to context -- only uses word form itself
* Not robust to some types of string errors (e.g., "identif**ies**inconsistencies")

The system does have some advantages though:
* "Unlexicalized" -- Porter stemmer's behavior does not depend on what word is being stemmed
* Able to generalize to unseen words
* Much faster than statistically optimized methods


# Lemmatization

* Reverting a word to its base (root) form
  * sleeps --> sleep
  * bears --> bear
  * is --> be
  * it/they/he/she --> -PRON- (short for pronoun)

In [3]:
# spacy's lemmatization of the abstract
pprint(' '.join([x.lemma_ for x in nlp(abstracts[0])]))

('offensive language detection ( OLD ) have receive increase attention due to '
 '-PRON- societal impact . recent work show that bidirectional transformer '
 'base method obtain impressive performance on OLD . however , such method '
 'usually rely on large - scale well - label old dataset for model training . '
 'to address the issue of datum / label scarcity in OLD , in this paper , '
 '-PRON- propose a simple yet effective domain adaptation approach to train '
 'bidirectional transformer . -PRON- approach introduce domain adaptation ( DA '
 ') training procedure to ALBERT , such that -PRON- can effectively exploit '
 'auxiliary datum from source domain to improve the old performance in a '
 'target domain . experimental result on benchmark dataset show that -PRON- '
 'approach , ALBERT ( DA ) , obtain the state - of - the - art performance in '
 'most case . particularly , -PRON- approach significantly benefit '
 'underrepresented and under - perform class , with a significant impro

Problems associated with lemmatization:
  * Datasets and transformations must be curated (time consuming)
  * Lossy -- removes morphological information
    - Not directly possible to retrieve it from context
  * List of lemmas not update automatically

In [4]:
list(zip([(x.text, x.lemma_) for x in nlped_abstract], stemmed_abstract))[0:20]

[(('Offensive', 'offensive'), 'offens'),
 (('language', 'language'), 'languag'),
 (('detection', 'detection'), 'detect'),
 (('(', '('), '('),
 (('OLD', 'OLD'), 'old'),
 ((')', ')'), ')'),
 (('has', 'have'), 'ha'),
 (('received', 'receive'), 'receiv'),
 (('increasing', 'increase'), 'increas'),
 (('attention', 'attention'), 'attent'),
 (('due', 'due'), 'due'),
 (('to', 'to'), 'to'),
 (('its', '-PRON-'), 'it'),
 (('societal', 'societal'), 'societ'),
 (('impact', 'impact'), 'impact'),
 (('.', '.'), '.'),
 (('Recent', 'recent'), 'recent'),
 (('work', 'work'), 'work'),
 (('shows', 'show'), 'show'),
 (('that', 'that'), 'that')]

In [5]:
# compute basically the same thing as for n-grams
# conditional probability of lemma given wordform

lemma_to_wordforms = {}
wordforms_to_lemmas = {}
# look at the first 1000 abstracts
for abstract in abstracts[0:1000]:
  tokenized_abstract = nlp(abstract)
  # one direction
  lemmas = [token.lemma_ for token in tokenized_abstract]
  raw_strings = [token.text for token in tokenized_abstract]
  for i, lemma in enumerate(lemmas):
    raw_string = raw_strings[i]
    if lemma not in lemma_to_wordforms:
      lemma_to_wordforms[lemma] = {raw_string: 1}
    else:
      if raw_string not in lemma_to_wordforms[lemma]:
        lemma_to_wordforms[lemma][raw_string] = 1
      else:
        lemma_to_wordforms[lemma][raw_string] += 1
  # the other direction
  for i, raw_string in enumerate(raw_strings):
    lemma = lemmas[i]
    if raw_string not in wordforms_to_lemmas:
      wordforms_to_lemmas[raw_string] = {lemma: 1}
    else:
      if lemma not in wordforms_to_lemmas[raw_string]:
        wordforms_to_lemmas[raw_string][lemma] = 1
      else:
        wordforms_to_lemmas[raw_string][lemma] += 1

In [6]:
lemma_to_wordforms['be']

{'Being': 3,
 'are': 644,
 'be': 398,
 'been': 187,
 'being': 31,
 'is': 1260,
 'was': 127,
 'were': 89}

In [7]:
lemma_to_wordforms['check']

{'check': 9, 'checked': 2, 'checking': 16}

In [8]:
lemma_to_wordforms['verify']

{'Verifying': 1, 'verified': 2, 'verify': 14, 'verifying': 4}

In [9]:
lemma_to_wordforms['interpret']

{'interpret': 9, 'interpreted': 5, 'interpreting': 10}

In [10]:
lemma_to_wordforms['interpretation']

{'interpretation': 11, 'interpretations': 4}

In [11]:
wordforms_to_lemmas['interpretations']

{'interpretation': 4}

In [12]:
wordforms_to_lemmas['interpretation']

{'interpretation': 11}

# A modern proposal: Byte Pair Encoding (BPE) subwords

We will cover the algorithm for BPE on Friday -- it is the current standard approach related to _representing wordforms_ but not necessarily about _representing morphology_. It produces very different output from the lemmatization presented above, while retaining all aspects of the words' written forms. 

In [13]:
#@title Download transformers if necessary { vertical-output: true, display-mode: "form" }
!pip install transformers



# 🤗 Transformers by Huggingface

Lots of implementations of modern neural network models along with their tokenizers -- which are subword-centered.

In [14]:
from transformers import BertTokenizer

m = BertTokenizer.from_pretrained("bert-base-cased")
#m = BertTokenizer.from_pretrained("bert-base-uncased")

In [15]:
# compare BERT tokenization to the ones above
pprint(' '.join(m.tokenize(abstracts[0])))

('Offensive language detection ( O ##LD ) has received increasing attention '
 'due to its societal impact . Recent work shows that bid ##ire ##ctional '
 'transform ##er based methods obtain impressive performance on O ##LD . '
 'However , such methods usually rely on large - scale well - labeled O ##LD '
 'data ##sets for model training . To address the issue of data / label scar '
 '##city in O ##LD , in this paper , we propose a simple yet effective domain '
 'adaptation approach to train bid ##ire ##ctional transform ##ers . Our '
 'approach introduces domain adaptation ( D ##A ) training procedures to AL '
 '##BE ##RT , such that it can effectively exploit auxiliary data from source '
 'domains to improve the O ##LD performance in a target domain . Experimental '
 'results on bench ##mark data ##sets show that our approach , AL ##BE ##RT ( '
 'D ##A ) , obtain ##s the state - of - the - art performance in most cases . '
 'Part ##icular ##ly , our approach significantly benefits u

# Unsupervised morphology induction

Algorithms like Morfessor (Creutz & Lagus, 2004) use probabilities to learn what morphemes are in a language. The introduction of the paper is quite good -- read as much of it as you understand. We will talk about unsupervised morphology induction (as a class of problem) on Wednesday, as well as discuss some resources for morphological information (e.g., [Unimorph](https://unimorph.github.io/)).