# The goals of good language modeling

We want to be able to account for the statistical structure of language as much as possible. The better we can build a statistical model, the more easily we can handle:

1. Unseen, novel language
2. Complex linguistic dependencies
3. Extract the important parts out of a text or conversation

# Challenges for n-gram language models and static embeddings

## Long-distance dependencies

We need models that can remember words from **far** outside the immediate words to make good predictions about what the next words will be. Consider for example, the following sentences:

###Example 1: Coreference resolution
1. Last week, I took **my cat** to the vet and the doctor told me **he** was doing very well
2. Last week, I took **my cat** to the vet and the doctor told me **my boyfriend** was doing very well

A good language model will be able to tell you that (1) is generally a better sentence than (2). But, any model that does not know that the pronoun **he** is a better word to follow **my cat** than the noun phrase **my boyfriend** will get this wrong.

This is especially clear in languages with **flexible word order**. For example, languages like Bulgarian, Turkish, classical Latin, Medieval French, etc., all have highly flexible word order. This means that estimating immediate sequences of n-grams is _even harder_ than for languages like English which have relatively **fixed word order**. Even more challenging, languages with flexible word order often have **complex morphology** which increases **sparsity** in co-occurrence data. Effectively:

###<center>Our counts become unreliable.</center>

###Example 2: Verb-particle constructions
* Do you know which doctor I **sent** a letter **to**?
* I had a hard time **carrying** the suitcase full of heavy equipment **up** the stairs.

These problems are very difficult and finding a good solution has been challenging. For coreference resolution, knowing that **my cat** and **he** are the same referent requires remembering what other things I have mentioned. For verb-particle constructions, I need to be able to learn that **(send, to)** and **(carry, up)** are pairs of words that go together.

**Long-distance dependencies** make the probabilities of words very hard to estimate. That is, we cannot usually build very good language models. In order to reliably estimate upcoming words in a language like English, companies like Google would store 6-, 7-, and even 8- or 10-gram sequences of words for applications like **machine translation**.

This leads to these hard questions

1. How far out do we need to look?
2. What length of n-gram do we need to store?

## Ambiguity

Liz did a great job on Monday discussing how ambiguity plays a role in understanding the meanings conveyed by a sentence. Ambiguity can range from **polysemy** (e.g., the material versus the research meanings of _paper_) to **homonymy** (e.g., _bass_ the fish versus _bass_ the guitar).

## Phrases

We have previously created **document embeddings** using LSA, `word2vec`, and LDA. But, we do not have a good way to represent combinations of words. Consider, for example, the following types of data:

* strong coffee vs. powerful coffee
  - "powerful coffee" sounds strange to us -- but why? If you were developing a dialogue agent, you would want to make sure it says the more natural phrase in English.
* the white house (a place) vs. the White House (the US government entity)
  - A good system should know that _The White House_ only has a tiny bit of the same kind of "white" in it as _the white house_
* Green Bay Packers (football) vs. Green Bay (place)
* Xe **is not** in El Paso vs. Xe **is** in El Paso

Ultimately, we are faced with not just an issue of **ambiguity** but rather **compositionality**.

## How do we compute the meanings of higher-order combinations?

For today and Friday, we can think about this in an n-gram or embedding frameworks. But, next week, we will spend more time thinking about meanings in terms of first order logic and sets!

## A quick-and-easy way to learn phrase representations

N-grams are the easiest way to approximate the meaning of a phrase. We simply make the assumption that two words that co-occur are potentially meaningful, and phrases that co-occur a lot are more meaningful. Importantly, if two words in a phrase almost always occur in the same 

We can easily extend the bag-of-words model code from before to n-grams of arbitrary sizes, creating a **bag-of-n-grams** representation. We will start with bigrams to be simple, and do stop word removal.

In order to create "n-grams" in our vocabulary, we just have to treat phrases as unanalyzed strings! Let's try LSA with frequent n-grams and test out how similar phrases are

In [21]:
import nltk
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [23]:
# load in standard stuff
from google.colab import files, drive
from collections import Counter
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD # PCA but for sparse matrices
nltk_stops = stopwords.words("english")
missing_stops = ["'d", "'ll", "'re", "'s", "'ve", 'could',
                 'might', 'must', "n't", 'need', 'sha', 'wo', 'would']
stop_words = nltk_stops + missing_stops

drive.mount("/content/drive/")
abstracts = open(
    ("/content/drive/MyDrive/Fall 2021 Computational"
     " Linguistics Notebooks/files/abstracts.tsv"), 'r').readlines()

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [24]:
help(CountVectorizer)

Help on class CountVectorizer in module sklearn.feature_extraction.text:

class CountVectorizer(_VectorizerMixin, sklearn.base.BaseEstimator)
 |  CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)
 |  
 |  Convert a collection of text documents to a matrix of token counts
 |  
 |  This implementation produces a sparse representation of the counts using
 |  scipy.sparse.csr_matrix.
 |  
 |  If you do not provide an a-priori dictionary and you do not use an analyzer
 |  that does some kind of feature selection then the number of features will
 |  be equal to the vocabulary size found by analyzing the data.
 |  
 |  Read more in the :ref:`User Guide <text_feature_extraction>`.
 |  
 |  Parameters
 |  ---------

In [5]:
# turn our documents into a bag-of-words plus a bag-of-bigrams
vectorizer = CountVectorizer( 
    tokenizer=word_tokenize,
    stop_words=stop_words,
    ngram_range=(1, 2), # learn bigrams
    lowercase=True)
# basically just one line to get a giant matrix
bow_abstracts = vectorizer.fit_transform(abstracts)

In [25]:
bow_abstracts.shape

(27470, 1100487)

In [26]:
# do PCA
N_COMPONENTS = 100

pca = TruncatedSVD(n_components=N_COMPONENTS)
pca.fit(bow_abstracts)

TruncatedSVD(algorithm='randomized', n_components=100, n_iter=5,
             random_state=None, tol=0.0)

In [29]:
pca.explained_variance_ratio_

array([0.3692086 , 0.05771472, 0.01183093, 0.01050857, 0.00931065,
       0.00878017, 0.0056165 , 0.00428921, 0.0039652 , 0.0038232 ,
       0.0033733 , 0.00294632, 0.00256014, 0.0023847 , 0.00229067,
       0.00212326, 0.00202841, 0.0017337 , 0.0016834 , 0.00161826,
       0.00156556, 0.00151432, 0.00145235, 0.00144049, 0.00135888,
       0.00131564, 0.00129326, 0.00125716, 0.00122282, 0.00119186,
       0.00117083, 0.00113469, 0.00111088, 0.00110497, 0.00108503,
       0.00106873, 0.00104472, 0.00102723, 0.00100087, 0.00098857,
       0.00097882, 0.00097021, 0.00095404, 0.00093225, 0.00092549,
       0.00091923, 0.0009044 , 0.00088168, 0.00087176, 0.00086272,
       0.00084663, 0.00084001, 0.00083729, 0.00082749, 0.00081828,
       0.00081738, 0.00080247, 0.00078626, 0.00077412, 0.00076023,
       0.00075079, 0.00074161, 0.00073603, 0.00072274, 0.00071703,
       0.0007134 , 0.00070361, 0.00069429, 0.00068583, 0.00067729,
       0.00067318, 0.00066818, 0.0006623 , 0.00065274, 0.00064

In [7]:
# try again with "parsing algorithms"
word_vectors = pca.components_.T
_index = vectorizer.vocabulary_['parsing algorithms']
word_similarities = cosine_similarity(word_vectors[_index].reshape(1, -1),
                                      word_vectors)
_to_similarities = dict(zip(vectorizer.get_feature_names(),
                            word_similarities[0].tolist()))
dict(sorted(_to_similarities.items(), key=lambda item: item[1])[-15:])

{'. parsing': 0.9314807905080164,
 '1-endpoint-crossing': 0.9015153297051915,
 'algebras .': 0.8931686691990012,
 'chart parsing': 0.8928353448486199,
 'constituent parsing': 0.9099672574812746,
 'dependency parsing': 0.8927422486203865,
 'parsing': 0.9302649052178742,
 'parsing ,': 0.9139515080136837,
 'parsing .': 0.9238950963351666,
 'parsing algorithm': 0.9118931403500804,
 'parsing algorithms': 1.0,
 'parsing schema': 0.9341754124624457,
 'parsing schemata': 0.9218372055211215,
 'parsing strategies': 0.8932287250109212,
 'transition-based': 0.908928470006244}

In [30]:
_index = vectorizer.vocabulary_['machine translation']
word_similarities = cosine_similarity(word_vectors[_index].reshape(1, -1),
                                      word_vectors)
_to_similarities = dict(zip(vectorizer.get_feature_names(),
                            word_similarities[0].tolist()))
dict(sorted(_to_similarities.items(), key=lambda item: item[1])[-15:])

{'( smt': 0.7053785221912773,
 'machine': 0.9027890466965998,
 'machine translation': 1.0,
 'phrase-based machine': 0.7788280894106764,
 'smt )': 0.7021311492810304,
 'statistical machine': 0.7882546492292113,
 'translation .': 0.7467885645523396,
 'translation neural': 0.7240753774811233,
 'vigor practical': 0.6986410150754385,
 'way recognition': 0.6986410150754385,
 'white house': 0.6986410150754385,
 'win/win': 0.6986410150754385,
 'win/win outcomes': 0.6986410150754385,
 'witnessed accelerated': 0.6986410150754385,
 'years truly': 0.6986410150754385}

In [9]:
# try again with "state of the art", removing the stop words
_index = vectorizer.vocabulary_['state art']
word_similarities = cosine_similarity(word_vectors[_index].reshape(1, -1),
                                      word_vectors)
_to_similarities = dict(zip(vectorizer.get_feature_names(),
                            word_similarities[0].tolist()))
dict(sorted(_to_similarities.items(), key=lambda item: item[1])[-15:])

{', competitive': 0.5718417337818319,
 'art': 0.8907364912757962,
 'art .': 0.6134248243313557,
 'art results': 0.5159260987754503,
 'backpropagating': 0.5136848164839294,
 'competitive state-of-the-art': 0.5108603450616938,
 'current state': 0.6263172908546091,
 'new state': 0.7099132084691453,
 'outperforming': 0.5274225684992738,
 'previous state': 0.6363981684187728,
 'reduce size': 0.5280202647050907,
 'state': 0.6372971352265207,
 'state art': 1.0,
 'uses large': 0.5061153100760793,
 'wmt romanian-english': 0.5008015392156285}

## Assessing the "parsing" in "parsing algorithms"

We can use standard vector math -- addition, substraction, cosine similarity -- to understand how different "parsing algorithms" is from "parsing." This allows us to know -- somewhat -- how much parsing matters. To do this, we can simply subtract "algorithms" from "parsing algorithms" and compute the cosine similarity between _that new vector_ and the vector for "parsing."

In [31]:
# find out which word vector to get
parsing_index = vectorizer.vocabulary_['parsing']
algorithms_index = vectorizer.vocabulary_['algorithms']
parsing_algorithms_index = vectorizer.vocabulary_['parsing algorithms']

# store these for easy reuse
parsing_vector = word_vectors[parsing_index].reshape(1, -1)
algorithms_vector = word_vectors[algorithms_index].reshape(1, -1)
parsing_algorithms_vector = word_vectors[parsing_algorithms_index].reshape(1, -1)

In [32]:
parsing_only = parsing_algorithms_vector - algorithms_vector
print(cosine_similarity(parsing_only, parsing_vector))

[[-0.25381316]]


In [33]:
schema_vector = word_vectors[vectorizer.vocabulary_['schema']].reshape(1, -1)
parsing_schema = parsing_algorithms_vector - algorithms_vector + schema_vector


In [34]:
cosine_similarity(parsing_schema,
                  word_vectors[vectorizer.vocabulary_['parsing schema']].reshape(1, -1))

array([[-0.20535282]])

### Vector addition, subtraction not necessarily well-behaved

It is not clear that vectors should *add* to form the word vectors for more complex linguistic categories. Linguistic relationships are often _non-additive_ in a statistical sense. That is, in most things about language, a combination represents something distinct from the components. There is, effectively, an **interaction** between representations. Consider, for example, the following:

* the doctor 's **office**
* brain **doctor**

Most of us would intuitively say the word "doctor" means something _roughly_ equivalent across all three examples.

But, the _core meaning_ of the combination depends on both component words, and often one slightly more than another (the bolded words). But, it is also not clear that the meaning should be a **weighted average** of static word vectors, either.

So, we probably want another solution -- for which we can use more sophisticated math. This includes **convolutional** methods, **neural network** representatations, or **learned composition functions**.

# Recurrent Neural Networks

Elman (1990): Finding structure in time. https://doi.org/10.1207/s15516709cog1402_1

Recurrent Neural Networks, or RNNs, are like the n-gram language models we learned about before. RNNs are very similar to `word2vec` and n-gram models in that their job is to **predict what the next word is going to be** as well as possible. The particular insight in creating recurrent neural networks was that we can represent the **context** as something like a word vector.

The earliest instantiation of Elman's RNN was a **character language model**, or a model that learned to predict the next letter in a sequence. Also known as a **simple recurrent network**, his model had a vocabulary of about 40 characters (26 English letters + 10 numbers). This type of model took weeks to run on machines even though its task would take 30 seconds on your own laptop.

But, as we gained computing power, it was feasible to extend his small RNN to complex vocabularies from whole corpora. Starting in the early 2010s, we were able to finally compute sophisticated RNN representations rather than work with toy examples.

## Four critical components in RNNs

<!-- Here is some python pseudocode where we have a hidden layer (just like in `word2vec`) of $k$ dimensions: -->

<!-- ```python
for i, word in enumerate(sentence):
  one_hot_word_vector = make_bag_of_words([word])
  one_hot_next_word_vector = make_bag_of_words([sentence[i + 1])
  if i==0:
    hidden_units = randomly_initialize(dimensionality=k)
    recurrent_units = zero_initialize(dimensionality=k)
  else:
    recurrent_units = hidden_units # set to previous hidden state
    hidden_units = recompute_hidden_from(one_hot_vector)
  concatenated_hidden = concat(hidden_units, recurrent_units)
  prediction = predict(concatenated_hidden)
  error = compute_loss(prediction, one_hot_next_word_vector)
``` -->

<center><img src="https://www.oreilly.com/library/view/keras-2x-projects/9781789536645/assets/8bf6fccb-4bdf-4542-b095-1791a7e2ca88.png" width=550 /></center>


1. Input representations (a one-hot bag-of-words representation of the current word)
2. Hidden units (just like in `word2vec`)
3. Recurrent units (which hold a "copy" of (2) from the previous cycle
4. Output representations (a one-hot bag-of-words representation of the next word)

On the surface, this is very different from n-gram language modeling. But, the outcome is similar.

The major contribution of the RNN is that it predicts the next output (just like an n-gram language model) using a **latent representation** of the context. That is, both the _current word_ and _all the prior words_ contribute to the prediction.

Cool historical page on SRNs from Jay McClelland's research group: https://web.stanford.edu/group/pdplab/pdphandbook/handbookch8.html


## Some caveats about Simple Recurrent Networks (SRNs)

Elman's original models work well over simplified inputs and outputs, such as small corpora. But, **hidden states can slowly get corrupted**, so the model is not guaranteed to work very well for long sequences. That is, it might forget some of what it has read.

So, while SRNs are better than n-gram language models for prediction, they still need more tricks to better remember the past. For that, researchers implemented Long Short-Term Memory ([Hochreiter & Schmidhuber, 1997](https://doi.org/10.1162/neco.1997.9.8.1735)) and Attention ([Bahdanau, Cho, & Bengio, 2014](https://arxiv.org/abs/1409.0473)). These help models to better remember what they have just seen.

## Word representations to date

To date, we have talked about the following algorithms we have covered that learn word vectors are:

* Latent Semantic Analysis (LSA) via Principal Component Analysis (PCA)
* word2vec (the continuous bag-of-words version)
* Latent Dirichlet Allocation (LDA) for topic model representations of words

What are the features of each of the dimensions learned by these methods? What is the input and the output?

**LSA (Latent Semantic Analysis)**
<details>
<summary>Each dimension corresponds to </summary>
  Its distance along an axis defined by learning the best characterization of the subspace
</details>
<details>
<summary>The input to LSA is </summary>
  A set of document representations in bag-of-words format -- a document matrix
</details>
<details>
<summary>The result of LSA is </summary>
  A subspace that projects all words $w$ in a vocabulary $V$ into a lower-dimensional space ($|V| \times k$)
</details>

**word2vec (continuous bag-of-words)**
<details>
<summary>Each dimension corresponds to</summary>
A compressed representation learned by predicting "held-out" words from a context
</details>
<details>
<summary>The input to word2vec is</summary>
Document representations presented one at a time until the model converges on a solution. These document representations are like bag-of-words representations, except the model predicts a held-out word that is not included in the counts
</details>
<details>
<summary>The result of word2vec is </summary>
  Two matrices that (1) turn one-hot bag-of-words representations into word vectors and (2) turn a context vector [from the document representation] into predictions (probabilities) of held-out words
</details>

**LDA (Latent Dirichlet Allocation)**
Unlike `word2vec` and LSA, LDA does not return a word vector representation exactly. Like the vectors learned by PCA...
<details>
<summary>Each dimension of a "word vector" corresponds to</summary>
A "weight" in the form of a probability that a given word $w$ belongs to a given topic $t$
</details>

But, unlike the vectors we learn in LSA:

<details>
<summary>Each dimension corresponds to</summary>
The dimensions are arbitrary -- PCA gives us vectors ordered by their importance. LDA treats each topic separately.
</details>

And, because the values learned by LDA for a given word are **probabilities** this means some similarity math (e.g., cosine similarity or dot products) are challenging. This motivates contextual methods that can learn word representations that still behave nicely in standard geometric spaces.

**RNN (Recurrent Neural Network)**
Using a more complex neural structure that holds onto prior states in memory, we can learn contextual word representations.
<details>
<summary>Each dimension of a "word vector" corresponds to</summary>
The hidden state (a float vector) prior to the output layer (a vector of probabilities of the next word).
</details>


## Preview of Friday!

## ELMo

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer

https://aclanthology.org/N18-1202.pdf

In [10]:
# doing this instead of trying to use Allen Institute's package
!pip install simple-elmo

Collecting simple-elmo
  Downloading simple_elmo-0.8.0-py3-none-any.whl (45 kB)
[?25l[K     |███████▏                        | 10 kB 21.0 MB/s eta 0:00:01[K     |██████████████▍                 | 20 kB 28.0 MB/s eta 0:00:01[K     |█████████████████████▋          | 30 kB 26.5 MB/s eta 0:00:01[K     |████████████████████████████▊   | 40 kB 20.3 MB/s eta 0:00:01[K     |████████████████████████████████| 45 kB 2.8 MB/s 
Installing collected packages: simple-elmo
Successfully installed simple-elmo-0.8.0


## BERT

https://aclanthology.org/N19-1423.pdf

## RoBERTa

https://arxiv.org/pdf/1907.11692.pdf

## GPT-2



In [15]:
!pip install transformers
# using the HuggingFace implementation
from transformers import BertModel, RobertaModel, GPT2Model

Collecting transformers
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 8.8 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 43.5 MB/s 
Collecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 4.6 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 23.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 52.2 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  At

# A comparison of contextual representations

https://aclanthology.org/D19-1006.pdf