# From documents back to words

This week we have covered:
  * `word2vec`
  * Document embeddings with Latent Semantic Analysis
  * Clustering algorithms

Goals of prior analyses:
  * Learn the representations of words
  * Combine words in different ways to characterize documents
  * Create groups and assign labels to them

The algorithms we have talked about are neat from a data perspective:
  * Have intuitive, geometric interpretations
  * Work well in practice for practical natural language processing
  * But do not always work with language data perfectly

## The skew of language data

<img src="https://upload.wikimedia.org/wikipedia/commons/3/3e/Dirichlet_distributions.png" width=450 />

e.g., word frequency distributions:

<img src="https://finnaarupnielsen.files.wordpress.com/2013/10/brownzipf.png" width=500 />

# Generative Models

Before we talk about Latent Dirichlet Allocation (LDA), let's talk a bit about what kinds of models we are learning today. LDA is what is called a **generative** model, which can be used to *create* data from what it learns. It makes the assumption of a **data generating process**. Understanding what this might look like will help us understand what we learn when we build topic models. The vast majority of models that we use in contemporary NLP are **discriminative models** (e.g., `word2vec`, `BERT`, logistic regression, etc.).

Intuitively, how do we end up with a document? What steps do we go through?

<details>
<summary>Step 1: Message formulation
</summary>
Someone has to have something they want to say (a message)
</details>
<details>
<summary>
Step 2: Articulation
</summary>
Someone has to use their knowledge of language to say it
</details>
<details>
<summary>
Step 3: Interpretation
</summary>
Someone else has to interpret what was said in (2) and try to reconstruct (1)
</details>

One way these three steps have beeen thought of is as a **noisy channel model** proposed by Claude Shannon. This model is essentially just the same as Bayes' theorem (aka Bayes rule).

<img src="https://upload.wikimedia.org/wikipedia/commons/d/d4/Thomas_Bayes.gif" />

The Reverend Bayes

## Bayes Theorem

We discussed Bayes Theorem briefly a few weeks ago when we talked about conditional probabilities. A conditional probability is like looking at a **subset** of the data. For example:

$p(\text{Word} = w_i | \text{Context} = c_j)$ refers to the probability of a word $w_i$ given a context $c_j$. 

That is, among all the contexts where we have seen $c_j$, what proportion of events did we also see $w_i$?

If we think about *bigrams* -- two word sequences -- again, Bayes Theorem allows us to shift between different kinds of probabilities:

$p(\text{Word} = w_i | \text{Context} = c_j) \approx \large \frac{p(\text{Context} = c_j | w_i ) p(\text{Word} = w_i)}{p(\text{Context}=c_j)} $

## How does Bayes relate to topic models?

When estimating language models, we have spent a **lot** of time trying to find $p(w | c)$ by looking at **term frequencies** in TF-IDF, or by computing counts of words in bag-of-words representations. But, if we want to know what kinds of topics are present in a document, we actually need to **invert the question**.

That is, we need to use the information about the words to define documents, rather than defining words by the documents they occur in. 

In order to do this, we flip the formula -- we are trying to best approximate the *topics* of our documents using the words that we have in our vocabulary.




#### Set up for code examples

In [2]:
import nltk
nltk.download("punkt")
nltk.download('stopwords')
from nltk import word_tokenize
from nltk.corpus import stopwords
from google.colab import drive, files
import pandas as pd

drive.mount("/content/gdrive")
stopwords_ = set(stopwords.words())
with open("/content/gdrive/MyDrive/Fall 2021 Computational Linguistics Notebooks/files/abstracts.tsv") as file:
  abstracts = file.readlines()
rice = pd.read_excel("/content/gdrive/MyDrive/Fall 2021 Computational Linguistics Notebooks/files/Riceetal_SupplementaryMaterials_R1.xlsx")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## Latent Dirichlet Allocation (LDA)

Using LDA requires making some assumptions:

* Documents are composed of multiple parts (e.g., words) --> Prior
* Words may be used in different ways in different contexts --> Prior
* Words can disambiguate each other
* Documents can be composed of multiple topics --> Prior
* Terms do not have to be used equally often across topics --> Prior

A topic model learns several types of probabilities:

* The probabilities of words (out of all tokens)
* The probabilities of a word $w_i$ occurring in a topic $k$ (all $k$ topics for $w$ must sum to 1)
* The probabilities of a topic $k$ occurring in a document $d$ (all topics $t$ must sum to 1)

These components in the `gensim` implementation are referred to as the:

* **Term Topic Matrix** --> Strongest terms in each topic (e.g., "radical")
* **Document Topic Matrix** --> Strongest topics for a given document (e.g., 30% chemistry, 50% math)
* **Topic Term Matrix** --> Strongest topics for a given term (e.g., 30% chemistry, 50% math)

In [3]:
from gensim.models import LdaModel
from gensim.corpora import Dictionary

tokenized_abstracts = [word_tokenize(x) for x in abstracts]
tokenized_abstracts = [[x for x in y if x not in stopwords_] for y in tokenized_abstracts]

bib_dictionary = Dictionary(tokenized_abstracts)
bib_corpus = [bib_dictionary.doc2bow(text) for text in tokenized_abstracts]

model = LdaModel(corpus=bib_corpus, num_topics=40)

### Let's explore the components of our trained LDA model

In [4]:
# document
model.get_term_topics(bib_dictionary.token2id['parsing'])

[(13, 0.058627035)]

In [6]:
model.get_topic_terms(13)

[(3933, 0.05852286),
 (3, 0.054765403),
 (4307, 0.035495874),
 (2742, 0.033214096),
 (2, 0.033086985),
 (634, 0.028297313),
 (1851, 0.014612291),
 (1219, 0.014602867),
 (2753, 0.014518901),
 (1, 0.014266459)]

In [7]:
for term, score in model.get_topic_terms(13):
  print(bib_dictionary[term])

parsing
.
parser
dependency
,
syntactic
semantic
tree
trees
)


### Exploring the topics that LDA learns over our abstracts

In [10]:
model.get_term_topics([bib_dictionary.token2id['dependency']])

[(13, array([0.0332691], dtype=float32))]

What is the strongest document on the same topic that is most associated with "parsing"? I.e., what document is in the same general area?

In [12]:
parsing_affinity = 0
top_doc = abstracts[0]
for i, doc in enumerate(bib_corpus):
  # get topics FOR a document
  topics_scores = model.get_document_topics(doc)
  for topic, score in topics_scores:
    if topic==13 and score > parsing_affinity:
      parsing_affinity = score
      top_doc = abstracts[i]

top_doc

'The de-facto standard decoding method for semantic parsing in recent years has been to autoregressively decode the abstract syntax tree of the target program using a top-down depth-first traversal. In this work, we propose an alternative approach: a Semi-autoregressive Bottom-up Parser (SmBoP) that constructs at decoding step t the top-K sub-trees of height {\\mbox{$\\leq$}} t. Our parser enjoys several benefits compared to top-down autoregressive parsing. From an efficiency perspective, bottom-up parsing allows to decode all sub-trees of a certain height in parallel, leading to logarithmic runtime complexity rather than linear. From a modeling perspective, a bottom-up parser learns representations for meaningful semantic sub-programs at each step, rather than for semantically-vacuous partial trees. We apply SmBoP on Spider, a challenging zero-shot semantic parsing benchmark, and show that SmBoP leads to a 2.2x speed-up in decoding time and a {\\textasciitilde}5x speed-up in training 

# Caveat emptor! A note about topic models as clustering algorithms

Unlike PCA, the topics that LDA learns do not have fixed labels. Like K-Means, LDA creates "latent" categories that correspond to the different topics. For that reason, on one jump of your notebook, what is in Category 3 will be very different from Category 3 the next time. Or, if you change dimensionality to add more topics, your numbers will change as well. So, always keep an eye out for what topic you are trying to refer to!

## Topic models allow us to handle ambiguous words

Previously, we have talked about how Latent Semantic Analysis (LSA), and word2vec all leave us with a representation of words that is *completely insensitive to context*.

How I interpret the word "particle" will change depending on what I am reading:

* Chemistry: A unit of matter
* Linguistics: A type of morpheme or small word
* Meteorology: Small bits of matter suspended in the air

Topic models allow us to see contextual predictions -- how much do we think an article is about chemistry? If we think it is about chemistry, does that change how we understand the meaning of the word "particle"?

Topic models allow us to look at exactly that! What do you think are the most ambiguous words in NLP?

We are going to look for the words that have the most even probability distributions. For this, we need to go back to our code that lets us look at the "term topics", or topics associated with a given word.

`model.get_term_topics([bib_dictionary.token2id['dependency']])`

In [13]:
n_nonzero = 0
for term in bib_dictionary.token2id:
  topics = model.get_term_topics(bib_dictionary.token2id[term])
  if len(topics) > 0:
    n_nonzero += 1

n_nonzero

164

In [54]:
for term in bib_dictionary.token2id:
  topics = model.get_term_topics(bib_dictionary.token2id[term])
  if len(topics) > 1:
    print(term, len(topics))

( 9
) 10
, 38
. 38
data 3
language 4
model 2
paper 4
We 10
The 7
models 2
' 2
machine 2
systems 2
{ 4
} 4
system 6
resources 2
using 2
method 2
task 2
translation 2
\ 2
semantic 2


The way we would allow words to be *more* ambiguous (assigned to more topics) is by setting different kinds of expectations (or priors) that help the model make decisions.

It is also clear that word frequency plays a role. We might need to be clever to handle rare words.

# What resources can we use to learn about word meanings?

Lots and lots of labeled databases -- but the most well-known one is WordNet. WordNet is a huge resource that tries to group every string into all of its distinct meanings. 

Like any other resource _Wordnet is **incomplete**_. But, it can be useful for distinguishing between **unrelated** senses of the same string. How it handles **related** senses is another matter.

Here is an example of a string with lots of **related** senses:

* "Paper" can refer to
  * _a journal article_ ("I loved their paper on this topic")
  * physical sheets of paper ("give me a piece of paper")
  * the material ("paper was strewn everywhere after the party last week")

(instance of polysemy)

Here is an example of a string with lots of **unrelated** senses:

* "Bow" can refer to
  * The side of a ship
  * One part of a "bow and arrow"
  * A ribbon used to wrap packages or hair

(instance of homonymy)

Understanding what senses get used when is a huge challenge in NLP.

In [45]:
from nltk.corpus import wordnet
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [53]:
for sense in wordnet.synsets("slick"):
  print(sense)
  print(sense.definition())

Synset('slickness.n.03')
a slippery smoothness
Synset('slick.n.02')
a magazine printed on good quality paper
Synset('slick.n.03')
a film of oil or garbage floating on top of water
Synset('slick.n.04')
a trowel used to make a surface slick
Synset('slick.v.01')
make slick or smooth
Synset('slick.v.02')
give a smooth and glossy appearance
Synset('slick.s.01')
made slick by e.g. ice or grease
Synset('glib.s.02')
having only superficial plausibility
Synset('satiny.s.01')
having a smooth, gleaming surface reflecting light
Synset('crafty.s.01')
marked by skill in deception


## Topic models learn how ambiguous words are

If there's enough time, we're going to do a feasibility exercise to look at what happens with ambiguous words. For this, we'll load in the Rice et al. (2018) dataset from HW4.

In [15]:
# preview ambiguous word data
# check out the top of this dataset, only two columns to avoid clutter
rice[['line', 'probe']].head()

Unnamed: 0,line,probe
0,"Come on , pack it up !",pack
1,We are gon na pack them to the doors .,pack
2,Everybody pack up whatever you can .,pack
3,You go home and you pack your bags now ! How '...,pack
4,I thought I brought an extra pack of gum .,pack


In [16]:
# figure out the columns of this dataset
rice.columns

Index(['probe', 'line', 'Rater_1', 'Rater_2', 'Meaning_R1', 'Meaning_R2',
       'Meaning_R1_BestGuess', 'Meaning_R2_BestGuess', 'Comment_R1',
       'Comment_R2', 'CertainAgreement_R1_R2',
       'CertainOrUncertainAgreement_R1_R2'],
      dtype='object')

In [17]:
# subset my data to just where raters agreed
agree = rice.loc[rice['CertainAgreement_R1_R2']==1]

In [18]:
from gensim.models import LdaModel
from gensim.corpora import Dictionary

rice_tokenized = [word_tokenize(x) for x in agree['line'].str.lower()]
rice_dictionary = Dictionary(rice_tokenized)
rice_corpus = [rice_dictionary.doc2bow(text) for text in rice_tokenized]

ambig_lda = LdaModel(corpus=rice_corpus, num_topics=10, minimum_probability=0)

In [19]:
# find all the unique probes in rice['probes']
for term in rice['probe'].unique():
  if term in rice_dictionary.token2id.keys():
    topics = ambig_lda.get_term_topics(rice_dictionary.token2id[term])
    print(term, topics)

pack [(0, 6.7012843e-07), (2, 6.8161394e-06), (4, 3.6192665e-05), (6, 0.00015470246), (7, 4.8771573e-05), (8, 1.2598084e-06)]
boxing [(7, 1.8465785e-08)]
spear [(4, 2.9770662e-08), (6, 1.5583635e-05), (7, 4.10562e-05)]
perk [(2, 2.7570266e-07), (6, 5.292729e-06)]
slam [(2, 1.0458667e-05), (6, 1.6346848e-06)]
cope [(6, 3.6947224e-08)]
mortar [(1, 4.850477e-07), (4, 3.311726e-08), (8, 1.4038118e-06)]
sheet [(0, 4.780449e-05), (2, 2.2386962e-06), (4, 6.3887717e-07), (5, 8.604184e-07), (6, 3.4664379e-06), (7, 1.0249078e-08)]
race [(0, 0.0002241557), (2, 1.1982501e-06), (3, 5.482213e-05), (4, 3.494444e-05), (5, 0.00029129133), (6, 1.1077971e-05), (7, 1.1939469e-05), (8, 3.7490088e-07), (9, 9.746949e-08)]
multiply [(0, 2.9043154e-05), (6, 5.119869e-08), (7, 3.4080514e-08)]
banker [(0, 3.6221564e-07), (2, 1.0115738e-05), (3, 5.7532486e-07), (6, 1.9558292e-06), (8, 1.7188546e-08)]
stalk [(1, 3.9752535e-08), (6, 2.6054582e-05)]
scale [(0, 1.3604407e-06), (1, 5.3925755e-06), (3, 4.711465e-08), (

In [29]:
# let's look at classifications of documents containing an ambiguous word
# "slick" has only two topics
# slick [(3, 4.517052e-05), (7, 9.48455e-06)]

slick_df = rice[rice['probe']=='slick']

print("Topic 3")
for term, score in model.get_topic_terms(3):
  print(rice_dictionary[term])

print()

print("Topic 7")
for term, score in model.get_topic_terms(7):
  print(rice_dictionary[term])

Topic 3
going
it
come
fire
racket
mess
beyond
don
time
fetch

Topic 7
it
come
time
representing
built-in
welcome
small
allah
pockets
1


In [55]:
# now we want to encode all our documents using LdaModel
threes = []
sevens = []
for i, row in slick_df.sample(n=10).iterrows():
  document_text = row['line']
  # get doc-topic representation
  doc_to_topic_rep = ambig_lda.get_document_topics(bow=rice_corpus[i])
  _, topic_3_score = doc_to_topic_rep[3]
  _, topic_7_score = doc_to_topic_rep[7]
  # compare the probabilities of our two "slick" topics
  print(topic_3_score, topic_7_score)
  # does the text align with our intuitions?
  if topic_3_score > topic_7_score:
    threes.append(document_text)
  elif topic_7_score > topic_3_score:
    sevens.append(document_text)

print()
print("These are are 'Topic 3' uses of 'slick':")
for doc in threes:
  print(doc)
print()
print("These are 'Topic 7' uses of 'slick:")
for doc in sevens:
  print(doc)

0.0066689197 0.006669977
0.61245483 0.008335385
0.011114817 0.23883118
0.0030309826 0.3115274
0.0043484955 0.1723827
0.18834357 0.66434586
0.01000216 0.5434803
0.005557318 0.0055559804
0.10803135 0.0055585233
0.0066695143 0.006667007

These are are 'Topic 3' uses of 'slick':
I 'm gon na hate to have to give you up , slick . ( KARR )
But I 've got this guy , he 's just come from America , he 's really slick .
Nothing . It might be a slick way to get to know her . Why ?
Its tyres are as slick as the law allows and you get racing suspension .

These are 'Topic 7' uses of 'slick:
Okay , we 're gon na get our show back on the air ... ... and were not gon na be intimidated by any slick executive types .
Still just as slick as a horsehair couch .
Oil slick to port .
Pretty slick .
My mind was slick , my temper was too quick
Or we can slick on it , oil slick


It looks like "slick" can be used in a more positive way, and a more negative way! But, we might need to do more digging into this ambiguous words dataset and wordnet to really say.

### As you can see, topic modelling algorithms like LDA do well with large amounts of data, but struggle with short documents just like Latent Semantic Analysis (LSA)

Every single model will struggle with short documents! And people do too. But people are better at it. Can you think of why?

# Next week:

## 1. Guest lecture by Liz on Monday! Please attend, she is awesome :)
## 2. Learning more about (neural) contextual language models