### Clarification of Question 6 on HW2
There is an error in Question 6 that states that you need to select *random* indices in order to compare three different tokenization systems. You do not need to randomize anything. However, you must select a _slice_ of length five, with each element corresponding to one abstract that you will have tokenized using different methods. 

### Clarification of Question 7 on HW2

You will need to tabulate all word counts across all abstracts. The easiest way to combine your abstracts into one object that `Counter` can process efficiently is by "adding" arrays which will _concatenate_ them. You can then use this concatenated array as an argument to `Counter`. For example:

```python
my_array = ['this', 'is', 'a', 'sentence']
another_array = ['here', 'is', 'another', 'one']

concatenated_arrays = my_array + another_array
concatenated_arrays
# expected output: ['this', 'is', 'a', 'sentence', 'here', 'is', 'another', 'one']
```

<hr />

# Better estimating probabilities in our data

Today covers two broad topics:

* N-gram smoothing
* TF-IDF (Term frequency, inverse document frequency)

We will cover each of these things somewhat separately from each other, but they will all become relevant again in a couple weeks when we start talking about distributional models of word meaning/semantics and vector representations of words and other kinds of tokens.

## The probability of a document

The probability of a document is a product of the probabilities of that document. It is a measure of the _fit_ of any statistical model $M$ to a certain dataset $D$ -- such as a statistical language model to a corpus or to a novel dataset. Typically, we want to pick a model that _maximizes_ the likelihood of a dataset because we want to get as close as possible to the real process that generated our data. 

The best way to get perfect likelihoods for a dataset is to take estimates from that exact dataset. If we estimate word frequencies by computing counts in a corpus $C$, then that is the best **unigram model** of that corpus. But, if we apply those frequencies to another corpus $C'$, we are not guaranteed to get good estimates. Those corpora may be completely different. The counts that maximize the likelihood for $C'$ will probably differ.

## Estimating the probability of a document 

When estimating the probability of a sentence like

> "My friend Moss grew up on an island"

we have a lot of choices. Here are some:

* Unigram probabilities (word frequencies)
* Bigram probabilities
* Transition or conditional probabilities

If we are computing a sentence's **unigram probability**, we can break down the problem down into a multiplication of $k$ unigram probabilities for each token (e.g., words) in a sentence:

<center>$\Large \prod_{i=1}^{k} \frac{count(w_i)}{\sum_{j} count(w_j)}$</center>

### What should we do with words we have never seen before?

* teh --> the
* hangry --> hungry
* A word written withoutspaces

## Why are unseen events a problem with maximum likelihood estimation?
<details>
<summary>
</summary>

  * As NLP practioners, we often want to "score" language based on its probability, or some product of probabilities, i.e., compute the _likelihood_ of a document
  * If the counts anywhere in the model are 0, then the whole probability is 0...
</details>

## The same problems are posed by combinations of words 

*Except* combinations of words are almost always rarer than the words that make them up. This means that there are even more and more rare events (remember Zipf's law).

$p(w_1w_2) = \large \frac{count(w_1w_2)}{\sum_{w_i} \sum_{w_{i+1}} count(w_iw_{i+1})}$

For example, if we do not see the combination $w_1w_2$, then $count(w_1w_2)=0$. By substitution,

$p(w_2 | w_1) = \large \frac{0}{\sum_{w_i} \sum_{w_{i+1}} count(w_iw_{i+1})} = \normalsize 0$


## We can **smooth** the counts:

* [Add-one smoothing](https://en.wikipedia.org/wiki/Additive_smoothing)
* [Good-Turing](https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation)
* [Katz's backoff](https://en.wikipedia.org/wiki/Katz%27s_back-off_model)
* [Kneser-Ney](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing)

We can use **pseudocounts** to change the frequencies of words in our vocabulary. Consider for example the **additive smoothing** algorithm. For example the maximum likelihood estimate of the probability of a word $w_i$ is

<center>$\large p_{w_i^\text{empirical}} = \frac{count(w_i)}{\sum_{j} count(w_j)}$</center>

but the posterior probability when additively smoothed is

<center>$\large p_{w_i^\text{ $\alpha$-smoothed}} = \frac{count(x_i) + \alpha}{\sum_{j} count(w_j) + \alpha}$</center>

These pseudocounting techniques can help us out of a rut when we run into unfamiliar language. But, there are many ways 

<hr />

## Joint probabilities and transition probabilities revisited

It is clear, however, that unigram probabilities are not sufficient. Words like "Moss" or "island" can depend on other words in the sentence. 

<details>
<summary>
The Chain Rule
</summary>

But, estimating those probabilities can be very challenging because we would first need to compute a probability for every possible pair, and pairs of pairs, and so on. There are so many parameters we would need to estimate, and then we would need to apply [the chain rule](https://en.wikipedia.org/wiki/Chain_rule_(probability)) (the Jurafsky Martin chapter is very useful for illustrating this point). This is for most purposes too complex to compute and, in practice, impossible to estimate.
</details>

In practice, we build very simple models to estimate the likelihood of a dataset given a model instead. We can simplify our problem if we look only at two-word combinations. Specifically, we can look at the **transition probability**, which we can define as the proportion of times we see each word following the immediately preceding word.

Recall that a transition probability is defined as:

<center>$\large p(w_{i+1} | w_i)$</center>

If we are working with transition probabilities, we can break down the bigram problem into a series of $k$ multiplications for all $k$ tokens (e.g., words) in a sentence:

<center>$\large \prod_{i=1}^{k=i-1} p(w_{i+1} | w_i)$</center>

which breaks down to:

<center>$\large p(w_{i+1} | w_i) \cdot p(w_{i+2} | w_{i+1}) \cdot \ldots \cdot p(w_{k} | w_{k - 1})$</center>

Or, in this example:

<center>$p(w_{i+1}==\text{"friend"} | w_i==\text{"My"}) \cdot p(w_{i+2}==\text{"Moss"} | w_{i+1}==\text{"friend"}) \cdot \ldots \cdot p(w_{k}==\text{"island"} | w_{k - 1}==\text{"an"})$</center>

This is many fewer probabilities to compute than if we wanted to do it over all possible pairs ($k$ versus $k^2$) which would allow us to estimate [the full **joint probability**](https://en.wikipedia.org/wiki/Joint_probability_distribution).

There are more complicated ways to change the probabilities, many of which try to incorporate knowledge about the context in which tokens occur. We will return to these issues when we talk about semantics in two weeks.

# Term Frequency-Inverse Document Frequency (tf-idf)

This measure is one way of computing the "distinctiveness" of a word for a given document. Like a conditional probability, we will be counting up words and the contexts in which they occur, but we will not be computing a real probability. 

tf-idf is canonically defined as a ratio between the counts of words in documents relative to the number of documents in which they occur. For example, for "parsing" from previous examples, we can first compute the number of documents in which "parsing" appears:

In [None]:
# Let's load in the data again

from pprint import pprint
# colab-specific
from google.colab import drive
from google.colab import files
import os
# nlp software we used last time
import nltk
from nltk import word_tokenize, sent_tokenize
nltk.download('punkt')

drive.mount("/content/drive", force_remount=True)

## alternately: upload a file from your local machine:
## uncomment the two lines below if uploading abstracts.tsv is easier
# from google.colab import files
# uploaded = files.upload()
# abstracts = uploaded['abstracts.tsv'].decode('utf-8')

location_of_my_abstracts = ('/content/drive/MyDrive/Teaching/'
                            'Fall2021/Computational Linguistics/'
                            'Lectures/supplementary_files')
## your location will probably be:
# location_of_my_abstracts = ('/content/drive/Shared/'
#                             'Computational Linguistics/Lectures/'
#                             'supplementary_files')
abstracts = open(os.path.join(location_of_my_abstracts,
                              'abstracts.tsv'), 'r').read().split("\n")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Mounted at /content/drive


In [None]:
parsed_abstracts = [word_tokenize(x) for x in abstracts]
# parsed_abstracts = []
# for x in abstracts:
#     parsed_abstract = word_tokenize(x)
#     parsed_abstracts.append(parsed_abstract)

parsing_term_frequencies = {i: 0 for i, abstract in enumerate(abstracts)}
# parsing_term_frequencies = {i: 0 for i, _ in enumerate(abstracts)}

# computing the frequency of "parsing" in each document
for i, abstract in enumerate(parsed_abstracts): # explain enumerate
  count_of_parsing_in_doc = [x=='parsing' for x in abstract] # List of bool
  parsing_term_frequencies[i] = sum(count_of_parsing_in_doc)

# computing the document frequency of "parsing"
# document frequencies must be log transformed
import numpy as np

parsing_is_in_doc = [parsing_term_frequencies[i]>0 # True or False
                     for i in parsing_term_frequencies]
parsing_doc_frequency = np.log(sum(parsing_is_in_doc))

TF-IDF is just the ratio between the term frequency of a document $D$ and the number of documents $d \in D$ that have that word. For parsing that will look like this:

In [None]:
abstract_text_to_parsing_tfidf = {}
for i, abstract in enumerate(parsed_abstracts): # note the enumerate
  # get the abstract text by back-tracking to the original file
  abstract_text = abstracts[i]
  # get the term frequency for that abstract
  parsing_tf = parsing_term_frequencies[i]
  # divide term frequency by document frequency
  tf_idf = parsing_tf / doc_frequency # parsing_doc_frequency
  # map the doc and the score onto each other
  abstract_text_to_parsing_tfidf[abstract_text] = tf_idf

In [None]:
#@title We now sort the dictionary with tf-idf scores to find the most "parsing" abstract
pprint(sorted(abstract_text_to_parsing_tfidf.items(),
              key=lambda item: item[1],
              reverse=True)[0])

('Computational text-level discourse analysis mostly happens within Rhetorical '
 'Structure Theory (RST), whose structures have classically been presented as '
 'constituency trees, and relies on data from the RST Discourse Treebank '
 '(RST-DT); as a result, the RST discourse parsing community has largely '
 'borrowed from the syntactic constituency parsing community. The standard '
 'evaluation procedure for RST discourse parsers is thus a simplified variant '
 'of PARSEVAL, and most RST discourse parsers use techniques that originated '
 'in syntactic constituency parsing. In this article, we isolate a number of '
 'conceptual and computational problems with the constituency hypothesis. We '
 'then examine the consequences, for the implementation and evaluation of RST '
 'discourse parsers, of adopting a dependency perspective on RST structures, a '
 'view advocated so far only by a few approaches to discourse parsing. While '
 'doing that, we show the importance of the notion of h

In [None]:
print(abstract_text_to_parsing_tfidf.values())
print(sorted(abstract_text_to_parsing_tfidf.values(), reverse=True))

dict_values([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13848295882740902, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0

In [None]:
c#@title We now sort the dictionary with tf-idf scores to find the most "parsing" abstract
pprint(sorted(abstract_text_to_parsing_tfidf.items(),
              key=lambda item: item[1],
              reverse=True)[15])

('Syntactic parsing plays a crucial role in improving the quality of natural '
 'language processing tasks. Although there have been several research '
 'projects on syntactic parsing in Vietnamese, the parsing quality has been '
 'far inferior than those reported in major languages, such as English and '
 'Chinese. In this work, we evaluated representative constituency parsing '
 'models on a Vietnamese Treebank to look for the most suitable parsing method '
 'for Vietnamese. We then combined the advantages of automatic and manual '
 'analysis to investigate errors produced by the experimented parsers and find '
 'the reasons for them. Our analysis focused on three possible sources of '
 'parsing errors, namely limited training data, part-of-speech (POS) tagging '
 'errors, and ambiguous constructions. As a result, we found that the last two '
 'sources, which frequently appear in Vietnamese text, significantly '
 'attributed to the poor performance of Vietnamese parsing.',
 0.9693807

One last food for thought: Can you think of cases when tf-idf will not work right? (Hint: Look at the denominator.)