# Expanding on basic text processing

Basic text processing involves taking a corpus and producing useful representations of that text. Today, we will be talking about how we can look at corpora using basic tools. Now that we know how to load files into Colab, we can use a variety of built-in Python tools as well as natural language processing packages that are freely available as open-source software.

This lecture will cover the following general steps to understanding your data.

* "lumping" versus "splitting" in natural language processing and **lemmatization**
* Computing basic statistics
  * Counts of words
  * Counts of n-grams
  * Transition probabilities
  * Term frequency / inverse document frequency (tf-idf)
* Basic statistics in languages other than English
  * The need for linguistic specialization
  * Challenges posed by different writing systems (Hebrew, Chinese, Japanese)

In [1]:
## load in all of our imports from last time
# python built-ins
from collections import Counter
import os
from pprint import pprint
# colab-specific
from google.colab import drive
from google.colab import files
# nlp software we used last time
import nltk
from nltk import word_tokenize, sent_tokenize
nltk.download('punkt')

ModuleNotFoundError: No module named 'google'

In [None]:
drive.mount("/content/drive", force_remount=True)

## alternately: upload a file from your local machine:
## uncomment the two lines below if uploading abstracts.tsv is easier
# from google.colab import files
# uploaded = files.upload()
# abstracts = uploaded['abstracts.tsv'].decode('utf-8')

location_of_my_abstracts = ('/content/drive/MyDrive/Teaching/'
                            'Fall2021/Computational Linguistics/'
                            'Lectures/supplementary_files')
## your location will probably be:
# location_of_my_abstracts = ('/content/drive/Shared/'
#                             'Computational Linguistics/Lectures/'
#                             'supplementary_files')
abstracts = open(os.path.join(location_of_my_abstracts,
                              'abstracts.tsv'), 'r').read().split("\n")
abstracts[0:5]

To build a dictionary ourselves, we basically want to loop through all of the words and add words it if they are not already in the dictionary, and increment by one if they are already in the dictionary. The code for that looks like this:

In [None]:
abstract_counts_dict = {}

for abstract in abstracts:
  abstract_tokenized_into_words = word_tokenize(abstract)
  for word in abstract_tokenized_into_words:
    if word not in abstract_counts_dict:
      abstract_counts_dict[word] = 1
    else:
      abstract_counts_dict[word] += 1

print(abstract_counts_dict)

## Sentence formatting can matter

Sometimes a text is short and we may want to "lump" together different instances of the same word but which appears in slightly different ways. For example:

> Doctor Marshall spent three weeks at a ski resort with their best doctor friends.

We might want to count both instances of "doctor" as the same. That is, the word "doctor" appears twice in this text. 

If we want to lump both instances of "doctor" together, then we can edit the string that our tokenization algorithm gets ahead of time. The most common way to do this is to get rid of the contribution of case, or whether a word is UPPER or lower case.

To change the case of a string, we can use the `.lower()` or `.upper()` methods. For example:

In [None]:
# Convert a string to lowercase with the .lower() method
print('Doctor Marshall spent three weeks at a ski resort with their best doctor friends.'.lower())

# Now try UPPERCASE with .upper()
print('Doctor Marshall spent three weeks at a ski resort with their best doctor friends.'.upper())

In [None]:
Counter(('Doctor Marshall spend three weeks at a ski '
         'resort with their best doctor friends.').lower().split(" ")).most_common()

In [None]:
Counter(('Doctor Marshall spend three weeks at a ski '
         'resort with their best doctor friends.').upper().split(" ")).most_common()

### Can you think of a situation when we might want to preserve case information in a string?

<details>
<summary>
</summary>
  Case information is useful for telling what kind of word something is. For example, if we are trying to find all the corporations in a document (e.g., as part of a named entity recognition (NER) task), it will matter whether it is spelled "The Dow Chemical Company" versus "The Dow Chemical company." 
  Case information can also tell us about register -- if we see a lowercase i, it might be $i$ written in a mathematics paper, or it could be informal (e.g., a tweet).
</details>

## What other challenges are there for tokenization? Languages other than English

* Writing systems for Chinese languages
  * Tens of thousands of characters
  * Many, many more words -- "word" in Mandarin is often compared to English language compounds (e.g., "houseboat")
  * Words are not separated by spaces
  * = Boundaries between words are highly ambiguous
* Japanese
  * Three writing systems (kanji, hirigana, katakana)
  * No spaces are used
  * Three scripts are used for complex linguistic reasons

Tl;dr: Many written languages use scripts that do not easily translate into tokens in the English sense. We will delve more into this during the morphology section of the course 
👀

<!-- * Other scripts: E.g., Hebrew and Arabic
  * These writing systems do not mark vowels
    * Primarily available only for second language learners
  * Vowels can (usually) be inferred from context
  * Challenges for segmentation and tokenization because the vowels can completely change the kind of word (e.g., noun to verb; noun to another kind of noun, etc.) -->

In [None]:
# Google Translate of part of second paragraph of 
# https://zh.wikipedia.org/wiki/%E6%B1%89%E5%AD%97 accessed 9/9/2021: 
# Not only used by China, but for a long period of time,
# it has also served as the only internationally-used script 
# in East Asia. Before the 20th century, it was the written 
# standard script of the Korean Peninsula, Vietnam, Ryukyu, and Japan. 
# In addition to Chinese, the ancient East Asian countries all created Chinese 
# characters on their own to a certain extent. 

chinese_script_from_wikipedia = (
    "不單中國使用，在很長時期內還充當東亞地區唯一的國際通用文字，"
    "在20世紀前都是朝鮮半島、越南、琉球和日本等國家的書面規範文字。"
    "除了漢語之外，古代東亞諸國都有一定程度地自行創製漢字。 ")

pprint(chinese_script_from_wikipedia.split("，"))

It is clear from the above that splitting on `"，"` only breaks this sentence into clauses along these commas. This is a great case where we will want to use a regular expression to split the sentence on either `"，"` or `"。"`

For this, we need language-specific knowledge.

In [None]:
# split the sentence above on either the comma or the period characters
# here we import the regular expression built-in module from Python
import re

re.split('，|。', chinese_script_from_wikipedia)

# Lemmatization and tokenization with spaCy

Sometimes we want a more sophisticated, linguistically-informed way to process our sentences. The open source Python package `spaCy` uses algorithms and pre-trained models from some of the most recent advances in NLP to make its tokenization and processing decisions. `spaCy` typically returns intuitive tokenizations. So, we will check out the output of `spaCy`'s model for English, which we need to download and load into the notebook.

In [None]:
# install required packages
!pip install spacy
!python -m spacy download en_core_web_sm

In [None]:
# Begin analysis with spaCy -- looking at the structure of the output
import spacy

spacy_model = spacy.load('en_core_web_sm')

# a quick example on one of the abstracts from earlier
spacy_abstract = spacy_model(abstracts[17])
pprint(spacy_abstract)

Note that the output of the list comprehension above does not return a normal-looking array. Normally, we would expect to see quotation marks -- but because spaCy does not render it like this for us, we can tell that this is special behavior.

In order to look inside, we have to actually inspect each of the objects within the `spacy_abstract` itself.

Each object that spaCy returns is a `Doc`, which contains within it sentences (`sent`s) and tokens (`Token` objects). You can iterate through all the tokens with a simple `for` loop. 

spaCy does a lot of things on the backend to give you reasonably high-quality results. You can inspect all of the things it gives you for a Token object by typing `.` for a given token.

```python
for t in spacy_abstract[0:10]:
  print(t.dep_, t.pos_, t.tag_)
```

will give you the output

```
amod ADJ JJ
compound NOUN NN
nsubj NOUN NN
punct PUNCT -LRB-
appos PROPN NNP
punct PUNCT -RRB-
aux AUX VBZ
ROOT VERB VBN
amod VERB VBG
dobj NOUN NN
```

which correspond to values for the token's syntactic dependency (Week 12), its broad part-of-speech category (Week 11), and its classic syntactic part-of-speech "tag" (Week 11).

## From tokenization to lemmatization

Our next goal with this is to extract **lemmas** from the spaCy tokens that it automatically processes.

## What is a lemma?

A lemma is something like the "base form" of a word. So for the words "cats" and "cat" or "sleeping" and "sleeps", we might want to collapse into just "cat" and "sleep" respectively. This is related to the UPPERCASE/lowercase problem earlier. 

The reason we often want to do this is _data sparsity_, which can make it hard to use language data for other applications. That is, sometimes we do not care what particular form of a word appeared in a text because that form might be so rare as to not be useful. So, we can combine it with another (related!) word for analysis.

## What is lemmatization?

**Lemmatization** turns more complex words into their base forms. This is useful for languages like English, which has relatively simple wordforms (singular/plural, very few ways to change verb forms), and very useful for languages like French or Spanish. For example, in French and Spanish, verbs are much more varied than in a language like English. 

<center><img src="https://leconjugueur.lefigaro.fr/images/ipadlcecr3.png" width = 700 alt="A screenshot of a conjugation table from Le Conjuguer.com. It illustrates the conjugation of the verb _avoir_."/></center>

So, in many cases, we want to turn potentially hundreds of forms into one single form. This is will help us avoid the problem of data sparsity. We can turn many rare words into more common ones -- which effectively allows us to learn even from events made of rare words. Otherwise, without clever models, we would basically be throwing that data away.

## Getting lemmas from `spaCy`

The lemmas, or the lemmatized forms of words that are the output of natural language processing models, are one of the many linguistic features that `spaCy` will give you. Most of the time, these lemmas are obtained through a complex dictionary lookup. That is, a word (`key`) like "sleeps" will have a lemma `value` of "sleep." Other times, sophisticated algorithms are used to try to guess the lemma (a topic we will cover in the Morphology lectures in Week ).

To get the lemma form of a word, simply ask for the `.lemma_` property of a `Token` object:

In [None]:
# get the lemmas from spacy
for t in spacy_abstract[0:10]:
  print(t.lemma_)

## Aggregate all lemmas to make lemma-nade (better count estimates for rare words)

We can combine our techniques above and create a dictionary of word frequencies using the _lemmatized strings_ rather than the raw strings, as in the below example:

In [None]:
# this will take a while, running locally is not recommended
lemmatized_abstract_counts_dict = {}

for abstract in abstracts:
  abstract_tokenized_into_words = spacy_model(abstract)
  for word in abstract_tokenized_into_words:
    # we have to change the previous examples
    # from *word* to *lemma*
    lemma = word.lemma_
    print('lemma', lemma)
    if lemma not in lemmatized_abstract_counts_dict:
      lemmatized_abstract_counts_dict[lemma] = 1
    else:
      lemmatized_abstract_counts_dict[lemma] += 1

print(lemmatized_abstract_counts_dict)

There are many fewer lemmas than individual words:

In [None]:
len(lemmatized_abstract_counts_dict), len(abstract_counts_dict)

### Lemmatization can have huge consequences for your data:

In [None]:
pprint(sorted(abstract_counts_dict.items(),
              key=lambda item: item[1], reverse=True)[0:5])
pprint(sorted(lemmatized_abstract_counts_dict.items(),
              key=lambda item: item[1], reverse=True)[0:5])

Or, we can check out familiar terms. Any idea what types of words will change most? When do you think lemmatization can hurt what we learn?

* 
* 
* 

In [None]:
# suggest something for us to search for!

print(abstract_counts_dict['priority'], lemmatized_abstract_counts_dict['priority'])
print(abstract_counts_dict['organization'], lemmatized_abstract_counts_dict['organization'])
print(abstract_counts_dict['get'], lemmatized_abstract_counts_dict['get'])
print(abstract_counts_dict['go'], lemmatized_abstract_counts_dict['go'])
print(lemmatized_abstract_counts_dict['went'])

### Debugging a question that came up in class: Why would the counts of something like "the" be different across the two lists?

Some possibilities:

* There is something wrong with the code
* There are differences between `spaCy` and `nltk` and how they tokenize. For example, it looks like `spaCy` keeps words linked to their parentheses. This is a puzzle but it is worth looking into because this behavior is a bit unexpected.

In [None]:
[x for x in lemmatized_abstract_counts_dict if "the" in x]

# Computing more than count statistics for single tokens

* Multi-word sequences (n-grams)
* Conditional probabilities (transition probabilities)

For Wednesday and Friday:

* Bag-of-words representations (vector representations)
* tf-idf (term frequency / inverse document frequency)
* Smoothing and overcoming sparsity
* Text normalization (e.g., spelling correction, edit distance)

# N-gram frequencies

**N-grams** are defined as multi-word combinations of length $n$. Most of the time, for a language like English, these are immediately adjacent tokens in a sentence. **Bigrams** typically mean two-word sequences, **trigrams** sequences of three, and **4-grams** and **5-grams**...and so on. 

Computing bigram frequencies is very simple -- the easiest thing to do is to create a dictionary of dictionaries with the first word ($w_1$) as the key to another dictionary that contains the second words ($w_2$) as keys, which are then mapped onto the value (frequency) of that bigram.

For this code, as with the unigram frequency code, note that the `not` **must** come first because you will run into issues building the dictionary. We can't add things to the dictionary if an entry doesn't already exist.

In [None]:
abstract_bigrams_dict = {}

for abstract in abstracts:
  abstract_tokenized_into_words = word_tokenize(abstract)
  # note: enumerate allows us to get indices
  for i, word in enumerate(abstract_tokenized_into_words):
    word_string = word
    # if we are not yet at the end of the sentence
    if i < (len(abstract_tokenized_into_words) - 1):
      next_word_string = abstract_tokenized_into_words[i + 1]
    else:
      next_word_string = "<<EOS>>"
    # have we ever seen this word as the first one in a bigram?
    if word_string not in abstract_bigrams_dict:
      abstract_bigrams_dict[word_string] = {next_word: 1}
    else:
      # have we ever seen the second word with the first word?
      if next_word_string not in abstract_bigrams_dict[word_string]:
        abstract_bigrams_dict[word_string][next_word_string] = 1
      else:
        abstract_bigrams_dict[word_string][next_word_string] += 1

In [None]:
sum(abstract_bigrams_dict['parsing'].values()), abstract_counts_dict['parsing']

In [None]:
# briefly inspect one example to make sure it is saving counts properly
pprint(sorted(abstract_bigrams_dict['parsing'].items(),
              key=lambda item: item[1], reverse=True)[0:20])

## Computing conditional probabilities

Now that you have a dictionary of frequencies that has the structure `{w1: {w2: count(w1, w2)}}`, you can compute conditional probabilities. Formally, for a pair of words that occur in sequence in a sentence $w_1$ and $w_2$, these are defined as:

<center>$\large p(w_2|w_1) = \frac{p(w_1 \cap w_2)}{p(w_1)}$</center>

We already know $p(w_1)$. To compute the probability of a _single_ word, all we have to do is find the frequency of A and divide it by the sum of the counts all of the words.

We already know $p(A \cap B)$ because we computed the frequencies of the words occurring together. For example, the frequency of "parsing algorithms" is 21 (from the above). In order to compute the _unconditional probability of a word_, all we have to do is divide by the count of all of the pairs of words in our dataset (i.e., the total number we have seen).



In [None]:
total_pairs = 0
for w1 in abstract_bigrams_dict:
  total_pairs += sum(abstract_bigrams_dict[w1].values())

total_pairs

In [None]:
freq_parsing = abstract_counts_dict['parsing']
total_count_of_completions = sum(abstract_counts_dict.values())
p_parsing = freq_parsing / total_count_of_completions
p_parsing

In [None]:
# if "parsing algorithms" occurs 21 times, then it makes up:
p_parsing_and_algorithms = 21 / 3559578
p_parsing_and_algorithms

In [None]:
# therefore the probability of algorithms _given_ parsing is:
p_parsing_and_algorithms / p_parsing

In [None]:
# now we can compare another alternative -- dividing by known frequencies
abstract_bigrams_dict['parsing']['algorithms'] / sum(abstract_bigrams_dict['parsing'].values())

## Bayes' rule

Bayes' rule allows us to change between two different conditional probabilities. Bayes' rule comes up in many places when we have one known probability but we do not know others. It can take a lot of practice to be familiar with. 

But, here is an example for the previous:

$p(w_2=\text{algorithms} | w_1=\text{parsing})$ = $\large \frac{p(w_1=\text{parsing} | w_2=\text{algorithms}) \cdot p(w_2=\text{algorithms})}{p(w_1=\text{parsing})}$



## For Wednesday

Take a look at the assignment to be released by 5pm on Monday September 12, 2021. Please come to class with questions or post on the Discord server.

Be sure to keep up with the readings. This week is:

* Jurafsky and Martin (Speech and Language Processing 3rd Ed.) Ch. 3 (pdf pages 37-54)
* Eisenstein (2019) section 2.1 – Bag of Words (pdf pages 31-34)