# An introduction to basic text processing

Basic text processing involves taking a corpus and producing useful representations of that text. Today, we will be talking about how we can look at corpora using basic tools. Now that we know how to load files into Colab, we can use a variety of built-in Python tools as well as natural language processing packages that are freely available as open-source software.

## Problems for basic text processing

This lecture is going to cover the following general steps to understanding your data.

* What is a corpus? How do we assemble **corpora**?
* How do we turn corpora into useful units for analysis, such as paragraphs, sentences, or words?
  * How do we define **tokens**?
* "lumping" versus "splitting" in natural language processing and **lemmatization**
* Computing basic statistics
  * Counts of words
  * Counts of n-grams
  * Transition probabilities
  * Term frequency / inverse document frequency (tf-idf)
* Basic statistics in languages other than English
  * The need for linguistic specialization
  * Challenges posed by different writing systems (Hebrew, Chinese, Japanese)

# What is a corpus?

* A collection of written, spoken, or signed language
* May vary in size
  * Extremely small: Hand-collected sentences or words (e.g., speech errors)
  * Dense -- recordings of speech or video (lots of _physical_ observations) -- but which may be sparse (few words or speech sounds of interest)
  * Extremely large -- Datasets like the Common Crawl, which are a snapshot of the internet
* May be curated
  * Experiment conducted in the lab
  * Entire set of works by an author over time
* May be highly variable
  * Tweets or instagram posts vs. diary entries
  * Multiple authors: Wikipedia
  * Varied styles -- mix of text types
  * Across time (Supreme Court rulings corpus)
  * Scanned in from old documents -- optical character recognition (OCR) errors

# Tokenization
## How many words are in these sentences?
### Lumping versus splitting

1. Max took their shoes to their bedroom.
<details>
<summary>
</summary>
<p>
This sentence contains at least 6 words. Sometimes, some algorithms will count punctuation (like ".", but also ")", "[", "?" and so on) as a unique word. This is because punctuation can convey additional meaning in a similar way to words -- by signaling the end of the sentence, or a break in ideas.
</p>
</details>
2. The students in the class liked doing homework in the sun.
<details>
<summary>
</summary>
<p>
  This sentence contains at least 11 words. It has many of the same properties as the previous example.
</p>
</details>
3. He's so excited to attend my sister's wedding.
<details>
<summary>
</summary>
<p>
  This sentence contains two tricks in it. If we just look at all the spaces, we can separate out at least 8 words. 
  
  But, what should we do with "He's" and "sister's"? These are two different uses of "'s" (apostrophe s) in English. 
  One of them is a _contraction_ ("He's" is a shortening of "He is") while the other conveys a special relationship -- "my sister's" indicates something like, "the wedding of my sister." That is, the "'s" has a _grammatical_ role. 
  We might want to separate both of these out -- so there could be 10 words in this.

  Lots of other sequences are like this in English. For example, the "n't" in "didn't". In other languages, it is even more important to separate out words that are written together. For example, in French, "Le pompier qui m'a envoyé une lettre" (Translation: "The firefighter who sent me a letter") we want to separate out the "me" from the verb because they are _different_ types of words.

</p>
</details>

4. My friend had three cats visit xem this weekend.
<details>
<summary>
</summary>
<p>
We know that some systems will say that "sister's" can be broken down into "sister" and "'s". Do we want to make the same decision for "cats"?
</p>
</details>

Ultimately, there is no right or wrong answer. We sometimes make decisions that are based on **linguistic knowledge** ("cats" and "sister's" are not contractions, but "He's" is). On the other hand, in many real-world applications, we might make our decisions based on how well our models perform -- making decisions on the basis of **model performance** happens very often in practice. Which decision you make will depend a bit on properties of the data you are working with.


### The simplest tokenization algorithm for English: Splitting on whitespace

In [1]:
from pprint import pprint
sentences = ['Max took their shoes to their bedroom.',
             'The students in the class liked doing homework in the sun.',
             'He\'s so excited to attend my sister\'s wedding.',
             'My friend had three cats visit xem this weekend.']

for sentence in sentences:
  pprint(sentence.split())

['Max', 'took', 'their', 'shoes', 'to', 'their', 'bedroom.']
['The',
 'students',
 'in',
 'the',
 'class',
 'liked',
 'doing',
 'homework',
 'in',
 'the',
 'sun.']
["He's", 'so', 'excited', 'to', 'attend', 'my', "sister's", 'wedding.']
['My', 'friend', 'had', 'three', 'cats', 'visit', 'xem', 'this', 'weekend.']


### Linguistic knowledge and special vocabularies

In light of things like contractions or more complex words, we often want to be build language-aware systems. For a language like English, we can mostly use complex rules to identify where words are. 

The `word_tokenize` function within `nltk` implements an algorithm by [Kiss and Strunk (2006)](https://aclanthology.org/J06-4003/) that learns from statistical regularities for English where sentence boundaries occur -- and what words should be written as one word or more than one word. 

We can therefore keep words like "Mr." or "Mx." as single words, keeping "sister's" as one word but "He's" as two, etc. By default, the `word_tokenize` function uses a very specific regular expression (a topic we will not get into here but which are very handy to know):

```python
    _word_tokenize_fmt = r"""(
        %(MultiChar)s
        |
        (?=%(WordStart)s)\S+?  # Accept word characters until end is found
        (?= # Sequences marking a word's end
            \s|                                 # White-space
            $|                                  # End-of-string
            %(NonWord)s|%(MultiChar)s|          # Punctuation
            ,(?=$|\s|%(NonWord)s|%(MultiChar)s) # Comma if at end of word
        )
        |
        \S
    )"""
```

In [2]:
!pip install nltk



In [4]:
import nltk
# we need to download a model that will do the hard stuff for us
nltk.download('punkt')

from nltk import word_tokenize

pprint(word_tokenize("He's so excited to attend my sister's wedding."))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['He',
 "'s",
 'so',
 'excited',
 'to',
 'attend',
 'my',
 'sister',
 "'s",
 'wedding',
 '.']


## How many sentences are in this paragraph?

> Over the years a number of machine translation metrics have been developed in order to evaluate the accuracy and quality of machine-generated translations. Metrics such as BLEU and TER have been used for decades. However, with the rapid progress of machine translation systems, the need for better metrics is growing. This paper proposes an extension of the edit distance, which achieves better human correlation, whilst remaining fast, flexible and easy to understand.

## What kinds of rules can we come up with for what ends a sentence?

* Punctuation (., ?, !, ...)
* But not :, ;, -- -- > "within sentence"
* Some punctuation signs followed by a capital letter

## What about these sentences?

1. I um really like listening to this one band but honestly I had to stop listening to them most of the time the vibe is off

2. Look, I just. Don't like it.

* This example is challenging because it is highly colloquial.
* Speech (and transcripts) are often interrupted
* Informal writing often does not clearly segment the ends of sentences
* Sometimes, punctuation is used for reasons other than ending a sentence -- like emphasis


One tool that can be useful if you want to look at individual sentences is the NLTK tool `sent_tokenize`, which breaks texts into sentences. Let's see how it handles some of our different examples.




In [5]:
from nltk import sent_tokenize

sentences += [("I um really like listening to this one band but "
               "honestly I had to stop listening to them "
               "most of the time the vibe is off"),
               "Look, I just. Don't like it."]

for sentence in sentences:
  print(sent_tokenize(sentence))

['Max took their shoes to their bedroom.']
['The students in the class liked doing homework in the sun.']
["He's so excited to attend my sister's wedding."]
['My friend had three cats visit xem this weekend.']
['I um really like listening to this one band but honestly I had to stop listening to them most of the time the vibe is off']
['Look, I just.', "Don't like it."]


The output of the `sent_tokenize` function is a little strange!

It does well with the simple sentences, but doesn't recognize when punctuation \*would ordinarily have been there\* and doesn't recognize when punctuation \*does not mark sentence boundaries\*.

In [6]:
sent_tokenize("I think dr. jacobs is very cool")

['I think dr. jacobs is very cool']

In [7]:
sent_tokenize("I think mx. jacobs is very cool")

['I think mx.', 'jacobs is very cool']

## Computing basic statistics from a tokenized document

Sometimes, we want to look at the properties of a document. Let's load in the `abstracts.tsv` file again and compute some basic statistics. 

Specifically, here, we will:
1. Split the document into word-sized tokens using the Python `nltk` function `word_tokenize` that we imported earlier for the entire document.
2. Compute basic count statistics using two approaches:
    * A dictionary that we build ourselves
    * The `collections.Counter` function from standard Python

Both of these methods will return a `dict` object (a dictionary) whose `key`s are strings (words) and whose values are numbers (counts of those words).

We want a **dictionary** because we might want to "look up" the values periodically. For example, if I want to know how common a given word is in the `abstracts.tsv` file, I can simply type the word (e.g., "Transformer") and get the associated frequency out. 

In [8]:
from google.colab import drive
import os

drive.mount("/content/drive", force_remount=True)

## alternately: upload a file from your local machine:
## uncomment the two lines below if uploading abstracts.tsv is easier
# from google.colab import files
# uploaded = files.upload()
# abstracts = uploaded['abstracts.tsv'].decode('utf-8')

location_of_my_abstracts = ('/content/drive/MyDrive/Teaching/'
                            'Fall2021/Computational Linguistics/'
                            'Lectures/supplementary_files')
## your location will probably be:
# location_of_my_abstracts = ('/content/drive/Shared/'
#                             'Computational Linguistics/Lectures/'
#                             'supplementary_files')
abstracts = open(os.path.join(location_of_my_abstracts,
                              'abstracts.tsv'), 'r').read()
abstracts[0:400]

Mounted at /content/drive


'Offensive language detection (OLD) has received increasing attention due to its societal impact. Recent work shows that bidirectional transformer based methods obtain impressive performance on OLD. However, such methods usually rely on large-scale well-labeled OLD datasets for model training. To address the issue of data/label scarcity in OLD, in this paper, we propose a simple yet effective domai'

To build a dictionary ourselves, we basically want to loop through all of the words and add words it if they are not already in the dictionary, and increment by one if they are already in the dictionary. The code for that looks like this:

In [9]:
abstract_counts_dict = {}

abstract_tokenized_into_words = word_tokenize(abstracts)
for word in abstract_tokenized_into_words:
  if word not in abstract_counts_dict:
    abstract_counts_dict[word] = 1
  else:
    abstract_counts_dict[word] += 1

print(abstract_counts_dict)



One of the nice things about the `Counter` object is that it has some additional methods that make inspecting the dictionary easier. If we take the same abstracts file, but run `Counter` over it, we can now _sort_ the words by frequency using the `.most_common()` method.

What words do you think will be the most frequent?

In [10]:
from collections import Counter

abstract_tokenized_into_words = word_tokenize(abstracts)
abstract_counter = Counter(abstract_tokenized_into_words)

pprint(abstract_counter.most_common(5))

[('the', 163172), (',', 160068), ('.', 158239), ('of', 115654), ('and', 101170)]


We can confirm that these two approaches lead to identical estimates:

(s/o to the Stack Overflow page: https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value)

In [11]:
pprint(sorted(abstract_counts_dict.items(),
              key=lambda item: item[1], reverse=True)[0:5])

[('the', 163172), (',', 160068), ('.', 158239), ('of', 115654), ('and', 101170)]


## Sentence formatting can matter

Sometimes a text is short and we may want to "lump" together different instances of the same word but which appears in slightly different ways. For example:

> Doctor Marshall spent three weeks at a ski resort with their best doctor friends.

We might want to count both instances of "doctor" as the same. That is, the word "doctor" appears twice in this text. 

If we want to lump both instances of "doctor" together, then we can edit the string that our tokenization algorithm gets ahead of time. The most common way to do this is to get rid of the contribution of case, or whether a word is UPPER or lower case.

To change the case of a string, we can use the `.lower()` or `.upper()` methods. For example:

In [8]:
'Doctor Marshall spent three weeks at a ski resort with their best doctor friends.'.lower()

'doctor marshall spent three weeks at a ski resort with their best doctor friends.'

In [9]:
'Doctor Marshall spent three weeks at a ski resort with their best doctor friends.'.upper()

'DOCTOR MARSHALL SPENT THREE WEEKS AT A SKI RESORT WITH THEIR BEST DOCTOR FRIENDS.'

In [13]:
Counter(('Doctor Marshall spend three weeks at a ski '
         'resort with their best doctor friends.').upper().split(" ")).most_common()

[('DOCTOR', 2),
 ('MARSHALL', 1),
 ('SPEND', 1),
 ('THREE', 1),
 ('WEEKS', 1),
 ('AT', 1),
 ('A', 1),
 ('SKI', 1),
 ('RESORT', 1),
 ('WITH', 1),
 ('THEIR', 1),
 ('BEST', 1),
 ('FRIENDS.', 1)]

In [12]:
Counter(('Doctor Marshall spend three weeks at a ski '
         'resort with their best doctor friends.').lower().split(" ")).most_common()

[('doctor', 2),
 ('marshall', 1),
 ('spend', 1),
 ('three', 1),
 ('weeks', 1),
 ('at', 1),
 ('a', 1),
 ('ski', 1),
 ('resort', 1),
 ('with', 1),
 ('their', 1),
 ('best', 1),
 ('friends.', 1)]

### Can you think of a case that we might want to preserve case information in a string?

<details>
<summary>
</summary>
  Case information is useful for telling what kind of word something is. For example, if we are trying to find all the corporations in a document (e.g., as part of a named entity recognition (NER) task), it will matter whether it is spelled "The Dow Chemical Company" versus "The Dow Chemical company." 
  Case information can also tell us about register -- if we see a lowercase i, it might be $i$ written in a mathematics paper, or it could be informal (e.g., a tweet).
</details>

## What other challenges are there for tokenization? Languages other than English

* Writing systems for Chinese languages
  * Tens of thousands of characters
  * Many, many more words -- "word" in Mandarin is often compared to English language compounds (e.g., "houseboat")
  * Words are not separated by spaces
  * = Boundaries between words are highly ambiguous
* Japanese
  * Three writing systems (kanji, hirigana, katakana)
  * No spaces are used
  * Three scripts are used for complex linguistic reasons

Tl;dr: Many written languages use scripts that do not easily translate into tokens in the English sense. We will delve more into this during the morphology section of the course 
👀

<!-- * Other scripts: E.g., Hebrew and Arabic
  * These writing systems do not mark vowels
    * Primarily available only for second language learners
  * Vowels can (usually) be inferred from context
  * Challenges for segmentation and tokenization because the vowels can completely change the kind of word (e.g., noun to verb; noun to another kind of noun, etc.) -->

In [14]:
# Google Translate of part of second paragraph of 
# https://zh.wikipedia.org/wiki/%E6%B1%89%E5%AD%97 accessed 9/9/2021: 
# Not only used by China, but for a long period of time,
# it has also served as the only internationally-used script 
# in East Asia. Before the 20th century, it was the written 
# standard script of the Korean Peninsula, Vietnam, Ryukyu, and Japan. 
# In addition to Chinese, the ancient East Asian countries all created Chinese 
# characters on their own to a certain extent. 

chinese_script_from_wikipedia = (
    "不單中國使用，在很長時期內還充當東亞地區唯一的國際通用文字，"
    "在20世紀前都是朝鮮半島、越南、琉球和日本等國家的書面規範文字。"
    "除了漢語之外，古代東亞諸國都有一定程度地自行創製漢字。 ")

pprint(chinese_script_from_wikipedia.split("，"))

['不單中國使用',
 '在很長時期內還充當東亞地區唯一的國際通用文字',
 '在20世紀前都是朝鮮半島、越南、琉球和日本等國家的書面規範文字。除了漢語之外',
 '古代東亞諸國都有一定程度地自行創製漢字。 ']


# Next week: Computing more than count statistics for single tokens

* Multi-word sequences (n-grams)
* Conditional probabilities (transition probabilities)
* Bag-of-words representations (vector representations)
* Smoothing and overcoming sparsity
* Text normalization (e.g., spelling correction, edit distance)

# Reminder:
## Reaction to the Stochastic Parrots paper due tonight (September 10, 2021) by 11:59pm Eastern on UBLearns