<center><b>DIGHUM101</b></center>
<center>2-1: Text Preprocessing</center>

---

# Fast review

1. What is a data frame? 
2. What methods and other syntax can be used to subset rows and columns?

# Learning objectives

1. Download a .txt file from Project Gutenberg and import it into Python
2. Quick walkthrough of Hathi Trust Research Center (HTRC) resources
3. Learn the basics of text preprocessing: 
    - Tokenization
    - Punctuation removal
    - Count words, unique words, and word frequencies
    - Stop word removal
    - Stemming/lemmatization
    - Part of speech tagging
    - Quick introduction to n-grams, skip-grams, and BERT

In [None]:
# Module to remove punctuation from string library
from string import punctuation
print(punctuation)
print(len(punctuation))

In [None]:
# Module to count word frequencies
from collections import Counter

In [None]:
# Module to help us remove stopwords
import nltk
nltk.download("stopwords")
nltk.download("averaged_perceptron_tagger")
from nltk.corpus import stopwords

In [None]:
# Install spaCy and trained model downloaded.
# install spacy
#!pip install spacy

# Download a trained English model (small)
# !python -m spacy download en_core_web_sm 

# Download the large model as well
# !python -m spacy download en_core_web_lg

In [None]:
import spacy

# Project Gutenberg

[Project Gutenberg](https://www.gutenberg.org/) has more than 60,000 texts for you to download. Be sure to check out their [Terms of Use](https://www.gutenberg.org/wiki/Gutenberg:Terms_of_Use). You can find many .txt files here that are in the public domain. 

In [None]:
# Try it! Search for a book, download it, copy it to your working directory, and import it.

## YOUR CODE HERE
import os
os.getcwd()

In [None]:
os.chdir("../../Data/")
%ls

In [None]:
## HOW TO IMPORT dracula.txt?
dracula = open("dracula.txt").read()
print(dracula[501:])

# The Hathi Trust Research Center (HTRC)

Check out the [HTRC](https://www.hathitrust.org/) and learn about their many [collections tools](https://www.hathitrust.org/htrc_collections_tools) and the [Python library](https://github.com/htrc/htrc-feature-reader) to connect to the API. The [Analytics](https://analytics.hathitrust.org/) website gives you access to many canned features if you don't want to mess with the Python code. 

# Text Preprocessing: Strings in depth

Text preprocessing is an essential first step to coding and understanding machine learning algorithms. For machine learning portions of this course, we will focus on [bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) models, namely [document-term](https://en.wikipedia.org/wiki/Document-term_matrix) and [term frequency-inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) matrices from the [sklearn library](https://scikit-learn.org/stable/).

Text preprocessing/pattern matching can be further enhanced through use of [regular expressions](https://docs.python.org/2/library/re.html).

![borges](../../Img/borges_1921.png)

In [None]:
borges = '''In the fullness of the years, like it or not,
a luminous mist surrounds me, unvarying, 
that breaks things down into a single thing,
colorless, formless. Almost into a thought. 
The elemental, vast night and the day
teeming with people have become that fog
of constant, tentative light that does not flag,

and lies in wait at dawn. I longed to see
just once a human face. Unknown to me
the closed encyclopedia, the sweet play
in volumes I can do no more than hold, 
the tiny soaring birds, the moons of gold.
Others have the world, for better or worse; 
I have this half-dark, and the toil of verse.'''


print(type(borges))
print()
print(borges)

In [None]:
# What do the triple quotes do in the assignment of borges above?

# Also, make a copy to preserve the original borges variable
poem = borges

# Tokenization

Tokenization is the process of splitting text into _something_ - often words. Each word is called a "token" and a word such as "the" might adhere to multiple tokens of "the" within a text based on its capitalization, punctuation, etc.

The `.split()` method allows us to split the text based on some sort of separator. The default is blank and will split on the blank spaces between words.

In [None]:
# Split the string into a list of strings (single words)
print(poem.split())

# Count words

Jump in! 

In [None]:
# How many characters in poem?
len(poem)

In [None]:
# How many words?
len(poem.split())

In [None]:
# How many lines?
len(poem.split("\n"))

In [None]:
# How many periods? 
# Should this be equal to the number of sentences in the cell below?
poem.count(".")

In [None]:
poem.split(".")

In [None]:
# ... but how many sentences? Why is this different from the number of periods?
len(poem.split("."))

In [None]:
# How many stanzas?
len(poem.split("\n\n"))

In [None]:
# At which index does the word "me" first appear?
# .find() is "forward search"
poem.find("me")

In [None]:
# .index works as well
poem.index("me")

In [None]:
# Note that .find does not throw an error when an element is not found (but .index does)
poem.find("kangaroo")

In [None]:
# At which index does the word "me" last appear?
# .rfind() starts at the highest index and works in reverse
poem.rfind("me")

# Count _unique_ words

In [None]:
# How many unique words?
# "Casting" our list into a set
len(set(poem.split()))

In [None]:
# Why two less unique words when we convert all the text to lower?
len(set(poem.lower().split()))

In [None]:
# Print the unique words
print(set(poem.lower().split()))

In [None]:
# What type of data structure is this? 
type(set(poem.lower().split()))

In [None]:
# Why is this different from .lower()?
len(set(poem.split()))

# Punctuation removal 

Remember how we imported that nice string of English punctuation in the first cell of this notebook? We could manually remove all of the punctuation using the .replace method, but this would get old fast!

In [None]:
# How many characters
len(punctuation)

In [None]:
# Replace periods with nothing
del_periods = poem.replace(".", "")
del_periods

But, what if you have tons of text and don't know exactly what punctuation is present? A quick comprehension can help us remove all the punctuation from dirty, i.e. !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~)

> NOTE: You will learn more about custom functions, for loops, list comprehensions, and lambda functions starting in week 3. The reason we are glossing over them now is so that you focus on **what is possible** for planning your individual projects instead of getting lost and frustrated in the nuances of the Python code. 

In [None]:
# For loop
for char in punctuation:
    poem = poem.lower().replace(char, "")

In [None]:
# Punctuation is gone! 
print(poem)

# Count word frequencies

In [None]:
# Tokenize poem into single words
tokens = poem.split()
print(tokens)

In [None]:
# Show the ten most common words (stopwords included)
freq = Counter(tokens)
freq.most_common(10)

# Stop word removal

[Stop words](https://en.wikipedia.org/wiki/Stop_words) are the most common words in a language, and may or may not add information about the content of the analysis.

In [None]:
stop = stopwords.words("english")
print(stop)

In [None]:
# List comprehension (we also saw for converting to datetime in "2-1_pandas.ipynb")
no_stops = [word for word in tokens if word not in stopwords.words('english')]
print(no_stops)

In [None]:
# This is the same as the following:
no_stops = []
for word in tokens:
    if word not in stopwords.words('english'):
        no_stops.append(word)
print(no_stops)

In [None]:
freq2 = Counter(no_stops)
freq2.most_common(10)

# Stemming and Lemmatization

One common problem with standardizing text is standardizing all parts of a word to its root, stem, or prefix. This is useful as it allows us to analyze word meaning without having to pour over separate inflectional forms of a word. It also speeds up the process as we have fewer words.

[What's the difference between stemming and lemmatizing?](https://stackoverflow.com/questions/1787110/what-is-the-difference-between-lemmatization-vs-stemming)

"**Stemming** usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. **Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. 

If confronted with the token "saw", **stemming** might return just "s", whereas **lemmatization** would attempt to return either "see" or "saw" depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma." See [this post](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) for more information.

So what do we do? Use pretrained models from [spaCy](https://spacy.io/)! It does all the tokenization and lemmatization for you.

[Read more about spaCy's pretrained models](https://spacy.io/models)

# Lemmatizing with spaCy

Let's start by loading the trained model. We don't need the Named Entity Recognition (NER) or the text classification capabilities of the model, so we don't load them to make everything faster. If we wanted to lemmatize a language other than English, we would just need to download a trained model for that language using spaCy and change the 'en_core_web_sm' to the name of that model. Everything else would be the same.

In [None]:
# Load the small pretrained model
nlp = spacy.load("en_core_web_sm", disable=["ner", "textcat"])
print(type(nlp))

In [None]:
# spaCy expects a string as input, so let's use .join to force our list into a string  
words = ' '.join(tokens)
words

In [None]:
doc = nlp(words)
doc

In [None]:
# We can now get the lemma of any word using the .lemma_ attribute
doc[5].lemma_

In [None]:
# ...Or even the part of speech
doc[5].pos_

In [None]:
# Note that spaCy also has its own stopwords list
nlp.Defaults.stop_words

Check out the [spaCy documentation](https://spacy.io/api/token#attributes) for more information about all the linguistic features that spaCy allows you to access as attributes.

Now, let's create a function that takes in a list of tokens and lemmatizes it using spaCy.

In [None]:
# Define our function
def lemmatize(tokens):
    """Return the lemmas for each word in `tokens`."""
    
    # spacy models operate on strings, not lists, so we turn the tokens back into
    # a string of words
    words = ' '.join(tokens)
    
    # this line does all sorts of processing, including the lemmatization.
    # `doc` will be like a list of tokens that we can iterate over
    doc = nlp(words)
    
    # each token in `doc` holds information about that token. The `lemma_`
    # attribute holds the lemma of that token represented as a string. For
    # performance reasons, the `lemma` (without the trailing underscore) holds
    # an integer representation of the token, that we'll rarely ever need.
    return [token.lemma_ for token in doc]

In [None]:
lemmas = lemmatize(tokens)
print(lemmas)

# Notice that spacy lemmatizes pronouns (e.g. "you", "I", "your") in a funny way.
# It just tells us that they are pronouns, rather than giving us something like
# "your" -> "you".


# N-grams, skip-grams, and BERT?

Are you interested in tokenizing more than just single words for the purpose of increasing "context"? [n-grams](https://en.wikipedia.org/wiki/N-gram) are "contiguous sequence of n items from a given sample of text or speech."

[Check out this clever solution for n-gramizing text](https://stackoverflow.com/questions/17531684/n-grams-in-python-four-five-six-grams)

Do you want even more context? We will learn more about [skip-grams](https://en.wikipedia.org/wiki/Word2vec) and [BERT](https://www.searchenginejournal.com/google-bert-update/332161/#close) later in this course. 


"The skip-gram architecture weighs nearby context words more heavily than more distant context words."

"The BERT algorithm (Bidirectional Encoder Representations from Transformers) is a deep learning algorithm related to natural language processing. It helps a machine to understand what words in a sentence mean, but with all the nuances of context."