<a href="https://colab.research.google.com/github/amckenny/text_analytics_intro/blob/main/notebooks/05_text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Prerequisites
---

In [None]:
# Get external files
!mkdir -p texts
!wget -q https://www.dropbox.com/s/5ibk0k4mibcq3q6/AussieTop100private.zip?dl=1 -O ./texts/AussieTop100private.zip
!unzip -qq -n -d ./texts/ ./texts/AussieTop100private.zip

# Standard library imports
import glob, string
from pathlib import Path
from collections import Counter

# 3rd party imports
import nltk, nltk.sentiment
from nltk.corpus import wordnet as wn
import pandas as pd

# Downloads nltk corpora for preprocessing tasks
nltk.download("stopwords", quiet=True)
nltk.download("punkt", quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

# Creates path variables to texts
about_dir = Path.cwd() / "texts" / "About"
pr_dir = Path.cwd() / "texts" / "PR"
dirs_to_load = [about_dir, pr_dir]

#Module 5 - Text Preprocessing
---

There are lots of decisions to make with regards to text preprocessing. Which ones are most relevant depends on a lot of factors - more than can reasonably discussed here. I don't agree with *all* recommendations they make, however [this article](https://journals.sagepub.com/doi/pdf/10.1177/1094428120971683) does a good job in outlining key techniques and considerations for text preprocessing.

In this module, we'll introduce a cornucopia of text preprocessing options and demonstrate how they can be done in Python. The goals for this module are:

* Demonstrate segmentation
* Demonstrate tokenization
* Demonstrate case conversion
* Demonstrate non-word character removal
* Demonstrate token replacement
* Demonstrate stop word removal
* Demonstrate stemming/lemmatization

**Notes**: These are not presented in a prescribed order. They are ordered for pedagogical convenience.

##5.1. Segmentation
---

The *prerequisites* code for this module downloads a corpus of 'about us' webpages and press releases from Australian family businesses. However, sometimes our level of analysis may be lower than the 'text' level. In these cases we use 'segmentation' to transform the full text into a list of segments.

For this example we will use sentence segmentation. Handily, the NLTK package (Python's jack-of-all-trades-and-master-of-none for natural language processing) has a sentence segmenter built in.

Let's see what it can do:

In [None]:
# Segments a string of sentences into a list of sentences and displays the list.
sentences = "Mr. Davis and Mrs. Garvey were handed the treatment options for their son by Dr. Smith. "\
            "They were pleased to see that it would not be too expensive! "\
            "However, they were still hopeful that there would be outpatient options available."
nltk.tokenize.sent_tokenize(sentences)

Perfect. Note how the segmenter was not fooled by the periods after Mr, Dr, and Mrs... and it knows that exclamation marks also delimit sentences. This is much easier than us doing it manually with a `split()` method as we did previously.

##5.2. Tokenization
---

Tokens in text analysis refer to the discrete units of meaning we want to analyze. In our case, that means individual words. Accordingly, tokenization is going to turn our text into a list of individual words. Let's take a look:

In [None]:
# Breaks a string into a list of tokens and displays the list.
sentences = "This is a test. This will break it into words."
word_tokenize = nltk.tokenize.word_tokenize
word_tokenize(sentences)

Note that the word tokenizer broke the text into words, but not into sentences. If your unit of analysis is the sentence, you'll generally want to segment before tokenizing.

##5.3. Case Conversion
---


As we saw in a previous module, Python treats words with different casing as being different, even if they are the same word:

In [None]:
# Demonstrates the inequality of differently cased strings
"This" == "this"

For many text analyses, we do not care about the casing of the words. As a result it is common to make everything lowercase to facilitate comparisons/grouping.

Consider the following:

In [None]:
# Creates and displays Counter objects for lists of words with and without case conversion
list_of_words = ["Innovative", "innovative", "INNOVATIVE"]
word_counter = Counter(list_of_words)
print(f"Before case conversion, the word counts are: {word_counter}")

list_of_words = [word.lower() for word in list_of_words]
word_counter = Counter(list_of_words)
print(f"After case conversion, the word counts are: {word_counter}")

You can see that converting the words to lowercase enables us to count the number of times that the word 'innovative' appears in the text independent of its case.

##5.4. Non-Word Character Removal
---


Just as we're often not concerned with the casing of words, we're often not concerned with non-word characters. For instance, numerals, punctuation, typesetting, empty spaces, and so on. Depending on the task at hand, you may want to remove or replace these characters.

There are a number of different ways to do this, here's one way:

In [None]:
# Displays the tokenized sentence with and without non-word character removal
list_of_words = ["this", "sentence", "has", "numerals", ",", "such", "as", "0123", ",", "and", "punctuation", "!"]
print(f"Before non-word character removal: {list_of_words}")

table = str.maketrans('', '', string.punctuation)
list_after_nwc_removal = [word.translate(table) for word in list_of_words if word.translate(table) and not word.isdigit()]
print(f"After non-word character removal: {list_after_nwc_removal}")

Sometimes flawed optical character recognition in PDFs or encoding errors may result in non-word characters beyond punctuation, numerals, etc. These are important to remove as well and can be handled similarly. The exception here is that the `string.punctuation` variable will not often contain these, you'll have to tweak this to capture those characters as well.

##5.5. Token Replacement
---

For certain tasks you may determine that some tokens need to be replaced with others. For instance:

In [None]:
# Displays tokenization of contractions
nltk.tokenize.word_tokenize("Aaron's out today.")

We know that "'s" stands for "is", but we may want for the text we analyze to contain "is" rather than the abbreviation. Similarly, in some cases you may want to:
* replace "not happy" with "happy_NEG" to indicate that the word was negated.
* replace abbreviations with their expanded form.
* replace emoticons with a textual representation of the emoticon

These are cases when you would use token replacement. Consider the following:

In [None]:
# Displays token replacement in different stages
sentence = "Aaron's at the IT lab today and isn't expected to return until tomorrow"
list_of_words = nltk.tokenize.word_tokenize(sentence)
print(f"Before processing: {list_of_words})")

translation_dict = {"'s": "is",
                    "n't": "not",
                    "IT": "Information Technology"}

list_of_words = [word if word not in translation_dict else translation_dict[word] for word in list_of_words]
print(f"\nAfter first-round processing (abbreviation/contraction expansion): {list_of_words})")

list_of_words = nltk.sentiment.util.mark_negation(list_of_words)
print(f"\nAfter second-round processing (negation marking): {list_of_words})")

However, we have to be careful with token replacement lest we introduce artifacts in our text:

In [None]:
# Displays token replacement for text that becomes invalid with the replacement operation.
sentences = "Let's meet at the lab tomorrow for brainstorming. IT IS IMPORTANT TO ARRIVE EARLY! "\
            "It's not a bad idea to keep the creative juices flowing"
list_of_words = nltk.tokenize.word_tokenize(sentences)
print(f"Before processing: {list_of_words})")

translation_dict = {"'s": "is",
                    "n't": "not",
                    "IT": "Information Technology"}

list_of_words = [word if word not in translation_dict else translation_dict[word] for word in list_of_words]
print(f"\nAfter first-round processing (abbreviation/contraction expansion): {list_of_words})")

list_of_words = nltk.sentiment.util.mark_negation(list_of_words)
print(f"\nAfter second-round processing (negation marking): {list_of_words})")

Here we can see that nuances in natural language led our rather brutish token replacement operation into creating problems.

*   "Let's" was replaced with "Let" "is"
*   "IT" (not referring to information technology) was replaced with "Information Technology"
*   Words like 'creative' were negated because they followed 'not'; however, the negation clause shouldn't have negated those words.

If you are going to use token replacement, do so with the utmost care and review the results to ensure that the desired output was obtained.



##5.6. Stop Word Removal
---

Some words are used so frequently in natural language as to be uninformative for some text analyses. These are commonly called 'stop words'. Let's take a look at some common stop words in English:

In [None]:
# Displays the NLTK stop words list
stop_words = nltk.corpus.stopwords.words("english")
print(stop_words)

For many analyses, these words do not tell us much regarding what is being talked about. Accordingly, in text analysis these stop words are sometimes removed as part of preprocessing.

Let's take a look at an example:

In [None]:
# Displays a tokenized sentence before and after stop words are removed
sentence = "I tend to agree with you that the lack of a liquid secondary market is a critical lynchpin "\
           "missing from equity crowdfunding taking off."
sentence = nltk.word_tokenize(sentence.lower())
print(f"Before stop word removal: {sentence}")

sentence = [word for word in sentence if word not in stop_words]
print(f"After stop word removal: {sentence}")

We can still largely understand the meaning of the sentence without the stop words being present, which is a key idea behind stop word removal. However, [psychology research](https://www.secretlifeofpronouns.com/) also indicates that some of these words contain more meaning than we might imagine, suggesting that stop word removal should be done with care and only when doing so improves the validity/reliability of the analysis.

##5.7. Stemming/Lemmatization
---

Stemming and lemmatization are two approaches to accomplishing the same thing: reducing the number of morphemes in our corpus. However, that sounds complicated... let's simplify with an example:

Do we lose much meaning by replacing "*The innovative innovator innovatively created innovations in the innovation labs*" with "*The innovate innovate innovate create innovate innovate in the innovate lab*"?

Technically, the answer is 'yes'; however, if we're just trying to identify whether the sentence talks about innovation, both sentences convey about the same thing; however, the latter sentence is much simpler computationally because it contains only six distinct words (aka 'types') instead of ten.

Let's look at an example of stemming:

In [None]:
# Displays a sentence before and after using the Porter stemmer
sentence = "I am enjoying my frequent vacations to beautiful San Diego"
sentence = nltk.word_tokenize(sentence.lower())
print(f"Before stemming: {sentence}")

stemmer = nltk.stem.porter.PorterStemmer()
sentence = [stemmer.stem(word) for word in sentence]
print(f"After stemming: {sentence}")

That works, but we're left with snippets of words rather than what we'd think to be the root of the word itself. That's a characterization of stemming.

Let's look at Lemmatization:

In [None]:
# Displays a sentence before and after using the WordNet Lemmatizer
sentence = nltk.word_tokenize("I am enjoying my frequent vacations to beautiful San Diego".lower())
print(f"Before lemmatization: {sentence}")

pos_map = {'J': wn.ADJ, 'N': wn.NOUN, 'R': wn.ADV, 'V': wn.VERB}
lemmatizer = nltk.stem.WordNetLemmatizer()
sentence = [lemmatizer.lemmatize(a, pos_map.get(b[0], wn.NOUN)) for a, b in nltk.pos_tag(sentence)]

print(f"After lemmatization: {sentence}")

It's far from perfect. WordNet is built on a manually developed and maintained dataset, so not all words are in there and it is difficult to keep up-to-date with the constant shifts in natural language use. However, when it does work, you're left with a real word (e.g., "vacation") rather than a truncated word (e.g., "vacat").

#5.8. Preprocessing our Corpus
---

Now that we've seen a litany of preprocessing activities. Let's put them to work in our corpus of About Us webpages and press releases:

In [None]:
# Load texts
texts = [] 
for directory in dirs_to_load:
  for file in glob.glob(f"{directory}/*.txt"):
    with open(file, 'r') as infile: 
      text_type = file.split("/")[-2]
      text_id = file.split("/")[-1]
      texts.append({'text_type': text_type, 'text_id': text_id, 'text': infile.read()})

# Text Preprocessing Pipeline
for id, article in enumerate(texts):
  if id == 0:
    print("---Original--")
    print(article['text'][0:367])

  # Segmentation
  article['text'] = nltk.tokenize.sent_tokenize(article['text'])
  if id == 0:
    print("\n---After segmentation--")
    print(article['text'][0:2])  

  # Tokenization
  article['text'] = [nltk.tokenize.word_tokenize(sentence) for sentence in article['text']]
  if id == 0:
    print("\n---After tokenization--")
    print(article['text'][0:2])  

  # Case conversion
  for sent_id, sentence in enumerate(article['text']):
    article['text'][sent_id] = [word.lower() for word in sentence]
  if id == 0:
    print("\n---After case conversion--")
    print(article['text'][0:2])  

  # Non-word character removal
  table = str.maketrans('', '', string.punctuation)
  for sent_id, sentence in enumerate(article['text']):
    article['text'][sent_id] = [word.translate(table) for word in sentence if word.translate(table) and not word.isdigit()]
  if id == 0:
    print("\n---After non-word character removal--")
    print(article['text'][0:2])

  # Token replacement
  translation_dict = {"'s": "is",
                      "n't": "not",
                      "IT": "Information Technology"}
  for sent_id, sentence in enumerate(article['text']):
    article['text'][sent_id] = [word if word not in translation_dict else translation_dict[word] for word in sentence]
    article['text'][sent_id] = nltk.sentiment.util.mark_negation(sentence)
  if id == 0:
    print("\n---After token replacement removal--")
    print(article['text'][0:2])

  # Stop word removal
  stop_words = nltk.corpus.stopwords.words("english")
  for sent_id, sentence in enumerate(article['text']):
    article['text'][sent_id] = [word for word in sentence if word not in stop_words]
  if id == 0:
    print("\n---After stop word removal--")
    print(article['text'][0:2])

  # Lemmatization
  pos_map = {'J': wn.ADJ, 'N': wn.NOUN, 'R': wn.ADV, 'V': wn.VERB}
  lemmatizer = nltk.stem.WordNetLemmatizer()
  for sent_id, sentence in enumerate(article['text']):
    article['text'][sent_id] = [lemmatizer.lemmatize(a, pos_map.get(b[0], wn.NOUN)) for a, b in nltk.pos_tag(sentence)]
  if id == 0:
    print("\n---After lemmatization--")
    print(article['text'][0:2])

corpus_df = pd.DataFrame(texts)  

The text preprocessing is complete - let's see our dataframe with the final preprocessed texts:

In [None]:
# Display the first five rows of the dataframe
display(corpus_df.head(5))