<a href="https://colab.research.google.com/github/amckenny/text_analytics_intro/blob/main/notebooks/09_parts_of_speech.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Prerequisites
---


In [None]:
# Load 3rd-party packages
!pip install --quiet stanza

# Import Standard Library packages
import glob
from collections import Counter
from IPython.display import display
from pathlib import Path

# Import 3rd-party packages
import nltk, stanza
import numpy as np
import pandas as pd

In [None]:
# Data Prerequisites
!mkdir -p texts

# Spin up Stanza
stanza.download("en")
nlp = stanza.Pipeline('en', processors='tokenize,pos,ner', tokenize_no_ssplit=True)

# Get text files
!wget -q https://www.dropbox.com/s/5ibk0k4mibcq3q6/AussieTop100private.zip?dl=1 -O ./texts/AussieTop100private.zip
!unzip -qq -n -d ./texts/ ./texts/AussieTop100private.zip

about_dir = Path.cwd() / "texts" / "About"
pr_dir = Path.cwd() / "texts" / "PR"
dirs_to_load = [about_dir, pr_dir]

# Text preprocessing prerequisites
texts = [] 
for directory in dirs_to_load:
  for file in glob.glob(f"{directory}/*.txt"):
    with open(file, 'r') as infile: 
      text_type = file.split("/")[-2]
      text_id = file.split("/")[-1]
      fulltext = infile.read().replace("\n", " ")
      texts.append({'text_type': text_type, 'text_id': text_id, 'text': fulltext})
text_df = pd.DataFrame(texts)

#Module 9 - Parts of Speech
---

Thus far in the series he haven't much cared about linguistic characteristics of the words we've analyzed. Our analyses have been agnostic to whether we're looking at a noun, verb, adjective, or otherwise. For many purposes, that's OK... in some analyses, that isn't important.

However, in some cases we would like to know what actions are being taken by individuals or organizations, requiring an understanding of verb phrases. Or we might be interested in how organizations view their salient stakeholders, requiring an understanding of proper nouns. In this module, we will demonstrate tools to help us tag parts of speech in our texts.

The goals for this module are to:
* Tag the parts of speech for individual words
* Identify phrases within tagged text
* Identify named entities in text

##9.1. Part-of-Speech Tagging
---


If you have been taking a peek at the *prerequisites* text in the past few modules, you've probably seen that I have been doing a little bit of part-of-speech tagging behind the scenes without telling you. Specifically, when I have been preprocessing the texts for use in the module, I've used this line of code:

`tokens = [word.text.lower() for sentence in nlp(fulltext).sentences for word in sentence.words if word.upos not in ["PUNCT", "SYM", "NUM", 'X']]`

This line of code takes a text and does the following things:
* Segments it into sentences
* Tokenizes the sentences
* Removes punctuation, symbols, numerals, and other non-text characters,
* Changes the casing to lowercase
* Saves the tokens to a list called 'tokens'

But how does it know what tokens aren't words? Part-of-speech tagging. Each token is assigned a tag indicating its part of speech, including tokens that aren't words. There are many tools for accomplishing part-of-speech (POS) tagging; however, we're going to use Stanza - a Python NLP package associated with the Stanford NLP group in this module.

Let's start with an example:


In [None]:
# Tags a sentence for different parts of speech
sentence = "The greatest glory in living lies not in never falling, but in rising every time we fall."
print(f"Original sentence: {sentence}")

doc = nlp(sentence)
print(f"Tagged sentence:   {[(word.text, word.upos) for sentence in doc.sentences for word in sentence.words]}")

But is that all that impressive? Couldn't we just have one massive table that looks up the different parts of speech of different words?

The answer: kind-of... but probably not. There's a couple of problems with that approach:

* Natural language is constantly evolving, new words are being created every day - whose job it is to identify those new words and add them to our look-up table... and is that even feasible?
* Some words change their part of speech depending on their usage.

Consider the following:

In [None]:
# Tags the POS in a sentence using one word in two different ways
sentence = "I wouldn't bank on the bank being open on Friday."
print(f"Original sentence: {sentence}")

doc = nlp(sentence)
print(f"Tagged sentence:   {[(word.text, word.upos) for sentence in doc.sentences for word in sentence.words]}")

Note how the word "bank" is in there twice, but is being used both as a verb and as a noun? Our POS tagger recognizes the different usages and is able to tag it correctly.

The reason this works is that Stanza uses a neural network to tag words rather than looking up the part of speech in a table. Let's see how it handles made-up words and misspellings. Let's compare then-president Donald Trump's famously misspelled tweet "Despite the constant negative press covfefe" to the correct version

In [None]:
# Compares the POS tagging of two parallel sentences, where one has a misspelling
incorrect_sentence = "Despite the constant negative press covfefe."
correct_sentence = "Despite the constant negative press coverage."
inc_doc = nlp(incorrect_sentence)
cor_doc = nlp(correct_sentence)
print(f"Incorrect sentence:   {[(word.text, word.upos) for sentence in inc_doc.sentences for word in sentence.words]}")
print(f"Correct sentence:     {[(word.text, word.upos) for sentence in cor_doc.sentences for word in sentence.words]}")

Feel free to change 'covfefe' to any other random letters, I suspect you'll find the neural POS tagger still identifies it as a noun. It doesn't know the meaning of 'covfefe', but it can tell what it should be tagged based on how it is used in a sentence.

##9.2. Chunking
---


Simply tagging the part of speech of each word can be helpful on its own, but we can learn a lot more about the text when we start looking at *chunks*, or groupings of tokens identified on the basis of their parts of speech.

For instance, let's start by looking at noun phrases.

In [None]:
# Defines a function to chunk noun phrases from the passed POS tagged document
def extract_noun_phrases(tagged_doc):
  chunk_grammar = r"NP: {(<DET>|<PRON>)?(<ADJ>)*(<CCONJ><ADJ>)?(<NOUN>|<PROPN>)+}"
  parser = nltk.RegexpParser(chunk_grammar)
  parsed_doc = parser.parse(tagged_doc)
  noun_phrases = []
  for subtree in parsed_doc.subtrees(lambda x: x.label() == "NP"):
    noun_phrases.append(" ".join([word[0] for word in subtree.leaves()]))
  return noun_phrases

# Extracts noun phrases from a sample text
sentence = "The brown cow called out to the happy yet tired farmer."
tagged_doc = [(word.text, word.upos) for sentence in nlp(sentence).sentences for word in sentence.words]
print(f"Noun Phrases: {extract_noun_phrases(tagged_doc)}")   

The `chunk_grammar` variable contains our 'rules' for chunking noun phrases. Here we're using [regular expressions](https://docs.python.org/3/howto/regex.html) which go far beyond the scope of this module. However, in this case the regular expression looks for:
* A determiner (e.g., "the") or pronoun ("his"), if there is one
* Adjectives (e.g., "green"), followed by a conjunction and adjective (e.g., "and yellow"), if there are any
* At least one noun or proper noun

This clearly doesn't capture all the ways that noun phrases can be modified. For instance, noun phrases can use postmodifiers as in "the man with a green tattoo." However, this will work for our illustrative example.

The *prerequisites* code preloaded some 'about us' webpages and press release texts. Let's take a look at noun phrases from the first text.

In [None]:
# Extracts verb phrases from a text in the loaded corpus
text_to_extract_from = 0
tagged_doc = [(word.text, word.upos) for sentence in nlp(text_df.iloc[text_to_extract_from]["text"]).sentences 
                                     for word in sentence.words]
noun_phrases = extract_noun_phrases(tagged_doc)
print(f"Noun Phrases\n{'-'*30}")
for phrase in sorted(set(noun_phrases)):
  print(phrase)

Chunking can be used to capture a wide variety of patterns of speech beyond noun phrases. For instance, we can capture prepositional phrases and verb phrases as well. 

Verb phrases may be particularly valuable to understand what is happening in a text. Let's take a look at how we would change the chunk grammar to capture verb phrases:

In [None]:
# Defines a function to chunk verb phrases from the passed POS tagged document
def extract_verb_phrases(tagged_doc):
  chunk_grammar = r"VP: {<AUX>*<VERB>+}"
  parser = nltk.RegexpParser(chunk_grammar)
  parsed_doc = parser.parse(tagged_doc)
  verb_phrases = []
  for subtree in parsed_doc.subtrees(lambda x: x.label() == "VP"):
    verb_phrases.append(" ".join([word[0] for word in subtree.leaves()]))
  return verb_phrases

# Extracts verb phrases from a sample text
sentence = "I should have gone to to the doctor because I became very sick"
tagged_doc = [(word.text, word.upos) for sentence in nlp(sentence).sentences for word in sentence.words]
print(f"Verb Phrases: {extract_verb_phrases(tagged_doc)}")   

Here we're looking for:
* One or more auxiliary verbs (if they occur at all), followed by 
* the main verb

Here too, verb phrases can be more complex than this simple pattern; however, for our purposes this will suffice.

Let's look at the verb patterns in a real text:

In [None]:
# Extracts verb phrases from a text in the loaded corpus
text_to_extract_from = 0
tagged_doc = [(word.text, word.upos) for sentence in nlp(text_df.iloc[text_to_extract_from]["text"]).sentences 
                                     for word in sentence.words]
verb_phrases = extract_verb_phrases(tagged_doc)
print(f"Verb Phrases\n{'-'*30}")
for phrase in sorted(set(verb_phrases)):
  print(phrase)

##9.3. Named Entity Recognition
---


Proper nouns may be of particular interest to our research. For instance, understanding the salient stakeholders of an organization based on the attention they receive in a corpus could be done by using chunking (as above) to select all proper nouns. However, this area is important enough in natural language processing research that it has evolved in parallel to part-of-speech tagging.

Named entity recognition seeks to extract and classify the names of people places and things in a text. For example, "William Henry Gates III, founder and former CEO of Microsoft Corporation was born in Seattle, Washington." has several named entities:
* William Henry Gates III - the person
* Microsoft Corporation - the company
* Seattle, Washington - the location

Let's have Python try to identify the named entities in this sentence

In [None]:
# Extracts named entities from a sample text
sentence = "William Henry Gates III, founder and former CEO of Microsoft Corporation was born in Seattle, Washington."
entities = nlp(sentence).entities
for entity in entities:
  print(f"{entity.text:25} is a {entity.type}")

That worked. It identified the three named entities (though it split Seattle and Washington into two) and it correctly identified the type of entity they are (GPE stands for geopolitical entity).

Let's try it on a full text from our sample.

In [None]:
# Extracts named entities from a text in the loaded corpus
text_to_extract_from = 0
entities = nlp(text_df.iloc[text_to_extract_from]["text"]).entities
for entity in entities:
  print(f"{entity.text:60} - is a {entity.type}")

Ok, let's do something a bit more applied with this. Let's write some code that:
* Identifies all organizations in our corpus
* Counts the number of times each organization is mentioned
* Presents the top ten such organizations

In [None]:
# Have Stanza analyze our corpus of texts
text_df['nlp'] = text_df['text'].apply(nlp)

In [None]:
# Count all organizations mentioned in each text
def gen_org_counter(doc):
  org_counter = Counter()
  for entity in doc.entities:
    if entity.type == "ORG":
      org_counter[entity.text]+=1
  return org_counter

text_df['organizations'] = text_df['nlp'].apply(gen_org_counter)

In [None]:
# Sum the organization counts for all texts and print the ten most common
corpus_counter = text_df['organizations'].sum()
print(corpus_counter.most_common(10))

It did a pretty good job. We do see some artifacts (i.e., "Company" and "Group") indicating that the analysis isn't perfect; however, this is still a pretty informative list. We could now slice and dice this dataset several ways to understand whether these are companies talking about themselves or others, or we could compare who is being talked about in different types of texts.

In fact, let's do that, let's compare the organizations being discussed in 'About Us' webpages vs press releases:

In [None]:
grouped_df = text_df.groupby(['text_type'])
corpus_counter = grouped_df['organizations'].sum()
for text_type, type_counter in corpus_counter.items():
  print(f"Text type: {text_type}\nOrganization mentions: {type_counter.most_common(10)}\n")

The opportunities for the application of these techniques in organizational research abound. 

For avid readers: [Coreference resolution](https://nlp.stanford.edu/projects/coref.shtml) is a complementary area of natural language processing which helps identify all references to an entity throughout the text, but is beyond the scope of this series.