<a href="https://colab.research.google.com/github/amckenny/text_analytics_intro/blob/main/notebooks/06_text_visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Prerequisites

In [None]:
# Get external files and install 3rd party packages
!wget -q https://www.dropbox.com/s/vbcpntdzt1weid2/australia.png?dl=1 -O ./australia.png
!mkdir -p texts
!wget -q https://www.dropbox.com/s/5ibk0k4mibcq3q6/AussieTop100private.zip?dl=1 -O ./texts/AussieTop100private.zip
!unzip -qq -n -d ./texts/ ./texts/AussieTop100private.zip
!pip -q install scattertext
!pip -q install stanza

# Standard library imports
import glob, re
from pathlib import Path
from IPython.core.display import display, HTML
from collections import Counter

# 3rd party imports
import nltk, stanza
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scattertext as st
import seaborn as sns
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator

sns.set_theme()

# Spin up stanza pipeline for tokenization
nltk.download("stopwords", quiet=True)
stanza.download('en')
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos', tokenize_no_ssplit=True)

In [None]:
#Preprocess texts
about_dir = Path.cwd() / "texts" / "About"
pr_dir = Path.cwd() / "texts" / "PR"
dirs_to_load = [about_dir, pr_dir]

texts = [] 
stops = nltk.corpus.stopwords.words('english')+["'s", '&']
for directory in dirs_to_load:
  for file in glob.glob(f"{directory}/*.txt"):
    with open(file, 'r') as infile: 
      text_type = file.split("/")[-2]
      text_id = file.split("/")[-1]
      fulltext = infile.read()
      tokens = [word.text.lower() for sentence in nlp(fulltext).sentences for word in sentence.words if word.upos not in ["PUNCT", "SYM", "NUM", 'X']]
      texts.append({'text_type': text_type, 'text_id': text_id, 'text': fulltext, 'tokens': tokens})
text_df = pd.DataFrame(texts)
text_df['token_no_stops'] = text_df['tokens'].apply(lambda x: [word for word in x if word not in stops])
text_df['text_no_stops'] = text_df['token_no_stops'].apply(lambda x: " ".join([word for word in x]))

In [None]:
#Australia word cloud mask initialization
aus_img = Image.open("australia.png")
mask = np.array(aus_img)

def transform_mask(pixel):
  if pixel == 0:
    return 255
  else:
    return pixel

tf_mask = np.ndarray((mask.shape[0],mask.shape[1]), np.int32)
for i in range(len(mask)):
  tf_mask[i] = list(map(transform_mask, mask[i]))

#Module 6 - Text Visualization
---

As researchers, we're often told to get to know our data before analyzing it. The same holds true in text analysis as well. We'd like to know 'what is in' our corpus of texts at a high level before conducting analyses.

In this module, we'll look at a number of different ways of visualizing our data - both graphically and statistically. The goals for this module are:

* View document length distributions
* Generate word frequency distributions and see Zipf's law in action
* Find n-grams and generate PMI statistics
* Use concordance analysis to see the context of tokens/phrases
* Create a word cloud

##6.1. Basic Visualizations
---

###6.1.1. Document Length Distribution
One of the first things we often look at when 'getting to know our texts' is the length of the texts themselves. We usually find a positively skewed distribution.

In [None]:
# Displays a histogram of corpus word counts
text_df['word_count'] = text_df['tokens'].apply(lambda x: len(str(x)))
text_df.hist(bins=20, grid=True, figsize=(20,5))

However, we've combined two different types of documents here... let's look at the distribution by document type.

In [None]:
# Displays a histogram of corpus word counts for each type of text
text_df.hist(bins=20, by="text_type", grid=True, figsize=(20,5))

###6.1.2. Word Frequency Distributions

We may also want to understand what is being said in these texts in general before we start doing deeper analyses. Let's look at the most frequently used words in the corpus.

In [None]:
# Creates a frequency distribution plot for the words in the corpus
freq_dist = nltk.FreqDist(word for id, row in text_df.iterrows() for word in row['tokens'])
plt.figure(figsize=(20,5))
freq_dist.plot(50, title="Frequency distribution of the top 50 words")

This illustrates a couple of things:
1. The value of stopword removal: 'the', 'and', 'of', 'to', etc. are frequent, but they tell us little regarding the contents of the text
2. Zipf's law appears to hold even in our small corpus of texts.

Let's remove stopwords and look at the distribution again:

In [None]:
# Creates a frequency distribution plot for the words in the corpus (minus stopwords)
freq_dist = nltk.FreqDist(word for id, row in text_df.iterrows() for word in row['token_no_stops'])
plt.figure(figsize=(20,5))
freq_dist.plot(50, title="Frequency distribution of the top 50 words")

This distribution is reflective of Zipf's law, but the frequencies don't seem to drop off as quickly as the law might specify. I would suggest that in a larger corpus, it would.

We also see that the 'content' of the texts seem a lot more apparent. These texts seem to talk about australian businesses, with an emphasis on things that are new about these businesses. This is pretty accurate given the corpus of texts.

However, we're still looking at the overall frequency. Let's look at a conditional frequency distribution and see if any patterns emerge.

In [None]:
# Creates a conditional frequency distribution conditioned on the type of text
cfd = nltk.ConditionalFreqDist((row['text_type'], word) for id, row in text_df.iterrows() for word in row['token_no_stops'])
plt.figure(figsize=(20,5))
cfd.plot(conditions=["About", "PR"], samples=list(list(zip(*freq_dist.most_common(50))))[0], title="Conditional frequency distribution of the top 50 words")

This is interesting; however, there are significantly more words in our corpus for press releases than for about us webpages:

In [None]:
# Calculates and displays the total number of words for each type of texts
about_words = text_df.groupby('text_type')['word_count'].sum()['About']
pr_words = text_df.groupby('text_type')['word_count'].sum()['PR']
print(f"There are a total of {pr_words} in the press release texts, whereas there are only {about_words} in the About Us pages.")

A better apples-to-apples comparison would look at the conditional **relative** frequency distribution of the words:

In [None]:
# Displays the conditional frequency distribution as percentages rather than raw word counts
for word in freq_dist:
 cfd['About'][word] /= about_words
 cfd['PR'][word] /= pr_words

plt.figure(figsize=(20,5))
cfd.plot(conditions=["About", "PR"], samples=list(list(zip(*freq_dist.most_common(50))))[0], title="Conditional relative frequency distribution of the top 50 words")

###6.1.3. Distinctive Words
---

The conditional frequency distributions above show us that among the most frequent words used in the corpus there are several words that are more frequently used in one type of text than in another.

However, using `scattertext` we can view such distinctive words in a clearer and more interactive way:

In [None]:
# Creates a scattertext diagram for the corpus with the percential_dense transformation
st_corpus = st.CorpusFromPandas(text_df, category_col='text_type', text_col='text_no_stops').build().compact(st.AssociationCompactor(2000))
st_html = st.produce_scattertext_explorer(st_corpus,
                                          category='About',
                                          category_name='About Us',
                                          not_categories=['PR'],
                                          sort_by_dist=False,
                                          term_scorer=st.CredTFIDF(st_corpus),
                                          metadata=text_df['text_id'],
                                          transform=st.Scalers.percentile_dense)
HTML(st_html)

This view is similar to what we saw before in that we see a lot more variance around the most frequently used words. However, here we have much more context. 

We can change the scaling to see more variance in the less-frequent words:

In [None]:
# Creates a scattertext diagram for the corpus without the percential_dense transformation
st_html = st.produce_scattertext_explorer(st_corpus,
                                          category='About',
                                          not_categories=['PR'],
                                          sort_by_dist=False,
                                          term_scorer=st.CredTFIDF(st_corpus),
                                          metadata=text_df['text_id'])
HTML(st_html)

##6.2. N-grams
---

Whereas previously we have largely focused on understanding the documents one word at a time, we may also be interested in groupings of words that appear frequently in the corpus (appropriately called n-grams).

Let's take a look at the most frequent word sequences (i.e., bigrams, trigrams):

In [None]:
# Calculates and displays a frequency distribution for the most frequent bigrams
freq_dist = nltk.FreqDist(" ".join(bigram) for id, row in text_df.iterrows() for bigram in nltk.ngrams(row['tokens'], 2))
plt.figure(figsize=(20,5))
freq_dist.plot(50)

In [None]:
# Calculates and displays a frequency distribution for the most frequent trigrams
freq_dist = nltk.FreqDist(" ".join(bigram) for id, row in text_df.iterrows() for bigram in nltk.ngrams(row['tokens'], 3))
plt.figure(figsize=(20,5))
freq_dist.plot(50)

Clearly the most common n-grams frequently contain stop words (e.g., "of the," "as well as"). Sometimes this is desirable to capture phrases like "products and services" that reflect meaningful phrases that include a stop word. However, it's also often valuable to complement this with an examination with stop words removed.

In [None]:
# Calculates and displays a frequency distribution for the most frequent bigrams (with stopwords removed)
freq_dist = nltk.FreqDist(" ".join(bigram) for id, row in text_df.iterrows() for bigram in nltk.ngrams(row['token_no_stops'], 2))
plt.figure(figsize=(20,5))
freq_dist.plot(50)

In [None]:
# Calculates and displays a frequency distribution for the most frequent trigrams (with stopwords removed)
freq_dist = nltk.FreqDist(" ".join(bigram) for id, row in text_df.iterrows() for bigram in nltk.ngrams(row['token_no_stops'], 3))
plt.figure(figsize=(20,5))
freq_dist.plot(50)

Looking at the pointwise mutual information can provide us with complementary insight by showing us words that show up most frequently as part of an n-gram (versus in other contexts)

In [None]:
# Identifies and displays the bigrams with the highest PMI scores (and which appear at least ten times)
bigrams = nltk.collocations.BigramCollocationFinder.from_documents(text_df['tokens'])
bigrams.apply_freq_filter(10)
for idx, (bigram, pmi) in enumerate(bigrams.score_ngrams(nltk.collocations.BigramAssocMeasures.pmi)):
  print(f"Bigram = {' '.join(bigram):30}--- Frequency = {bigrams.ngram_fd[bigram]} --- PMI = {pmi} ")
  if idx > 19:
    break

In [None]:
# Identifies and displays the trigrams with the highest PMI scores (and which appear at least ten times)
trigrams = nltk.collocations.TrigramCollocationFinder.from_documents(text_df['tokens'])
trigrams.apply_freq_filter(10)
for idx, (trigram, pmi) in enumerate(trigrams.score_ngrams(nltk.collocations.TrigramAssocMeasures.pmi)):
  print(f"Trigram = {' '.join(trigram):30}--- Frequency = {trigrams.ngram_fd[trigram]} --- PMI = {pmi} ")
  if idx > 19:
    break

##6.3. Concordance/KWIC Analysis
---

Having looked at what individual words and phrases appear to be important in the corpus, it's often valuable to look at a concordance/keyword in context (KWIC) analysis.

NLTK offers this natively for individual words:

In [None]:
# Presents a concordance/KWIC analysis for the word 'australia' in the corpus
look_for = "australia"
nltk.Text([word for id, row in text_df.iterrows() for word in row['tokens']]).concordance(look_for)

The nltk version cannot handle phrases. However, it is not difficult to create our own version that can:

In [None]:
# Creates a concordance analysis function that can handle phrases in addition to individual words
def phrase_concordance(text, phrase, window=40, lines=25):
  locations = [idx.start() for idx in re.finditer(phrase, text)]
  if lines > len(locations):
    print(f"Displaying all {len(locations)} matches:")
    for location in locations:
      print(text[location-window:location+len(phrase)+window])
  else:
    print(f"Displaying {lines} of {len(locations)} matches:")
    for idx, location in enumerate(locations):
      print(text[location-window:location+len(phrase)+window])
      if idx == lines-1:
        break

# Presents a concordance/KWIC analysis for the phrase 'chief executive officer' in the corpus
phrase = "chief executive officer"
text = " ".join([" ".join(row['tokens']) for id, row in text_df.iterrows()])
phrase_concordance(text, phrase)

#6.4. Word Clouds
---

A valuable way of conveying a lot of information about a corpus of texts (or even a subset thereof) is to present readers with a wordcloud. There are a lot of ways to present word clouds. However, the most basic form presents the word cloud with size being proportional to frequency:

In [None]:
# Creates a word cloud for the corpus
wordcloud = WordCloud(background_color="white", max_words = 150, stopwords=stops).generate(text)
plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

You can also prepare a 'mask' image for the word cloud to fit. In this way, the generated word cloud will fit inside of the mask to represent something relevant to the corpus - in this case, the political boundaries of Australia:

*Note: If you're looking for it, the mask preprocessing was done in the 'prerequisites' section*

In [None]:
# Creates a word cloud for the corpus using an Australia mask
wordcloud = WordCloud(background_color="white", max_words = 150, stopwords=stops, mask=tf_mask).generate(text)
plt.figure(figsize=(20,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()