<a href="https://colab.research.google.com/github/amckenny/text_analytics_intro/blob/main/notebooks/08_dictionary_based_analyses.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prerequisites
---


In [None]:
# Load 3rd-party packages
!pip -q install python-Levenshtein
!pip -q install "gensim==4.0.0"

# Import Standard Library packages
import glob
from collections import Counter
from itertools import compress, takewhile
from IPython.display import display
from pathlib import Path


# Import 3rd-party packages
import nltk
import numpy as np
import pandas as pd
from gensim.models import KeyedVectors
from nltk import agreement
from nltk.corpus import wordnet
from scipy.stats import pearsonr

nltk.download('wordnet', quiet=True)

In [None]:
# Data Prerequisites
!mkdir -p texts

from_raw_texts = False # Stanza takes a while to download, loading from raw texts is slower than opening a pre-tokenized dataframe
if from_raw_texts:
  
  # Get text files
  !wget -q https://www.dropbox.com/s/5ibk0k4mibcq3q6/AussieTop100private.zip?dl=1 -O ./texts/AussieTop100private.zip
  !unzip -qq -n -d ./texts/ ./texts/AussieTop100private.zip

  about_dir = Path.cwd() / "texts" / "About"
  pr_dir = Path.cwd() / "texts" / "PR"
  dirs_to_load = [about_dir, pr_dir]


  # Text preprocessing prerequisites
  !pip install --quiet stanza
  import stanza
  stanza.download("en")
  nlp = stanza.Pipeline('en', processors='tokenize,pos', tokenize_no_ssplit=True)
  texts = [] 
  for directory in dirs_to_load:
    for file in glob.glob(f"{directory}/*.txt"):
      with open(file, 'r') as infile: 
        text_type = file.split("/")[-2]
        text_id = file.split("/")[-1]
        fulltext = infile.read().replace("\n", " ")
        tokens = [word.text.lower() for sentence in nlp(fulltext).sentences for word in sentence.words if word.upos not in ["PUNCT", "SYM", "NUM", 'X']]
        texts.append({'text_type': text_type, 'text_id': text_id, 'text': fulltext, 'tokens': tokens})
  text_df = pd.DataFrame(texts)

else:
  # Get pretokenized text dataframe
  !wget -q https://www.dropbox.com/s/xg4nuigde974k36/pretokenized_aussie_fbs.pkl?dl=1 -O ./texts/pretokenized_aussie_fbs.pkl
  text_df = pd.read_pickle('./texts/pretokenized_aussie_fbs.pkl')
  text_df = text_df.drop(['text_no_stops', 'tokens_no_stops'], axis=1)

# Get GloVe files and load into gensim
!mkdir -p GloVe
!wget -q https://www.dropbox.com/s/9k39nheab1rhezq/glove.6B.50d.zip?dl=1 -O ./GloVe/glove.6B.50d.zip
!unzip -qq -n -d ./GloVe ./GloVe/glove.6B.50d.zip

glove_model = KeyedVectors.load_word2vec_format("./GloVe/glove.6B.50d.txt", binary=False, no_header=True)

# Module 8 - Dictionary-Based Computer-Aided Text Analysis
---


One of the most basic text analyses is a dictionary-based analysis. With dictionary-based analyses, we provide the computer with a list of words/phrases/stems we believe are associated with a construct of interest (e.g., innovativeness). The computer then examines the frequency with which words associated with that construct are used in each text.

The goals for this module are to:
* Execute a dictionary-based analysis including words, phrases, and stems as entries
* Develop useful functions for dictionary creation and validation


##8.1. Using Existing Dictionaries
---


In previous modules we created a basic dictionary-based analysis function that uses a Counter object to count the number of instances of each token in a text.

In [None]:
# Counts the number of dictionary words in the passed list of tokens
def count_dictionary_words(tokens, dictionary):
  word_counter = Counter(tokens)
  accumulator = 0
  for word in dictionary:
    accumulator = accumulator + word_counter.get(word, 0)
  return accumulator

We can use this function to identify how frequently the texts loaded in the *prerequisites* code use words that we think are associated with 'optimism':

In [None]:
# Creates a new column called 'optimism_words' and uses the count_dictionary_words function to populate that column
optimism_words = ['optimism', 'optimistic', 'optimists', 'optimist', 'improve', 'improved', 'improving']
text_df['optimism_words'] = text_df['tokens'].apply(lambda x: count_dictionary_words(x, optimism_words))
display(text_df)

This is a good start; however, dictionary-based analyses often include phrases in addition to individual words. Our text preprocessing makes it difficult to identify whether two consecutive tokens are part of a phrase (e.g., two consecutive tokens may be part of different sentences if punctuation has been removed). Fortunately, we can easily count instances of the phrase in the original text.

In [None]:
# Counts the number of times a set of phrases occur in the passed text
def count_dictionary_phrases(text, dictionary):
  accumulator = 0
  for phrase in dictionary:
    accumulator = accumulator + text.lower().count(phrase)
  return accumulator

# Creates a new column called 'optimism_phrases' and uses the count_dictionary_phrases function to populate that column
optimism_phrases = ['bank on', 'good omen', 'looking up', 'looks up']
text_df['optimism_phrases'] = text_df['text'].apply(lambda x: count_dictionary_phrases(x, optimism_phrases))
display(text_df)

Some individuals put word stems in the dictionary to capture all words starting with a particular stem (e.g., hope\* catches hope, hopeful, hopefully). The dangers of this aside (e.g., hope\* also catches hopeless, hopelessness), we can extend our word counter to capture word stems.

Here instead of returning all Counter entries that match a word in our list, we return all Counter entries that *start* with a stem in our list.

In [None]:
# Counts the number of times word stems occur the passed tokens
def count_dictionary_stems(tokens, dictionary):
  word_counter = Counter(tokens)
  accumulator = 0
  for stem in dictionary:
    stem = stem[:-1]
    matches = [match for match in word_counter.keys() if match.startswith(stem)]
    for match in matches:
      accumulator = accumulator + word_counter.get(match, 0)
  return accumulator

# Creates a new column called 'optimism_stems' and uses the count_dictionary_stems function to populate that column
optimism_stems = ['hope*', 'favor*', 'improv*']
text_df['optimism_stems'] = text_df['tokens'].apply(lambda x: count_dictionary_stems(x, optimism_stems))
display(text_df)

With the three types of entries (i.e., words, phrases, and stems) counted, we can take the sum of the columns to calculate an overall optimism score.

In [None]:
# Calculate an overall optimism score
text_df['optimism_tot'] = text_df['optimism_words'] + text_df['optimism_phrases'] + text_df['optimism_stems']
text_df

All that being said, it's a bit of a hassle to have to manually separate the list into words, stems, and phrases. If we can identify rules for each category, we could let Python decide whether each entry is a word, stem, or phrase.

Here are our rules:
* Phrases contain at least one space in them
* Stems end with an asterisk
* If it doesn't contain a space or end in an asterisk, it is a word

With this insight, we can easily integrate the three functions above into one end-to-end dictionary-based analysis function:

In [None]:
# Conducts a dictionary analysis agnostic to whether the dictionary contains words/phrases/stems
def dictionary_analysis(text, tokens, dictionary):
  stems = []
  words = []
  phrases = []
  for entry in dictionary:
    if " " in entry:
      phrases.append(entry)
    elif entry.endswith('*'):
      stems.append(entry[:-1])
    else:
      words.append(entry)

  word_counter = Counter(tokens)
  accumulator = 0

  for phrase in phrases:
    accumulator = accumulator + text.lower().count(phrase)

  for stem in stems:
    matches = [match for match in word_counter.keys() if match.startswith(stem)]
    for match in matches:
      accumulator = accumulator + word_counter.get(match, 0)
  
  for word in words:
    accumulator = accumulator + word_counter.get(word, 0)
  
  return accumulator

optimism_dictionary = ['optimism', 'optimistic', 'optimists', 'optimist', 'improve', 'improved', 'improving', 'hope*', 'favor*', 'improv*', 'bank on', 'good omen', 'looking up', 'looks up']
text_df['new_optimism_tot'] = text_df.apply(lambda x: dictionary_analysis(x['text'], x['tokens'], optimism_dictionary), axis=1)
text_df

Looking at the columns for the summed `optimism_tot` column and our newly generated `new_optimism_tot` column, the numbers appear to be consistent. However, it's probably a good idea to check:

In [None]:
same_cols = np.allclose(text_df['optimism_tot'], text_df['new_optimism_tot'])
if same_cols:
  print("The two columns are exactly the same.")
else:
  print("The two columns are different.")

While it is possible to use the raw wordcounts in an analysis, we often use *document length normalization* to make documents of varying lengths more directly comparable. To do so, we often simply divide each frequency by the total word count of the text.

In [None]:
# Document-length normalize the overall optimism score
text_df['dlnorm_optimism'] = text_df['optimism_tot'] / len(text_df['tokens'])
text_df

Before we proceed, let's eliminate some of those extra columns that we no longer need.

In [None]:
text_df = text_df.drop(['optimism_words', 'optimism_phrases', 'optimism_stems', 'new_optimism_tot'], axis=1)

## 8.2. Creating/Refining a Dictionary
---


There are many ways to create a dictionary for dictionary-based analysis. A comprehensive review of these approaches is beyond the scope of this module. In this section, we'll largely stick to procedures outlined in two papers: [Short et al. (2010)](https://journals.sagepub.com/doi/abs/10.1177/1094428109335949) and [McKenny et al. (2018)](https://journals.sagepub.com/doi/abs/10.1177/0149206316657594?casa_token=WkDQF900_uEAAAAA:wTc2J5bmpf0LwpE-WnR2iLyZEth261K_JULuzoG46QNi6PAi3fvn8zGXshML3kiafTB1X6WjTuyX). 

### 8.2.1. Inductive Word List Generation
---


In the Short et al. (2010) paper, the authors use DICTION to create an inductively generated list of all words that appear three or more times in at least one text. This can be easily recreated in Python.

In [None]:
# Identifies all words found frequently in at least one text
def inductive_words(tokenized_texts, threshold=3):
  word_set = set() # A set is basically a list where there cannot be duplicate entries (as with sets in mathematics)
  for text in tokenized_texts:
    word_counter = Counter(text).most_common()
    word_set = word_set.union({token[0] for token in takewhile(lambda x: x[1] >= threshold, word_counter)})
  return word_set

threshold = 3
inductive_list = inductive_words(text_df['tokens'], threshold)

print(f"Inductive list of {len(inductive_list)} words used {threshold}+ "\
      f"times in at least one text from the corpus: {sorted(inductive_list)}")

Here we actually expanded our capability, we can set the threshold to whatever we want. Three is the default, but we can change it to 5 if we wanted to.

An alternative to the above may be to look at words based on their frequency in the entire corpus rather than within an individual text. We can easily extend our function above to handle either case.

In [None]:
# Identifies all words found frequently either in at least one text or in the corpus overall
def inductive_words(tokenized_texts, threshold=3, within='text'):
  word_set = set() # A set is basically a list where there cannot be duplicate entries (as with sets in mathematics)
  if within=="text":
    for text in tokenized_texts:
      word_counter = Counter(text).most_common()
      word_set = word_set.union({token[0] for token in takewhile(lambda x: x[1] >= threshold, word_counter)})
  elif within=="corpus":
    corpus_tokens = sum(tokenized_texts, [])
    word_counter = Counter(corpus_tokens).most_common()
    word_set = {token[0] for token in takewhile(lambda x: x[1] >= threshold, word_counter)}
  return word_set

threshold = 3
within = "corpus"
inductive_list = inductive_words(text_df['tokens'], threshold, within)

print(f"Inductive list of {len(inductive_list)} words used {threshold}+ "\
      f"times (based on {within} frequency): {sorted(inductive_list)}")

### 8.2.2. Deductive Word List Generation
---


Short et al. (2010) also create a deductive word list based on the focal construct. They use a thesaurus to generate such words - and there are APIs to thesauruses, such as [Merriam-Webster](https://dictionaryapi.com/products/api-collegiate-thesaurus) that may be used to recreate their procedure directly. However, these APIs require you to request a 'key' making it less feasible for inclusion in this module.

However, WordNet, while significantly less comprehensive than a traditional thesaurus, does not require you to apply for a key to use it.

In [None]:
for sense in wordnet.synsets("hope"):
  print(f"{sense.name()} - {sense.definition()}")
  print(f"{'-'*30}")
  print(f"Synonyms: {[entry.name() for entry in sense.lemmas()]}")
  print(f"Hypernyms: {[entry.name() for entry in sense.hypernyms()]}")
  print(f"Hyponyms: {[entry.name() for entry in sense.hyponyms()]}")
  print(f"Other words with the same hypernym: {[hypo_entry.name() for hyper_entry in sense.hypernyms() for hypo_entry in hyper_entry.hyponyms() if hypo_entry.name() != sense.name()]}")
  print(f"\n{'-'*30}")

Perhaps more useful than that, we saw in the word representations module that there are also other ways of identifying syntactically similar words to a focal word. In particular, we saw that we could use word embeddings, such as GloVe to find similar words:

In [None]:
glove_model.most_similar(positive=['hope'])

Neither of these approaches are likely to be a satisfactory replacement for a comprehensive thesaurus-based search. However, these tools offer valuable complements to such an approach.

###8.2.3. Dictionary Judging and Interrater Agreement
---


After generating an inductive and deductive word list for the construct, researchers generally have raters evaluate the words for inclusion in the final dictionary. To justify inclusion, researchers generally provide evidence of interrater agreement.

The NLTK package offers several built-in functions to handle such calculations:

In [None]:
words = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
coder_1 = [1,0,1,1,0,1,1,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1,0,0,1]
coder_2 = [1,0,1,1,1,1,0,0,0,0,0,1,0,1,0,1,0,1,0,0,0,0,1,0,0,1]

# The format of the codes must be a list of (coder#, word#, their evaluation)
formatted_codes = [(1,word_num,coder_1[word_num]) for word_num in range(len(coder_1))] + \
                  [(2,word_num,coder_2[word_num]) for word_num in range(len(coder_2))]

ira_annotator = agreement.AnnotationTask(data=formatted_codes)
print(f"Percentage Agreement: {ira_annotator.avg_Ao():.2}")
print(f"Cohen's Kappa:        {ira_annotator.kappa():.2}")
print(f"Krippendorff's Alpha: {ira_annotator.alpha():.2}")

For simplicity, we had 'two raters' evaluate a small number of individual letters rather than a large number of words; however, we did this to illustrate how you can move from this to compiling a final list.

When the authors of a paper aren't the raters, you may not be able to have the raters discuss the disagreed-upon words. In this situation, you have to decide whether to include words where the raters disagreed.

Python can handle both:

In [None]:
include_disagreements = [a | b for a,b in zip(coder_1, coder_2)]
print(f"The liberal (inclusive) word list is:      {list(compress(words, include_disagreements))}")

exclude_disagreements = [a & b for a,b in zip(coder_1, coder_2)]
print(f"The conservative (exclusive) word list is: {list(compress(words, exclude_disagreements))}")

On the other hand, if the authors of the paper are the raters or your raters are willing to discuss differences in coding, you can follow a similar approach to identify the words that the raters disagreed upon:

In [None]:
disagreements = [a ^ b for a,b in zip(coder_1, coder_2)]
print(f"The list of words to discuss among the coauthors is: {list(compress(words, disagreements))}")

###8.2.4. Validity/Reliability Tests
---

Short et al. (2010) and McKenny et al. (2018) identify several analyses researchers can do to assess the reliability and validity of a dictionary. Many of these involve examining the correlation of the dictionary measurements with other data. Naturally, this can be done in external statistical tools such as Stata/SAS/SPSS; however, it can also be done within Python directly.

In [None]:
# Generate a column based on the optimism column for use in demonstrating correlations
# This is for demo purposes only - you would normally use data from another source
text_df['comparison_data'] = text_df['optimism_tot'] + np.random.randint(low=0, high=5, size=len(text_df))

# Calculate and display the correlations
correlation = pearsonr(text_df['optimism_tot'], text_df['comparison_data'])
print(f"The correlation between the optimism dictionary scores and the comparison data is: "
      f"r = {correlation[0]:.2}; p = {correlation[1]:.2}")