[Empath](https://github.com/Ejhfast/empath-client) is a set of dictionaries spanning 194 different topics (e.g., "car", "leisure", "tool", real_estate", etc.), originally described in Fast et al. (2016), "[Empath: Understanding Topic Signals in Large-Scale Text](https://hci.stanford.edu/publications/2016/ethan/empath-chi-2016.pdf)".  In this work, we'll explore using empath to characterize texts and also use it as a jumping off point to think about **validity**. 

The Empath *category* "help", for example, is a dictionary that contains the following *dictionary terms*:

help = {help, chore, responsible, help, grateful, maid, housekeeping, helpful, stabilize, servant, benefit, financial, aide, supportive, assistance, favor, tend, favor, encourage, wheelchair, nurse, patient, honor, protection, oversee, guide, hospitality, duty, advisor, carry, trust, obligation, rely, support, escort, friend, treat, offer, serve, cooperate, encouragement, promote, volunteer, counsel, kindly, crutch, aid, nursing, helper, request, rescue, provide, protect, generously, housework, advise, temporary, assist, entrust, prepare
}

When applied to text, we can count which *tokens* have lemmas that are *dictionary terms*, indicating that it is indicative of that corresponding *category*.  In the following text, the tokens that have lemmas corresponding to "help" dictionary terms have been highlighted:  

(1) "The doctor prescribed a ***wheelchair*** rather than ***crutches*** to help heal the broken leg of the ***patient***.  The hospital bill, however, was a significant ***financial*** burden to the ***patient***."

In [1]:
import spacy
from collections import Counter

In [2]:
nlp = spacy.load('en_core_web_sm', disable=['ner,parser'])
nlp.remove_pipe('ner')
nlp.remove_pipe('parser')

('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fb51693e3a0>)

First, let's read in [the Empath dictionaries](https://github.com/Ejhfast/empath-client/blob/master/empath/data/categories.tsv) and create two mappings: one mapping categories to the dictionary terms within it, and one mapping dictionary terms to the categories they belong to (words can belong to multiple categories).

In [3]:
def read_dictionaries(filename):
    category_to_lemmas={}
    lemma_to_categories={}
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols=line.rstrip().split("\t")
            category=cols[0]
            category_to_lemmas[category]=set(cols)
            for lemma in cols:
                if lemma not in lemma_to_categories:
                    lemma_to_categories[lemma]={}
                lemma_to_categories[lemma][category]=1
    return lemma_to_categories, category_to_lemmas

In [5]:
lemma_to_categories, category_to_lemmas=read_dictionaries("../data/empath_categories.txt")

Now let's use it to count up the Empath categories present in an input text.

In [6]:
def count_empath_categories(text, lemma_to_categories):
    category_counts=Counter()
    tokens=nlp(text.lower())
    for word in tokens:
        lemma=word.lemma_
        if lemma in lemma_to_categories:
            for cat in lemma_to_categories[lemma]:
                category_counts[cat]+=1
    
    for k,v in category_counts.most_common():
        print(v, k)

In [46]:
lemma_to_categories # provides lemmatized word and categories it appears in
category_to_lemmas # category: all lemmatized words associated in the text

{'help': {'advise',
  'advisor',
  'aid',
  'aide',
  'assist',
  'assistance',
  'benefit',
  'carry',
  'chore',
  'cooperate',
  'counsel',
  'crutch',
  'duty',
  'encourage',
  'encouragement',
  'entrust',
  'escort',
  'favor',
  'financial',
  'friend',
  'generously',
  'grateful',
  'guide',
  'help',
  'helper',
  'helpful',
  'honor',
  'hospitality',
  'housekeeping',
  'housework',
  'kindly',
  'maid',
  'nurse',
  'nursing',
  'obligation',
  'offer',
  'oversee',
  'patient',
  'prepare',
  'promote',
  'protect',
  'protection',
  'provide',
  'rely',
  'request',
  'rescue',
  'responsible',
  'servant',
  'serve',
  'stabilize',
  'support',
  'supportive',
  'temporary',
  'tend',
  'treat',
  'trust',
  'volunteer',
  'wheelchair'},
 'office': {'accounting',
  'administration',
  'administrative',
  'administrator',
  'advisor',
  'agenda',
  'application',
  'archive',
  'assist',
  'assistant',
  'attendance',
  'backroom',
  'binder',
  'blueprint',
  'boardroo

We'll run it on the following text from [CNN](https://www.cnn.com/2021/08/31/middleeast/syria-cyprus-oil-spill-intl/index.html).

"An oil spill that originated from Syria's largest refinery is growing and spreading across the Mediterranean Sea, and could reach the island of Cyprus by Wednesday, Cypriot authorities have said.

Syrian officials said last week that a tank filled with 15,000 tons of fuel had been leaking since August 23 at a thermal power plant on the Syrian coastal city of Baniyas. They said they had been able to bring it under control.
Satellite imagery analysis by Orbital EOS now indicates that the oil spill was larger than originally thought, covering around 800 square kilometres (309 square miles) -- an area around the same size as New York City. The company told CNN Tuesday evening that the oil slick was around 7 kilometers (4 miles) from the Cypriot coast.
The Cypriot Department of Fisheries and Marine research said that, based on a simulation of the oil spill's movements and meteorological data, the slick could reach the Apostlos Andreas Cape "in the next 24 hours." The department posted the statement at around 11 a.m. local time (4 a.m. ET) on Tuesday.
It also said it would be willing to assist in tackling the spill."

In [7]:
text="""An oil spill that originated from Syria's largest refinery is growing and spreading across the Mediterranean Sea, and could reach the island of Cyprus by Wednesday, Cypriot authorities have said.

Syrian officials said last week that a tank filled with 15,000 tons of fuel had been leaking since August 23 at a thermal power plant on the Syrian coastal city of Baniyas. They said they had been able to bring it under control.
Satellite imagery analysis by Orbital EOS now indicates that the oil spill was larger than originally thought, covering around 800 square kilometres (309 square miles) -- an area around the same size as New York City. The company told CNN Tuesday evening that the oil slick was around 7 kilometers (4 miles) from the Cypriot coast.
The Cypriot Department of Fisheries and Marine research said that, based on a simulation of the oil spill's movements and meteorological data, the slick could reach the Apostlos Andreas Cape "in the next 24 hours." The department posted the statement at around 11 a.m. local time (4 a.m. ET) on Tuesday.
It also said it would be willing to assist in tackling the spill."""

In [None]:
count_empath_categories(text, lemma_to_categories)

Remember that dictionaries operate at the type level -- *every* instance of the word "financial", for instance, evokes the Empath "help" category, even though specific tokens of "financial" in context may not. Let's first identify what tokens in a text are evoking specific Empath categories, so we can examine them for their correctness.

**Q1:** Write a function that identifies the *tokens* corresponding to specific *dictionary terms* for an input *category* present in a given input text. This function should highlight those specific tokens in context by wrapping them in \*\*\*.  Taking the category "help" and the input text given in (1) above, your output should look like the following:

The doctor prescribed a \*\*\*wheelchair\*\*\* rather than \*\*\*crutches\*\*\* to help heal the broken leg of the \*\*\*patient\*\*\*.  The hospital bill, however, was a significant \*\*\*financial\*\*\* burden to the \*\*\*patient\*\*\*.


In [57]:
nlp(test_case)
nlp(test_case.lower())[3]

a

In [74]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [143]:
def print_empath_tokens_in_context(text, category_to_lemmas, category):
    # Tokens w/ correct punctuation format of returned string
    tokens_for_writing = nlp(text)
    
    # Tokens cleaned for lemmatization
    tokens= nlp(text.lower())
    
    storage = []
    
    for word_index in range(len(tokens)):
        word = tokens[word_index].lemma_
        string_ver = str(tokens_for_writing[word_index])
        if word in category_to_lemmas[category]:
            formatted = "***" + string_ver + "***"
            storage.append(formatted)
        elif word not in ["\n", "\n\n"]: 
            # Just doing some cleaning to make it more readable
            # Ignoring new line/white space characters
            storage.append(string_ver)
            
    return " ".join(storage)

In [145]:
print_empath_tokens_in_context(text, category_to_lemmas, "liquid")

'An ***oil*** ***spill*** that originated from Syria \'s largest refinery is growing and spreading across the Mediterranean Sea , and could reach the island of Cyprus by Wednesday , Cypriot authorities have said . Syrian officials said last week that a tank filled with 15,000 tons of fuel had been leaking since August 23 at a thermal power plant on the Syrian coastal city of Baniyas . They said they had been able to bring it under control . Satellite imagery analysis by Orbital EOS now indicates that the ***oil*** ***spill*** was larger than originally thought , covering around 800 square kilometres ( 309 square miles ) -- an area around the same size as New York City . The company told CNN Tuesday evening that the ***oil*** slick was around 7 kilometers ( 4 miles ) from the Cypriot coast . The Cypriot Department of Fisheries and Marine research said that , based on a simulation of the ***oil*** ***spill*** \'s movements and meteorological data , the slick could reach the Apostlos Andr

**Q2.** Use the function you just wrote to find all tokens identified by the "liquid," "fire," "beach" and "ocean" categories and use them to fill out the table below.  Judge whether or not each token in context actually belongs to that category. Include a rationale if you think the decision would be contestable.

|Category|Token in Context|Label|Rationale (if needed)
|---|---|---|---|
|liquid|An ***oil*** spill that originated|Correct|Oil is a liquid|
|liquid|An oil ***spill*** that originated|Correct|Spill describes what is happening with said liquid|
|liquid|indicates that the ***oil*** spill was larger|Correct|See above|
|liquid|indicates that the oil ***spill*** was larger|Correct|See above|
|liquid|evening that the ***oil*** slick was around|Correct|See above|
|liquid|a simulation of the ***oil*** spill \'s movements|Correct|See above|
|liquid|a simulation of the oil ***spill*** \'s movements|Correct|See above|
|liquid|assist in tackling the ***spill*** |Correct|See above|
|fire|An ***oil*** spill that |Incorrect|While oil can burn, the spill has not caught fire in this example|
|fire|the ***oil*** spill was larger |Incorrect|See above|
|fire|the ***oil*** slick was around 7 kilometers |Incorrect|See above|
|fire|a simulation of the ***oil*** spill \'s |Incorrect|See above|
|beach|across the Mediterranean ***Sea*** |Correct|Beaches are surrounded by seas|
|beach|reach the ***island*** of Cyprus |Correct|Islands usually have beaches!|
|beach|the Syrian ***coastal*** city of Baniyas |Correct|Coasts tend to be near beaches|
|beach|from the Cypriot ***coast*** |Correct|See above|
|ocean|spreading across the Mediterranean ***Sea*** |Correct|A sea is similar (or the same?) to an ocean|
|ocean|reach the ***island*** of Cyprus |Correct|Possibly contestable (since land != sea), but islands are surrounded by oceans|
|ocean|the Syrian ***coastal*** city of Baniyas |Correct|Coasts are usually adjacent to the ocean/sea|
|ocean|from the Cypriot ***coast*** |Correct|See above|


You have a total of 20 rows (8 liquid, 4 fire, 4 beach, and 4 ocean, as identified above).

In [149]:
print_empath_tokens_in_context(text, category_to_lemmas, "ocean")

'An oil spill that originated from Syria \'s largest refinery is growing and spreading across the Mediterranean ***Sea*** , and could reach the ***island*** of Cyprus by Wednesday , Cypriot authorities have said . Syrian officials said last week that a tank filled with 15,000 tons of fuel had been leaking since August 23 at a thermal power plant on the Syrian ***coastal*** city of Baniyas . They said they had been able to bring it under control . Satellite imagery analysis by Orbital EOS now indicates that the oil spill was larger than originally thought , covering around 800 square kilometres ( 309 square miles ) -- an area around the same size as New York City . The company told CNN Tuesday evening that the oil slick was around 7 kilometers ( 4 miles ) from the Cypriot ***coast*** . The Cypriot Department of Fisheries and Marine research said that , based on a simulation of the oil spill \'s movements and meteorological data , the slick could reach the Apostlos Andreas Cape " in the 

**Q3.** Using that table, calculate the precision of the "liquid," "fire," "beach" and "ocean" categories for this passage using the following equation:

$$
\textrm{Precision(liquid)} = {{\textrm{# of "liquid" tokens identified by Empath that you marked as correct}} \over {\textrm{# of "liquid" tokens identified by Empath}}}
$$

You should report 4 numbers (one measure of precision for each of the 4 categories).

In [1]:
precision_liquid = 8/8
precision_fire = 0/4
precision_beach = 4/4
precision_ocean = 4/4

print("Precision(liquid):", precision_liquid)
print("Precision(fire):", precision_fire)
print("Precision(beach):", precision_beach)
print("Precision(ocean):", precision_ocean)

Precision(liquid): 1.0
Precision(fire): 0.0
Precision(beach): 1.0
Precision(ocean): 1.0
