# Lab 8: Named entity recognition

- Jacob Eisenstein
- For Georgia Tech CS8803-CSS, Fall 2017

In this project, you'll use Stanford's CoreNLP tagger to tag names of people, places, and organizations in the abolitionist newspaper The Liberator.

You can download the software here:

https://nlp.stanford.edu/software/stanford-ner-2017-06-09.zip

Next, unzip it:

In [1]:
! unzip stanford-ner-2017-06-09.zip

Archive:  stanford-ner-2017-06-09.zip
   creating: stanford-ner-2017-06-09/
  inflating: stanford-ner-2017-06-09/README.txt  
  inflating: stanford-ner-2017-06-09/stanford-ner-3.8.0.jar  
  inflating: stanford-ner-2017-06-09/ner-gui.bat  
  inflating: stanford-ner-2017-06-09/build.xml  
  inflating: stanford-ner-2017-06-09/stanford-ner-3.8.0-sources.jar  
  inflating: stanford-ner-2017-06-09/stanford-ner.jar  
  inflating: stanford-ner-2017-06-09/sample-conll-file.txt  
  inflating: stanford-ner-2017-06-09/sample.ner.txt  
   creating: stanford-ner-2017-06-09/lib/
  inflating: stanford-ner-2017-06-09/lib/joda-time.jar  
  inflating: stanford-ner-2017-06-09/lib/stanford-ner-resources.jar  
  inflating: stanford-ner-2017-06-09/lib/jollyday-0.4.9.jar  
  inflating: stanford-ner-2017-06-09/ner-gui.command  
  inflating: stanford-ner-2017-06-09/ner.sh  
  inflating: stanford-ner-2017-06-09/stanford-ner-3.8.0-javadoc.jar  
  inflating: stanford-ner-2017-06-09/NERDemo.java  
  inflating: stan

In [2]:
from nltk.tag.stanford import StanfordNERTagger
from glob import glob
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
import os
from collections import Counter

Let's build a tagger object. The first argument is the location of the model file, the second argument is the location of the jar file. Both should have been extracted from the zipfile you downloaded.

In [3]:
tagger = StanfordNERTagger('stanford-ner-2017-06-09/classifiers/english.conll.4class.distsim.crf.ser.gz',
                           path_to_jar='stanford-ner-2017-06-09/stanford-ner.jar')

Let's run it. The input is a sequence of tokens. Here we'll just use string split for tokenization.

In [4]:
example = 'Colonel Mustard was in Druid Hills , with the President of the Coca Cola Corporation .'.split()

In [7]:
tagger.tag(example)

[('Colonel', 'O'),
 ('Mustard', 'PERSON'),
 ('was', 'O'),
 ('in', 'O'),
 ('Druid', 'LOCATION'),
 ('Hills', 'LOCATION'),
 (',', 'O'),
 ('with', 'O'),
 ('the', 'O'),
 ('President', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('Coca', 'ORGANIZATION'),
 ('Cola', 'ORGANIZATION'),
 ('Corporation', 'ORGANIZATION'),
 ('.', 'O')]

The output is a labeling of each token. The tag 'O' means 'outside' of any entity name.

Here is a simple function that extracts names from this output.

In [8]:
def get_entities(tagger_output):
    current_entity = []
    entities = []
    for token,tag in tagger_output:
        if tag != 'O':
            current_entity.append((token,tag))
        else:
            if current_entity != []:
                entities.append(current_entity)
                current_entity = []
    return ['%s_%s'%(' '.join([tok for tok,tag in entity]),entity[0][1]) for entity in entities]

Let's run the function.

In [9]:
get_entities(tagger.tag(example))

['Mustard_PERSON',
 'Druid Hills_LOCATION',
 'Coca Cola Corporation_ORGANIZATION']

Let's try a harder one

In [11]:
hard_example = 'I told Lucia Coca Cola was bad for her teeth .'.split()

In [12]:
tagger.tag(hard_example)

[('I', 'O'),
 ('told', 'O'),
 ('Lucia', 'ORGANIZATION'),
 ('Coca', 'ORGANIZATION'),
 ('Cola', 'ORGANIZATION'),
 ('was', 'O'),
 ('bad', 'O'),
 ('for', 'O'),
 ('her', 'O'),
 ('teeth', 'O'),
 ('.', 'O')]

In [37]:
get_entities(tagger.tag(hard_example))

['Lucia Coca Cola_ORGANIZATION']

In [42]:
with_complementizer = 'I told Taha that Coca Cola was bad for her teeth .'.split()

In [43]:
get_entities(tagger.tag(with_complementizer))

['Taha_PERSON', 'Coca Cola_ORGANIZATION']

And this is why you should use complementizers when you write.

# Tagging full documents

To do better word segmentation, make sure you have downloaded the `punkt` tokenization model from nltk.

In [13]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/shawnramirez/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Now let's try tagging a document from The Liberator.

Link or copy this directory in from Lab 7 if necessary.

In [15]:
! ln -s ../lab7/liberator-stories/

ln: ./liberator-stories: File exists


In [46]:
filename = 'liberator-stories/Issue of April 01, 1853/story006.txt'

In [47]:
tagged_lines = []
with open(filename) as fin:
    for line in fin:
        tagged_lines.append(tagger.tag(word_tokenize(line)))

In [49]:
print(tagged_lines[0][:100])

[('Southern', 'MISC'), ('slaveholders', 'O'), ('have', 'O'), ('a', 'O'), ('passion', 'O'), ('for', 'O'), ('mischiefframed', 'O'), ('into', 'O'), ('law', 'O'), (',', 'O'), ('which', 'O'), ('attracts', 'O'), ('the', 'O'), ('attentionof', 'O'), ('the', 'O'), ('civilized', 'O'), ('world', 'O'), ('.', 'O'), ('Yet', 'O'), (',', 'O'), ('with', 'O'), ('all', 'O'), ('their', 'O'), ('ardorfor', 'O'), ('slavery', 'O'), (',', 'O'), ('they', 'O'), ('do', 'O'), ('not', 'O'), ('knew', 'O'), ('how', 'O'), ('to', 'O'), ('be', 'O'), ('guiltyof', 'O'), ('such', 'O'), ('mean', 'O'), (',', 'O'), ('detestable', 'O'), (',', 'O'), ('low-minded', 'O'), (',', 'O'), ('base-heartedscoundrelism', 'O'), (',', 'O'), ('in', 'O'), ('matters', 'O'), ('touching', 'O'), ('the', 'O'), ('slaveryquestion', 'O'), (',', 'O'), ('as', 'O'), ('a', 'O'), ('certain', 'O'), ('class', 'O'), ('of', 'O'), ('pro-slavery', 'O'), ('politiciansin', 'O'), ('the', 'O'), ('North', 'LOCATION'), ('.', 'O'), ('We', 'O'), ('have', 'O'), ('this',

In [50]:
get_entities(tagged_lines[0])

['Southern_MISC',
 'North_LOCATION',
 'Illinois_LOCATION',
 'Virginia_LOCATION',
 'Old Dominion_MISC',
 'Illinoisan_LOCATION',
 'Virginia_LOCATION',
 'Illinois_LOCATION',
 'Northern_ORGANIZATION',
 'question-In_MISC',
 'Democracy_ORGANIZATION',
 'Commonwealth_ORGANIZATION']

**Your turn**: Get a story from today's news, copy it into the variable below, and extract the named entities. Skim the first few lines of the story yourself to see if it's correct. 

In [23]:
my_story = 'UmmWaqqasTweets.txt'
#this was too long!!! I need to edit the code to get this to read in more quickly.

In [None]:
tagged_lines = []
with open(my_story) as fin:
    for line in fin:
        tagged_lines.append(tagger.tag(word_tokenize(line)))

In [None]:
print(tagged_lines[0][:10])

In [None]:
#let's try again

In [51]:
my_story2 = """_UmmWaqqas	The test in our struggles is our ability to stay patient in times of hardships. Allah says in the Quran "Allah Loves those who are PATIENT"Nothing is impossible for Allah, & if something is meant for you it will happen. Regardless of how it does it eventually willIn life we face so many hardships & think there's no way out of them yet forget that Allah is Wali Al-Hamid. He will make a way out for youAllah says "I will make a way for you from sources you could not imagine"When you think your back is against the wall, & you feel like there is no way out for you PLACE your trust in Allah & watch things go smooth3. Alhamdulilah she made it to Ash-Shaam.  She is a sister whom I love dearly & admire, now the lesson in all of this is...2. She placed her trust in Allah & tied her camel & set out to make Hijrah. Though the path was tough & though there were obstacles2. One BIG PROBELM she was under watch by the government & was being tracked by her every move.1. Bismillah.... 
"""

In [52]:
get_entities(tagger.tag(word_tokenize(my_story2)))
#Issues that this had: Sometimes Allah is used in different ways, which makes it complicated. 
#For example, place your trust in Allah makes it list Allah as a place. Allah Loves is listed
#as an organization. Other people are also somestimes listed as organizations. 
#It could be that this tagger (the Stanford NER tagger) is not designed for tweets,
#so maybe I should find another one that pays less attention to capitalization. 

['Allah_PERSON',
 'Quran_LOCATION',
 'Allah Loves_ORGANIZATION',
 'Allah_PERSON',
 'Allah_PERSON',
 'Wali Al-Hamid_PERSON',
 'Allah_LOCATION',
 'Alhamdulilah_ORGANIZATION',
 'Ash-Shaam_ORGANIZATION',
 'Allah_LOCATION',
 'Hijrah_PERSON',
 'Bismillah_PERSON']

In [53]:
your_story = """
A new video of what would appear to be one of Apple’s “Project Titan” self-driving cars was posted to Twitter last night, and it looks much different than it did the last time we saw it. The car appears to be outfitted with standard third-party sensors and hardware, including (count ‘em) six Velodyne-made LIDAR sensors, several radar units, and a number of cameras — all encased in Apple-esque white plastic.

The video was captured by someone who knows his stuff about autonomous vehicles: MacCallister Higgins, co-founder of self-driving startup Voyage (that just launched its own pilot ride-hailing project in a San Jose retirement community). Higgins jokingly referred to the 

Indeed, when you compare Apple’s car with the latest iteration of Waymo’s self-driving minivan, the differences are striking. While Waymo has minimized and streamlined its sensors so they conform nicely with the vehicle’s body, Apple’s are perched on the vehicle’s roof like an ugly cargo carrier.

When I asked Higgins if he caught a look at the compute stack, he replied that it was likely located on the roof with the sensors. That would be a departure from other self-driving car operators, who typically load their high-powered GPUs in the vehicles’ spacious trunks.

Earlier this year, Apple caused a stir when it applied for and received a permit to test autonomous vehicles on public roads in California. We do know, from various reports that Apple has ditched its ambitions to build an entirely new vehicle from scratch and has instead shifted focused to building autonomous software it could develop for existing carmakers. Last July, CEO Tim Cook confirmed in an interview that the iPhone maker is currently “focusing on autonomous systems” — rather than, say, a car stamped with the Apple logo — and that this could be used for many different purposes.



"""

In [54]:
# here's my output
get_entities(tagger.tag(word_tokenize(your_story)))

['Apple_ORGANIZATION',
 'Twitter_MISC',
 'LIDAR_ORGANIZATION',
 'Apple-esque_MISC',
 'MacCallister Higgins_PERSON',
 'San Jose_LOCATION',
 'Higgins_PERSON',
 'Apple_ORGANIZATION',
 'Waymo_LOCATION',
 'Waymo_PERSON',
 'Apple_ORGANIZATION',
 'Higgins_PERSON',
 'GPUs_PERSON',
 'Apple_ORGANIZATION',
 'California_LOCATION',
 'Apple_ORGANIZATION',
 'Tim Cook_PERSON',
 'Apple_ORGANIZATION']

## Counting entities

We can keep count of all these different entities, using a `Counter` object.

In [55]:
from collections import Counter

In [56]:
Counter(get_entities(tagged_lines[0]))

Counter({'Commonwealth_ORGANIZATION': 1,
         'Democracy_ORGANIZATION': 1,
         'Illinois_LOCATION': 2,
         'Illinoisan_LOCATION': 1,
         'North_LOCATION': 1,
         'Northern_ORGANIZATION': 1,
         'Old Dominion_MISC': 1,
         'Southern_MISC': 1,
         'Virginia_LOCATION': 2,
         'question-In_MISC': 1})

A cool thing about counters is that you can add them up, which makes it easy to keep a running count.

In [59]:
counter1 = Counter(get_entities(tagged_lines[0]))
counter1 += Counter(get_entities(tagged_lines[0]))

In [60]:
counter1

Counter({'Commonwealth_ORGANIZATION': 2,
         'Democracy_ORGANIZATION': 2,
         'Illinois_LOCATION': 4,
         'Illinoisan_LOCATION': 2,
         'North_LOCATION': 2,
         'Northern_ORGANIZATION': 2,
         'Old Dominion_MISC': 2,
         'Southern_MISC': 2,
         'Virginia_LOCATION': 4,
         'question-In_MISC': 2})

In [61]:
counter1.most_common(3)

[('Illinois_LOCATION', 4), ('Virginia_LOCATION', 4), ('Southern_MISC', 2)]

We'll use this to incrementally build a counter as we process all stories in a single edition.

Another useful trick is to build a counter from a list. 

In [62]:
the_list = ['a','b','a','a','c','b']

In [63]:
Counter(the_list)

Counter({'a': 3, 'b': 2, 'c': 1})

**Your turn** 

- Count the number of times a person is mentioned in your news story.
- Count the total number of different person names that are mentioned.

In [64]:
entities = get_entities(tagger.tag(word_tokenize(my_story2)))

In [66]:
entity_tuples = [entity.split('_') for entity in entities]

In [67]:
counter = Counter([name for 
                   name,ne_type 
                   in entity_tuples
                   if ne_type=='PERSON'])

In [68]:
counter = Counter()
for entity_tuple in entity_tuples:
    if entity_tuple[1] == 'PERSON':
        counter[entity_tuple[0]] +=1

In [69]:
counter

Counter({'Allah': 3, 'Bismillah': 1, 'Hijrah': 1, 'Wali Al-Hamid': 1})

# Comparing entities across multiple texts

Now let's compare the named entities that are mentioned in two specific editions.

In [70]:
edition1 = 'liberator-stories/Issue of November 01, 1850'
edition2 = 'liberator-stories/Issue of November 11, 1859'

Here's a function to compute a running count of entities across the lines and stories of an issue.

In [71]:
def get_entity_counts(directory,show_progress=False):
    entity_counts = Counter()
    for filename in glob(os.path.join(directory,'story*txt')):
        with open (filename) as fin:
            if show_progress: print(filename)
            for i,line in enumerate(fin):
                if len(line)>10:
                    output = tagger.tag(word_tokenize(line))
                    entity_counts += Counter(get_entities(output))
    return entity_counts

In [72]:
counts1 = get_entity_counts(edition1,show_progress=True)

liberator-stories/Issue of November 01, 1850/story000.txt
liberator-stories/Issue of November 01, 1850/story001.txt
liberator-stories/Issue of November 01, 1850/story002.txt
liberator-stories/Issue of November 01, 1850/story003.txt
liberator-stories/Issue of November 01, 1850/story004.txt
liberator-stories/Issue of November 01, 1850/story005.txt
liberator-stories/Issue of November 01, 1850/story006.txt
liberator-stories/Issue of November 01, 1850/story007.txt
liberator-stories/Issue of November 01, 1850/story008.txt
liberator-stories/Issue of November 01, 1850/story009.txt


In [73]:
len(counts1)

262

In [74]:
counts1.most_common(5)

[('Massachusetts_LOCATION', 14),
 ('Constitution_ORGANIZATION', 13),
 ('Senate_ORGANIZATION', 9),
 ('United States_LOCATION', 9),
 ('God_PERSON', 7)]

In [75]:
counts2 = get_entity_counts(edition2,show_progress=True)

liberator-stories/Issue of November 11, 1859/story000.txt
liberator-stories/Issue of November 11, 1859/story001.txt
liberator-stories/Issue of November 11, 1859/story002.txt
liberator-stories/Issue of November 11, 1859/story003.txt
liberator-stories/Issue of November 11, 1859/story004.txt
liberator-stories/Issue of November 11, 1859/story005.txt
liberator-stories/Issue of November 11, 1859/story006.txt
liberator-stories/Issue of November 11, 1859/story007.txt
liberator-stories/Issue of November 11, 1859/story008.txt
liberator-stories/Issue of November 11, 1859/story009.txt


In [76]:
counts1.most_common(10)

[('Massachusetts_LOCATION', 14),
 ('Constitution_ORGANIZATION', 13),
 ('Senate_ORGANIZATION', 9),
 ('United States_LOCATION', 9),
 ('God_PERSON', 7),
 ('California_LOCATION', 6),
 ('BERRIEN_ORGANIZATION', 6),
 ('ERRIEN_ORGANIZATION', 6),
 ('Whittier_LOCATION', 6),
 ('EWING_PERSON', 5)]

In [77]:
counts2.most_common(10)

[('Harper_PERSON', 13),
 ('Brown_PERSON', 12),
 ('North_LOCATION', 11),
 ('South_LOCATION', 9),
 ('Phillips_PERSON', 9),
 ('Virginia_LOCATION', 8),
 ('Republican_MISC', 7),
 ('Seward_PERSON', 6),
 ('Giddings_ORGANIZATION', 4),
 ('Wilson_PERSON', 4)]

As predicted, the John Brown's raid of Harper's Ferry dominates the news in late 1859, even though the NER system mistakenly label "Harper" as a person. Another key difference is "North" and "South" seem to play a bigger role in the 1859 data, which might make sense, as the war in only a year and a half away.

**Your turn** What are the ten most frequently-mentioned organizations, across **both** newspaper issues?

In [100]:
# your code here 
def get_entity_countsORG(directory,show_progress=False):
    entity_counts = Counter()
    for filename in glob(os.path.join(directory,'story*txt')):
        with open (filename) as fin:
            if show_progress: print(filename)
            for i,line in enumerate(fin):
                if len(line)>10:
                    output = tagger.tag(word_tokenize(line))
                    entities = get_entities(output)
                    entity_tuples = [entity.split('_') for entity in entities]
                    for entity_tuple in entity_tuples:
                        if entity_tuple[1] == 'ORGANIZATION':
                             entity_counts[entity_tuple[0]] +=1
    return entity_counts

In [101]:
counts1 = get_entity_countsORG(edition1,show_progress=True)

liberator-stories/Issue of November 01, 1850/story000.txt
liberator-stories/Issue of November 01, 1850/story001.txt
liberator-stories/Issue of November 01, 1850/story002.txt
liberator-stories/Issue of November 01, 1850/story003.txt
liberator-stories/Issue of November 01, 1850/story004.txt
liberator-stories/Issue of November 01, 1850/story005.txt
liberator-stories/Issue of November 01, 1850/story006.txt
liberator-stories/Issue of November 01, 1850/story007.txt
liberator-stories/Issue of November 01, 1850/story008.txt
liberator-stories/Issue of November 01, 1850/story009.txt


In [102]:
counts2 = get_entity_countsORG(edition2,show_progress=True)

liberator-stories/Issue of November 11, 1859/story000.txt
liberator-stories/Issue of November 11, 1859/story001.txt
liberator-stories/Issue of November 11, 1859/story002.txt
liberator-stories/Issue of November 11, 1859/story003.txt
liberator-stories/Issue of November 11, 1859/story004.txt
liberator-stories/Issue of November 11, 1859/story005.txt
liberator-stories/Issue of November 11, 1859/story006.txt
liberator-stories/Issue of November 11, 1859/story007.txt
liberator-stories/Issue of November 11, 1859/story008.txt
liberator-stories/Issue of November 11, 1859/story009.txt


In [103]:
counttot = Counter(counts1)
counttot += Counter(counts2)

In [104]:
counttot.most_common(10)

[('Constitution', 14),
 ('Senate', 9),
 ('BERRIEN', 6),
 ('ERRIEN', 6),
 ('Congress', 6),
 ('Fugitive Slave Bill', 5),
 ('Union', 4),
 ('Supreme Court', 4),
 ('Giddings', 4),
 ('Legislature', 3)]

In [85]:
org_counts.most_common(10)

[('Constitution_ORGANIZATION', 14),
 ('Senate_ORGANIZATION', 9),
 ('ERRIEN_ORGANIZATION', 6),
 ('BERRIEN_ORGANIZATION', 6),
 ('Congress_ORGANIZATION', 6),
 ('Fugitive Slave Bill_ORGANIZATION', 5),
 ('Giddings_ORGANIZATION', 4),
 ('Supreme Court_ORGANIZATION', 4),
 ('Union_ORGANIZATION', 4),
 ('Legislature_ORGANIZATION', 3)]

# Large-scale comparison

Now let's do a large-scale comparison over the years of the dataset. Because the NER system is a little slow, I ran it overnight on our server. You can load in the output as shown:

In [105]:
! tar xzf liberator-nes.tgz

In [106]:
with open('TheLiberator/Issue of April 15, 1859/story001.ne') as fin:
    for line in fin:
        print (line.rstrip())

States_LOCATION
States_LOCATION
SWERED_ORGANIZATION
Unioncan_LOCATION
States_LOCATION
ONSTITUTION_ORGANIZATION


**Your turn** Read this same file into a counter

In [107]:
# your code here
with open('TheLiberator/Issue of April 15, 1859/story001.ne') as fin:
    counts = Counter(line.rstrip() for line in fin)

In [108]:
# your code here
with open('TheLiberator/Issue of April 15, 1859/story001.ne') as fin:
    counts = Counter()
    for line in fin:
        counts[line.rstrip()] += 1

In [109]:
counts

Counter({'ONSTITUTION_ORGANIZATION': 1,
         'SWERED_ORGANIZATION': 1,
         'States_LOCATION': 3,
         'Unioncan_LOCATION': 1})

Now let's build counters for all stories in a year. The following function should help.

In [116]:
get_files_for_year = lambda year : glob('TheLiberator/Issue*%d/story*.ne'%(year))

In [117]:
get_files_for_year(1859)[:5]

['TheLiberator/Issue of April 15, 1859/story000.ne',
 'TheLiberator/Issue of April 15, 1859/story001.ne',
 'TheLiberator/Issue of April 15, 1859/story002.ne',
 'TheLiberator/Issue of April 15, 1859/story003.ne',
 'TheLiberator/Issue of April 15, 1859/story004.ne']

**Your turn** Implement the following function, which should return a counter of entity names for all stories in a given year. Don't forget the two tricks about Counters that I showed you above.

In [118]:
def get_entity_counts_for_year(year):
    counts = Counter()
    for filename in get_files_for_year(year):
        with open(filename) as fin:
            counts += Counter(line.rstrip() for line in fin)
    return counts

In [119]:
get_entity_counts_for_year(1859)

Counter({'General Agent_ORGANIZATION': 7,
         'American_MISC': 58,
         'Massachusetts_LOCATION': 150,
         'Pennsylvania_LOCATION': 13,
         'Ohio_LOCATION': 34,
         'Michigan_LOCATION': 6,
         'Societiesare_PERSON': 6,
         'JACKSON_PERSON': 6,
         'EDMUND QUINCY_PERSON': 2,
         'SAMUEL PHILBRICK_PERSON': 7,
         'WENDELLPHILLIPS_ORGANIZATION': 6,
         'ACKSON_PERSON': 6,
         'States_LOCATION': 56,
         'SWERED_ORGANIZATION': 1,
         'Unioncan_LOCATION': 6,
         'ONSTITUTION_ORGANIZATION': 6,
         'Personal Liberty Bill_ORGANIZATION': 8,
         'Northern States_LOCATION': 5,
         'Northern Legislature_ORGANIZATION': 1,
         'Abolitionism_MISC': 4,
         'Legislature_ORGANIZATION': 33,
         'United States_LOCATION': 70,
         'Republicans_MISC': 18,
         'Committeeon_MISC': 1,
         'House_ORGANIZATION': 28,
         'Republican_MISC': 49,
         'Arnold_PERSON': 2,
         'Constitutio

Test your function

In [120]:
get_entity_counts_for_year(1859).most_common(10)

[('Massachusetts_LOCATION', 150),
 ('God_PERSON', 136),
 ('Boston_LOCATION', 127),
 ('South_LOCATION', 77),
 ('Brown_PERSON', 76),
 ('United States_LOCATION', 70),
 ('Constitution_ORGANIZATION', 68),
 ('North_LOCATION', 66),
 ('New York_LOCATION', 62),
 ('American_MISC', 58)]

Now let's print the top entities for every year

In [121]:
for year in range(1846,1866):
    print(year,end=': ')
    print(' '.join([name.split('_')[0] 
                    for name,count 
                    in get_entity_counts_for_year(year).most_common(5)]))

1846: American God Alliance Rev United States
1847: God American Mexico Boston United States
1848: God South North Senate Boston
1849: God South Boston American Congress
1850: God South Boston Webster North
1851: God American Boston America England
1852: God Boston American South Constitution
1853: God Boston Mann American Constitution
1854: God Boston Congress American South
1855: God Union South American Constitution
1856: Kansas God South Boston North
1857: God South Boston North Kansas
1858: God Massachusetts Boston S.A. Allen Constitution
1859: Massachusetts God Boston South Brown
1860: God Boston Rev Constitution New York
1861: North God South Union Boston
1862: God South North Union Boston
1863: God North South Boston England
1864: Lincoln North God Union South
1865: South God United States North States


# Pointwise mutual information 

The top names appear to be dominated by a few recurring items: God, American, South, etc.

Remember that we addressed this problem before by using pointwise mutual information (PMI). As a reminder, here is the formula:

\begin{equation}
\text{PMI}(i,j) = \log \frac{P(i,j)}{P(i)P(j)} = \log \frac{P(i \mid j) P(j)}{P(i) P(j)} = \log P(i \mid j) - \log P(i)
\end{equation}

We can compute this directly from the Counter objects. First let's compute the bottom term, $\log P(i)$, which is the probability of each entity name, over all years in the dataset.

In [122]:
total_counts = Counter()
for year in range(1846,1866):
    total_counts += get_entity_counts_for_year(year)

In [135]:
total_counts.most_common(10)

[('God_PERSON', 2716),
 ('South_LOCATION', 1710),
 ('Boston_LOCATION', 1668),
 ('North_LOCATION', 1428),
 ('American_MISC', 1314),
 ('United States_LOCATION', 1195),
 ('Union_ORGANIZATION', 1176),
 ('Massachusetts_LOCATION', 1131),
 ('Constitution_ORGANIZATION', 1024),
 ('Congress_ORGANIZATION', 990)]

In [143]:
total_counts.items()

dict_items([('Court House_LOCATION', 10), ('ENTLEMENS_ORGANIZATION', 1), ('James G. Hicks_PERSON', 1), ('N.D. Anderson_PERSON', 1), ('Alexander_PERSON', 12), ('Hicks_PERSON', 5), ('Beard_ORGANIZATION', 2), ('Benj_PERSON', 9), ('W. Shackfett_PERSON', 1), ('Joseph Atwell_PERSON', 2), ('Daniel Lane_PERSON', 2), ('Bradenburg_PERSON', 1), ('Constitution_ORGANIZATION', 1024), ('State of Ken-tucky_MISC', 1), ('Henry P. Byram_PERSON', 1), ('Byram_LOCATION', 1), ('Byram_PERSON', 3), ('Benjamin W. Shacklett_PERSON', 1), ('P.M. N.D. ANDERSON_ORGANIZATION', 1), ('Pres_PERSON', 41), ('Sec_LOCATION', 26), ('LEXANDER Sec_ORGANIZATION', 2), ('Presidentat_ORGANIZATION', 1), ('EARD_LOCATION', 1), ('SHACKLETT_PERSON', 1), ('HACKLETT_LOCATION', 1), ('JOSEPH ATWELL_PERSON', 1), ('OSEPH TWELL_LOCATION', 1), ('DANIEL LANE_PERSON', 1), ('ANIEL ANE_ORGANIZATION', 1), ('Committee_ORGANIZATION', 347), ('Brandenburg_LOCATION', 1), ('N.D. ANDERSON_LOCATION', 1), ('Boston_LOCATION', 1668), ('Sears_ORGANIZATION', 11

In [123]:
import numpy as np # we need this for log

In [124]:
log_pi = {name:np.log(count) for name,count in total_counts.items()}

In [125]:
[(name,log_pi[name]) for name,count in total_counts.most_common(5)]

[('God_PERSON', 7.9069154886785871),
 ('South_LOCATION', 7.4442486494967053),
 ('Boston_LOCATION', 7.4193805829186923),
 ('North_LOCATION', 7.2640301428995295),
 ('American_MISC', 7.1808311990445555)]

These are the log counts. Note that they are positive -- this means that they can't be log probabilities! $(p(x) <= 1 \Leftrightarrow \log p(x) <= 0)$. 

We'll fix this by subtracting $\log N$, where $N$ is the sum of all counts.

In [126]:
tot_log_N = np.log(sum(total_counts.values()))
log_pi = {name:np.log(count) - tot_log_N
          for name,count in total_counts.items()}

In [144]:
total_counts.values()

dict_values([10, 1, 1, 1, 12, 5, 2, 9, 1, 2, 2, 1, 1024, 1, 1, 1, 3, 1, 1, 41, 26, 2, 1, 1, 1, 1, 1, 1, 1, 1, 347, 1, 1, 1668, 11, 45, 1, 990, 1, 105, 146, 6, 3, 1, 122, 4, 118, 3, 1, 6, 79, 1314, 37, 686, 17, 60, 193, 38, 1, 82, 2, 2716, 307, 3, 18, 8, 300, 170, 1, 1, 5, 1, 224, 21, 13, 5, 18, 42, 663, 59, 6, 54, 1, 345, 121, 1, 1, 71, 42, 116, 3, 1, 1, 1, 37, 42, 2, 1, 8, 1, 1, 1, 1, 2, 1, 26, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 14, 60, 23, 9, 1, 1, 1, 20, 1, 51, 1, 1, 9, 1, 750, 45, 5, 12, 1, 2, 1, 1, 1, 4, 189, 1, 46, 4, 8, 23, 1, 1, 36, 1, 1, 42, 1, 1, 1, 10, 62, 1, 101, 10, 1195, 1, 1, 3, 1, 83, 6, 132, 1, 3, 14, 11, 362, 1, 58, 143, 1, 2, 1, 6, 1, 1, 2, 84, 1, 1, 1, 1, 1, 7, 1, 4, 3, 43, 1, 39, 1, 15, 41, 23, 6, 16, 1, 1, 3, 1, 27, 845, 2, 1, 2, 8, 53, 1, 25, 1, 315, 8, 1, 1, 1, 20, 1, 64, 1, 7, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 6, 1, 314, 87, 3, 1, 3, 1, 116, 1, 1, 1, 142, 12, 3, 39, 1, 1, 12, 378, 1, 269, 1, 12, 357, 1, 3, 177, 1, 2, 244, 114, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 4

In [145]:
tot_log_N

11.877103192505068

In [146]:
log_pi

{'Court House_LOCATION': -9.5745180995110211,
 'ENTLEMENS_ORGANIZATION': -11.877103192505068,
 'James G. Hicks_PERSON': -11.877103192505068,
 'N.D. Anderson_PERSON': -11.877103192505068,
 'Alexander_PERSON': -9.3921965427170679,
 'Hicks_PERSON': -10.267665280070968,
 'Beard_ORGANIZATION': -11.183956011945122,
 'Benj_PERSON': -9.6798786151688478,
 'W. Shackfett_PERSON': -11.877103192505068,
 'Joseph Atwell_PERSON': -11.183956011945122,
 'Daniel Lane_PERSON': -11.183956011945122,
 'Bradenburg_PERSON': -11.877103192505068,
 'Constitution_ORGANIZATION': -4.9456313869056148,
 'State of Ken-tucky_MISC': -11.877103192505068,
 'Henry P. Byram_PERSON': -11.877103192505068,
 'Byram_LOCATION': -11.877103192505068,
 'Byram_PERSON': -10.778490903836959,
 'Benjamin W. Shacklett_PERSON': -11.877103192505068,
 'P.M. N.D. ANDERSON_ORGANIZATION': -11.877103192505068,
 'Pres_PERSON': -8.163531125800759,
 'Sec_LOCATION': -8.6190066544835862,
 'LEXANDER Sec_ORGANIZATION': -11.183956011945122,
 'Presidentat

In [127]:
[(name,log_pi[name]) for name,count in total_counts.most_common(5)]

[('God_PERSON', -3.9701877038264808),
 ('South_LOCATION', -4.4328545430083626),
 ('Boston_LOCATION', -4.4577226095863756),
 ('North_LOCATION', -4.6130730496055383),
 ('American_MISC', -4.6962719934605124)]

In [147]:
# exps of log probabilities should sum to one
sum(np.exp(val) for val in log_pi.values())

1.0000000000019544

Better! 

Now, to compute $\log P(i \mid j)$, you just need to do the same operation, but with the counter for each specific year.

**Your turn**: fill in the function below, which should return a dict of names and log probabilites.

In [148]:
#I'm not sure what to do here. P(i|j) = P(i and j)/P(j)? 
#j is the prob of each word in a year
#i is the prob of each word in the entire dataset
#Q: is that correct?
def get_log_pij(year):
    counts = get_entity_counts_for_year(year)
    tot_log_counts = np.log(sum(counts.values()))
    log_pj = {name:np.log(count) - tot_log_counts for name,count in counts.items()}
    return log_pj

In [149]:
#test
[(name,get_log_pij(1859)[name]) for name,count in total_counts.most_common(5)]

[('God_PERSON', -4.2038142706300592),
 ('South_LOCATION', -4.7726637345124274),
 ('Boston_LOCATION', -4.2722820699075204),
 ('North_LOCATION', -4.9268144143396864),
 ('American_MISC', -5.0560261458196925)]

In [227]:
# desired output
[(name,get_log_pij(1859)[name]) for name,count in total_counts.most_common(5)]

[('God_PERSON', -4.2038142706300592),
 ('South_LOCATION', -4.7726637345124274),
 ('Boston_LOCATION', -4.2722820699075204),
 ('North_LOCATION', -4.9268144143396864),
 ('American_MISC', -5.0560261458196925)]

In [150]:
# check that the probabilities still sum to one
sum(np.exp(val) for val in get_log_pij(1859).values())

0.99999999999987066

Now we can compute the PMI. Note that we need only compute PMI for names the appear in the year; it's undefined $(\log 0)$ for other years.

In [151]:
def get_PMI(year):
    log_pij = get_log_pij(year)
    pmi = {name:log_pij_name- log_pi[name] 
           for name,log_pij_name 
           in log_pij.items()}           
    return pmi

In [152]:
[(name,get_PMI(1859)[name]) for name,count in total_counts.most_common(5)]

[('God_PERSON', -0.23362656680357841),
 ('South_LOCATION', -0.33980919150406486),
 ('Boston_LOCATION', 0.18544053967885521),
 ('North_LOCATION', -0.31374136473414804),
 ('American_MISC', -0.35975415235918007)]

## Top names by PMI per year

Now we'll call your function to get the top names by PMI for each year. 

We'll focus on names among the 500 most common overall.

In [155]:
top1k = [name for name,count in total_counts.most_common(500)]

In [156]:
for year in range(1846,1866):
    pmi_year = get_PMI(year)
    pmi_year_filtered = {name:pmi for name,pmi in pmi_year.items() if name in top1k}
    top_names = sorted(pmi_year_filtered,key=pmi_year_filtered.get,reverse=True)[:5]
    print(year,'\t'.join(name.split('_')[0] for name in top_names))

1846 Alliance	Mexicans	Edmund Quincy	Glasgow	Briggs
1847 Campbell	Douglass	Stanton	Wright	Mexico
1848 Clarkson	Foote	Palfrey	Mexicans	Hale
1849 Mathew	Assembly	General Assembly	O'Connell	Bridgewater
1850 Pres	Webster	New Mexico	Cass	Bill
1851 Sims	Englishmen	Unitarians	Morris	Bristol
1852 Kossuth	SEC	Syracuse	Hungary	Hungarian
1853 MANN	Mann	Foss	Cabin	Tom
1854 Nebraska	Smith	Indiana	BOSTON	West Indies
1855 B.	American Union	Fugitive Slave Bill	Spanish	Committee of Arrangements
1856 Brooks	Lawrence	Bro	Kansas	Southerners
1857 Stephens	Tract Society	Melodeon	Executive Committee	Cheever
1858 S.A. Allen	Loring	Zylobalsamum	Hair Restorer	World 's Hair Restorer
1859 Oregon	John Brown	Charlestown	Brown	Harper
1860 Ford	Cheever	Sumner	United States Constitution	Newport
1861 Agent	Mayor	Tremont Temple	Morris	Seward
1862 Courier	McClellan	Richmond	Committee of Arrangements	SEC
1863 Hinna	D.D.	Sims	McClellan	Tremont Temple
1864 Conway	Lincoln	Administration	Lincoln	Butler
1865 Cox	Lee	S.C.	Johns

These are honestly not so great -- lots of typos and capitalization issues. 

There are also a lot of mentions of other newspapers, which might make sense, depending on when those newspapers ran, and whether Liberator frequently borrowed from them.

Future work on this data could play with smoothing, try TF-IDF instead of PMI, or work with the count of unique stories or issues in which each name appears, rather than the raw counts.