# Entity Extraction of Pentagon Papers

In this notebook we'll use [```spaCy```, a Python module that does natural language processing,](https://urldefense.com/v3/__https://spacy.io/usage/models__;!!DZ3fjg!qvPffHhHBJIodW8Tf1IP52IiCn_x6Tj9b547HUdN1URaf5JTSz-yueHuvNkYDRWVgsk$ ) including part-of-speech tagging and named entity recognition (NER).

In [2]:
!pip install spacy                          
# this gets the python module itself



In [3]:
!python -m spacy download en_core_web_sm    
# this gets a particular English-lang model
# if this doesn't work try saying !python3 (everything else the same)
# or run the cell below and check what version you're running, and then say
# !python3.7, or !python3.8 (with everything else the same) as appropriate.

Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
from platform import python_version
print(python_version())

3.8.8


In [5]:
import spacy
from spacy import displacy
from pprint import pprint
from collections import Counter
import en_core_web_sm
from pathlib import Path

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

In [6]:
engl_nlp = en_core_web_sm.load()

### SECTION 0: Questions to address about how to work with spaCy


1. Do I need to redo the text conversion so it keeps in the newline character for the document so I can have the entity extraction go line by line for a document and then total all the information up afterward? That way I can trace back exactly which sentence an individual entity came from.
    
    Answer: No, try it by dividing on the blank space and chunking per 300 characters

2. How do I extract either a structured/unstructured document or array with all the information of a single document so I can evaluate it? I would like to be able to have each document as an individual unit and then be able to compound everything into a single document to run analysis on if I want. 

    Answer: ????

### SECTION 1: Ingest and Chunk the File



In [7]:
#prepare the document for entity extraction by chunking it into every 300 words
transformed_doc = []
combined_word_lists = []

path = Path('../../Cleaned_Pentagon_Papers_text_files/Cleaned_Pentagon-Papers-Index.txt')
input_doc = open(path, encoding = 'utf-8').read()

words = input_doc.split()

for i in range(0, len(words), 300):
    chunk_words = words[i: i+300]
    combined_word_lists.append(chunk_words)

print(combined_word_lists)

[['Final', 'Report', 'OSD', 'Vietnam', 'Task', 'Force', '(u)', 'w1', 'Inclosure', 'SecDefControl#', '-029569', 'CONFIDENTIAL', 'UNCLASSIFIED', 'WHEN', 'SEPARATED', 'FROM', 'INCLOSURJ', 'INCLOSURE', '#', '1', 'Document', 'Subject', 'OSD', 'Vietnam', 'Task', 'Force', 'Outline', 'of', 'Studies', '(U)', 'dtd', '10', 'January', '1969', 'n', 'CONFIDENTIAL', 'REMAINS', 'CONFIDENTIAL', 'NOTHING', 'F0LL0WS', 'Document', 'was', 'forwarded', 'to', 'both', 'Mr.', 'Bundy', 'and', '1', 'If', 'you', 'have', 'any', 'questions', 'concerning', 'this', 'regj', 'SFC', 'william', 'C.', 'Holzer', 'US', 'Army', 'Chief', 'Clerk', 'J', 'The', 'Pentagon.', 'Phone', '0X-76131', 'lr.', 'Katzenback', 'ading', 'action', 'p', 'ecDefClassifi', 'Ln', '1969', 'Lease', 'contact', 'ad', 'Control', 'Sect', 'ion', 'Rm3A948', 'DEPARTMENT', 'JUL', '1', '5', 'OFFICE', 'OF', 'S', 'h', 'STATE', '1371', 'iCURITY', 'The', 'document(e)', 'listed', 'above', 'haehave', 'been', 'regraded', 'and', 'action', 'should', 'be', 'taken', 't

In [8]:
#make every 300 words into a single string that is an individual element within the list 'total_lists'
total_lists = []
for list in combined_word_lists: 
    list_into_string = " ".join([word for word in list])
    total_lists.append(list_into_string)
    
print(total_lists)

['Final Report OSD Vietnam Task Force (u) w1 Inclosure SecDefControl# -029569 CONFIDENTIAL UNCLASSIFIED WHEN SEPARATED FROM INCLOSURJ INCLOSURE # 1 Document Subject OSD Vietnam Task Force Outline of Studies (U) dtd 10 January 1969 n CONFIDENTIAL REMAINS CONFIDENTIAL NOTHING F0LL0WS Document was forwarded to both Mr. Bundy and 1 If you have any questions concerning this regj SFC william C. Holzer US Army Chief Clerk J The Pentagon. Phone 0X-76131 lr. Katzenback ading action p ecDefClassifi Ln 1969 Lease contact ad Control Sect ion Rm3A948 DEPARTMENT JUL 1 5 OFFICE OF S h STATE 1371 iCURITY The document(e) listed above haehave been regraded and action should be taken to mark copies quested that you notify all recipients to whom additional distribution was furnished. T. B. EDWARDS MAJ USA Control Officer Prlnfad or typad name ot official Signature e n F0RM 7 M 1 MAY 60 CJ ! kJ G4 4 7 6 3 19451967VIETNAM TASK FORCE OFFICE OF THE SECRETARY OF DEFENSE UNITED STATES VIETNAM RELATIONSSec Def f

In [9]:
#entity extraction on a single text file alone
#based off of Ted's lab

ent_types = dict()

for i in total_lists: #for each string in the list
    doc = engl_nlp(i)
        
    for entity in doc.ents:           # for each character, go through all the entities
        label = entity.label_         # get their labels
        if label not in ent_types:    # make sure there's a key for this label in the dictionary
            ent_types[label] = Counter()      # each label key points to a Counter for examples
        text = entity.text
        ent_types[label][text] += 1 

In [10]:
#what is the amount of the entities in that certain chunk? 
for etype, examples in ent_types.items():
    print(etype, len(examples))

PRODUCT 3
DATE 41
PERSON 18
CARDINAL 10
ORG 35
GPE 15
EVENT 2
FAC 1
LAW 3
ORDINAL 1
NORP 4
LOC 1
WORK_OF_ART 1


In [11]:
# print(ent_types["DATE"])
print(ent_types['ORG'])

Counter({'OSD': 3, 'the Task Force': 2, 'U.S. Forces': 2, 'US Army': 1, 'Control Sect': 1, 'Rm3A948 DEPARTMENT JUL 1 5': 1, 'CJ': 1, 'TASK FORCE OFFICE OF THE SECRETARY OF DEFENSE': 1, 'ASDISA': 1, 'CIA': 1, 'State Department': 1, 'House': 1, 'GVN': 1, 'Chronology': 1, 'GELB': 1, 'U.S. Involvement': 1, 'A. U.S. France': 1, 'A. U.S. Military Planning': 1, 'Obligations of State': 1, 'NATO': 1, 'SEATO A': 1, 'Vietnamese National Army': 1, 'Strategic Hamlet Program I96I-63': 1, 'C. Direct Action': 1, 'U.S. Programs': 1, 'Military Pressures Against NVN': 1, 'ROLLING THUNDER Program Begins': 1, 'U.S.-GVN Relations': 1, 'A. Public Statements': 1, 'The Roosevelt Administration 2': 1, 'The Truman Administration': 1, 'The Eisenhower Administration': 1, 'The Kennedy Administration': 1, 'The Public Record B. Negotiations': 1, 'Announced Position Statements C. Histories of Contacts': 1})


In [12]:
ent_types

{'PRODUCT': Counter({'Task Force': 3,
          'Task Force Outline': 1,
          'the Task Force': 1}),
 'DATE': Counter({'10 January 1969': 1,
          '1969': 1,
          '1371': 1,
          '15 January 1969': 1,
          'June 17 19&7': 1,
          'three months': 2,
          'A year and a half later': 1,
          'forty-three': 1,
          'a month': 1,
          'six months': 1,
          'an average of four months': 1,
          'the years 1945 to 1961': 1,
          '1961': 1,
          '11069': 1,
          '1940-1950': 1,
          '1940-50': 1,
          '1950-1954': 1,
          'Vol': 1,
          '1954-1960': 1,
          '1950-541 3': 1,
          '1954-56 4': 1,
          '1954-59 5': 1,
          '1961 2': 1,
          '1961-67 4': 1,
          '1962-64 5': 1,
          'May-Nov. 1963': 1,
          '1964-1968': 1,
          'November l963-April 1965': 1,
          '1964 3': 1,
          'March 1965 5': 1,
          '1965 6': 1,
          '19651967': 1,
      

In [13]:
# df = pd.DataFrame.from_dict(ent_types['ORG'], orient = 'index')
# df

Index_ORG_df = pd.DataFrame(ent_types['ORG'].items(), columns =["Organizations", "Organization_Count"])
Index_ORG_df.sort_values(by=["Organizations"], inplace = True)
Index_ORG_df.reset_index(drop=True)

# https://stackoverflow.com/questions/51424453/adding-list-with-different-length-as-a-new-column-to-a-dataframe

Unnamed: 0,Organizations,Organization_Count
0,A. Public Statements,1
1,A. U.S. France,1
2,A. U.S. Military Planning,1
3,ASDISA,1
4,Announced Position Statements C. Histories of ...,1
5,C. Direct Action,1
6,CIA,1
7,CJ,1
8,Chronology,1
9,Control Sect,1


In [14]:
for phrase in doc.ents:
    print(phrase.text, phrase.label_)

1950-1952 3 DATE
The Eisenhower Administration ORG
1953 DATE
Geneva GPE
Accords15 GPE
March 1956 DATE
French withdrawal1960 4 LAW
The Kennedy Administration ORG
Book I Book II VI Settlement of the Conflict (6 Vols WORK_OF_ART
1965-67 DATE
The Public Record B. Negotiations ORG
1965-67 DATE
Announced Position Statements C. Histories of Contacts ORG
1 CARDINAL
2 CARDINAL
3 CARDINAL
4 CARDINAL
1965-1966 DATE
Polish NORP
Moscow GPE
1967-1968 DATE
LESLIE H. CffiLB PERSON
OSD Task Force PERSON


<H2> Main Data Frame </H2>
    
Notes: The table does not have the text file "Pentagon-Papers-Part-IV-C-10.txt" because this entire file is all statistical reports and maps that do not translate to NLP. This leaves the entire text file corpus at 48 documents, each located in its entirity in the text column in the df. 
    
Overall_years refers to the span of dates that the content within is listed to pertain to. This information was found via the index (first row of df). Some documents did not have listed dates, so the information was found by searching through the body of the document, and often was found in the files chronology section.
    
Specific_dates is a more granular look at the time span a particular document refers to than overall_years. Most of the information is in month/year-month/year format, but for documents where it refered to a very specific start and end dates (such as Part-V-A which documents the U.S. governments official statements of position on the war to the public) then the date format is month/day/year-month/day/year. 

In [15]:
#use pwd to see where I am currently in the directory if I am having trouble identifying where I am
path = Path('full_txt_document.csv')
main_df = pd.read_csv(path, sep = ',')

In [16]:
main_df = main_df.drop(["Unnamed: 0"], axis=1)
main_df

Unnamed: 0,name,title,overall_years,specific_dates,text
0,Index,Index,1945-1967,1945-1967,Final Report OSD Vietnam Task Force (u) w1 In...
1,Part-I,Vietnam and the U.S.,1940-1950,1940-1950,EXECUTIVE SECRETARIAT TLE JJ RMM KDA ; r SS-...
2,Part-2,U.S. Involvement in the Franco-Viet Minh War,1950-1954,1950-1954,II U.S. Involvement in the Franco-Viet Minh W...
3,Part-3,The Geneva Accords,1954,1954,III The Geneva Accords 1954 A. U.S. Military ...
4,Part-IV-A-1,NATO and SEATO - A Comparison,1954-1960,1954-1960,", IwA Evolution of the War (26 Vols.) U.S. MAP..."
5,Part-IV-A-2,Aid for France in Indochina,1950-1954,1950-1954,A Evolution of the War (26 Vols.) U.S. MAP for...
6,Part-IV-A-3,U.S. and France's Withdrawl from Vietnam,1954-1956,1954-1956,A Evolution of the War (26 Vols.) U.S. MAP for...
7,Part-IV-A-4,U.S. Training of Vietnamese National Army,1954-1959,1954-1959,", Iw A Evolution of the War (26 Vols.) U.S. MA..."
8,Part-IV-A-5,Origins of the Insurgency,1954-1960,1954-1960,", IwA Evolution of the War (26 Vols.) U.S. MAP..."
9,Part-IV-B-1,The Kennedy Commitments and Programs,1961,1961,", IwB Evolution of the War (26 Vols.) Counteri..."


<H2> Functions to Chunk the Document, Extract the Entities, and Create a DataFrame from the Entity Dictionary</H2>

In [17]:
#function to prepare the document for extraction
#takes input of the file path, chunks into 300 words, and combines those 300 words into one string

def prepare_document(file_path):
    combined_word_lists = []
    total_lists = []

    path = Path(file_path)
    input_doc = open(path, encoding = 'utf-8').read()

    words = input_doc.split()

    for i in range(0, len(words), 300):
        chunk_words = words[i: i+300]
        combined_word_lists.append(chunk_words)
    
    #make every 300 words into a single string that is an individual element within the list 'total_lists'
    
    for list in combined_word_lists: 
        list_into_string = " ".join([word for word in list])
        total_lists.append(list_into_string)

    return total_lists

In [18]:
#function that does entity extraction on the prepared document using spaCy
#input: formatted document from prepare_document
#output: entity extraction dictionary

def entity_extraction(individual_document):
    ent_types = dict()

    for i in total_lists: #for each string in the list
        doc = engl_nlp(i)

        for entity in doc.ents:           # for each character, go through all the entities
            label = entity.label_         # get their labels
            if label not in ent_types:    # make sure there's a key for this label in the dictionary
                ent_types[label] = Counter()      # each label key points to a Counter for examples
            text = entity.text
            ent_types[label][text] += 1 
    
    return ent_types

In [19]:
def org_entity(document_dictionary):

    # for each dictionary, create a new df
    new_df_org = pd.DataFrame(ent_types['ORG'].items(), columns =["Organizations", "Count"])
    return new_df_org

In [20]:
def person_entity(document_dictionary):

    # for each dictionary, create a new df
    new_df_person = pd.DataFrame(ent_types['PERSON'].items(), columns =["Person", "Count"])
    
    return new_df_person

In [31]:
def trial_org_entity(document_dictionary):
    
    new_df_org = pd.DataFrame(ent_types['ORG'].items(), columns =["Organizations", "Count"])
    new_df_org.sort_values(by=["Organizations"], inplace = True)
    part_III_orgs_alpha = new_df_org
    new_df = part_III_orgs_alpha.reset_index(drop=True)
    
    return new_df

<H2> Formatting DF's for Organizations </H2>

In [22]:
part_II_formatted_org = prepare_document('../../Cleaned_Pentagon_Papers_text_files/Cleaned_Pentagon-Papers-II.txt')

In [23]:
part_II_dictionary_org = entity_extraction(part_II_formatted_org)

In [24]:
new_df_org = org_entity(part_II_dictionary_org)
new_df_org.sort_values(by=["Organizations"], inplace = True)
new_df_org

Unnamed: 0,Organizations,Count
28,A. Public Statements,1
15,A. U.S. France,1
16,A. U.S. Military Planning,1
5,ASDISA,1
34,Announced Position Statements C. Histories of ...,1
23,C. Direct Action,1
8,CIA,1
3,CJ,1
12,Chronology,1
1,Control Sect,1


In [25]:
part_II_df = new_df_org.reset_index(drop=True)
part_II_df

Unnamed: 0,Organizations,Count
0,A. Public Statements,1
1,A. U.S. France,1
2,A. U.S. Military Planning,1
3,ASDISA,1
4,Announced Position Statements C. Histories of ...,1
5,C. Direct Action,1
6,CIA,1
7,CJ,1
8,Chronology,1
9,Control Sect,1


In [26]:
part_III_formatted = prepare_document('../../Cleaned_Pentagon_Papers_text_files/Cleaned_Pentagon-Papers-III.txt')
part_III_dictionary = entity_extraction(part_III_formatted)
part_III_orgs = org_entity(part_III_dictionary)
part_III_orgs.sort_values(by=["Organizations"], inplace = True)
part_III_orgs_alpha = part_III_orgs
part_III_df = part_III_orgs_alpha.reset_index(drop=True)
print(len(part_III_df))

35


In [27]:
part_index_formatted_org = prepare_document('../../Cleaned_Pentagon_Papers_text_files/Cleaned_Pentagon-Papers-Index.txt')
part_index_dictionary_org = entity_extraction(part_index_formatted_org)
part_index_org = org_entity(part_index_dictionary_org)
part_index_org.sort_values(by=["Organizations"], inplace = True)
part_index_org_alpha = part_index_org
part_index_df_org = part_index_org_alpha.reset_index(drop=True)
print(len(part_index_df_org))

35


In [32]:
part_iv_a_1_formatted_org = prepare_document('../../Cleaned_Pentagon_Papers_text_files/Cleaned_Pentagon-Papers-IV-A-1.txt')
part_iv_a_1_dictionary_org = entity_extraction(part_iv_a_1_formatted_org)
trial_org_entity(part_iv_a_1_dictionary_org)

Unnamed: 0,Organizations,Count
0,A. Public Statements,1
1,A. U.S. France,1
2,A. U.S. Military Planning,1
3,ASDISA,1
4,Announced Position Statements C. Histories of ...,1
5,C. Direct Action,1
6,CIA,1
7,CJ,1
8,Chronology,1
9,Control Sect,1


In [22]:
# concat dictionary to existing main dictionary OR merge on index
#main_df= pd.concat([main_df, new_df], axis = 1)
   

# Trying to figure out how to deal with the different lengths: 
# https://www.geeksforgeeks.org/how-to-merge-dataframes-of-different-length-in-pandas/
# https://stackoverflow.com/questions/51424453/adding-list-with-different-length-as-a-new-column-to-a-dataframe

<H2>Entity Extraction Text from Ted's Jupyter Notebook <H2>

In [23]:
# This may take a few minutes.

ent_types = dict()      # initialize a dictionary

for line in chars['lines'][0:80]:     # just go through the first eighty characters
    doc = engl_nlp(line)

    for entity in doc.ents:           # for each character, go through all the entities
        label = entity.label_         # get their labels
        if label not in ent_types:    # make sure there's a key for this label in the dictionary
            ent_types[label] = Counter()      # each label key points to a Counter for examples
        text = entity.text
        ent_types[label][text] += 1   # count the number of times we see each example

NameError: name 'chars' is not defined

It can also use context to identify certain words and phrases as "named entities"--a loose category that includes proper nouns, quantities, and references to particular quantities of money or lengths of time.

In [None]:
pprint([(X.text, X.label_) for X in doc.ents])

There's a nice built-in mode of visualizing entities, too.

In [None]:
displacy.render(doc, style = 'ent', jupyter = True)

How many types of entities did we find?

In [None]:
for etype, examples in ent_types.items():
    print(etype, len(examples))

To get an explanation of a particular label we can ask ```spaCy``` to ```explain()```:

In [None]:
spacy.explain('NORP')

In [None]:
ent_types['EVENT']

So Katrina occurred several times in these characters' lines, and ```spaCy``` inferred that it was the hurricane, not a person called Katrina. Whether it inferred correctly is hard to know from this evidence. But we can check by tracking the occurrences down in the data. In fact, these occurrences of Katrina are coming from the movie *Juno,* where Katrina was a person, not a hurricane! Mistakes do happen.

If you've got any questions about the meanings of tags, use ```spaCy```'s explain() function to inquire about them. It may also help to explore the examples in ```ent_types.``` Or you can look at an example from a movie script, like the lines from a gangster movie below.

In the example below, NER works pretty well, although the model may occasionally get confused whether a mob boss is a person, a place, or the name of an organization. Admittedly, in the mob it gets a little blurry.

**We'll stop here for discussion.**

In [None]:
doc = engl_nlp(chars.loc[200, 'lines'])
displacy.render(doc, style = 'ent', jupyter = True)

### SECTION 2: What can you do with NER?

Actually running named entity recognition across a large corpus can be a bit slow; to produce the data we'll use here took me a couple of hours. It didn't make sense to ask you to duplicate all that processing. If you can get the cells above to run, you can also get NER to run across a larger corpus.

Instead, let's ask what it's good for!

To explore this, I've created a table of extracted entities, ```book_entities.tsv.``` You'll need to get this from the Moodle and place it in your is417/data folder; it's a little too large to share via GitHub.

In [None]:
bookents = pd.read_csv('../../data/book_entities.tsv', sep = '\t')
bookents.head()

We also have a metadata file, ```entity_metadata.tsv.``` This I've placed on GitHub, so it should appear along with this notebook.

In [None]:
meta = pd.read_csv('../../data/entity_metadata.tsv', sep = '\t')
meta.head()

Let's write some code that measures the frequency of a given type of entity in each book and then produces a scatterplot where the x axis is date of composition, the y axis is the frequency of that type of entity (measured as entities per thousand words), and each dot in the plot is a single book.

Your code should:

1. allow the user to enter an entity label, like ```TIME``` or ```MONEY.```

2. create a dataframe that has only entities of the desired type. 

3. Group entities by ```book_id``` and ```.sum()``` the counts of entities for each book.

4. Then join the ```composition_date,``` ```wordcount,``` and ```genre``` columns to your dataframe of summed counts,

5. Produce a ```frequency``` column by dividing entity counts by (wordcounts/1000), and finally

6. Graph entity frequency (composition_date will be the horizontal axis, and frequency will be the vertical axis).

Once you've got this working, you may want to limit the final graph by excluding biographies (```genre == 'bio'```).

**Collaborate in breakout groups to get the code working, and then discuss the observed patterns.**

Do some entities increase or decrease in frequency across time?

What hypotheses might explain the patterns you observe?

What kinds of *error* might be explaining these patterns? (Note that these are not optically transcribed texts; they're from the Gutenberg Project. So we don't have to worry about optical transcription error. But that's not the only kind of error in the world ...)

In [None]:
all_types = set(bookents['entity_type'])
meta = meta.set_index('book_id')

In [None]:
from scipy.stats import pearsonr

In [None]:
#desired_label = input('What label? ')

#if desired_label not in all_types:
    #print("That doesn't exist.")
    
#else:
for desired_label in all_types:
    print(desired_label)
    ents = bookents.loc[bookents['entity_type'] == desired_label, : ]
    
    groupedbybook = ents.groupby('book_id', as_index = False).sum()  # group the entities by book
    
    groupedbybook = groupedbybook.join(meta.loc[ : , ['title', 'compositionyear',
                                                      'wordcount', 'genre']], on = 'book_id')  # join metadata
    
    groupedbybook = groupedbybook.loc[groupedbybook['genre'] == 'fic', : ]     # exclude everything but fiction
    
    groupedbybook = groupedbybook.assign(frequency = groupedbybook['count'] / groupedbybook['wordcount'])
    
    sns.scatterplot(x = groupedbybook['compositionyear'], y = groupedbybook['frequency'])
    plt.show()
    
    # let's get pearson correlation for good measure
    rval, pval = pearsonr(groupedbybook['compositionyear'], groupedbybook['frequency'])
    print('Correlation with time: ', rval)
    print('p value of correlation: ', pval)

In [None]:
groupedbybook.head()

In [None]:
groupedbybook.head()

### SECTION 3: A specific question

In the process of exploring changes in different entities, you probably noticed that the frequencies of ```DATE``` and ```TIME``` entities are both going up across the timeline.

What could explain this?

Are people just getting more interested in time generally?

The British historian E. P. Thompson wrote a famous essay on ["Time, Work-Discipline, and Industrial Capitalism,"](https://urldefense.com/v3/__https://www.sv.uio.no/sai/english/research/projects/anthropos-and-the-material/Intranet/economic-practices/reading-group/texts/thompson-time-work-discipline-and-industrial-capitalism.pdf__;!!DZ3fjg!qvPffHhHBJIodW8Tf1IP52IiCn_x6Tj9b547HUdN1URaf5JTSz-yueHuvNkYrySM11A$ ) which argues that new forms of work associated with the industrial revolution tended to reorganize our conception of time around clocks and watches. Work was no longer organized by the task ("we'll work until the harvest's in") but by the clock ("eight to six, with a thirty-minute break for lunch").

If this explanation is right, we might expect the ```TIME``` entities to increase even more than the ```DATE``` entities do.

One way to think about this would be to calculate the ratio of TIME references to DATE references for all the works in our dataset, and graph the ratio on the vertical (y) axis, with date of composition on the horizontal (x) axis. We might need to use Laplacian smoothing to avoid division by zero.

Is there a relationship?

**If we don't get to this in class, it might become homework.**

**If we do get to this in class, we'll frame a homework question about the difference between fiction and biography.**

### SECTION 4: What else could we do with entity extraction?

Matt Wilkens' essay on the time lag shaping literary history illustrates another influential way of using named entity recognition: *geoparsing,* which translates the place names in a text into latitude and longitude coordinates to plot on a map. We won't have time to explore that method in this lab, but there are lots of tools available to help you; e.g. [the ```geopy``` and ```geopandas``` modules make things pretty easy.](https://urldefense.com/v3/__https://towardsdatascience.com/geoparsing-with-python-c8f4c9f78940)*5Cn__;JQ!!DZ3fjg!qvPffHhHBJIodW8Tf1IP52IiCn_x6Tj9b547HUdN1URaf5JTSz-yueHuvNkY4itsykU$ 
Something we can do without installing a lot of additional Python modules is, dig more deeply into descriptions of time. The dividing line between ```DATE``` and ```TIME``` within ```spaCy``` is arbitrary. The people who wrote the software decided to separate entities that are smaller than a day from those that describe a whole day or a longer unit of time.

We don't really have a principled reason to rely on that binary division.

Also, some date and time references describe lengths of time, but some are just references to a particular hour of the day "eight o'clock" or day of the week "Wednesday." The approach we took in Section 3 mixed all those things together.

But with a bit more work, we could come up with a more principled estimate of the average span of time mentioned in a work of fiction (or a biography for that matter).

All we would need to do is identify the unit of time in each phrase (a "minute," "hour," "month," or what have you) and then the multiplier (if any) applied to the unit. E.g., "several minutes." Then we could translate most of our entities into a number roughly quantifying (say) the number of minutes that would elapse in the described span of time.

The code below does that, just to illustrate how we could develop our casual exploration of dates and times into a systematic measurement of the granularity of time in fiction.

Note that you could do something very similar with references to money.

In [None]:
def text2int (textnum, numwords={}):
    '''
    By Adnan Umer on Stack Overflow:
    https://urldefense.com/v3/__https://stackoverflow.com/questions/493174/is-there-a-way-to-convert-number-words-to-integers*5Cn__;JQ!!DZ3fjg!qvPffHhHBJIodW8Tf1IP52IiCn_x6Tj9b547HUdN1URaf5JTSz-yueHuvNkYdQqQDog$     '''
    
    if not numwords:
        units = [
        "zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
        "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
        "sixteen", "seventeen", "eighteen", "nineteen",
        ]

        tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]

        scales = ["hundred", "thousand", "million", "billion", "trillion"]

        numwords["and"] = (1, 0)
        for idx, word in enumerate(units):  numwords[word] = (1, idx)
        for idx, word in enumerate(tens):       numwords[word] = (1, idx * 10)
        for idx, word in enumerate(scales): numwords[word] = (10 ** (idx * 3 or 2), 0)

    ordinal_words = {}    # I've disabled the ordinal part
    ordinal_endings = []  # it's not appropriate for time

    textnum = textnum.replace('-', ' ')
    textnum = textnum.replace('a hundred', '100')
    textnum = textnum.replace('a thousand', '1000')
    textnum = textnum.replace('a million', '1000000')

    current = result = 0
    curstring = ""
    onnumber = False
    for word in textnum.split():
        if word in ordinal_words:
            scale, increment = (1, ordinal_words[word])
            current = current * scale + increment
            if scale > 100:
                result += current
                current = 0
            onnumber = True
        else:
            for ending, replacement in ordinal_endings:
                if word.endswith(ending):
                    word = "%s%s" % (word[:-len(ending)], replacement)

            if word not in numwords:
                if onnumber:
                    curstring += repr(result + current) + " "
                curstring += word + " "
                result = current = 0
                onnumber = False
            else:
                scale, increment = numwords[word]

                current = current * scale + increment
                if scale > 100:
                    result += current
                    current = 0
                onnumber = True

    if onnumber:
        curstring += repr(result + current)

    return curstring

def get_numeric_time(text):
    
    text = text.replace('\n', ' ').lower()
    
    if 'of age' in text:
        return float('nan')
    
    units = {'minute': 1, 'minutes': 1, 'hour': 60, 'hours': 60,
            'seconds': .017, 'moment': .017, 
            'day': 1440, 'days': 1440, 
            'night': 480, 'nights': 1440, 'week': 10080, 'weeks': 10080,
             'month': 40320, 'months': 40320, 'year': 525600, 'years': 525600}
    
    text = text2int(text)
    words = text.split()
    
    if 'old' in words:
        # this is saying how old someone is!
        return float('nan')
    
    multiplier = 1
    unit = 0
    for w in words:
        if w.isdigit():
            multiplier = int(w)
            if multiplier > 500:
                multiplier = 0
                # that's probably a date, not a count!
        elif w in units:
            unit = units[w]
        elif w == 'couple':
            multiplier = 2
        elif w == 'few':
            multiplier = 3
        elif w == 'several':
            multiplier = 4
        elif w == 'half':
            multiplier = multiplier * 0.5
        elif w == 'quarter':
            multiplier = multiplier * 0.25
    
    time = multiplier * unit
    
    if time == 0:
        time = float('nan')
    elif time > 20000000:
        time = 20000000
    
    return time
            
    

In [None]:
time_phrases = bookents.loc[(bookents['entity_type'] == 'TIME') | (bookents['entity_type'] == 'DATE'), : ]

In [None]:
times = time_phrases['entity_text'].map(get_numeric_time)
times[0:12]

In [None]:
time_phrases = time_phrases.assign(minutes = times)
time_phrases = time_phrases.assign(logminutes = np.log(times))
time_phrases.sample(25)

#### The central tendency of a log-normal distribution

Wow, time spans are distributed pretty strangely. If you inspect the distribution, it is in no way normal.

In [None]:
sns.kdeplot(time_phrases['minutes'])

On the other hand, the logarithm of time span is distributed closer to normally.

In [None]:
sns.kdeplot(time_phrases['logminutes'])

So it's going to make sense to take the mean of ```logminutes``` instead of the mean of ```minutes.``` The mean of ```minutes``` will be strongly dominated by a small number of extreme values at the high end. Differences at the low end will get erased.

In [None]:
timesbybook = time_phrases.groupby('book_id', as_index = False).mean()
timesbybook.head()

Let's join composition year and genre.

In [None]:
if 'book_id' in meta.columns:
    meta = meta.set_index('book_id')

timesbybook = timesbybook.join(meta.loc[: , ['compositionyear', 'genre']], on = 'book_id')
timesbybook.head()

In [None]:
timesbybook = timesbybook.loc[timesbybook['genre'] == 'fic', : ]
timesbybook = timesbybook.loc[~np.isnan(timesbybook['logminutes']), : ]
sns.scatterplot(x = timesbybook['compositionyear'], y = timesbybook['logminutes'])

In [None]:
from scipy.stats import pearsonr
pearsonr(timesbybook['compositionyear'], timesbybook['logminutes'])

**Conclusion:** In short, it looks like description of time gets considerably more granular as we move from the eighteenth century to the early twentieth. The spans of time mentioned get shorter on average.

Since there are significant sources of error in entity extraction, we might want to confirm this manually. In practice, the inference we're drawing does seem to check out.