# Entity Extraction of Pentagon Papers

In this notebook we'll use [```spaCy```, a Python module that does natural language processing,](https://urldefense.com/v3/__https://spacy.io/usage/models__;!!DZ3fjg!qvPffHhHBJIodW8Tf1IP52IiCn_x6Tj9b547HUdN1URaf5JTSz-yueHuvNkYDRWVgsk$ ) including part-of-speech tagging and named entity recognition (NER).

In [19]:
!pip install spacy                          
# this gets the python module itself



In [20]:
!python -m spacy download en_core_web_sm    
# this gets a particular English-lang model
# if this doesn't work try saying !python3 (everything else the same)
# or run the cell below and check what version you're running, and then say
# !python3.7, or !python3.8 (with everything else the same) as appropriate.

Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [21]:
from platform import python_version
print(python_version())

3.8.8


In [22]:
import spacy
from spacy import displacy
from pprint import pprint
from collections import Counter
import en_core_web_sm
from pathlib import Path

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

In [23]:
engl_nlp = en_core_web_sm.load()

### SECTION 1: Ingest and Chunk the File



<H2> Main Data Frame </H2>
    
Notes: The table does not have the text file "Pentagon-Papers-Part-IV-C-10.txt" because this entire file is all statistical reports and maps that do not translate to NLP. This leaves the entire text file corpus at 48 documents, each located in its entirity in the text column in the df. 
    
Overall_years refers to the span of dates that the content within is listed to pertain to. This information was found via the index (first row of df). Some documents did not have listed dates, so the information was found by searching through the body of the document, and often was found in the files chronology section.
    
Specific_dates is a more granular look at the time span a particular document refers to than overall_years. Most of the information is in month/year-month/year format, but for documents where it refered to a very specific start and end dates (such as Part-V-A which documents the U.S. governments official statements of position on the war to the public) then the date format is month/day/year-month/day/year. 

In [6]:
#use pwd to see where I am currently in the directory if I am having trouble identifying where I am
path = Path('full_txt_document.csv')
main_df = pd.read_csv(path, sep = ',')

In [7]:
main_df = main_df.drop(["Unnamed: 0"], axis=1)
main_df

Unnamed: 0,name,title,overall_years,specific_dates,text
0,Index,Index,1945-1967,1945-1967,Final Report OSD Vietnam Task Force (u) w1 In...
1,Part-I,Vietnam and the U.S.,1940-1950,1940-1950,EXECUTIVE SECRETARIAT TLE JJ RMM KDA ; r SS-...
2,Part-2,U.S. Involvement in the Franco-Viet Minh War,1950-1954,1950-1954,II U.S. Involvement in the Franco-Viet Minh W...
3,Part-3,The Geneva Accords,1954,1954,III The Geneva Accords 1954 A. U.S. Military ...
4,Part-IV-A-1,NATO and SEATO - A Comparison,1954-1960,1954-1960,", IwA Evolution of the War (26 Vols.) U.S. MAP..."
5,Part-IV-A-2,Aid for France in Indochina,1950-1954,1950-1954,A Evolution of the War (26 Vols.) U.S. MAP for...
6,Part-IV-A-3,U.S. and France's Withdrawl from Vietnam,1954-1956,1954-1956,A Evolution of the War (26 Vols.) U.S. MAP for...
7,Part-IV-A-4,U.S. Training of Vietnamese National Army,1954-1959,1954-1959,", Iw A Evolution of the War (26 Vols.) U.S. MA..."
8,Part-IV-A-5,Origins of the Insurgency,1954-1960,1954-1960,", IwA Evolution of the War (26 Vols.) U.S. MAP..."
9,Part-IV-B-1,The Kennedy Commitments and Programs,1961,1961,", IwB Evolution of the War (26 Vols.) Counteri..."


<H2> Functions to Chunk the Document, Extract the Entities, and Create a DataFrame from the Entity Dictionary</H2>

In [24]:
#function to prepare the document for extraction
#takes input of the file path, chunks into 300 words, and combines those 300 words into one string

def prepare_document(file_path):
    combined_word_lists = []
    total_lists = []

    path = Path(file_path)
    input_doc = open(path, encoding = 'utf-8').read()

    words = input_doc.split()

    for i in range(0, len(words), 300):
        chunk_words = words[i: i+300]
        combined_word_lists.append(chunk_words)
    
    #make every 300 words into a single string that is an individual element within the list 'total_lists'
    
    for list in combined_word_lists: 
        list_into_string = " ".join([word for word in list])
        total_lists.append(list_into_string)

    return total_lists

In [25]:
#function that does entity extraction on the prepared document using spaCy
#input: formatted document from prepare_document
#output: entity extraction dictionary

def entity_extraction(individual_document):
    ent_types = dict()

    for i in individual_document: #for each string in the list
        doc = engl_nlp(i)

        for entity in doc.ents:           # for each character, go through all the entities
            label = entity.label_         # get their labels
            if label not in ent_types:    # make sure there's a key for this label in the dictionary
                ent_types[label] = Counter()      # each label key points to a Counter for examples
            text = entity.text
            ent_types[label][text] += 1 
    
    return ent_types

In [26]:
def org_entity(document_dictionary):

    # for each dictionary, create a new df
    new_df_org = pd.DataFrame(ent_types['ORG'].items(), columns =["Organizations", "Count"])
    new_df_org.sort_values(by=["Organizations"], inplace = True)
    part_III_orgs_alpha = new_df_org
    new_df = part_III_orgs_alpha.reset_index(drop=True)
    
    return new_df

In [27]:
def person_entity(document_dictionary):

    # for each dictionary, create a new df
    new_df_person = pd.DataFrame(ent_types['PERSON'].items(), columns =["Person", "Count"])
    new_df_person.sort_values(by=["Organizations"], inplace = True)
    part_III_person_alpha = new_df_person
    new_df_person = part_III_person_alpha.reset_index(drop=True)
    
    return new_df_person

In [28]:
# concat dictionary to existing main dictionary

def concat_df(main_df, new_df):

    main_df= pd.concat([main_df, new_df], axis = 1)
    
    return main_df

# Trying to figure out how to deal with the different lengths: 
# https://www.geeksforgeeks.org/how-to-merge-dataframes-of-different-length-in-pandas/
# https://stackoverflow.com/questions/51424453/adding-list-with-different-length-as-a-new-column-to-a-dataframe

<H2> Formatting DF's for Organizations </H2>

In [31]:
part_index_formatted_org = prepare_document('../../Cleaned_Pentagon_Papers_text_files/Cleaned_Pentagon-Papers-Index.txt')
part_index_dictionary_org = entity_extraction(part_index_formatted_org)
part_index_org = org_entity(part_index_dictionary_org)
print(part_index_dictionary_org)

NameError: name 'ent_types' is not defined

In [None]:
displacy.render(doc, style = 'ent', jupyter = True)

In [32]:
part_I_formatted_org = prepare_document('../../Cleaned_Pentagon_Papers_text_files/Cleaned_Pentagon-Papers-I.txt')
part_I_dictionary_org = entity_extraction(part_I_formatted_org)
part_I_org = org_entity(part_I_dictionary_org)
print(part_I_org)
print(len(part_I_org))

NameError: name 'ent_types' is not defined

In [None]:
part_II_formatted_org = prepare_document('../../Cleaned_Pentagon_Papers_text_files/Cleaned_Pentagon-Papers-II.txt')
part_II_dictionary_org = entity_extraction(part_II_formatted_org)
part_II_org = org_entity(part_II_dictionary_org)
print(part_II_org)
print(len(part_II_org))

In [None]:
part_III_formatted = prepare_document('../../Cleaned_Pentagon_Papers_text_files/Cleaned_Pentagon-Papers-III.txt')
part_III_dictionary = entity_extraction(part_III_formatted)
part_III_orgs = org_entity(part_III_dictionary)
print(part_III_orgs)
print(len(part_III_orgs))

In [None]:
part_iv_a_1_formatted_org = prepare_document('../../Cleaned_Pentagon_Papers_text_files/Cleaned_Pentagon-Papers-IV-A-1.txt')
part_iv_a_1_dictionary_org = entity_extraction(part_iv_a_1_formatted_org)
part_iv_a_1_org = org_entity(part_iv_a_1_dictionary_org)
print(len(part_iv_a_1_org))