# Import the English Language Model

If you have not already done so, you will need to run this code to download the language model.

In [5]:
import sys
!{sys.executable} -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


# Defining variables

In [4]:
## define directory path and entity type
import os
cwd = os.getcwd()
data_loc = cwd + "/data"
output_loc = cwd + "/output/"
ent_type = "PERSON"

### entity type can be "PERSON", "NORP", "ORG", "GPE", etc.
### https://spacy.io/api/annotation#named-entities

# Imports and setup

In [5]:
import spacy
from spacy import displacy
import os
import string
import codecs
import subprocess
from collections import Counter

nlp = spacy.load('en_core_web_sm')

# Walk the directory tree and collect text files

In [6]:
allfiles = []

for root, dirs, files in os.walk(data_loc):
    for file in files:
        if file.endswith(".txt"):
            allfiles.append(os.path.join(root, file))
            
print('files: %d ' % len(allfiles))

files: 6 


In [7]:
myfile = codecs.open(allfiles[0], 'r', encoding='utf-8')
pagetext=myfile.read()
myfile.close()

# First pass: Parse the text and recognize entities

Here we apply the plain, "out of the box" Spacy English model to our text document. 
We then display the first sentence as a dependency graph and the entire document
with highlighted entities.

In [8]:
def parse():
    doc = nlp(pagetext)
    sentence_spans = list(doc.sents)
    displacy.render(sentence_spans[0:1], options={'compact': True}, style="dep")
    displacy.render(doc, options={'compact': True}, style="ent")
    

In [9]:
parse()

# Student Exercise

Analyze the results obtained above. How accurate are the entities that are recognized. Can you point out any reasons why certain mistakes were made
by the "out of the box" model?

# Create Line-by-Line Sentence Boundaries

Our directory text files contain one group of related words per line, but they aren't exactly sentences.
Let's see if we can improve the NLP output by explicitly telling the pipeline that each line is a sentence of related
words. The code below creates a function 'set_newline_sentences', which is added to our NLP pipeline.

## Newline and Escape Characters
The newline character in text-encoded files that is only indirectly visible. It causes the character after it
to jump to the next when the file is printed or displayed in an editor or viewer. In programming languages you
often need to create a newline character within a string, without typing a literal line-break. Instead we use
an "escape code" to add the invisible character. Newline's escape code is '\n'. String escape code in most 
programming languages start with a '\', for instance a tab character is created by placing '\t' in a string.

In [13]:
def set_newline_sentences(doc):
    for token in doc[:-1]:
        if token.text in ['\n',]:
            doc[token.i+1].is_sent_start = True
        elif doc[token.i].is_sent_start is None:
            doc[token.i].is_sent_start = False
    return doc

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(set_newline_sentences, before="parser")

In [14]:
parse()

## Adding the "Race" Labels from the Directory

The Charlotte directory at this time labeled each household as either "Black" or "White". Spacy often seems to think that "White" and "Black" are names or parts of "works of art", perhaps since they are colors. Since each line in the historical directory starts with a race label, let's inform Spacy about this prior to named entity recognition in the processing pipeline.

In [15]:
from spacy.pipeline import EntityRuler
race_entities = EntityRuler(nlp)
patterns = [{"label": "RACE", "pattern": [{"LOWER": "black"},]},
            {"label": "RACE", "pattern": [{"LOWER": "white"},]}]
race_entities.add_patterns(patterns)

nlp = spacy.load('en_core_web_sm')
nlp.entity.add_label('RACE')
nlp.add_pipe(set_newline_sentences, before="parser")
nlp.add_pipe(race_entities, before="ner")

In [16]:
parse()

## The Token after race label is a name

In [19]:
from spacy.tokens import Span
def lastname_follows_race_entities(doc):
    new_ents = []
    for ent in doc.ents:
        new_ents.append(ent)
        if ent.label_ == "RACE":
            next_token = doc[ent.end].nbor()
            new_ent = Span(doc, next_token.i, next_token.i + 1, label="PERSON")
            new_ents.append(new_ent)
    doc.ents = new_ents
    return doc

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(set_newline_sentences, name="newline", before="parser")
nlp.entity.add_label('RACE')
nlp.add_pipe(race_entities, name="race", before="ner")
nlp.add_pipe(lastname_follows_race_entities, name="lastname", after='race')

In [20]:
parse()

# Analysis of NER Output

In the above examples we are adding information in each step, usually prior to the Spacy NER processing. However, we are finding that all our "hints" are not improving the results. We could continue to manually add more information to the document, but this would still not improve the Spacy produced NER output. We would be doing out own version of NER, based on what we know about the Charlotte directory's contents.

# Student Activity

Why doesn't Spacy detect more named entities in our text?

# Next steps

Instead of doing our own data parsing through the Spacy pipeline, it may be more efficient to create our own text parsing pipeline:

* [Using Regular Expressions to Parse the Directory](regex.ipynb)