# Extract story information with Named entity extraction


In [1]:
from book_reading import Book

## Import a story

In [10]:
sherlock=Book("books/Sherlock.html")
story_paragraphs=list(sherlock.get_paragraphs(6))
story_full="\n\n".join(story_paragraphs)
story_begin="\n\n".join(story_paragraphs[0:5])

In [11]:
print("\n\n".join(story_paragraphs[0:3]))

Isa Whitney, brother of the late Elias Whitney, D.D., Principal of the Theological College of St. George’s, was much addicted to opium. The habit grew upon him, as I understand, from some foolish freak when he was at college; for having read De Quincey’s description of his dreams and sensations, he had drenched his tobacco with laudanum in an attempt to produce the same effects. He found, as so many more have done, that the practice is easier to attain than to get rid of, and for many years he continued to be a slave to the drug, an object of mingled horror and pity to his friends and relatives. I can see him now, with yellow, pasty face, drooping lids, and pin-point pupils, all huddled in a chair, the wreck and ruin of a noble man.

One night—it was in June, ’89—there came a ring to my bell, about the hour when a man gives his first yawn and glances at the clock. I sat up in my chair, and my wife laid her needle-work down in her lap and made a little face of disappointment.

“A patien

## Analyze with Spacy

In [12]:
import spacy
from spacy import displacy
nlp=spacy.load("en_core_web_sm")

In [13]:
doc=nlp(story_full)

In [14]:
displacy.render(nlp(story_begin), style="ent", jupyter=True)

We look at the characters. With SpaCY we can use a preset _named entity recigniser_ to detect names in the story. If we count them, we can find which characters are the most important.

In [16]:
from collections import Counter
entities={}
for ent in doc.ents:
    entities.setdefault(ent.label_, Counter()).update([str(ent.lemma_)])
entities["PERSON"]

Counter({'Isa Whitney': 2,
         'Elias Whitney': 1,
         'De Quincey ’s': 1,
         'Kate Whitney': 1,
         'Kate': 3,
         'Upper Swandam Lane': 1,
         'Watson': 16,
         'Lascar': 12,
         'Holmes': 15,
         'St. Clair ’s': 3,
         'near Lee': 1,
         'John': 1,
         'pon': 1,
         'Lee': 5,
         'St. Clair': 9,
         'Neville St.': 1,
         'Swandam Lane': 5,
         'Neville St. Clair': 7,
         'Barton': 1,
         'Surrey': 2,
         'Sherlock Holmes': 3,
         'Neville': 1,
         'Bradstreet': 2})

Try to merge different names for same person.

In [31]:
names=set([k.rstrip("'’s ") for k in entities["PERSON"].keys()])

In [32]:
names

{'Barton',
 'Bradstreet',
 'De Quincey',
 'Elias Whitney',
 'Holme',
 'Isa Whitney',
 'John',
 'Kate',
 'Kate Whitney',
 'Lascar',
 'Lee',
 'Neville',
 'Neville St.',
 'Neville St. Clair',
 'Sherlock Holme',
 'St. Clair',
 'Surrey',
 'Swandam Lane',
 'Upper Swandam Lane',
 'Watson',
 'near Lee',
 'pon'}