# Extract story information with Named entity extraction


In [1]:
from book_reading import Book

## Import a story

In [8]:
sherlock=Book("books/PeterPan.html")
story_paragraphs=list(sherlock.get_paragraphs(2))
story_full="\n\n".join(story_paragraphs)
story_begin="\n\n".join(story_paragraphs[0:5])

In [9]:
print("\n\n".join(story_paragraphs[0:3]))

All children, except one, grow up. They soon know that they will grow up, and the way Wendy knew was this. One day when she was two years old she was playing in a garden, and she plucked another flower and ran with it to her mother. I suppose she must have looked rather delightful, for Mrs. Darling put her hand to her heart and cried, “Oh, why can’t you remain like this for ever!” This was all that passed between them on the subject, but henceforth Wendy knew that she must grow up. You always know after you are two. Two is the beginning of the end.

Of course they lived at 14, and until Wendy came her mother was the chief one. She was a lovely lady, with a romantic mind and such a sweet mocking mouth. Her romantic mind was like the tiny boxes, one within the other, that come from the puzzling East, however many you discover there is always one more; and her sweet mocking mouth had one kiss on it that Wendy could never get, though there it was, perfectly conspicuous in the right-hand co

## Analyze with Spacy

In [10]:
import spacy
from spacy import displacy
nlp=spacy.load("en_core_web_sm")

In [11]:
doc=nlp(story_full)

In [12]:
displacy.render(nlp(story_begin), style="ent", jupyter=True)

We look at the characters. With SpaCY we can use a preset _named entity recigniser_ to detect names in the story. If we count them, we can find which characters are the most important.

In [13]:
from collections import Counter
entities={}
for ent in doc.ents:
    entities.setdefault(ent.label_, Counter()).update([str(ent.lemma_)])
entities

{'DATE': Counter({'two year old': 1,
          '14': 1,
          'a week or two': 1,
          'nine nine seven': 2,
          'a year': 1,
          'two fifteen six': 1,
          'the day': 1,
          'a week': 1}),
 'PERSON': Counter({'Darling': 29,
          'Wendy': 15,
          'Michael': 9,
          'George': 2,
          'John': 3,
          'Newfoundland dog': 1,
          'Fulsom': 1,
          'Liza': 1,
          'Peter Pan': 3,
          'John ’s': 1,
          'a Peter Pan': 1,
          'mark': 1,
          'Wendy one morning': 1}),
 'CARDINAL': Counter({'two': 4,
          'one': 4,
          'six': 3,
          'ten': 2,
          'two nine and six': 1,
          'eighteen': 1,
          'three': 3,
          'five': 1,
          'eight nine': 1,
          'nine nine seven': 1,
          'don’t': 2,
          'one five': 1,
          'half': 1,
          'fifteen': 1,
          'twelve': 1,
          'ninety - nine': 1,
          'between one': 1,
          'four

In [37]:
dir(ent)

['_',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_fix_dep_copy',
 '_vector',
 '_vector_norm',
 'as_doc',
 'char_span',
 'conjuncts',
 'doc',
 'end',
 'end_char',
 'ent_id',
 'ent_id_',
 'ents',
 'get_extension',
 'get_lca_matrix',
 'has_extension',
 'has_vector',
 'kb_id',
 'kb_id_',
 'label',
 'label_',
 'lefts',
 'lemma_',
 'n_lefts',
 'n_rights',
 'noun_chunks',
 'orth_',
 'remove_extension',
 'rights',
 'root',
 'sent',
 'sentiment',
 'set_extension',
 'similarity',
 'start',
 'start_char',
 'subtree',
 'tensor',
 'text',
 'text_with_ws',
 'to_array',
 'vector',
 'vector_norm',
 'vocab']