## Using Spacy for entity extraction

The idea is to use the phython library "spacy" to analyse the text of books.

I start with Sir Arthur Conan Doyle's [_The Adventures of Sherlock Holmes_](http://www.gutenberg.org/ebooks/1661) freely available on Project Gutenberg.

First we load the library and the english models provided alongwith the library. This should prove to be a good starting point. The idea would be to train the models further and tune them for the kind of data I hope to analyze.

In [68]:
# Import the spacy library
import spacy

# load the spacy models for english as the variable 'nlp'
nlp = spacy.load('en')

Once the model is loaded, we are ready to load the text of the book.

A simple function to load the text from a file.

In [69]:
# A function that loads text from a file. Import io to handle txt encoding issues
import io

def read_file(file_name):
    with io.open(file_name,'r',encoding='utf8') as f:
        return f.read()

Call our function and load the book as the variable _doc_. We are now ready to analyze. Switching to a more peculiar book, hitchhikers guide to the galaxy. 

In [70]:
text = read_file('hitch_guide.txt')
doc = nlp(text)


We now start exploring the spacy magic. Jump right into entity extraction using the spacy's built-in extensively trained model. Use the method _.ents_ and iterate through all the entities. We do an additional filter to only pick up entities labeled __ORG__.

In [71]:
for ent in doc.ents:
    if ent.label_ == 'ORG':
        print(ent.label_, ent.text)

ORG Hitchhiker
ORG Galaxy
ORG Galaxy
ORG Galaxy
ORG the Celestial Home Care Omnibus
ORG Fifty More Things
ORG Oolon

ORG the Outer Eastern Rim
ORG Galaxy
ORG the Hitchhiker’s Guide
ORG Encyclopedia
Galactica
ORG Yellow
ORG Yellow
ORG D. Point
ORG Leopard
ORG Guildford
ORG Ford Prefect
ORG Ford
ORG Betelgeuse
ORG Ford
ORG Betelgeuse
ORG Ford Prefect
ORG Ford
ORG Universe
ORG Ford
ORG Galaxy
ORG the Public Good
ORG House
ORG Ford
ORG Ford
ORG Ford
ORG Betelgeuse
ORG Ford
ORG Ford
ORG Ford
ORG Ford
ORG Ford
ORG Ford
ORG Ford
ORG Orion Beta
ORG Indian Wrestling
ORG Janx Spirit
ORG Orion
ORG Ford Prefect
ORG Ford
ORG Ford
ORG Ford
ORG Ford
ORG Ford
ORG Ford
ORG Ford
ORG Ford
ORG ’s
ORG Ford
ORG 

ORG Ford
ORG Ford
ORG 

ORG Ford
ORG Ford
ORG Ford
ORG Ford
ORG Ford
ORG Ford
ORG Ford
ORG Ford
ORG the Encyclopedia Galactica
ORG The Hitchhiker’s Guide
ORG Galaxy
ORG Pan Galactic Gargle Blaster
ORG Pan Galactic Gargle Blaster
ORG Pan Galactic Gargle
ORG Ol’ Janx Spirit
ORG the Marshes of Fallia


If you have read _the hitchhikers guide to the galaxy_ you will know that it uses a lot of made up names. Still, the library seems to have done a pretty decent job. 

A few observations though 

* There are a lot of repetitions. Like the word _Ford_ as that is the name of the lead character.
* It picks up words like _Universe_ & _Galaxy_ which could be charatereized as organizations but within the context of the book I know that they are nouns for celestial objects.
* Picks up quite a lot of _apostrophes_ and other special characters as organizations, which is strange.

## Next steps -

+ Get rid of the repetitions and the special characters. 
+ Look for a zany vizulation. 
+ Run the built in models against more formal texts like a sample contract and see how it behaves