# Text Processing with spaCy
Exploring basic processes using the spaCy python package. 

In [1]:
import spacy
import pandas as pd
from numpy import dot
from numpy.linalg import norm
from spacy.en import English
from collections import Counter

from IPython.core.display import display, HTML
pd.set_option('display.max_colwidth', -1)

In [2]:
nlp = spacy.load('en')

## Read Document
The nlp command instantiates the in memory document into the spacy english model class so we can perform operations on. 

In [4]:
with open('../data/11-0.txt','r') as f:
    alice_text = f.read()
    
doc = nlp(alice_text)  # Instantiate the document as a spacy.english model class.

## Tokenization
Tokens are building blocks that form a setence, breaking them down is called tokenization. The object that is created from the instantiation represents the words as tokens.

In [5]:
# Get the 100th token of the document.
token = doc[99]
print(token)

Last


## Sentences
In some circumstances it's better to process a document by individual sentences. SpaCy exposes this as the ```sents``` attribute. 

In [6]:
sentences = [sentence.orth_.replace('\n', ' ') for sentence in doc.sents]  # List comp, also replacing newline chars
print('Sentences found: ' + str(len(sentences)))
pd.DataFrame(sentences[10:20])

Sentences found: 1431


Unnamed: 0,0
0,"Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next."
1,"First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards and book-shelves; here and there she saw maps and pictures hung upon pegs."
2,She took down a jar from one of the shelves as she passed; it was labelled ‘
3,"ORANGE MARMALADE’, but to her great disappointment it was empty: she did not like to drop the jar for fear of killing somebody, so managed to put it into one of the cupboards as she fell past it. ‘"
4,"Well!’ thought Alice to herself, ‘after such a fall as this, I shall think nothing of tumbling down stairs!"
5,How brave they’ll all think me at home!
6,"Why, I wouldn’t say anything about it, even if I fell off the top of the house!’ (Which was very likely true.)"
7,"Down, down, down."
8,Would the fall NEVER come to an end! ‘
9,I wonder how many miles I’ve fallen by this time?’ she said aloud. ‘


## Noun Chunks 
These are phrases build from recovered nouns from speech tags, we can access this using the ```noun_chunks``` attribute which returns a generator of the clusters of nouns.

In [7]:
# Example sentence from alice (6)
subset_sentence = nlp("Why, I wouldn’t say anything about it, even if I fell off the top of the house!’ (Which was very likely true.)")
print([chunk for chunk in subset_sentence.noun_chunks])

[I, anything, it, I, the top, the house!’]


## Noun Phrases
Noun phrases can be helpful for determining the topic of text. This operation can also be perfomed on the document

In [8]:
noun_phrases = [[np.orth_, np.root.head.orth_] for np in doc.noun_chunks]
print('Total noun phrases: ' + str(len(noun_phrases)))
pd.DataFrame(noun_phrases[30:40])

Total noun phrases: 8025


Unnamed: 0,0,1
0,she,peeped
1,the\nbook,reading
2,her sister,was
3,it,had
4,no pictures,had
5,conversations,pictures
6,it,in
7,what,is
8,the use,thought
9,"a book,’",thought


## Keywords using noun chunks

In [9]:
keywords = Counter()
for chunk in doc.noun_chunks:
    if nlp.vocab[chunk.lemma_].prob < - 8:  # Probability value of neg 8 is the threshold
           keywords[chunk.lemma_] += 1
keywords.most_common(10)

[('-PRON-', 2700),
 ('alice', 331),
 ('the queen', 55),
 ('the king', 52),
 ('the gryphon', 50),
 ('the hatter', 48),
 ('the mock turtle', 45),
 ('-PRON- head', 36),
 ('the duchess', 34),
 ('the dormouse', 27)]

## Entities
In information extraction it's useful to use a named entity recognition tool to be able to identify persons, organizations, locations, time, quanity, values and other categorical distinctions of tokens. This is exposed as the ```ents``` attribute and the example will show a subset of the top 5 person entities within the text.

In [10]:
entities = list(doc.ents)  # Create list of documents entities
people = [entity.orth_ for entity in entities if entity.label_ in ['PERSON']]
pd.DataFrame(people)[0].value_counts()[:5]  # Print out the top five person entities within the text. 

Alice      368
Gryphon    42 
Queen      32 
Mock       25 
Mouse      22 
Name: 0, dtype: int64

## Parts of Speech
POS tagging is a method to assign the grammar tags given a word within a current context. This is very helpful when building nlp systems. These tags can be found from each individual token using the ```x.pos_``` attribute.

In [11]:
for token in doc[99:110]:
    print('{} - {}'.format(token, token.pos_))

Last - ADJ
Updated - VERB
: - PUNCT
October - PROPN
6 - NUM
, - PUNCT
2016 - NUM


 - SPACE
Language - NOUN
: - PUNCT
English - PROPN


## Word to Vector
Word vectors are a way to represent distributional similarity of words within a text. A popular method for training is using word2vect. SpaCy uses GloVe as an unsupervised algorithm to get the vector representation of words from the text supplied

In [12]:
parser = English()

# Find word vector for cat
cat = parser.vocab['cat']

# Cosine similarity function
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1)*norm(v2))
others = list({w for w in doc.vocab if w.has_vector and w.orth_.islower() and w.lower_ != "cat"})

# Sort by similarity scores
others.sort(key=lambda w:cosine(w.vector, cat.vector))
others.reverse()

print('Top similar words to cat: ')
for i in others[:10]:
    print(i.orth_)

Top similar words to cat: 
cats
kitten
dog
kitty
pet
puppy
dogs
rabbit
pets
animal


## Word Similarity
SpaCy has a built-in similarity method that can help us identify word similarity using a semantic similarity estimate using cosine similarity through an average of word vectors. 

In [13]:
cats = others[0]
kitten = others[1]
print(cats.similarity(kitten))

0.734707011482


## Resources

* http://gutenberg.org

* https://spacy.io

* https://blog.sharepointexperience.com/2016/01/nlp-and-sharepoint-part-1/

* https://www.analyticsvidhya.com/blog/2017/04/natural-language-processing-made-easy-using-spacy-%E2%80%8Bin-python/

* https://github.com/cytora/pycon-nlp-in-10-lines/

* https://spacy.io/docs/api/doc

Project Gutenberg’s Alice’s Adventures in Wonderland, by Lewis Carroll

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Alice’s Adventures in Wonderland

Author: Lewis Carroll

Posting Date: June 25, 2008 [EBook #11]
Release Date: March, 1994
Last Updated: October 6, 2016

Language: English

Character set encoding: UTF-8