Natural language processing: SpaCy 
----------------------------------------------------
Based on Tutorial by [Allison Parrish](http://www.decontextualize.com/)


**spaCy** is a free, open-source library for advanced Natural Language Processing (NLP) in Python. 


[spaCy library](https://spacy.io/), is a good compromise between being very powerful and state-of-the-art and easy for newcomers to understand.

spaCy provides very fast and accurate syntactic analysis and also offers named entity recognition and ready access to word vectors




In [1]:
from __future__ import unicode_literals
from __future__ import print_function

## Installing spaCy

[Follow the instructions here](https://spacy.io/docs/usage/). To install on Anaconda, you'll need to open a Terminal window (or the equivalent on your operating system) and type

    conda install -c conda-forge spacy
    
This line installs the library. You'll also need to download a language model. For that, type:

    python -m spacy download en_core_web_md
    
(Replace `en` with the language code for your desired language, if there's a model available for it.) The language model contains the statistical information necessary to parse text into sentences and sentences into parts of speech. Note that this download is several hundred megabytes, so it might take a while!

If you're not using Anaconda, you can also install with `pip`. When using `pip`, make sure to upgrade to the newest version first, with `pip install --upgrade pip`. (This will ensure that at least *some* of the dependencies are installed as pre-built binaries)

    pip install spacy
    
(If you're not using a virtual environment, try `sudo pip install spacy`.)


* Download the data using command line

    !python -m spacy download en_core_web_md

Features of spaCy
------------------

* **Tokenisation** Segmenting text into words, punctuations marks etc.

* **Part-of-speech (POS) Tagging** Assigning word types to tokens, like verb or noun.

* **Dependency Parsing Assigning** syntactic dependency labels, describing the relations between individual tokens, like subject or object.

* **Lemmatisation** Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".

* **Sentence** **Boundary** **Detection** (SBD) Finding and segmenting individual sentences.

* **Named** **Entity** **Recognition** (NER) Labelling named "real-world" objects, like persons, companies or locations.

* **Similarity** Comparing words, text spans and documents and how similar they are to each other.

* **Text Classification** Assigning categories or labels to a whole document, or parts of a document.

* **Rule-based Matching** Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.

* Training Updating and improving a statistical model's predictions.

* **Serialisation** Saving objects to files or byte strings.


# SpaCy versus NLTK

spaCy offers tokenization, sentence boundary detection, POS tagging, NER, syntactic parsing (and chunking, as a subset of this), integrated word vectors, and alignments

![](./data/whyspacy.png?raw=true)
##### source: https://spacy.io/

For Languages other than English
--------------------------------


* The main assumption that most NLP libraries and techniques make is that the text you want to process will be in English

* Historically, most NLP research has been on English specifically

* [spaCy has models for various languages](https://spacy.io/models/#available-models), including German, Spanish, Portuguese, French, Italian, and Dutch

* [Konlpy](https://github.com/konlpy/konlpy), natural language processing in
  Python for Korean
  
* [Jieba](https://github.com/fxsjy/jieba), text segmentation and POS tagging in
  Python for Chinese
  
* Facebook's [fasttext project](https://fasttext.cc/docs/en/pretrained-vectors.html) makes available word vectors for a large number of languages (~300)


## Basic usage

Import `spacy` like any other Python module

In [3]:
import spacy

# Loading language models

* Complete list of models [Language Models](https://spacy.io/usage/models)


![](./data/lm.png?raw=true)
##### source: https://spacy.io/

In [4]:
nlp = spacy.load('en')

#nlp = spacy.load('en_core_web_md')

# Create a document (some text)

In [5]:
doc = nlp("All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. Everyone has the right to life, liberty and security of person.")

## Sentences and parts of speech

* Sentences are themselves composed of individual words,  where the function of a word in a sentence is called its "part of speech"—i.e., a word functions as a noun, a verb, an adjective, etc.

* Example:

    I       really love entrees       from        the        new       cafeteria.
    pronoun adverb verb noun (plural) preposition determiner adjective noun




In [6]:
for item in doc.sents:
    print(item.text)

All human beings are born free and equal in dignity and rights.
They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
Everyone has the right to life, liberty and security of person.


* The `.sents` attribute is a [generator](https://wiki.python.org/moin/Generators), not a list

* converting into the `list()` function:

In [7]:
sentences_as_list = list(doc.sents)

In [8]:
len(sentences_as_list)

3

In [9]:
import random
random.choice(sentences_as_list)

All human beings are born free and equal in dignity and rights.

## Words

* Iterating over a document yields each word in the document

* Words are represented with spaCy [Tokens](https://spacy.io/docs/api/token) objects, which have several interesting attributes. 

* The `.text` attribute gives the underlying text of the word, and the `.lemma_` attribute gives the word's "lemma" (explained below):

In [10]:
for word in doc:
    print(word.text, word.lemma_)

All all
human human
beings being
are be
born bear
free free
and and
equal equal
in in
dignity dignity
and and
rights right
. .
They -PRON-
are be
endowed endow
with with
reason reason
and and
conscience conscience
and and
should should
act act
towards towards
one one
another another
in in
a a
spirit spirit
of of
brotherhood brotherhood
. .
Everyone Everyone
has have
the the
right right
to to
life life
, ,
liberty liberty
and and
security security
of of
person person
. .


In [11]:
sentence = list(doc.sents)[1]
for word in sentence:
    print(word.text)

They
are
endowed
with
reason
and
conscience
and
should
act
towards
one
another
in
a
spirit
of
brotherhood
.


## Parts of speech

* The `pos_` attribute gives a general part of speech; the `tag_` attribute gives a more specific designation

[List of meanings here.](https://spacy.io/docs/api/annotation)

In [12]:
for item in doc:
    print(item.text, item.pos_, item.tag_)

All DET DT
human ADJ JJ
beings NOUN NNS
are VERB VBP
born VERB VBN
free ADJ JJ
and CCONJ CC
equal ADJ JJ
in ADP IN
dignity NOUN NN
and CCONJ CC
rights NOUN NNS
. PUNCT .
They PRON PRP
are VERB VBP
endowed VERB VBN
with ADP IN
reason NOUN NN
and CCONJ CC
conscience NOUN NN
and CCONJ CC
should VERB MD
act VERB VB
towards ADP IN
one NOUN NN
another DET DT
in ADP IN
a DET DT
spirit NOUN NN
of ADP IN
brotherhood NOUN NN
. PUNCT .
Everyone NOUN NN
has VERB VBZ
the DET DT
right NOUN NN
to ADP IN
life NOUN NN
, PUNCT ,
liberty NOUN NN
and CCONJ CC
security NOUN NN
of ADP IN
person NOUN NN
. PUNCT .


### Extracting words by part of speech

* With knowledge of which part of speech each word belongs to, extract and recombine words by their part of speech

In [13]:
nouns = [item.text for item in doc if item.pos_ == 'NOUN']
adjectives = [item.text for item in doc if item.pos_ == 'ADJ']

And below, some code to print out random pairings of an adjective from the text with a noun from the text:

In [14]:
for i in range(10):
    print(random.choice(adjectives) + " " + random.choice(nouns))

free right
free rights
free liberty
equal security
human one
free conscience
human person
equal brotherhood
human life
free dignity


In [15]:
verbs = [item.text for item in doc if item.pos_ == 'VERB']

* The list of verbs is a bit unintuitive

* The `.pos_` attribute, gives us general information about the part of speech and the `.tag_` attribute allows to be more specific about the kinds of it

In [16]:
verbs

['are', 'born', 'are', 'endowed', 'should', 'act', 'has']

In [18]:
only_past = [item.text for item in doc if item.tag_ == 'VBN']
only_past

['born', 'endowed']

## Phrases and larger syntactic structures

* There are several different ways for talking about larger syntactic structures in sentences. The scheme used by spaCy is called a "dependency grammaronly_past


## Larger syntactic units

* Larger chunks of sentences `.noun_chunks`, which is an attribute of a document or a sentence

* List of [spans](https://spacy.io/docs/api/span) of noun phrases

In [19]:
noun_chunks = [item.text for item in doc.noun_chunks]
print(", ".join(noun_chunks))

All human beings, dignity, rights, They, reason, conscience, one, a spirit, brotherhood, Everyone, the right, life, liberty, security, person


# Understanding dependency grammars

* Dependency Parsing: spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or “chunks”. You can check whether a Doc object has been parsed with the doc.is_parsed attribute, which returns a boolean value. If this attribute is False, the default sentence iterator will raise an exception. 

* The spaCy library parses the underlying sentences using a [dependency grammar](https://en.wikipedia.org/wiki/Dependency_grammar). 

* Dependency grammars look different from the kinds of sentence diagramming you may have done in high school, and even from tree-based [phrase structure grammars](https://en.wikipedia.org/wiki/Phrase_structure_grammar) commonly used in descriptive linguistics. The idea of a dependency grammar is that every word in a sentence is a "dependent" of some other word, which is that word's "head." Those "head" words are in turn dependents of other words. The finite verb in the sentence is the ultimate "head" of the sentence, and is not itself dependent on any other word. (The dependents of a particular head are sometimes called its "children.")

* The question of how to know what constitutes a "head" and a "dependent" is complicated. 

* As a starting point, here's a passage from [Dependency Grammar and Dependency Parsing](http://stp.lingfil.uu.se/~nivre/docs/05133.pdf):

> Here are some of the criteria that have been proposed for identifying a syntactic relation between a head H and a dependent D in a construction C (Zwicky, 1985; Hudson, 1990):
>
> 1. H determines the syntactic category of C and can often replace C.
> 2. H determines the semantic category of C; D gives semantic specification.
> 3. H is obligatory; D may be optional.
> 4. H selects D and determines whether D is obligatory or optional.
> 5. The form of D depends on H (agreement or government).
> 6. The linear position of D is specified with reference to H."

Dependents are related to their heads by a *syntactic relation*. The name of the syntactic relation describes the relationship between the head and the dependent. Use the displaCy visualizer (linked above) to see how a particular sentence is parsed, and what the relations between the heads and dependents are.

Every token object in a spaCy document or sentence has attributes that tell you what the word's head is, what the dependency relationship is between that word and its head, and a list of that word's children (dependents). The following code prints out each word in the sentence, the tag, the word's head, the word's dependency relation with its head, and the word's children (i.e., dependent words):

# Visualisers dependencies and entities in your browser or in a notebook

* Visualising a dependency parse or named entities in a text is not only a fun NLP demo – it can also be incredibly helpful in speeding up development and debugging your code and training process. 

In [47]:
from spacy import displacy

#nlp = spacy.load("en")
doc = nlp(u"This is a sentence.")
displacy.render(doc, style="dep")

![displacy parse](http://static.decontextualize.com/syntax_example.png)

[See in "displacy", spaCy's syntax visualization tool.](https://demos.explosion.ai/displacy/?text=Everyone%20has%20the%20right%20to%20life%2C%20liberty%20and%20security%20of%20person&model=en&cpu=1&cph=0)

In [20]:
for word in list(doc.sents)[2]:
    print("Word:", word.text)
    print("Tag:", word.tag_)
    print("Head:", word.head.text)
    print("Dependency relation:", word.dep_)
    print("Children:", list(word.children))
    print()

Word: Everyone
Tag: NN
Head: has
Dependency relation: nsubj
Children: []

Word: has
Tag: VBZ
Head: has
Dependency relation: ROOT
Children: [Everyone, right, .]

Word: the
Tag: DT
Head: right
Dependency relation: det
Children: []

Word: right
Tag: NN
Head: has
Dependency relation: dobj
Children: [the, to]

Word: to
Tag: IN
Head: right
Dependency relation: prep
Children: [life]

Word: life
Tag: NN
Head: to
Dependency relation: pobj
Children: [,, liberty]

Word: ,
Tag: ,
Head: life
Dependency relation: punct
Children: []

Word: liberty
Tag: NN
Head: life
Dependency relation: conj
Children: [and, security]

Word: and
Tag: CC
Head: liberty
Dependency relation: cc
Children: []

Word: security
Tag: NN
Head: liberty
Dependency relation: conj
Children: [of]

Word: of
Tag: IN
Head: security
Dependency relation: prep
Children: [person]

Word: person
Tag: NN
Head: of
Dependency relation: pobj
Children: []

Word: .
Tag: .
Head: has
Dependency relation: punct
Children: []



Here's a list of a few dependency relations and what they mean. ([A more complete list can be found here.](http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf))

* `nsubj`: this word's head is a verb, and this word is itself the subject of the verb
* `nsubjpass`: same as above, but for subjects in sentences in the passive voice
* `dobj`: this word's head is a verb, and this word is itself the direct object of the verb
* `iobj`: same as above, but indirect object
* `aux`: this word's head is a verb, and this word is an "auxiliary" verb (like "have", "will", "be")
* `attr`: this word's head is a copula (like "to be"), and this is the description attributed to the subject of the sentence (e.g., in "This product is a global brand", `brand` is dependent on `is` with the `attr` dependency relation)
* `det`: this word's head is a noun, and this word is a determiner of that noun (like "the," "this," etc.)
* `amod`: this word's head is a noun, and this word is an adjective describing that noun
* `prep`: this word is a preposition that modifies its head
* `pobj`: this word is a dependent (object) of a preposition

### Using .subtree for extracting syntactic units

* The `.subtree` attribute evaluates to a generator that can be flatted by passing it to `list()`

* `flatten_subtree` merges a subtree and returns a string with the text of the words

In [21]:
def flatten_subtree(st):
    return ''.join([w.text_with_ws for w in list(st)]).strip()

In [22]:
for word in list(doc.sents)[2]:
    print("Word:", word.text)
    print("Flattened subtree: ", flatten_subtree(word.subtree))
    print()

Word: Everyone
Flattened subtree:  Everyone

Word: has
Flattened subtree:  Everyone has the right to life, liberty and security of person.

Word: the
Flattened subtree:  the

Word: right
Flattened subtree:  the right to life, liberty and security of person

Word: to
Flattened subtree:  to life, liberty and security of person

Word: life
Flattened subtree:  life, liberty and security of person

Word: ,
Flattened subtree:  ,

Word: liberty
Flattened subtree:  liberty and security of person

Word: and
Flattened subtree:  and

Word: security
Flattened subtree:  security of person

Word: of
Flattened subtree:  of person

Word: person
Flattened subtree:  person

Word: .
Flattened subtree:  .



Using the subtree and knowledge of dependency relation types larger syntactic units based on their relationship can be extracted

In [None]:
subjects = []
for word in doc:
    if word.dep_ in ('nsubj', 'nsubjpass'):
        subjects.append(flatten_subtree(word.subtree))

In [75]:
subjects

['All human beings', 'They', 'Everyone']

Or every prepositional phrase:

In [76]:
prep_phrases = []
for word in doc:
    if word.dep_ == 'prep':
        prep_phrases.append(flatten_subtree(word.subtree))

In [77]:
prep_phrases

['in dignity and rights',
 'with reason and conscience',
 'towards one another',
 'in a spirit of brotherhood',
 'of brotherhood',
 'to life, liberty and security of person',
 'of person']

## Entity extraction

* A common task in NLP is taking a text and extracting "named entities" from it—basically, proper nouns, or names of companies, products, locations, etc. You can easily access this information using the `.ents` property of a document.

In [None]:
doc2 = nlp("John McCain and I visited the Apple Store in Manhattan.")

In [79]:
for item in doc2.ents:
    print(item)

John McCain
Apple
Manhattan


Entity objects have a `.label_` attribute that tells you the type of the entity. ([Here's a full list of the built-in entity types.](https://spacy.io/docs/usage/entity-recognition#entity-types))

In [80]:
for item in doc2.ents:
    print(item.text, item.label_)

John McCain PERSON
Apple ORG
Manhattan GPE


## Loading data from a file

You can load data from a file easily with spaCy. [Here's the first few verses from the King James Version of the Bible](http://rwet.decontextualize.com/texts/genesis.txt), for example. (Download the linked file and make sure it's in the same directory as this notebook.)

In [27]:
doc3 = nlp(open("./data/test1.txt").read())

From here, we can see what entities were here with us from the very beginning:

In [28]:
for item in doc3.ents:
    print(item.text, item.label_)

God PERSON
God PERSON
God PERSON
the light Day DATE
Night PERSON
the evening TIME
the first day DATE
God PERSON
God PERSON
God PERSON
the evening TIME
the second day DATE
God PERSON
one CARDINAL
Earth LOC
God PERSON
the evening TIME
the third day DATE
God PERSON
the day DATE
the night TIME
two CARDINAL
the day DATE
the night TIME
the day DATE
the night TIME
the evening TIME
the fourth day DATE
God PERSON
God PERSON
the evening TIME
the fifth day DATE
God PERSON
God PERSON
God PERSON
God PERSON
Behold PERSON
the evening TIME
the sixth day DATE


In [29]:
[item.text for item in doc3.ents if item.label_ == 'TIME']

['the evening',
 'the evening',
 'the evening',
 'the night',
 'the night',
 'the night',
 'the evening',
 'the evening',
 'the evening']

## Approaches to keyword extraction

* "Keyword extraction" is the name for any kind of procedure that attempts to identify a subset of words in a text as being representative of that text's overall meaning

* Overview of different keyword extraction techniques (sometimes also called "automatic terminology recognition") from a number of different disciplines:

* Astrakhantsev, N. “ATR4S: Toolkit with State-of-the-Art Automatic Terms Recognition Methods in Scala.” ArXiv:1611.07804 [Cs], Nov. 2016. arXiv.org, http://arxiv.org/abs/1611.07804.
* [Chuang, Jason, et al. “‘Without the Clutter of Unimportant Words’: Descriptive Keyphrases for Text Visualization.” ACM Transactions on Computer-Human Interaction (TOCHI), vol. 19, no. 3, 2012, p. 19.](http://vis.stanford.edu/papers/keyphrases)
* [Understanding Keyness](http://www.thegrammarlab.com/?nor-portfolio=understanding-keyness) from the Grammar Lab



### Counting words

* To extract keywords from a text is to find the words that occur most frequently. 

* Python's `Counter` object, which provides an easy way to count the number of times that particular items occur in a list

In [23]:
from collections import Counter

And then pass a list of strings to `Counter()`, assigning the result to a variable. I'll start by just counting raw word counts:

In [30]:
word_counts = Counter([item.text for item in doc3 if item.is_alpha])

In [31]:
len(word_counts)

161

(The `if item.is_alpha` clause in the list comprehension above limits the list to only tokens that are alphanumeric, i.e., excluding punctuation.)

The `word_counts` variable contains a `Counter` object, which has a few interesting methods and properties. If you just evaluate it, you get a dictionary-like object that maps tokens to the number of times those tokens occur:

In [86]:
word_counts

Counter({'And': 33,
         'Be': 2,
         'Behold': 1,
         'Day': 1,
         'Earth': 1,
         'God': 32,
         'Heaven': 1,
         'I': 2,
         'In': 1,
         'Let': 8,
         'Night': 1,
         'Seas': 1,
         'So': 1,
         'Spirit': 1,
         'a': 2,
         'above': 2,
         'abundantly': 2,
         'after': 11,
         'air': 3,
         'all': 2,
         'also': 1,
         'and': 64,
         'appear': 1,
         'be': 7,
         'bearing': 1,
         'beast': 3,
         'beginning': 1,
         'behold': 1,
         'blessed': 2,
         'bring': 3,
         'brought': 2,
         'called': 5,
         'cattle': 3,
         'created': 5,
         'creature': 3,
         'creepeth': 3,
         'creeping': 2,
         'darkness': 4,
         'day': 9,
         'days': 1,
         'deep': 1,
         'divide': 3,
         'divided': 2,
         'dominion': 2,
         'dry': 2,
         'earth': 20,
         'evening': 6,
      

You can get the count for a particular token by using square bracket indexing with the `Counter` object:

Or you can get the *n* most frequent items using the `.most_common()` method, which takes an integer parameter to limit the list to a certain number of items, sorted from most frequent to least:

In [36]:
word_counts.most_common(5)

[('the', 108), ('and', 64), ('And', 33), ('God', 32), ('earth', 20)]

This is a list of [tuples](https://docs.python.org/3.5/library/stdtypes.html#typesseq-tuple). (Tuples are just like lists, except you can't change them after you create them.) To get just the list of the ten most common nouns:

In [89]:
top_ten_words = [item[0] for item in word_counts.most_common(10)]
print(", ".join(top_ten_words))

the, and, And, God, earth, of, was, it, that, in


You can think of this as a kind of (very simple!) list of keywords–essentially, the words that occur in this document more than any other word.

The following expression evaluates to a list of every word in the text and the percentage of the text that it comprises. (To keep things short, I'm just getting the first 25 items from the list using the list slice syntax `[:25]`.)

In [90]:
total_words = sum(word_counts.values())
[(item[0], word_counts[item[0]] / total_words) for item in word_counts.items()][:25]

[('In', 0.0012547051442910915),
 ('the', 0.1355081555834379),
 ('beginning', 0.0012547051442910915),
 ('God', 0.04015056461731493),
 ('created', 0.006273525721455458),
 ('heaven', 0.0075282308657465494),
 ('and', 0.08030112923462986),
 ('earth', 0.025094102885821833),
 ('And', 0.04140526976160602),
 ('was', 0.02132998745294856),
 ('without', 0.0012547051442910915),
 ('form', 0.0012547051442910915),
 ('void', 0.0012547051442910915),
 ('darkness', 0.005018820577164366),
 ('upon', 0.012547051442910916),
 ('face', 0.0037641154328732747),
 ('of', 0.025094102885821833),
 ('deep', 0.0012547051442910915),
 ('Spirit', 0.0012547051442910915),
 ('moved', 0.0012547051442910915),
 ('waters', 0.013801756587202008),
 ('said', 0.012547051442910916),
 ('Let', 0.010037641154328732),
 ('there', 0.006273525721455458),
 ('be', 0.00878293601003764)]

### Word probabilities

* The probability of a given word will occur in any text written in English


* spaCy's model includes—for every word in its vocabulary—the word's [log probability](https://en.wikipedia.org/wiki/Log_probability) estimate, based on a large corpus of English texts. You can access a word's log probability estimate in English using the `.prob` attribute of the `Token` object (which is what you get when you iterate over a document or a sentence.)

In [38]:
[(item.text, item.prob) for item in doc3][:25]

[('In', -20.0),
 ('the', -20.0),
 ('beginning', -20.0),
 ('God', -20.0),
 ('created', -20.0),
 ('the', -20.0),
 ('heaven', -20.0),
 ('and', -20.0),
 ('the', -20.0),
 ('earth', -20.0),
 ('.', -20.0),
 ('\n', -20.0),
 ('And', -20.0),
 ('the', -20.0),
 ('earth', -20.0),
 ('was', -20.0),
 ('without', -20.0),
 ('form', -20.0),
 (',', -20.0),
 ('and', -20.0),
 ('void', -20.0),
 (';', -20.0),
 ('and', -20.0),
 ('darkness', -20.0),
 ('was', -20.0)]

Lower numbers (i.e., numbers that are more negative) are more rare. You can also look up any word's probability using the `.vocab` attribute of the [`Language`](https://spacy.io/api/language) object, which we initially created by calling `spacy.load()`, which returns a [`Lexeme`](https://spacy.io/api/lexeme) object:

In [42]:
water = nlp.vocab['earth']

In [43]:
water.prob

-20.0

By the way: you can convert a log probability back to a percentage by raising the constant $e$ to the power of the log probability. The constant $e$ is included as part of the `math` package, and the operator to raise a value by a power in Python is `**`:

In [44]:
from math import e
e**water.prob

2.06115362243856e-09

* This tells us that, according to spaCy, if you pick a word at random from any given English text, the chance of it being "water" is about 0.02%

* A first approximation, then, of our task to find the words that are uniquely probable in our text would be simply to get a list of the *least common words* in the text, as judged by spaCy's word probability estimate

In [45]:
unique_words = list(set([item.text for item in doc3 if item.is_alpha]))

Then, using Python's `sorted()` function, we can sort these according to their probability and give only the top ten rarest words in the text.

In [96]:
[item for item in sorted(unique_words, key=lambda x: nlp.vocab[x].prob)][:15]

['moveth',
 'creepeth',
 'firmament',
 'Seas',
 'fowl',
 'yielding',
 'subdue',
 'abundantly',
 'Behold',
 'fruitful',
 'replenish',
 'likeness',
 'hath',
 'winged',
 'dominion']

# Word weirdness

* A measure of the *uniqueness* of the probability of a given word

* An easy and intuitive way to calculate this is simply to find the ratio of the word's probability in our document to spaCy's estimate of the word's probability in English

* word's "weirdness"[Ahmad, Khurshid, et al. “University of Surrey Participation in TREC8: Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER).” TREC, 1999, pp. 1–8.](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.3364) and a similar measure called "log ratio" was proposed by [Andrew Hardie here](http://cass.lancs.ac.uk/?p=1133)).

* word's "weirdness" score is obtained by dividing its frequency in our source document (Genesis) with its English log frequency estimate from spaCy

* [See this tutorial](quick-and-dirty-keywords.ipynb) for more statistical anaylsis of this concepts

In [97]:
square_weirdness = [(item, pow(word_counts[item]/total_words, 2) / e**nlp.vocab[item].prob) for item in unique_words]

In [98]:
square_weirdness

[('that', 0.02680706036866266),
 ('Let', 0.8735509428958823),
 ('one', 0.0006328201748930747),
 ('brought', 0.09958702824984658),
 ('whales', 0.5626471557219519),
 ('behold', 0.6741497008189071),
 ('two', 0.002663514115913552),
 ('face', 0.06997393346725132),
 ('said', 0.2234619775500119),
 ('fish', 0.16378760290969788),
 ('and', 0.3942243826208631),
 ('of', 0.04530349263154088),
 ('may', 0.0034026034243890475),
 ('after', 0.27245900703261716),
 ('sixth', 0.4794220601229158),
 ('he', 0.021358924306546304),
 ('over', 0.17746281298749195),
 ('our', 0.010520954094237858),
 ('creepeth', 1378.6369355482045),
 ('In', 0.003156013802475997),
 ('God', 8.966795747370472),
 ('moved', 0.024726696730063457),
 ('give', 0.012733081112292561),
 ('deep', 0.026994624370536725),
 ('fruitful', 7.9162567496963385),
 ('lights', 0.49984586472853804),
 ('his', 0.09135656008221582),
 ('bearing', 0.21605927589715465),
 ('fill', 0.039745426646784425),
 ('dominion', 4.919916289305311),
 ('you', 0.0004996394704867

The higher the score, the weirder the word (i.e., the more particular it is to our source text versus English in general). Sorting by the score gives us our new list of keywords:

In [99]:
[item[0] for item in sorted(weirdness, reverse=True, key=lambda x: x[1])][:15]

['firmament',
 'creepeth',
 'moveth',
 'fowl',
 'yielding',
 'waters',
 'earth',
 'herb',
 'God',
 'abundantly',
 'fruitful',
 'dominion',
 'cattle',
 'seed',
 'multiply']

This list has many a the "just the least probable" 

### Counting parsed units

* Another simple way to pull out common words and phrases is to focus on only particular stretches of the document that have certain syntactic or semantic characteristics, as determined by spaCy's parser

In [100]:
noun_counts = Counter([item.text for item in doc3 if item.pos_ == 'NOUN'])

... and then getting just the ten most common nouns:

In [101]:
top_ten_nouns = [item[0] for item in noun_counts.most_common(10)]
print(", ".join(top_ten_nouns))

earth, waters, kind, day, firmament, light, evening, morning, seed, fowl


Here's the same thing with noun chunks:

In [102]:
chunk_counts = Counter([item.text for item in doc3.noun_chunks])
top_ten_chunks = [item[0] for item in chunk_counts.most_common(10)]
print(", ".join(top_ten_chunks))

God, the earth, it, the waters, his kind, them, the firmament, the evening, the morning, the heaven


Or with named entities:

In [103]:
entity_counts = Counter([item.text for item in doc3.ents])
top_ten_entities = [item[0] for item in entity_counts.most_common(10)]
print(", ".join(top_ten_entities))

earth, the day, God, the evening and the morning, the night, Night, the first day, the second day, one, Earth


Or with subjects of sentences:

In [104]:
subject_counts = Counter([item.text for item in doc3 if item.dep_ == 'nsubj'])
top_ten_subjects = [item[0] for item in subject_counts.most_common(10)]
print(", ".join(top_ten_subjects))

God, it, that, evening, earth, which, he, them, seed, waters


# Pipelines

When you call nlp on a text in **spaCy**

1.spaCy first tokenizes the text to produce a Doc object. 
2.The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the default models consists of a tagger, a parser and an entity recognizer.
Each pipeline component returns the processed Doc, which is then passed on to the next component

![](./data/pipe.svg?raw=true)
##### source: https://spacy.io/

In [None]:
**tokenizer**   Tokenizer                    	Doc	Segment text into tokens.

**tagger**	            Tagger 	Doc[i].tag	Assign part-of-speech tags.

**parser**	DependencyParser 	Doc[i].head, Doc[i].dep, Doc.sents, Doc.noun_chunks	Assign dependency labels


ner	**EntityRecognizer** 	Doc.ents, Doc[i].ent_iob, Doc[i].ent_type	Detect and label named entities.


textcat	**TextCategorizer** 	Doc.cats	Assign document labels.




# Topic Modelling using SpaCy and Scikit-learn

# What is topic-modelling?

* wiki: In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. 

* Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. 


* Various techniques of dimensionality reduction(mostly non-linear) and unsupervised learning like LDA, SVD, autoencoders etc. are used for performing topic modelling

Source: [Wikipedia](https://en.wikipedia.org/wiki/Topic_model)

# Latent-Dirchlet allocation (LDA) 

* Generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar



In [None]:
# Loading required libraries
import numpy as np
import pandas as pd
from tqdm import tqdm
import string
import matplotlib.pyplot as plt
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE
import concurrent.futures
import time
import pyLDAvis.sklearn
from pylab import bone, pcolor, colorbar, plot, show, rcParams, savefig

In [None]:
# Creating a vectorizer
vectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(wines["processed_description"])

In [None]:
NUM_TOPICS = 10

In [None]:
# Latent Dirichlet Allocation Model
lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online',verbose=True)
data_lda = lda.fit_transform(data_vectorized)

In [None]:
# Non-Negative Matrix Factorization Model
nmf = NMF(n_components=NUM_TOPICS)
data_nmf = nmf.fit_transform(data_vectorized) 

In [None]:
# Latent Semantic Indexing Model using Truncated SVD
lsi = TruncatedSVD(n_components=NUM_TOPICS)
data_lsi = lsi.fit_transform(data_vectorized)

In [None]:
# Functions for printing keywords for each topic
def selected_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]]) 

In [None]:
# Keywords for topics clustered by Latent Dirichlet Allocation
print("LDA Model:")
selected_topics(lda, vectorizer)

In [None]:
# Transforming an individual sentence
text = spacy_tokenizer("Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.")
x = lda.transform(vectorizer.transform([text]))[0]
print(x)

Further reading and resources
-------------------------

* [The tutorials on the official site](https://spacy.io/docs/usage/tutorials) 

* [Matthew Honnibal - Designing spaCy: Industrial-strength NLP](https://www.youtube.com/watch?v=gJJQs47aUQ0)