In [1]:
#meta 10/28/2020  myLearning: SpaCy
#src: https://spacy.io/usage/spacy-101

#install reqs in env
#conda install -c conda-forge spacy
#conda install -c conda-forge spacy-model-en_core_web_sm
#conda install -c conda-forge spacy-model-en_core_web_md
#conda install -c conda-forge spacy-lookups-data

#history
#  11/13/2020 LEMMATIZER
#      default works
# next: custom lemmatizer for plural nouns only

In [2]:
import spacy
from spacy import displacy #visualize sentence entities and its dependencies


# spaCy 101: Everything you need to know
https://spacy.io/usage/spacy-101

spaCy helps you build applications that process and “understand” large volumes of text.  It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.  When working with text, you eventually want to know:
- What’s it about? 
- What do the words mean in context?
- Who is doing what to whom? 
- What companies and products are mentioned? 
- Which texts are similar to each other?

## 0. Overview
### Features
- Tokenization - segmenting text into words, punctuations marks etc.
- Part-of-speech (POS) Tagging - assigning word types to tokens, like verb or noun.
- Dependency Parsing - assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
- Lemmatization - assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “cats” is “cat”.
- Sentence Boundary Detection (SBD) - finding and segmenting individual sentences.
- Named Entity Recognition (NER) - labelling named “real-world” objects, like persons, companies or locations.
- Entity Linking (EL) - disambiguating textual entities to unique identifiers in a Knowledge Base.
- Similarity - comparing words, text spans and documents and how similar they are to each other.
- Text Classification - assigning categories or labels to a whole document, or parts of a document.
- Rule-based Matching - finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
- Training - updating and improving a statistical model’s predictions.
- Searialization - saving objects to files or byte strings.

### Statistical models
While some of spaCy’s features work independently, others require statistical models to be loaded, which enable spaCy to predict linguistic annotations – for example, whether a word is a verb or a noun. spaCy currently offers statistical models for a variety of languages, which can be installed as individual Python modules. Models can differ in size, speed, memory usage, accuracy and the data they include. The model you choose always depends on your use case and the texts you’re working with. For a general-purpose use case, the small, default models are always a good start. They typically include the following components:

- Binary weights for the part-of-speech tagger, dependency parser and named entity recognizer to predict those annotations in context.
- Lexical entries in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling.
- Data files like lemmatization rules and lookup tables.
- Word vectors, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.
- Configuration options, like the language and processing pipeline settings, to put spaCy in the correct state when you load in the model.

## 1. Linguistic annotations
spaCy provides a variety of linguistic annotations to give you insights into a text’s grammatical structure. This includes the word types, like the parts of speech, and how the words are related to each other. For example, if you’re analyzing text, it makes a huge difference whether a noun is the subject of a sentence, or the object – or whether “google” is used as a verb, or refers to the website or company in a specific context.

### 1.0 Load Model
Returns a Language object containing all components and data needed to process text.

In [3]:
#nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_md")
nlp.__class__

spacy.lang.en.English

`nlp(doc)` returns a processed Doc

Split into individual words and annotated – it still holds all information of the original text, like whitespace characters. You can always get the offset of a token into the original string, or reconstruct the original by joining the tokens and their trailing whitespace. This way, you’ll never lose any information when processing text with spaCy.

In [4]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")
my_doc = nlp("Each employee counts all employees matter good is best bad can be better.  Love my wages, great wages, high pay, great pay.  Employees want higher pay, higher wages. Companies: Expeditors and Google. Currencies: $2 mil and C$1 mil")

for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj
. PUNCT punct


### 1.1 Tokenization
DOESN'T NEED MODEL  
During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token. Each `Doc` consists of individual tokens, and we can iterate over them  

<a href="https://spacy.io/usage/spacy-101#statistical-models"><img src="images/spacy_tokenize.PNG" /></a>

In [5]:
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion
.


### 1.2 Tagging POS and Dependencies
NEEDS MODEL  
After tokenization, spaCy can parse and tag a given `Doc`. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following “the” in English is most likely a noun.

Linguistic annotations are available as `Token` attributes https://spacy.io/api/token#attributes

- Text: The original word text.  
- Lemma: The base form of the word.  
- POS: The simple UPOS part-of-speech tag.  
- Tag: The detailed part-of-speech tag.  
- Dep: Syntactic dependency, i.e. the relation between tokens.  
- Shape: The word shape – capitalization, punctuation, digits.  
- is alpha: Is the token an alpha character?  
- is stop: Is the token part of a stop list, i.e. the most common words of the language?  

To get the readable string representation of an attribute, we need to add an underscore `_` to its name

In [6]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False
. . PUNCT . punct . False False


##### Understanding tags and labels
Most of the tags and labels look pretty abstract, and they vary between languages. `spacy.explain` will show you a short description

In [7]:
spacy.explain("VBZ")

'verb, 3rd person singular present'

In [8]:
this_doc = nlp("Hello world")
for word in this_doc:
   print(word.text, word.tag_, spacy.explain(word.tag_))

Hello UH interjection
world NN noun, singular or mass


### 1.2 a Visualize
https://spacy.io/usage/visualizers

Using spaCy’s built-in `displaCy` visualizer, here’s what our example sentence and its dependencies look like

In [9]:
displacy.render(doc, style="dep") #was displacy.serve

### 1.3 Named Entities
NEEDS MODEL  
A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

- Text: The original entity text.
- Start: Index of start of entity in the Doc.
- End: Index of end of entity in the Doc.
- Label: Entity label, i.e. type.

Description
- `ORG` Companies, agencies, institutions
- `GPE` Geopolitical entity, i.e. countries, cities, states
- `MONEY` Monetary values, including unit

Named entities are available as the `ents` property of a `Doc`

In [10]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


### 1.3a Visualize Named Entities

Using spaCy’s built-in `displaCy` visualizer, here’s what our example sentence and its named entities look like

In [11]:
displacy.render(doc, style="ent")

My custom examples

In [12]:
for my_ent in my_doc.ents:
    print(my_ent.text, my_ent.start_char, my_ent.end_char, my_ent.label_)

Google 192 198 ORG
$2 mil 212 218 MONEY
C$1 mil 223 230 MONEY


In [13]:
displacy.render(my_doc, style="ent")

### 1.4 Word vectors and similarity 
NEEDS MODEL  
Similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like `word2vec` and usually look like this

In [14]:
tokens = nlp("dog cat banana afskfsd") #$todo results don't make sense

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 7.0336733 False
cat True 6.6808186 False
banana True 6.700014 False
afskfsd False 0.0 True


In [15]:
doc = nlp("Dog and cat are similar. Dog and banana aren't.")

dog = doc[0]
cat = doc[2]
banana = doc[8]

print("dog <-> cat", dog.similarity(cat))
print("dog <-> banana", dog.similarity(banana))
print(dog.has_vector, cat.has_vector, banana.has_vector)

dog <-> cat 0.80168545
dog <-> banana 0.24327648
True True True


My custom examples

In [16]:
for my_token in nlp("Each employee counts all employees matter. Companies: Expeditors and Google. Currencies: $2 mil and C$1 mil"):
    print(my_token.text, my_token.has_vector, my_token.vector_norm, my_token.is_oov)

Each True 5.700853 False
employee True 6.5591154 False
counts True 5.67418 False
all True 4.932093 False
employees True 6.81619 False
matter True 5.280463 False
. True 4.9316354 False
Companies True 6.7256937 False
: True 5.474056 False
Expeditors True 6.5938115 False
and True 4.6577983 False
Google True 6.3633595 False
. True 4.9316354 False
Currencies True 7.7191978 False
: True 5.474056 False
$ True 7.748268 False
2 True 5.163114 False
mil True 6.8282313 False
and True 4.6577983 False
C$ False 0.0 True
1 True 5.269974 False
mil True 6.8282313 False


##### Important note
To make them compact and fast, spaCy’s small models (all packages that end in sm) don’t ship with word vectors, and only include context-sensitive tensors. This means you can still use the similarity() methods to compare documents, spans and tokens – but the result won’t be as good, and individual tokens won’t have any vectors assigned. So in order to use real word vectors, you need to download a larger model. 

$note: difference in results between small and medium model and compare similarity scores

## 2. Pipelines
When you call `nlp` on a text, spaCy first tokenizes the text to produce a `Doc` object. The `Doc` is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the default models consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed `Doc`, which is then passed on to the next component.

<a href="https://spacy.io/usage/spacy-101#pipelines"><img src="images/spacy_pipeline.png" /></a>

The tokenizer is a “special” component and isn’t part of the regular pipeline. The reason is that there can only really be one tokenizer, and while all other pipeline components take a `Doc` and return it, the tokenizer takes a string of text and turns it into a `Doc`. You can still customize the `tokenizer`, though. `nlp.tokenizer` is writable, so you can either create your own Tokenizer class from scratch, or even replace it with an entirely custom function.
<a href="https://spacy.io/usage/processing-pipelines"><img src="images/spacy_pipeline_components.png" width="600" /></a>

## 3. Language data
Every language is different – and usually full of exceptions and special cases, especially amongst the most common words. 

<a href="https://spacy.io/usage/spacy-101#language-data"><img src= "images/spacy_language_data.png"></a>

In [17]:
from spacy.lang.en import English

en_nlp = English()  # Includes English data

- Stop words
stop_words.py	List of most common words of a language that are often useful to filter out, for example “and” or “I”. Matching tokens will return True for is_stop.

- Tokenizer exceptions
tokenizer_exceptions.py	Special-case rules for the tokenizer, for example, contractions like “can’t” and abbreviations with punctuation, like “U.K.”.

- Punctuation rules
punctuation.py	Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes.

- Lemmatizer
spacy-lookups-data	Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example “be” for “was”.

### 3.1 Lemmatizer
src https://spacy.io/api/lemmatizer  
Assign the base forms of words

src https://spacy.io/usage/adding-languages#lemmatizer  
As of v2.0, spaCy supports simple lookup-based lemmatization. This is usually the quickest and easiest way to get started. The data is stored in a dictionary mapping a string to its lemma. To determine a token’s lemma, spaCy simply looks it up in the table. 

In [18]:
#Initialize a Lemmatizer. 
#Typically, this happens under the hood within spaCy when a Language subclass and its Vocab is initialized.
from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups
lookups = Lookups()
lookups.add_table("lemma_rules", {"noun": [["s", ""]]})
lemmatizer = Lemmatizer(lookups) #class spacy.lemmatizer.Lemmatizer
lemmatizer.__class__

spacy.lemmatizer.Lemmatizer

Lemmatize a string.  Rerturns `list`the available lemmas for the string.

In [19]:
#Lemmatize a string
lemmas = lemmatizer("ducks", "NOUN")
print(lemmas)

#if True nothing happens, else AssertionError is raised
assert lemmas == ["duck"]

['duck']


In [20]:
my_lemmas = lemmatizer("employees", "NOUN")
my_lemmas

['employee']

In [21]:
#$ac doesn't work, fix
lemmatizer.lookup("employees")

'employees'

In [22]:
#$ac doesn't work, fix
word_list = ['feet', 'foot', 'foots', 'footing']
word_list = ['employee', 'employees', 'employment', 'employed']
#word_list = ['fly', 'flies', 'flying']
#word_list = ['organize', 'organizes', 'organizing']

[lemmatizer.lookup(word) for word in word_list]

['employee', 'employees', 'employment', 'employed']

### 3.2 Another way to lemma
src https://stackoverflow.com/questions/60306461/how-to-convert-plural-nouns-to-singular-using-spacy

Look at the tag_ field for each word/token and only lemmatize it if it's a NNS or NNPS. The full list of tags can be found
Alphabetical list of part-of-speech tags used in the Penn Treebank Project
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

- NN	Noun, singular or mass  
- NNS	Noun, plural  
- NNP	Proper noun, singular  
- NNPS	Proper noun, plural  


In [23]:
processed_text = nlp('employee employees employment employed feet foot foots footing organize organizes organizing')
lemma_tags = {"NNS", "NNPS"}
for token in processed_text:
   lemma = token.text
   if token.tag_ in lemma_tags:
      lemma = token.lemma_
      print(lemma)

employee
foot
foot
organize


In [24]:
#$ac check tokenizer with vocab
from spacy.lang.en import English
#en_nlp = English()

from spacy.tokenizer import Tokenizer
tokenizer = Tokenizer(en_nlp.vocab)

text = "apples and oranges"
tokens = tokenizer(text)

for token in tokens:
    print(token)

apples
and
oranges


## Xtra

In [25]:
#$xtra python assert
#Test if a condition returns True
x = "hello"

#if condition returns True, then nothing happens:
assert x == "hello"

#if condition returns False, AssertionError is raised:
assert x == "goodbye"

AssertionError: 

In [None]:
#add
##$mychange $example src https://spacy.io/
##import spacy

### Load English tokenizer, tagger, parser, NER and word vectors
##nlp = spacy.load("en_core_web_sm")

# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

In [None]:
#$mychange
doc = nlp('feet foot foots footing')
doc = nlp('employee employees employment employed')

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Nouns:", [token.lemma_ for token in doc if token.pos_ == "NOUN"])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])