### Install spaCy

* `pip install spacy` 
* `pip install gensim`
* `pip install matplotlib`
* `pip install pyLDAVis`

### Install Models

`python -m spacy download en_core_web_sm`

`python -m spacy download de_core_news_sm`

`python -m spacy download es_core_news_sm`

`python -m spacy download en_core_web_md`

`python -m spacy download de_core_news_md`

`python -m spacy download es_core_news_md`

# What’s spaCy?

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

https://spacy.io/usage/spacy-101

## Feautures

<table class="_59fbd182"><thead><tr class="_8a68569b"><th class="_2e8d2972">Name</th><th class="_2e8d2972">Description</th></tr></thead><tbody><tr class="_8a68569b"><td class="_5c99da9a"><strong>Tokenization</strong></td><td class="_5c99da9a">Segmenting text into words, punctuations marks etc.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><strong>Part-of-speech</strong> (POS) <strong>Tagging</strong></td><td class="_5c99da9a">Assigning word types to tokens, like verb or noun.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><strong>Dependency Parsing</strong></td><td class="_5c99da9a">Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><strong>Lemmatization</strong></td><td class="_5c99da9a">Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><strong>Sentence Boundary Detection</strong> (SBD)</td><td class="_5c99da9a">Finding and segmenting individual sentences.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><strong>Named Entity Recognition</strong> (NER)</td><td class="_5c99da9a">Labelling named “real-world” objects, like persons, companies or locations.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><strong>Similarity</strong></td><td class="_5c99da9a">Comparing words, text spans and documents and how similar they are to each other.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><strong>Text Classification</strong></td><td class="_5c99da9a">Assigning categories or labels to a whole document, or parts of a document.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><strong>Rule-based Matching</strong></td><td class="_5c99da9a">Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><strong>Training</strong></td><td class="_5c99da9a">Updating and improving a statistical model’s predictions.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><strong>Serialization</strong></td><td class="_5c99da9a">Saving objects to files or byte strings.</td></tr></tbody></table>

## Models

While some of spaCy’s features work independently, others require statistical models to be loaded, which enable spaCy to predict linguistic annotations – for example, whether a word is a verb or a noun.
For a general-purpose use case, the small, default models are always a good start. They typically include the following components:

* **Binary weights** for the part-of-speech tagger, dependency parser and named entity recognizer to predict those annotations in context.
* **Lexical entries** in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling.
* **Word vectors**, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.
* **Configuration** options, like the language and processing pipeline settings, to put spaCy in the correct state when you load in the model.









Once you’ve downloaded and installed a model, you can load it via `spacy.load()`. This will return a Language object containing all components and data needed to process text. We usually call it *nlp*. Calling the *nlp* object on a string of text will return a processed **Doc**

## Language Processing Pipelines

- When you call nlp on a text, spaCy first **tokenizes** the text to produce a **Doc object**.
- The Doc is then processed in several different steps – this is also referred to as the **processing pipeline**. 
- The pipeline used by the default models consists of a **tagger**, a **parser** and an **entity recognizer**.

![pipeline](images/pipeline.svg)

More info [here](https://spacy.io/usage/processing-pipelines)

### Tokenization
During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token. Each Doc consists of individual tokens, and we can iterate over them.

First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

1. **Does the substring match a tokenizer exception rule?** For example, “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.
2. **Can a prefix, suffix or infix be split off?** For example punctuation like commas, periods, hyphens or quotes.

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Hello, world. How are you?")
print([ t for t in doc])

nlp_de = spacy.load("de_core_news_sm")
doc_de = nlp_de(u"Ich bin ein Berliner.")
print([t for t in doc_de])

nlp_es = spacy.load("es_core_news_sm")
doc_es = nlp_de(u"Hola. Buenos dias")
print([t for t in doc_es])

print([ t for t in enumerate(doc)])
print("1st Token:", doc[0])          
print("Last Token:",doc[-1])
print("Slice from Token 3 to the end:",doc[2:].text)



[Hello, ,, world, ., How, are, you, ?]
[Ich, bin, ein, Berliner, .]
[Hola, ., Buenos, dias]
[(0, Hello), (1, ,), (2, world), (3, .), (4, How), (5, are), (6, you), (7, ?)]
1st Token: Hello
Last Token: ?
Slice from Token 3 to the end: world. How are you?


In [3]:
## Exists catalan Tokenizer, but there's not model yet!

nlp_ca = spacy.blank("ca")
doc_ca = nlp_ca(u"Hola món. Molt bon dia!")
for i,token in enumerate(doc_ca):
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_
    ))

Hola	0	Hola	False	False	Xxxx		
món	5	món	False	False	xxx		
.	8	.	True	False	.		
Molt	10	Molt	False	False	Xxxx		
bon	15	bo	False	False	xxx		
dia	19	dia	False	False	xxx		
!	22	!	True	False	!		


### Sentences

In [4]:
[s.text for s in doc.sents]    # Sentences

['Hello, world.', 'How are you?']

### Part-of-speech tags and flags 

What is a Speech Tag?
A speech tag is a context sensitive description of what a word means in the context of the whole sentence. More information about the kinds of speech tags which are used in NLP can be found here.

Examples:

* CARDINAL, Cardinal Number - 1,2,3
* PROPN, Proper Noun, Singular - "Matic", "Andraz", "Cardiff"
* INTJ, Interjection - "Uhhhhhhhhhhh"

In [274]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion.")
print("Text\tIndex\tLemma\tPunct?\tSpace?\tShape\tPOS\tTAG\tDEP\n")
for i,token in enumerate(doc):
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}\t{8}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_,
        token.dep_
    ))

Text	Index	Lemma	Punct?	Space?	Shape	POS	TAG	DEP

Apple	0	Apple	False	False	Xxxxx	PROPN	NNP	nsubj
is	6	be	False	False	xx	VERB	VBZ	aux
looking	9	look	False	False	xxxx	VERB	VBG	ROOT
at	17	at	False	False	xx	ADP	IN	prep
buying	20	buy	False	False	xxxx	VERB	VBG	pcomp
U.K.	27	U.K.	False	False	X.X.	PROPN	NNP	compound
startup	32	startup	False	False	xxxx	NOUN	NN	dobj
for	40	for	False	False	xxx	ADP	IN	prep
$	44	$	False	False	$	SYM	$	quantmod
1	45	1	False	False	d	NUM	CD	compound
billion	47	billion	False	False	xxxx	NUM	CD	pobj
.	54	.	True	False	.	PUNCT	.	punct


In [237]:
apple = doc[0]
print("Simple POS tag:", apple.pos_, spacy.explain(apple.pos_),apple.pos)
print("Detailed POS tag:", apple.tag_,spacy.explain(apple.tag_), apple.tag )
print("Word shape:", apple.shape_, apple.shape)
print("Alphanumeric characters?", apple.is_alpha)
print("Punctuation mark?", apple.is_punct)

billion = doc[10]
print("Digit?", billion.is_digit)
print("Like a number?", billion.like_num)
print("Like an email address?", billion.like_email)


Simple POS tag: PROPN proper noun 96
Detailed POS tag: NNP noun, proper singular 15794550382381185553
Word shape: Xxxxx 16072095006890171862
Alphanumeric characters? True
Punctuation mark? False
Digit? False
Like a number? True
Like an email address? False


#### What are syntactic dependencies?

We have the speech tags and we have all of the tokens in a sentence, but how do we relate the two to uncover the syntax in a sentence? Syntactic dependencies describe how each type of word relates to each other in a sentence, this is important in NLP in order to extract structure and understand grammar in plain text.


In [227]:
from spacy import displacy


displacy.render(doc, style="dep")

### Noun Chunks

Noun chunks are the phrases based upon nouns recovered from tokenized text using the speech tags.

Example:

The sentence "The boy saw the yellow dog" has 2 noun objects, the boy and the dog. 
Therefore the noun chunks will be

	1. "The boy"
	2. "the yellow dog"

In [241]:
doc = nlp(u"I went to Paris where I met my old friend Jack from uni.")

[ch.text for ch in doc.noun_chunks] #noun Chunks

['I', 'Paris', 'I', 'my old friend', 'Jack', 'uni']

### Named Entity Recognition

A named entity is any real world object such as a person, location, organisation or product with a proper name. 

Example:

	1. Barack Obama
	2. Edinburgh
	3. Ferrari Enzo
    
spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.




Doing NER with spaCy is super easy and the pretrained model performs pretty well:
    

- **PERSON**: People, including fictional.
- **NORP**: Nationalities or religious or political groups.
- **FAC**: Buildings, airports, highways, bridges, etc.
- **ORG**: Companies, agencies, institutions, etc.
- **GPE**: Countries, cities, states.
- **LOC**: Non-GPE locations, mountain ranges, bodies of water.
- **PRODUCT**: Objects, vehicles, foods, etc. (Not services.)
- **EVENT**: Named hurricanes, battles, wars, sports events, etc.
- **WORK_OF_ART**: Titles of books, songs, etc.
- **LAW**: Named documents made into laws. 
- **LANGUAGE**: Any named language.
- **DATE**: Absolute or relative dates or periods.
- **TIME**: Times smaller than a day.
- **PERCENT**: Percentage, including "%".
- **MONEY**: Monetary values, including unit.
- **QUANTITY**: Measurements, as of weight or distance.
- **ORDINAL**: " , "second", etc.
- **CARDINAL**: Numerals that do not fall under another type.


In [251]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"San Francisco considers banning sidewalk delivery robots")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

doc = nlp(u"Amazon is hiring a new VP of global policy")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
    
  

San Francisco 0 13 GPE
Amazon 0 6 ORG


You can add new NE using Span:

In [259]:
doc = nlp(u"Facebook is hiring a new VP of global policy")
if doc.ents:
    for ent in doc.ents:
        print(ent.text, ent.start_char, ent.end_char, ent.label_)
else:
    print("No entities found")

No entities found


In [260]:
from spacy.tokens import Span

doc = nlp(u"Facebook is hiring a new VP of global policy")
doc.ents = [Span(doc, 0, 1, label=doc.vocab.strings[u"ORG"])]
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Facebook 0 8 ORG


There's a lot of Entities in the pretrained model. You can uses `spacy.explain` to obtain more information about them:

In [263]:
def explain_text_entities(text):
    doc = nlp(text)
    for ent in doc.ents:
        print(f'Entity: {ent}, Label: {ent.label_}, {spacy.explain(ent.label_)}')

In [264]:
explain_text_entities("Next week I'll be in London.")

Entity: Next week, Label: DATE, Absolute or relative dates or periods
Entity: London, Label: GPE, Countries, cities, states


In [265]:
explain_text_entities("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")

Entity: 2, Label: CARDINAL, Numerals that do not fall under another type
Entity: 9 a.m., Label: TIME, Times smaller than a day
Entity: 30%, Label: PERCENT, Percentage, including "%"
Entity: just 2 days, Label: DATE, Absolute or relative dates or periods
Entity: WSJ, Label: ORG, Companies, agencies, institutions, etc.


`displacy` comes in handy for a better visualization:

In [268]:
doc_ent = nlp(u"When Sebastian Thrun started working on self-driving cars at Google "
              u"in 2007, few people outside of the company took him seriously.")
displacy.render(doc_ent, style="ent")


### Word embedding vectors and similarity

A word embedding is a representation of a word, and by extension a whole language corpus, in a vector or other form of numerical mapping. This allows words to be treated numerically with word similarity represented as spatial difference in the dimensions of the word embedding mapping.

Example:
	
With word embeddings we can understand that vector operations describe word similarity. This means that we can see vector proofs of statements such as:

	king-queen==man-woman

In [5]:
import en_core_web_md
nlp = en_core_web_md.load()
doc = nlp(u"Apple and banana are similar. Pasta and hippo aren't.")

apple = doc[0]
banana = doc[2]
pasta = doc[6]
hippo = doc[8]

print(apple.has_vector, banana.has_vector, pasta.has_vector, hippo.has_vector)

print("apple <-> banana", apple.similarity(banana))
print("pasta <-> hippo", pasta.similarity(hippo))

apples_sent, boots_sent = doc.sents
fruit = doc.vocab[u'fruit']
print(apples_sent.similarity(fruit))
print(boots_sent.similarity(fruit))


True True True True
apple <-> banana 0.5831845
pasta <-> hippo 0.12069741
0.657017
0.48992792


### Training a new model
We can train a catalan model with the data found in https://github.com/UniversalDependencies/UD_Catalan-AnCora
following the steps  in https://spacy.io/usage/training

Gonna try it:

In [4]:
from spacy import displacy

nlp_ca = spacy.load("./model-ca-best")
doc_ca = nlp_ca(u'Ahir el noi de la mare va anar a comprar dos pastissos')
print("Text\tIndex\tLemma\tPunct?\tSpace?\tShape\tPOS\tTAG\n")
for i,token in enumerate(doc_ca):
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_
    ))
    
print( len(list(doc_ca.noun_chunks)))
print( len(list(doc_ca.sents)))
print( len(list(doc_ca.ents)))

displacy.render(doc_ca, style="dep")


Text	Index	Lemma	Punct?	Space?	Shape	POS	TAG

Ahir	0	Ahir	False	False	Xxxx	ADV	ADV
el	5	ell	False	False	xx	DET	DET
noi	8	noi	False	False	xxx	NOUN	NOUN
de	12	de	False	False	xx	ADP	ADP
la	15	ell	False	False	xx	DET	DET
mare	18	mare	False	False	xxxx	NOUN	NOUN
va	23	anar	False	False	xx	AUX	AUX
anar	26	anar	False	False	xxxx	VERB	VERB
a	31	a	False	False	x	ADP	ADP
comprar	33	comprar	False	False	xxxx	VERB	VERB
dos	41	dosar	False	False	xxx	NUM	NUM
pastissos	45	pastís	False	False	xxxx	NOUN	NOUN
0
1
0


## Exercises

[Pride and perdjudice](01_pride_and_predjudice.ipynb)


## Topic Model  with genism

[Topic modelling](topic_modelling.ipynb)

## Sentiment Analysis

 [INTRODUCTION TO SENTIMENT ANALYSIS WITH SPACY, by Thomas Aglassinger at europython 2018](https://ep2018.europython.eu/conference/talks/introduction-to-sentiment-analysis-with-spacy)
