In [None]:
https://realpython.com/natural-language-processing-spacy-python/

# **SpaCy**

SpaCy is a NLP library similar to Gensim, but with different implementations, including a particular focus on creating NLP pipelines to generate models and corpora. SpaCy is open-source and has several extra libraries and tools built by the same team, including Displacy - a visualization tool for viewing parse trees which uses Node-js to create interactive text.

In [8]:
import spacy

nlp = spacy.load('en_core_web_sm')
nlp.entity
doc = nlp("""Berlin is the capital of Germany;
                 and the residence of Chancellor Angela Merkel.""")

print(doc.ents)
print(doc.ents[0], doc.ents[0].label_)

tokens = [token.text for token in doc]

# Generate lemmas
lemmas = [token.lemma_ for token in doc]

# Convert lemmas into a string
print(' '.join(lemmas))

(Berlin, Germany, Angela Merkel)
Berlin GPE
Berlin be the capital of Germany ; 
                  and the residence of Chancellor Angela Merkel .


# **Lemmatization** 

Convert word to its base form: reducing, reduces, reduced, reduction => reduce // n’t => not 

When you pass the string into nlp, spaCy automatically performs lemmatization by default. Therefore, generating lemmas is identical to generating tokens except that we extract token.lemma_ in each iteration inside the list comprehension instead of token.text.

Spacy comes with informal language corpora, allowing you to more easily find entities in documents like Tweets and chat messages. It's a quickly growing library.

In [20]:
article = """If we hope to one day leave Earth and explore the universe, our bodies are going to have to get a lot better at surviving the harsh conditions of space. 
Using synthetic biology since September, Lisa Nip hopes to harness special powers from microbes on Earth -- such as the ability to withstand radiation -- to make humans more fit for exploring space. 
"Lisa is approaching a time during which we'll have the capacity to decide our own genetic destiny," Nip says. 
"Augmenting the human body with new abilities is no longer a question of how, but of when. Carlos P. """

In [21]:
import spacy

#nlp = spacy.load('en',tagger=False, parser=False, matcher=False)
nlp = spacy.load('en_core_web_sm',tagger=False, parser=False, matcher=False)
doc = nlp(article)
for ent in doc.ents:
    print(ent.label_, ent.text)

DATE one day
LOC Earth
DATE September
PERSON Lisa Nip
LOC Earth
PERSON Lisa
PERSON Carlos P.


# **Multilingual NER with polyglot**

Polyglot is yet another natural language processing library which uses word vectors to perform simple tasks such as entity recognition.

The main benefit and difference of using Polyglot, however, is the wide variety of languages it supports. Polyglot has word embeddings for more than 130 languages! For this reason, you can even use it for tasks like transliteration, as shown here translating some english text into arabic. Transliteration is the ability to translate text by swapping characters from one language to another.

In [26]:
from polyglot.text import Text

text = """El presidente de la Generalitat de Cataluña,
                 Carles Puigdemont, ha afirmado hoy a la alcaldesa
                 de Madrid, Manuela Carmena, que en su etapa de
                 alcalde de Girona (de julio de 2011 a enero de 2016)
                 hizo una gran promoción de Madrid."""
ptext = Text(text)
ptext.entities

entities = [(ent.tag, ' '.join(ent)) for ent in ptext.entities]

print(entities)

ModuleNotFoundError: No module named 'icu'