### Importing required modules

In [74]:
# Import requests
import requests

# Import BeautifulSoup
from bs4 import BeautifulSoup

# Import spacy
import spacy
from spacy import displacy

# Import pretrained models
import en_core_web_sm

### Obtaining text from a website

In [75]:
# send a request to the website
page = requests.get("https://en.wikipedia.org/wiki/Natural_Language_Toolkit")

# Use BeautifulSoup to parse HTML using html5 protocol. It is slower
# but more efficient
page_content = BeautifulSoup(page.text, "html5lib")

# paragraphs
textContent = []
for i in range(0, 3):
    paragraphs = page_content.find_all("p")[i].text
    textContent.append(paragraphs)

# Join the paragraphs together and replace the `\n` for empty strings
wiki_nltk = " ".join(textContent).replace("\n", "")

### SpaCy in action

[SpaCy](https://github.com/explosion/spaCy) is an open-source application-oriented library in Python. It provides a very efficient statistical system for NER by labeling groups of contiguous tokens. It is able to recognize a wide variety of named or numerical entities. Among them, we can find company-names, locations, product-names, and organizations.

A huge advantage of Spacy is having pre-trained models in several languages: English, German, French, Spanish, Portuguese, Italian, Dutch, and Greek. 
These models support tagging, parsing and entity recognition. They have been designed and implemented from scratch specifically for spaCy.  
They can be imported as Python libraries. And loaded easily using `spacy.load()`.

In [76]:
# Load pre-trained model (Probably a heavy function, but lets use it)
nlp = spacy.load('en_core_web_sm')

SpaCy provides a Tokenizer, a POS-tagger and a Named Entity Recognizer. So it's very easy to use. We just called our model in our text `nlp(text)`. This will tokenize it, tagged it and recognize the entities.

The attribute .sents will retrieve the tokens. .tag_ the tag for each token. .ents the recognized entities. .label_ the label for each entity. .text just the text for any attribute.

In [77]:
 def get_entities(text):
    """
    This function takes a text. Uses the Spacy model.
    The model will tokenize, POS-tag and recognize the entities named in the text.
    Then, the entities are retrieved and saved in a list.
    It outputs a list with the named entities. It also outputs the result of applying
    the model to the text.
    """
    # Apply the model
    tags = nlp(text)
    # Append all entities recognized
    entities = [X.text for X in tags.ents]
    # Return the list of entities and the result of the model.
    return entities, tags

Now, we apply the defined method to our original Wikipedia text.

In [78]:
spacy_tags, sentences = get_entities(wiki_nltk)

In [79]:
for tag in spacy_tags:
    print(tag)

The Natural Language Toolkit
NLTK
NLP
English
Steven Bird
Edward Loper
the Department of Computer and Information Science
the University of Pennsylvania.[4
NLTK
NLTK
NLP
32
US
25
NLTK
NLTK


Spacy architecture is very rich. This results in a very efficient algorithm.

According to Explosion AI, Spacy Named Entity Recognition system features a sophisticated word embedding strategy using subword features, a deep convolutional neural network with residual connections, and a novel transition-based approach to named entity parsing.

Lastly, Spacy provides a function to display a beautiful visualization of the Named Entity annotated sentences: displacy.

In [80]:
sentences_full = [x for x in sentences.sents]
sentences_full = "".join(map(str, sentences_full))

In [81]:
displacy.render(nlp(str(sentences_full)), jupyter=True, style='ent')

