# Named entity recognition (N.E.R.)

Named entity recognition is a NLP technique to extract specific entities from a document.

You could, for example, extract all the addresses or all the prices or both at the same time! It should be easy to do for the model as the format will almost always be the same (there are exceptions of course).
    * All the addresses are going to contain a street name, a number, a zip code, a city and a country.
    * All the prices are going to be number with a money symbol before or after it.
    * All the names are going to start with an uppercase.

If you are in this case, even if putting AI in the title of you project is cool, you should ask yourself if a regular expression (RegEx) is not easier and better. 

But you could also want to only extract prices **without** taxes, or only the address of the receiver in letters. In this case, you want a specific entity, and it's way harder to do with RegEx. That's where NER is useful.

## The easy way: SpaCy

[SpaCy](https://spacy.io/) gives you a set of pre-trained models in several languages. However, they are very general models, so general that if you have a use-case a little bit specific, it will probably not be enough. **But** if you want to do a demo or do a proof of concept, it will be one of the best way to go.

In [1]:
# Load the needed dependencies from SpaCy
import spacy
from spacy import displacy
import en_core_web_sm

# Create an instance of the small pipeline and model from SpaCy
nlp = en_core_web_sm.load()

# The text we want to process
text = "BeCode’s mission : Enabling tomorrow’s digital talents to blossom.We believe that education makes anything possible.Since 2017, BeCode has been offering inclusive coding bootcamps for jobseekers to become developers in partnership with Simplon."

# Apply the complete NER pipeline from SpaCy (preprocessing + model prediction)
doc = nlp(text)

# Show the entities detected in a nice way
displacy.render(doc, jupyter=True, style='ent')

We can also get this information in a more "useful" way:

In [2]:
# Loop over all the tokens preprocessed by SpaCy
for token in doc:
    # If the token is detected as an entity
    if token.ent_type_:
        print(f"{token.text}: {token.ent_type_}")

tomorrow: DATE
2017: DATE
BeCode: ORG
Simplon: PRODUCT


## SpaCy's magic

Simple right? Well, that's the whole purpose of SpaCy. They provide state of the art models with simple usage.

Also, the documentation of SpaCy is amazing!

## Limitations

If you need to train a specific model because the general one's from SpaCy are not enough for your use-case, you will be stuck quite quickly with SpaCy.
Why? Because even if you can train a custom model with SpaCy, you will find that SpaCy is not that flexible and that's because it's not SpaCy's main purpose.

## Alternatives

SpaCy's equivalent with more flexibility is [NLTK](https://www.nltk.org/).
NLTK is made as a framework to train deep learning NLP models.

[BERT](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270) is one of the state of the art model architecture for NLP. But it's a bit more complex.
BERT is a [transformer](http://jalammar.github.io/illustrated-transformer/) (yes, like the robots) but we will see that later.


## Practice time!
With the text:
```
BeCode’s mission : Enabling tomorrow's digital talents to blossom.
We believe that education makes anything possible.
Since 2017, BeCode has been offering inclusive coding bootcamps for job seekers to become developers in partnership with Simplon.
```

Try to extracts entities with NLTK.

*You don't need to get an amazing accuracy or to get all the entities, the purpose of this exercise is just to make you understand how the library works.*

In [4]:
# Import NLTK
import nltk
# Preprocess the text
text = "BeCode’s mission: Enabling tomorrow’s digital talents to blossom.We believe that education makes anything possible. Since 2017, BeCode has been offering inclusive coding bootcamps for jobseekers to become developers in partnership with Simplon."
# Tokenize the text
tokens = nltk.word_tokenize(text)
# Tag the tokens
tagged = nltk.pos_tag(tokens)

# Extract entities
entities = nltk.ne_chunk(tagged)

# Print entities detected
print(entities)
# Or
for tree in entities:
    if hasattr(tree, 'label'):
        print(tree.label(), ' '.join(c[0] for c in tree.leaves()))


LookupError: 
**********************************************************************
  Resource [93maveraged_perceptron_tagger[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('averaged_perceptron_tagger')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtaggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle[0m

  Searched in:
    - 'C:\\Users\\becode/nltk_data'
    - 'c:\\Users\\becode\\Desktop\\DA\\LIE-Thomas3-DA\\.nlpvenv\\nltk_data'
    - 'c:\\Users\\becode\\Desktop\\DA\\LIE-Thomas3-DA\\.nlpvenv\\share\\nltk_data'
    - 'c:\\Users\\becode\\Desktop\\DA\\LIE-Thomas3-DA\\.nlpvenv\\lib\\nltk_data'
    - 'C:\\Users\\becode\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


## Additional resources
* [NER with SpaCy and NTLK](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)
* [What is named entity recognition (NER) and how can I use it?](https://medium.com/mysuperai/what-is-named-entity-recognition-ner-and-how-can-i-use-it-2b68cf6f545d)
* [How does Named Entity Recognition help on Information Extraction in NLP?](https://towardsdatascience.com/named-entity-recognition-3fad3f53c91e)

## Final word about SpaCy and NLTK
SpaCy and NLTK are fantastic tools that you **need to have** under your belt.

If you plan to work in the NLP field, all the time spent learning how to use those libraries is precious.

![cool machine](https://media.giphy.com/media/1flAwtHCYosL6LWnHr/giphy.gif)
