# NER
What is NER? 

>NER is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. 

Wikipedia.org

One example of such a task could be the following sentence:
>    Jim bought 300 shares of Acme Corp. in 2006.

Which would be tagged

>    Jim_person bought 300_unit shares of [[Acme Corp.]]_org in 2006_year.

System easily reach near human F-scoring today which is really awesome.  
Remember: SparkNLP. In SparkNLP there was a built in NER-tagger which works really well on English. But it might be a bit "blocking" as it might not have the entity-types that you're searching (e.g. a Hospital might want to find all medicines as different Entities to improve data retrieved by reports by doctors).

## Approaches
1. Build our own NER from ground up
2. Use/Train spaCy which includes a quick statistical NER tagger. It's possible to add a EntityMatcher on top in order to have more power off the decisions.
3. Use/Train a Neural Network (in our case I think we'll choose a Transformer, BERT namely which is SOTA, and make use of "Transfer Learning")

Extra:  
On the JVM actually a few "out-of-the-box" approaches exists, StanfordNLP, CoreNLP and SparkNLP. One could also pick up Deeplearning4j and build a Neural Network for it, but as Python is de-facto it's easier to keep up-to-date with the latest SOTA.

## Examples of use-cases
* Summarize documents  
    * Summarizing documents could be assisted by understanding what entities exist in the document.
* Optimizing search engine
    * Do a one-time parse of each article and create keywords of the entities found.
* Power Recommendation Systems
    * Same as above really
* Simplifying Customer Support
    * By extracting entities we could improve the result of classifying where an component should go and further it could extract information into slots.



## Datasets
We have a few different datasets to work with.
### Swedish
- [Manually Annotated](https://github.com/klintan/swedish-ner-corpus/)
- [Stockholm Internet Corpus (SIC)](https://www.ling.su.se/english/nlp/corpora-and-resources/sic)
- [SUC 3.0](https://spraakbanken.gu.se/en/resources/suc3)

### English
- [Emergin](https://github.com/leondz/emerging_entities_17)
- [A lot of different](https://github.com/juand-r/entity-recognition-datasets)
- [Kaggle](https://www.kaggle.com/akshay235/bert-implementation-on-ner-corpus). Dataset [here](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus)

### Self-built
We can build our own Dataset either by manual tagging or by making use of wikidata.org which could be helpful in finding certain types of entities.


NOTES:

https://spacy.io/usage/v2-1
https://spacy.io/universe/project/neuralcoref
https://spacy.io/universe/project/NeuroNER
https://spacy.io/universe/project/spacy-transformers
https://rasa.com/


https://github.com/doccano/doccano
https://www.wikidata.org/wiki/Q53747 
https://spacy.io/universe/project/sense2vec

https://towardsdatascience.com/a-review-of-named-entity-recognition-ner-using-automatic-summarization-of-resumes-5248a75de175


In [9]:
%%capture
import spacy
!python3 -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.5 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz#egg=en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0MB)
[K    100% |████████████████████████████████| 12.0MB 18.0MB/s 
[?25hCollecting spacy>=2.2.2 (from en_core_web_sm==2.2.5)
  Downloading https://files.pythonhosted.org/packages/47/13/80ad28ef7a16e2a86d16d73e28588be5f1085afd3e85e4b9b912bd700e8a/spacy-2.2.3-cp36-cp36m-manylinux1_x86_64.whl (10.4MB)
[K    100% |████████████████████████████████| 10.4MB 7.2MB/s 
[?25hCollecting cymem<2.1.0,>=2.0.2 (from spacy>=2.2.2->en_core_web_sm==2.2.5)
  Downloading https://files.pythonhosted.org/packages/e7/b5/3e1714ebda8fd7c5859f9b216e381adc0a38b962f071568fd00d67e1b1ca/cymem-2.0.3-cp36-cp36m-manylinux1_x86_64.whl
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy>=2.2.2->en_core_web_sm==2.2.5)
  Downlo

In [1]:
nlp = spacy.load('en_core_web_sm') 
sentence = "Apple is looking at buying U.K. startup for $1 billion"
  
doc = nlp(sentence) 
  
for ent in doc.ents: 
    print(ent.text, ent.start_char, ent.end_char, ent.label_) 

NameError: name 'spacy' is not defined