Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
polyglot/docs/NamedEntityRecognition.rst
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
201 lines (132 sloc)
5.72 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Named Entity Extraction | |
======================= | |
Named entity extraction task aims to extract phrases from plain text | |
that correpond to entities. Polyglot recognizes 3 categories of | |
entities: | |
- Locations (Tag: ``I-LOC``): cities, countries, regions, continents, | |
neighborhoods, administrative divisions ... | |
- Organizations (Tag: ``I-ORG``): sports teams, newspapers, banks, | |
universities, schools, non-profits, companies, ... | |
- Persons (Tag: ``I-PER``): politicians, scientists, artists, atheletes | |
... | |
Languages Coverage | |
------------------ | |
The models were trained on datasets extracted automatically from | |
Wikipedia. Polyglot currently supports 40 major languages. | |
.. code:: python | |
from polyglot.downloader import downloader | |
print(downloader.supported_languages_table("ner2", 3)) | |
.. parsed-literal:: | |
1. Polish 2. Turkish 3. Russian | |
4. Indonesian 5. Czech 6. Arabic | |
7. Korean 8. Catalan; Valencian 9. Italian | |
10. Thai 11. Romanian, Moldavian, ... 12. Tagalog | |
13. Danish 14. Finnish 15. German | |
16. Persian 17. Dutch 18. Chinese | |
19. French 20. Portuguese 21. Slovak | |
22. Hebrew (modern) 23. Malay 24. Slovene | |
25. Bulgarian 26. Hindi 27. Japanese | |
28. Hungarian 29. Croatian 30. Ukrainian | |
31. Serbian 32. Lithuanian 33. Norwegian | |
34. Latvian 35. Swedish 36. English | |
37. Greek, Modern 38. Spanish; Castilian 39. Vietnamese | |
40. Estonian | |
Download Necessary Models | |
^^^^^^^^^^^^^^^^^^^^^^^^^ | |
.. code:: python | |
%%bash | |
polyglot download embeddings2.en ner2.en | |
.. parsed-literal:: | |
[polyglot_data] Downloading package embeddings2.en to | |
[polyglot_data] /home/rmyeid/polyglot_data... | |
[polyglot_data] Package embeddings2.en is already up-to-date! | |
[polyglot_data] Downloading package ner2.en to | |
[polyglot_data] /home/rmyeid/polyglot_data... | |
[polyglot_data] Package ner2.en is already up-to-date! | |
Example | |
------- | |
Entities inside a text object or a sentence are represented as chunks. | |
Each chunk identifies the start and the end indices of the word | |
subsequence within the text. | |
.. code:: python | |
from polyglot.text import Text | |
.. code:: python | |
blob = """The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a "threat to the entire world".""" | |
text = Text(blob) | |
# We can also specify language of that text by using | |
# text = Text(blob, hint_language_code='en') | |
We can query all entities mentioned in a text. | |
.. code:: python | |
text.entities | |
.. parsed-literal:: | |
[I-ORG([u'Israeli']), I-PER([u'Benjamin', u'Netanyahu']), I-LOC([u'Iran'])] | |
Or, we can query entites per sentence | |
.. code:: python | |
for sent in text.sentences: | |
print(sent, "\n") | |
for entity in sent.entities: | |
print(entity.tag, entity) | |
.. parsed-literal:: | |
The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a "threat to the entire world". | |
I-ORG [u'Israeli'] | |
I-PER [u'Benjamin', u'Netanyahu'] | |
I-LOC [u'Iran'] | |
By doing more careful inspection of the second entity | |
``Benjamin Netanyahu``, we can locate the position of the entity within | |
the sentence. | |
.. code:: python | |
benjamin = sent.entities[1] | |
sent.words[benjamin.start: benjamin.end] | |
.. parsed-literal:: | |
WordList([u'Benjamin', u'Netanyahu']) | |
Command Line Interface | |
~~~~~~~~~~~~~~~~~~~~~~ | |
.. code:: python | |
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en ner | tail -n 20 | |
.. parsed-literal:: | |
, O | |
which O | |
was O | |
equalled O | |
five O | |
days O | |
ago O | |
by O | |
South I-LOC | |
Africa I-LOC | |
in O | |
their O | |
victory O | |
over O | |
West I-ORG | |
Indies I-ORG | |
in O | |
Sydney I-LOC | |
. O | |
Demo | |
---- | |
.. raw:: html | |
<embed> | |
<iframe src="https://entityextractor.appspot.com/" width="100%" height="225" seamless></iframe> | |
</embed> | |
Citation | |
~~~~~~~~ | |
This work is a direct implementation of the research being described in | |
the `Polyglot-NER: Multilingual Named Entity | |
Recognition <https://sites.google.com/site/rmyeid/papers/polyglot-ner.pdf?attredirects=0&d=1>`__ | |
paper. The author of this library strongly encourage you to cite the | |
following paper if you are using this software. | |
:: | |
@article{polyglotner, | |
author = {Al-Rfou, Rami and Kulkarni, Vivek and Perozzi, Bryan and Skiena, Steven}, | |
title = {{Polyglot-NER}: Massive Multilingual Named Entity Recognition}, | |
journal = {{Proceedings of the 2015 {SIAM} International Conference on Data Mining, Vancouver, British Columbia, Canada, April 30 - May 2, 2015}}, | |
month = {April}, | |
year = {2015}, | |
publisher = {SIAM} | |
} | |
References | |
---------- | |
- `Polyglot-NER project page. <https://bit.ly/polyglot-ner>`__ | |
- `Wikipedia on | |
NER <http://en.wikipedia.org/wiki/Named-entity_recognition>`__. |