Skip to content

Commit

Permalink
adding PT-br NER dataset twitter
Browse files Browse the repository at this point in the history
  • Loading branch information
diegoesteves committed Nov 12, 2018
1 parent 09efd9d commit 0774718
Show file tree
Hide file tree
Showing 6 changed files with 4,601 additions and 0 deletions.
908 changes: 908 additions & 0 deletions Portuguese/Peres2017/dataset.ptbr.twitter.test.ner

Large diffs are not rendered by default.

3,654 changes: 3,654 additions & 0 deletions Portuguese/Peres2017/dataset.ptbr.twitter.train.ner

Large diffs are not rendered by default.

Binary file added Portuguese/Peres2017/peres2017.pdf
Binary file not shown.
28 changes: 28 additions & 0 deletions Portuguese/Peres2017/readme.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
To comply with Twitter's Terms of Service, only tweet IDs are provided. To download the corresponding full tweets using your own Twitter API credentials.
The first line ("tweet_id start_pos end_pos named_entity_type") of the file is added to specify the above information regarding the annotations, so it is not part of the actual annotations.

If you use the dataset, please refer to the following paper:

@inproceedings{Peres:2017:BLC:3148011.3154478,
author = {Peres, Rafael and Esteves, Diego and Maheshwari, Gaurav},
title = {Bidirectional LSTM with a Context Input Window for Named Entity Recognition in Tweets},
booktitle = {Proceedings of the Knowledge Capture Conference},
series = {K-CAP 2017},
year = {2017},
isbn = {978-1-4503-5553-7},
location = {Austin, TX, USA},
pages = {42:1--42:4},
articleno = {42},
numpages = {4},
url = {http://doi.acm.org/10.1145/3148011.3154478},
doi = {10.1145/3148011.3154478},
acmid = {3154478},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {Deep learning, informal text, machine learning, named entity recognition, natural language processing, neural networks},
}

Paper: https://dl.acm.org/citation.cfm?doid=3148011.3154478



10 changes: 10 additions & 0 deletions Portuguese/README.MD
Original file line number Diff line number Diff line change
Expand Up @@ -64,4 +64,14 @@ A dataset for named entity recognition in Brazilian legal documents is, unlike o
- **cite**: ["LeNER-Br: a Dataset for Named Entity Recognition in Brazilian Legal Text"](leNER-Br/luz_etal_propor2018.pdf)
- **web**: https://cic.unb.br/~teodecampos/LeNER-Br/

Peres 2017
========
A dataset for named entity recognition in Brazilian Portuguese (#noisydata #twitter)

- **data**:
- [dataset.ptbr.twitter.train.ner](Peres2017/dataset.ptbr.twitter.train.ner)
- [dataset.ptbr.twitter.test.ner](Peres2017/dataset.ptbr.twitter.test.ner)

- **cite**: ["Bidirectional LSTM with a Context Input Window for Named Entity Recognition in Tweets"](Peres2017peres2017.pdf)


1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Datasets to train supervised classifiers for Named-Entity Recognition
* WikiNER
* Paramopama
* leNER-br
* Peres2017

<a name="de"></a>
### German
Expand Down

0 comments on commit 0774718

Please sign in to comment.