Twitter named entity extraction for WNUT 2016 http://noisy-text.github.io/2016/ner-shared-task.html
Clone or download
Pull request Compare This branch is 2 commits ahead of napsternxg:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
NoisyNLP
brown_clusters_wnut_and_hege
clark_clusters_wnut_and_hege
data
word_clusters
word_clusters_wv
.gitattributes
.gitignore
50mpaths2
COLING2016-WNUT-Model-Architechture.png
CRFSuite.ipynb
Data Generation - Weka.ipynb
Data preprocessing.ipynb
Download Wikidata.ipynb
Dynet - BiLSTM - Viterbi loss + Char + Word Embeds-Pretrained.ipynb
Dynet - BiLSTM - Viterbi loss + Char + Word Embeds.ipynb
Dynet tutorials.ipynb
Exploratory analysis.ipynb
Extra Gazetteers.ipynb
Gen new clusters.ipynb
KerasCharRNN.ipynb
KerasWordRNN.ipynb
LICENSE
Paper Figures.ipynb
README.md
Run Experiment.ipynb
Run Experiments.ipynb
Shubhanshu.DeepER.UIUC.WNUT_NER.10_types.txt
Tensorflow RNN.ipynb
Test module.ipynb
Updated Gazetteers.ipynb
WNUT_NER_2016_IM_models.txt
Word2Vec.ipynb
all_sequences.clark_clusters.32.txt
hege.test.tsv
run_experiment.py
run_ner.py
twitter_ner_wnut_and_hege_model.pkl
twokenize.py
vocab.no_extras.txt
vocab.txt
wnut16_dev
wnut16_train

README.md

TwitterNER DOI

Twitter named entity extraction for WNUT 2016 http://noisy-text.github.io/2016/ner-shared-task.html and the corresponding workshop paper at WNUT COLING 2016, titled Semi-supervised Named Entity Recognition in noisy-text by Shubhanshu Mishra and Jana Diesner

model architechture

Installation

pip install future gensim scikit-learn regex matplotlib seaborn sklearn-crfsuite jupyter joblib
wget http://nlp.stanford.edu/data/glove.twitter.27B.zip
unzip glove.twitter.27B.zip

Usage

>>> from run_ner import TwitterNER
>>> from twokenize import tokenizeRawTweetText
>>> from run_ner import TwitterNER
>>> ner = TwitterNER()
>>> tweet = "Beautiful day in Chicago! Nice to get away from the Florida heat."
>>> tokens = tokenizeRawTweetText(tweet)
>>> ner.get_entities(tokens)
[(3, 4, 'LOCATION'), (11, 12, 'LOCATION')]
>>> " ".join(tokens[3:4])
'Chicago'
>>> " ".join(tokens[11:12])
'Florida'

Data download

The dataset used in this repository can bs downloaded from https://github.com/aritter/twitter_nlp/tree/master/data/annotated/wnut16

Submitted Solution [ST]

See Word2Vec.ipynb for details on the original submitted solution for the task.

Improved model

See Run Experiments.ipynb for the details on the improved system. See Run Experiment.ipynb for the details on the improved system with test data.

Using the API

The final system is packaged as an API specified in the folder NoisyNLP. More updates will be made to the API in upcoming days. See Run Experiment.ipynb for API usage.

Downloading Gazetteers

See Updated Gazetteers.ipynb, Extra Gazetteers.ipynb, Download Wikidata.ipynb

Generating word clusters

See Gen new clusters.ipynb

Data Pre-processing

See Data preprocessing.ipynb

Preliminary comparison with RNN models

See KerasCharRNN.ipynb, and KerasWordRNN.ipynb

Please cite as:

@INPROCEEDINGS {mishra2016_wnut_ner,
    author    = "Shubhanshu Mishra and Jana Diesner",
    title     = "Semi-supervised Named Entity Recognition in noisy-text",
    booktitle = "Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)",
    publisher = "The COLING 2016 Organizing Committee",
    pages     = "203-212",
    url       = "http://aclweb.org/anthology/W16-3927",
    year      = "2016",
    month     = "dec"
}