NEDforNoisyText

Named Entity Disambiguation for Noisy Text

This repository contains code for reproducing our experiments in our "Named Entity Disambiguation for Noisy Text" submittion to CoNLL 2017. Some manual steps are required to setup the data for the experiments

Please setup a mysql schema with the page and redirect tables from a Wikipedia dump.
Please place the wikipedia pages-article xml file at data/enwiki/pages-articles.xml.
For processing wikilinks files from umass an installationg of scala is required.

python libraries required for the project: keras, tensorflow, numpy, nltk, json, unicodedata, unidecode, mysql.connector, urllib, matplotlib, zipfile, ujson, pandas, urlparse, sklearn.

For running the CoNLL test

Create a data/CoNLL folder and place the aida-conll dataset inside (CoNLL_AIDA-YAGO2-dataset.tsv) To generate this dataset refer to https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/downloads/

Seting up the data run ./setup_data.sh to setup data for the wikilinksNED evaluation run ./setup_wikilinks.sh <db_user> <db_pass> <db_schema> <db_ip> to setup data for the conll evaluation run ./setup_conll.sh <db_user> <db_pass> <db_schema> <db_ip>

REMARK: as of now you are required to use the script at https://github.com/NoamGit/Wiki2JSON/ to extract Wikilinks from its original format (thrift) to an easier to work with json format. The resulting files should be stored at data/wikilinks/unprocessed. We are working to automate this process as soon as possible.

Running evaluations for running WikilinksNED evaluation run ./evaluateWikilinksNED.sh <db_user> <db_pass> <db_schema> <db_ip> for running CoNLL evaluation run ./pretrainOnWikipedia.sh <db_user> <db_pass> <db_schema> <db_ip> and then ./evaluateCoNLL.sh <db_user> <db_pass> <db_schema> <db_ip>

results are written to evaluation.txt file in the main directory

NOTES:

Setting up data and running experiments takes a very long time.
After setting up data for both experiments the data folder can reach 300+Gb

This code is provided as-is. Running this code or any part of it is at your own risk. We do not take any responsibility for running any of the code or usage of any of the data. Much of this code was written at the same time as I was learning Keras, TF and Theano. It has undergone many changes and was used for extensive experimentation. It is therefore probably full of design flaws and redundancies.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.idea		.idea
WikilinkIterator2JSON/wiki2JSON-master		WikilinkIterator2JSON/wiki2JSON-master
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup_data.sh		setup_data.sh
setup_wikilinks.sh		setup_wikilinks.sh
split.zip		split.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NEDforNoisyText

About

Releases

Packages

Languages

davidandym/NEDforNoisyText

Folders and files

Latest commit

History

Repository files navigation

NEDforNoisyText

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages