Skip to content

davidandym/NEDforNoisyText

 
 

Repository files navigation

NEDforNoisyText

Named Entity Disambiguation for Noisy Text

This repository contains code for reproducing our experiments in our "Named Entity Disambiguation for Noisy Text" submittion to CoNLL 2017. Some manual steps are required to setup the data for the experiments

  • Please setup a mysql schema with the page and redirect tables from a Wikipedia dump.
  • Please place the wikipedia pages-article xml file at data/enwiki/pages-articles.xml.
  • For processing wikilinks files from umass an installationg of scala is required.

python libraries required for the project: keras, tensorflow, numpy, nltk, json, unicodedata, unidecode, mysql.connector, urllib, matplotlib, zipfile, ujson, pandas, urlparse, sklearn.

For running the CoNLL test

Seting up the data run ./setup_data.sh to setup data for the wikilinksNED evaluation run ./setup_wikilinks.sh <db_user> <db_pass> <db_schema> <db_ip> to setup data for the conll evaluation run ./setup_conll.sh <db_user> <db_pass> <db_schema> <db_ip>

REMARK: as of now you are required to use the script at https://github.com/NoamGit/Wiki2JSON/ to extract Wikilinks from its original format (thrift) to an easier to work with json format. The resulting files should be stored at data/wikilinks/unprocessed. We are working to automate this process as soon as possible.

Running evaluations for running WikilinksNED evaluation run ./evaluateWikilinksNED.sh <db_user> <db_pass> <db_schema> <db_ip> for running CoNLL evaluation run ./pretrainOnWikipedia.sh <db_user> <db_pass> <db_schema> <db_ip> and then ./evaluateCoNLL.sh <db_user> <db_pass> <db_schema> <db_ip>

results are written to evaluation.txt file in the main directory

NOTES:

  • Setting up data and running experiments takes a very long time.
  • After setting up data for both experiments the data folder can reach 300+Gb

This code is provided as-is. Running this code or any part of it is at your own risk. We do not take any responsibility for running any of the code or usage of any of the data. Much of this code was written at the same time as I was learning Keras, TF and Theano. It has undergone many changes and was used for extensive experimentation. It is therefore probably full of design flaws and redundancies.

About

Download scripts for Wikilinks-NED

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 59.8%
  • Scala 32.7%
  • Shell 6.4%
  • Thrift 1.1%