Skip to content
This repository has been archived by the owner on Oct 20, 2018. It is now read-only.

Adds log-linear weighting of features for disambiguation #390

Open
wants to merge 62 commits into
base: development
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
3f62f2f
migrating to scala 2.10
tgalery Jun 1, 2015
839e41e
add dependencies for new breeze version
Jun 12, 2015
99095bb
started vector model wrapper in contextsimilarity interface, added br…
Jun 13, 2015
c44546b
VectorContextSimilarity partly implemented with basic vector model fu…
Jun 14, 2015
d063627
fixed dependencies, but mvn package still fails.
Jun 15, 2015
b16b760
Seems that I was misusing breeze. core now compiles but is untested
Jun 15, 2015
4a7ac33
adding scala actors
tgalery Jun 17, 2015
b8a5030
add dependencies for new breeze version
Jun 12, 2015
0f9894a
started vector model wrapper in contextsimilarity interface, added br…
Jun 13, 2015
ce93716
VectorContextSimilarity partly implemented with basic vector model fu…
Jun 14, 2015
d537800
fixed dependencies, but mvn package still fails.
Jun 15, 2015
ad2491e
Seems that I was misusing breeze. core now compiles but is untested
Jun 15, 2015
58b99de
trying to fix breeze errors
Jun 17, 2015
a95776a
everything builds now
Jun 17, 2015
841584f
use vector context similarity
Jul 1, 2015
292ee27
disregard, just testing
Jul 1, 2015
e5ea9d4
fix paths
Jul 1, 2015
51c792f
Integrated VectorContextSimilarity to the point where it's basically …
Jul 1, 2015
ef353aa
work in progress: building memorystore for vector models, and refacto…
Jul 9, 2015
f89a80f
finishing up implementation of memorystore for vector models. implem…
Jul 11, 2015
3f2ad3b
edited .gitignore
Jul 12, 2015
0444238
implemented vector model store including model creation
Jul 13, 2015
9a1db6f
evaluation and vector store indexing implemented
Jul 15, 2015
cbd8250
working on training data generation for RankLib
Jul 22, 2015
5fbd072
undo some unneeded changes from before
Jul 22, 2015
2c194e4
added RankLib training data generation as a side effect of evaluation
Jul 22, 2015
6114561
basic log-linear model weighting implemented (gives a small performan…
Jul 24, 2015
00b6169
work in progress: LLM training integration into index
Aug 2, 2015
9cdb668
refactored LLM training to separate script
Aug 9, 2015
20ac1cc
updated ranklib training data generation
Aug 10, 2015
2ab209d
update ranklib model creation. Training data generation works, but sc…
Aug 10, 2015
2c0a2b0
RanklibTrainingDataWriter bug fixed, llm weights training implemented
Aug 11, 2015
4db1012
integrated LLM training as option into index_db.sh
Aug 11, 2015
4018540
made run_server executable
Aug 11, 2015
5cfcf7f
add dependencies for new breeze version
Jun 12, 2015
9e1c13a
add dependencies for new breeze version
Jun 12, 2015
f2b0718
trying to fix breeze errors
Jun 17, 2015
727c322
everything builds now
Jun 17, 2015
e22733d
use vector context similarity
Jul 1, 2015
8162a2d
Integrated VectorContextSimilarity to the point where it's basically …
Jul 1, 2015
3919e7c
work in progress: building memorystore for vector models, and refacto…
Jul 9, 2015
1a07e8d
finishing up implementation of memorystore for vector models. implem…
Jul 11, 2015
ddeb717
working on training data generation for RankLib
Jul 22, 2015
069495d
undo some unneeded changes from before
Jul 22, 2015
3941b8e
added RankLib training data generation as a side effect of evaluation
Jul 22, 2015
bef2a03
basic log-linear model weighting implemented (gives a small performan…
Jul 24, 2015
f7ef127
work in progress: LLM training integration into index
Aug 2, 2015
5430db4
refactored LLM training to separate script
Aug 9, 2015
d4de688
updated ranklib training data generation
Aug 10, 2015
2523b7a
update ranklib model creation. Training data generation works, but sc…
Aug 10, 2015
21d6171
RanklibTrainingDataWriter bug fixed, llm weights training implemented
Aug 11, 2015
fabf86d
integrated LLM training as option into index_db.sh
Aug 11, 2015
86a95fe
made llm mixture weights loading optional
Aug 12, 2015
0d09881
added new features to LLM model, but there are still some details lef…
Aug 15, 2015
6b672cd
saving all progress on new features and LLM training
Aug 17, 2015
2b9205b
only use vectorcontextsimilarity if we can actually load it here
Sep 2, 2015
44fdfb7
fixing errors that were introduced in the rebase
Jan 8, 2016
f1e5318
General cleanup in preparation for PR to @tgalery's new branch. This …
phdowling Mar 20, 2016
f649c4d
Merge pull request #1 from phdowling/feature/gsoc-llm-hopefully-final
tgalery Mar 20, 2016
beff993
improving readme and version bump
tgalery Jun 19, 2016
b80397b
removing unecessary .sh files
tgalery Jun 28, 2016
c807e63
removing another missing script
tgalery Jun 28, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,16 @@
.classpath
.project
.settings/
data_quickstart
target
*.log
*~
index/output
core/.cache
data_quickstart/
dist
eval/.cache
index/.cache
push.sh
rest-tomcat/.cache
rest/.cache
uima/.cache
49 changes: 44 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# DBpedia Spotlight
# DBpedia Spotlight
#### Shedding Light on the Web of Documents

DBpedia Spotlight looks for ~3.5M things of unknown or ~320 known types in text and tries to link them to their global unique identifiers in [DBpedia](http://dbpedia.org).
Expand Down Expand Up @@ -28,12 +28,51 @@ or for JSON:

#### Run your own server

##### Download jar file and data

If you need service reliability and lower response times, you can run DBpedia Spotlight in your own [In-House Server](https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Installation). Just download a model and Spotlight from [here](http://spotlight.sztaki.hu/downloads/) to get started.

wget http://spotlight.sztaki.hu/downloads/dbpedia-spotlight-latest.jar
wget http://spotlight.sztaki.hu/downloads/latest_models/en.tar.gz
tar xzf en.tar.gz
java -jar dbpedia-spotlight-latest.jar en http://localhost:2222/rest
1. wget http://spotlight.sztaki.hu/downloads/dbpedia-spotlight-latest.jar
2. wget http://spotlight.sztaki.hu/downloads/latest_models/en.tar.gz
3. tar xzf en.tar.gz
4. java -jar dbpedia-spotlight-latest.jar en http://localhost:2222/rest

Note that `en` above represents the path to the english model downloaded from step 2, and
`http://localhost:2222/rest` is the mountpoint of the spotlight server.
Although you can change the base address and port, you cannot change the `/rest` mountpoint.

##### Build From source

If you want to run the latest version of spotlight (to be packaged as v0.8), you should do the following:

1. Clone this repository
2. Checkout development branch (`git checkout -b development origin/development`, where `origin` is the name of the official spotlight remote)
3. Build the package `cd dbpedia-spotlight && mvn clean package` (needs java 7 and maven installed)
4. Download an entity model (as per step 2 in the subsection above)
5. Uncompress the language model tarball (as per step 3 in the subsection above)
6. Run the spotlight server (as per step 4 in the subsection above, the jar file is build on `dist/target` folder)

Note: the current development branch (v0.8) works with the vanilla datasets provided in the steps above,
but it also works with datasets containing (i) weights for a Log-Linear Model for disambiguation,
and (ii) serialized dense vector representations (word2vec) that would be loaded and used in the desambiguation step.
The weights comprise of a simple `ranklib-model.txt` file that should be included in the language model's folder (if it's not there already) with content as follows:

```
## Coordinate Ascent
## Restart = 5
## MaxIteration = 25
## StepBase = 0.05
## StepScale = 2.0
## Tolerance = 0.001
## Regularized = false
## Slack = 0.001
1:0.37391416006434364 2:0.07140601847073497 3:0.2616870643056067 4:0.07643781575763943 5:0.21655494140167517
```

The serialized dense vectors should be placed under a `word2vec` folder inside the spotlight language model's root.
Most of the work done in the general of these vectors is done by Idio's [wiki2vec](https://github.com/idio/wiki2vec) plus some tooling.
The use of these models is extremely experimental, so testing and bug reporting is very welcome.
A full wiki on how to generate these dense vector representations and obtain the LLVM weights is in the works, but for a tentative guide see (this document)[https://github.com/phdowling/gsoc-progress/wiki/Final-Summary]

#### Models and data

Expand Down
39 changes: 35 additions & 4 deletions bin/index_db.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,15 @@
# $4 Analyzer+Stemmer language prefix e.g. Dutch
# $5 Model target folder


# TODO: call ranklib to train LLM and generate output

export MAVEN_OPTS="-Xmx26G"

usage ()
{
echo "index_db.sh"
echo "usage: ./index_db.sh -o /data/spotlight/nl/opennlp wdir nl_NL /data/spotlight/nl/stopwords.nl.list Dutch /data/spotlight/nl/final_model"
echo "usage: ./index_db.sh -v -o /data/spotlight/nl/opennlp wdir nl_NL /data/spotlight/nl/stopwords.nl.list Dutch /data/spotlight/nl/final_model"
echo "Create a database-backed model of DBpedia Spotlight for a specified language."
echo " "
}
Expand All @@ -24,12 +27,18 @@ usage ()
opennlp="None"
eval="false"
blacklist="false"
data_only="false"
local_mode="false"
train_llm="false"

while getopts "eo:b:" opt; do
while getopts "ledovb:" opt; do
case $opt in
o) opennlp="$OPTARG";;
e) eval="true";;
b) blacklist="$OPTARG";;
d) data_only="true";;
l) local_mode="true"
v) train_llm="true"
esac
done

Expand Down Expand Up @@ -206,8 +215,30 @@ cd $BASE_WDIR/dbpedia-spotlight

mvn -pl index exec:java -Dexec.mainClass=org.dbpedia.spotlight.db.CreateSpotlightModel -Dexec.args="$2 $WDIR $TARGET_DIR $opennlp $STOPWORDS $4Stemmer"

if [ "$eval" == "true" ]; then
mvn -pl eval exec:java -Dexec.mainClass=org.dbpedia.spotlight.evaluation.EvaluateSpotlightModel -Dexec.args="$TARGET_DIR $WDIR/heldout.txt" > $TARGET_DIR/evaluation.txt
if [ "$data_only" == "true" ]; then
echo "$CREATE_MODEL" >> create_models.job.sh
else
eval "$CREATE_MODEL"

if ["$train_llm" == "true"]; then
echo "Training LLM Weights"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a separate script ? Could it run in isolation ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the 'train_llm.sh' script, but I figured I would keep this in here so that a model can be built by just calling one script. Should I simply remove this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to keep as a separate script. Plus, @jodaiber is doing a lot of changes to index_db.sh, so we might have to talk about the best way of doing this a lot.

echo "Downloading ranklib..."
mkdir -p $BASE_WDIR/ranklib/
cd $BASE_WDIR/ranklib/
curl -L -o RankLib-2.1-patched.jar http://downloads.sourceforge.net/project/lemur/lemur/RankLib-2.1/RankLib-2.1-patched.jar?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Flemur%2Ffiles%2Flemur%2FRankLib-2.1%2F&ts=1439317425&use_mirror=skylink
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we could add RankLib as a maven dependency ? There is a suggestion on how to do it here http://sourceforge.net/p/lemur/discussion/ranklib/thread/a45e2a7c/?limit=25.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, might be nice to do that! On the other hand it's only a training-time dependency, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but I find it a bit ugly downloading the jar on training time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do the same type of thing for pignlproc and other training dependencies, I think I kind of imitated that


cd $BASE_DIR
echo "Generating features and writing ranklib train data..."
MAVEN_OPTS='-Xmx15G' mvn -pl index exec:java -Dexec.mainClass=org.dbpedia.spotlight.db.CreateLLMTrainData -Dexec.args="$2 $WDIR $TARGET_DIR";

echo "Training LLM weights using ranklib..."
java -jar $BASE_WDIR/ranklib/RankLib-2.1-patched.jar -ranker 4 -train $TARGET_DIR/ranklib-training-data.txt -save $TARGET_DIR/ranklib-model.txt -metric2t ERR@1
fi

if [ "$eval" == "true" ]; then
mvn -pl eval exec:java -Dexec.mainClass=org.dbpedia.spotlight.evaluation.EvaluateSpotlightModel -Dexec.args="$TARGET_DIR $WDIR/heldout.txt" > $TARGET_DIR/evaluation.txt
fi

fi

curl https://raw.githubusercontent.com/dbpedia-spotlight/model-quickstarter/master/model_readme.txt > $TARGET_DIR/README.txt
Expand Down
136 changes: 136 additions & 0 deletions bin/train_llm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
#!/bin/bash
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels that this has a lot of similarities with the other scripts. It might be a better idea to split those into scripts that download the data and the scripts that actually can be used to train the LLVM

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss on slack what this should look like

#+------------------------------------------------------------------------------------------------------------------------------+
#| DBpedia Spotlight - Create database-backed model |
#| @author Joachim Daiber |
#| @author Philipp Dowling |
#+------------------------------------------------------------------------------------------------------------------------------+

# $1 Working directory
# $2 Locale (en_US)
# $3 Stopwords file
# $4 Analyzer+Stemmer language prefix e.g. Dutch
# $5 Model target folder

// TODO test run, fix usage string, integrate into index_db.sh

export MAVEN_OPTS="-Xmx26G"

usage ()
{
echo "index_db.sh"
echo "usage: ./train_llm.sh wdir en_US /data/spotlight/stopwords.list English /data/spotlight/output_model_folder"
echo "Train weights for the log-linear model used by Spotlight's vector-based context similarity."
echo " "
}


opennlp="None"
eval="false"
data_only="false"
local_mode="false"


while getopts "ledo:" opt; do
case $opt in
e) eval="true";;
d) data_only="true";;
l) local_mode="true"
esac
done


shift $((OPTIND - 1))

if [ $# != 5 ]
then
usage
exit
fi

BASE_DIR=$(pwd)

if [[ "$1" = /* ]]
then
BASE_WDIR="$1"
else
BASE_WDIR="$BASE_DIR/$1"
fi

if [[ "$5" = /* ]]
then
TARGET_DIR="$5"
else
TARGET_DIR="$BASE_DIR/$5"
fi

if [[ "$3" = /* ]]
then
STOPWORDS="$3"
else
STOPWORDS="$BASE_DIR/$3"
fi

WDIR="$BASE_WDIR/$2"

if [[ "$opennlp" == "None" ]]; then
echo "";
elif [[ "$opennlp" != /* ]]; then
opennlp="$BASE_DIR/$opennlp";
fi


LANGUAGE=`echo $2 | sed "s/_.*//g"`

echo "Language: $LANGUAGE"
echo "Working directory: $WDIR"

mkdir -p $WDIR

# Stop processing if one step fails
set -e

cd $BASE_DIR
#Set up pig:
if [ -d $BASE_WDIR/pig ]; then
echo "Updating PigNLProc..."
cd $BASE_WDIR/pig/pignlproc
git reset --hard HEAD
git pull
else
echo "Setting up PigNLProc..."
mkdir -p $BASE_WDIR/pig/
cd $BASE_WDIR/pig/
git clone --depth 1 https://github.com/dbpedia-spotlight/pignlproc.git
cd pignlproc
echo "Building PigNLProc..."
fi


echo "Generating train data."
mkdir -p $BASE_WDIR/wikipedia/
cd $BASE_WDIR/wikipedia/
echo "Downloading wikipedia dump..."
# curl -O "http://dumps.wikimedia.org/${LANGUAGE}wiki/latest/${LANGUAGE}wiki-latest-pages-articles.xml.bz2"

echo "Splitting off train set..."
# bzcat ${LANGUAGE}wiki-latest-pages-articles.xml.bz2 | python $BASE_WDIR/pig/pignlproc/utilities/split_train_test.py 12000 $WDIR/heldout.txt > /dev/null

echo "Downloading DBpedia redirects and disambiguations..."
cd $WDIR
if [ ! -f "redirects.nt" ]; then
curl -# http://downloads.dbpedia.org/current/$LANGUAGE/redirects_$LANGUAGE.nt.bz2 | bzcat > redirects.nt
curl -# http://downloads.dbpedia.org/current/$LANGUAGE/disambiguations_$LANGUAGE.nt.bz2 | bzcat > disambiguations.nt
fi

echo "Downloading ranklib..."
mkdir -p $BASE_WDIR/ranklib/
cd $BASE_WDIR/ranklib/
curl -L -o RankLib-2.1-patched.jar http://downloads.sourceforge.net/project/lemur/lemur/RankLib-2.1/RankLib-2.1-patched.jar?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Flemur%2Ffiles%2Flemur%2FRankLib-2.1%2F&ts=1439317425&use_mirror=skylink

cd $BASE_DIR
echo "Generating features and writing ranklib train data..."
MAVEN_OPTS='-Xmx15G' mvn -pl index exec:java -Dexec.mainClass=org.dbpedia.spotlight.db.CreateLLMTrainData -Dexec.args="$2 $WDIR $TARGET_DIR";

echo "Training model using ranklib..."
java -jar $BASE_WDIR/ranklib/RankLib-2.1-patched.jar -ranker 4 -train $TARGET_DIR/ranklib-training-data.txt -save $TARGET_DIR/ranklib-model.txt -metric2t ERR@1

3 changes: 2 additions & 1 deletion core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
<parent>
<groupId>org.dbpedia.spotlight</groupId>
<artifactId>spotlight</artifactId>
<version>0.7</version>
<version>0.8</version>
<relativePath>../pom.xml</relativePath>
</parent>

Expand Down Expand Up @@ -242,6 +242,7 @@
<version>0.10</version>
</dependency>


<dependency>
<groupId>com.typesafe.akka</groupId>
<artifactId>akka-actor_2.10</artifactId>
Expand Down
Loading