Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
config
demo_data
wikipedia_data
README
addSymTrainDev.py
evaluate.py
find-oovs.perl
sub-oovs.py
test_helper.sh
train_helper.sh
train_main.sh
translit_strings.sh

README

Transliteration training and decoding with sample Urdu data

1. If using a mac, update SRILM call in train_helper (comment out line 72, uncomment line 73)
If using CLSP or COE clusters, do nothing.

2. Set JOSHUA, SRILM, and JAVA_HOME environment variables appropriately (or use my pointers at the top of train_helper and test_helper for coe machines)

3. To train a system, create directory with a file containing a "train" file and a "dev" file with transliteration pairs, then run (pointing to training directory):
./train_main.sh demo_data/urdu
Example directory is demo_data/urdu (note that these example train and dev files already have beginning of word and end of word symbols appended)

4. To decode a test set, run:
./translit_strings.sh demo_data/urdu/urdu_test demo_data/urdu
First arg: test file (list of single words to decode)
Second arg: pointer to same directory given in training step, above

Output in demo_data/urdu/urdu_test.answer
For this example, references are given in demo_data/urdu/urdu_test.ref for comparison

5. Evaluate your output:
python evaluate.py demo_data/urdu/urdu_test.answer demo_data/urdu/urdu_test.ref
This simple evaluation script reports the number and percent of perfect transliterations, the average edit distance, and the average normalized (normalized by the length of the reference) edit distance

-------------------------------------

Using your own data

train/dev data:
To use your own, just need to create directory with a train and dev file. Script to append beginning and word and end of word symbols: addSymTrainDev.py (run: python addSymTrainDev.py inputTrain)

language model data:
Two English language model files are included in demo_data/lm 
One is taken from Wikipedia person-page titles and the other from NE-labeled Gigaword corpus, and each includes relative frequency counts.

train your own LM:
Can use other LM, just change pointer in train_main.sh (and script for adding beginning of word and end of word symbols in demo_data/lm)
to build from monolingual corpus:
cd demo_data/lm
python make_lm.py my-input-text
Change LM pointer to my-input-text.wordModel

-------------------------------------

Using Wikipedia data:
See wikipedia_data/README

-------------------------------------

Substitute transliterations into Joshua MT output

To find OOVs in Joshua MT output: find-oovs.pl

After generating transliterations for all OOV words, create a dictionary file (oov word - tab - transliterated word), and substitute them into the Joshua output:
python sub-oovs.py tab-separated-oov-dictionary translation-output-file

-------------------------------------

If you use this system, please cite the following:
http://www.clsp.jhu.edu/~anni/papers/irvine_amta_translit.pdf

-------------------------------------

Questions:
annirvine@gmail.com
You can’t perform that action at this time.