Skip to content

aixeuy/DI-C-project

Repository files navigation

D-I C Project: Implementation of Yarowsky's Algo.

data collecttion (do not do this):

Collect data for word disambiguation of 'bank': finace vs. river

run:
mkdir data
python collect_data.py
cat data/* > data/bank_final.txt
rm data/*tmp*

Each line is a sentense conatining 'bank'.
financial bank: line ends with '+'.
river bank: line ends with '-'. unknown: line ends with '0'.

run scala program:

input arguments:
input: path of training datset
output: path of the classification of input sentences generated by this alg
model: path of the model
n_start, n_end, n_step: the start value, end value and step size of N in N-Gram, used in searching
m_start, m_end, m_step: the start value, end value and step size of number of selected N-Grams, used in searching
num_iter_start, num_iter_end, num_iter_step: ...number of maximum iterations in each training...
threshold_start, threshold_end, threshold_step: ...minimum absolute score to label an unclassified sentence...
alpha_start, alpha_end, alpha_step: ...learning rate...
delta_start, delta_end, delta_step: ...minimum model parameter changes to stop training...
mod: specify search, train or test
test_sent: sentence to test

missing arguments will use default value
in training mod, sepcify model parameters using _start

examples:

search best m and threshold_start and use default values for other hyper paramenters:
spark-submit --class ca.uwaterloo.cs451.yarowski.Yarowsky \
target/assignments-1.0.jar --input data/bank_final.txt \
--mod search \
--m_start 100 --m_end 1000 --m_step 300 \
--threshold_start 0.5 --threshold_end 0.8 --threshold_step 0.1

search best m, use threshold =0.5 and default values for other hyper paramenters:
spark-submit --class ca.uwaterloo.cs451.yarowski.Yarowsky \
target/assignments-1.0.jar --input data/bank_final.txt \
--mod search \
--m_start 100 --m_end 1000 --m_step 300 \
--threshold_start 0.5 --threshold_end 0.5 --threshold_step 0.1

train a model with m = 700, threshold = 0.6 and default values for other hyper parameters;
write classification of input to "output/" and model to "model/":
spark-submit --class ca.uwaterloo.cs451.yarowski.Yarowsky \
target/assignments-1.0.jar --input data/bank_final.txt \
--output output --model model \
--mod train \
--m_start 700 --threshold_start 0.7999999999999999

find classification for the sentence "I go to the bank and withdraw some money"
spark-submit --class ca.uwaterloo.cs451.yarowski.Yarowsky \
target/assignments-1.0.jar --input data/bank_final.txt \
--model model \
--mod test --test_sent "I go to the bank and withdraw some money"

run python program

execute code in python_version_project/yarowsky.ipynb

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published