Skip to content
/ miaseg Public

Code for "Meaning-Informed Low-Resource Segmentation of Agglutinative Morphology"

License

Notifications You must be signed in to change notification settings

cbelth/miaseg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Miaseg

Code for the SCiL 2024 paper Meaning-Informed Low-Resource Segmentation of Agglutinative Morphology

@inproceedings{belth2024meaning,
  title={Meaning-Informed Low-Resource Segmentation of Agglutinative Morphology},
  author={Belth, Caleb},
  booktitle={Proceedings of the Society for Computation in Linguistics 2024},
  year={2024}
}

Reproducing Results

The results files from the paper are in results/. If you want to reproduce these files or create results for new seeds or training sizes, you can use the script exp.py. The script takes four command-line arguments:

--exp_name / -e
    - One of hun|mon|fin|tur
--model / -m
    - One of miaseg|morfessor|transformer
--num_train / -num_train (Optional)
    - The number of words to train on
    - By default, the script runs the model once each on 500, 1000, and 10000
--num_seeds / -seeds (Optional)
    - The number of seeds to run on 
    - By default 10

For example, to reproduce Miaseg's results on Turkish, you can run:

python exp.py -e tur -m miaseg

Note that each model has a number of required packages to run:

Notes on the Transformer comparison model

Running exp.py with the model argument as -m transformer actually does not run the transformer model. It runs a mock model TrReader, which simply reads the predictions of the transformer model and converts them into the evaluation metrics (accuracy, F1, etc.). This is because the transformer takes a long time to run and it is thus desirable to distribute different runs of the model onto different GPUs, save the results of each, and then aggregate them later (as TrReader does).

TrReader assumes that the result of each transformer run (which is a preds.txt file) are in results/transformer_feats/{exp}/ts_{num_train}/seed_{seed}, where {exp} is one of the language names hun|mon|fin|tur, {num_train} is 100, 1000, and 10000, and {seed} is 0...10 (all corresponding straight-forwardly to the exp.py arguments).

If for some reason you really want to re-run the transformer comparison model, you will need to look at hyper_feat_transformer.py, which wraps feat_transformer to provide hyperparameter tuning. The arguments are similar but not identical to those of exp.py:

--exp_name / -e
    - One of hun|mon|fin|tur
--data_path / -data_path
    - A path to the data (e.g., `data/fin/nouns.txt`)
    - This is handy if you, like me, run this on a server where you need to put all the data in a scratch directory with a slurm script
--temp_dir / -temp_dir
    - A path to save the results (should be `results/transformer_feats/{exp}/ts_{num_train}/seed_{seed}`, as described above)
    - This is handy if you, like me, run this on a server where you need to save all the data in a scratch directory before moving it off with a slurm script
--num_train / -num_train (Optional)
    - The number of words to train on
    - By default 10000
--seed / -seed (Optional)
    - The seed to run on 
    - By default 0

Running on Your Own Data

Miaseg is implemented in the Python package algophon. I recommend that implementation for running the model on your own data. Just pip install algophon!