Train and evaluate linguistic distributional models
Repository hosted and maintained on Github: https://github.com/emcoglab/ldm-train-and-evaluate
- Cleaning and tokenising text corpora.
- Computing summary information about text corpora.
- Training linguistic distributional models (LDMs) from text corpora.
- Querying LDMs using various distance measures.
- Evaluating LDMs using several benchmarking test datasets.
This project requires Python 3.7+.
Start by installing the requirements:
pip install -r requirements.txt
You may want to use a virtual environment.
To run a script, you'll need to go above this project directory, and run using the
-m flag. For example:
python -m corpus_analysis.scripts_model_evaluation.1_synonym_tests
To set up config, copy the file
ldm/preferences/default_config.yaml to somewhere else accessible and name it something like
congif_override.yaml. Then add the following as the first non-comment line in the script you are running:
from ldm.preferences.config import Config; Config(use_config_overrides_from_file="/path/to/config_override.yaml")
config_override.yaml, set the paths to be relevant to your local setup. Only values set in
config_override.yaml with override the corresponding value set in
default_config.yaml, so you don't need to set everything if it's not relevant.
Scripts to run to reproduce the analysis are found in
scripts_… directories; critical ones are numbered in sequence.
Non-numbered scripts are just for fun.
To run the analysis from beginning to end, run the following scripts in the following order and have a lot of time on your hands.