Skip to content

ccoreilly/spacy-catala-generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Training script and dataset for spacy-catala

Note: this repository uses Git LFS for the files under train directory

$ ./train.sh
Usage: train.sh <model_name> <model_version> <vectors_path> <train_dir> <dev_dir> --nvectors [vector_size]
E.g.: train.sh ca_fasttext_md 1.2.0 cc.ca.300.vec.gz train dev --nvectors 50000
Train a spacy model.

vectors_path expects gzipped text vectors. These are not included, you can download them with:

$ curl -x -o cc.ca.300.vec.gz https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ca.300.vec.gz

In order to recreate the large model in spacy-catala run:

$ ./train.sh ca_fasttext_wiki_lg 1.0.0 cc.ca.300.vec.gz train dev

The medium sized model has been pruned to the most common 20000 vectors using:

$ ./train.sh ca_fasttext_wiki_md 1.0.0 cc.ca.300.vec.gz train dev --nvectors 20000