Skip to content

adrianeboyd/floret-demo-projects

Repository files navigation

Demos for floret vectors

Note: As specified in requirements.txt, all demos currently require a custom version of spaCy, feature/fasttext-bloom-vectors.

Demos currently use data from OSCAR for training vectors, streaming the unshuffled deduplicated corpora using datasets.

A demo for training and loading floret vectors:

A demo for training and comparing standard fasttext and floret vectors using QVEC:

Demos for training vectors and and spaCy pipelines with a focus on cases where floret vectors are expected to improve the performance as compared to standard fasttext vectors on a fixed vocabulary:

  • floret_ko_ud_demo: agglutinative languages with Korean UD

    With 1M (3.3G) tokenized training texts and 50K 300-dim vectors, ~800K keys for the standard vectors:

    Vectors TAG POS DEP UAS DEP LAS
    none 72.6 85.0 73.3 64.6
    standard (pruned) 78.2 89.7 78.9 73.1
    floret (minn 2, maxn 3) 83.2 94.2 83.4 80.7

    With 12G tokenized training texts and 50K 300-dim vectors (except for unpruned), ~1M keys for the standard vectors:

    Vectors TAG POS DEP UAS DEP LAS SPEED
    none 72.6 85.0 73.3 64.6 15272
    standard (pruned) 78.9 89.9 79.0 73.5 14754
    standard (unpruned) 81.6 91.8 80.8 76.1 14200
    floret (minn 2, maxn 3) 83.6 94.3 83.5 80.7 13530
  • floret_fi_ud_demo: agglutinative languages with Finnish, UD_Finnish-TDT (syntax) and turku-ner-corpus (NER)

    With 13G tokenized training texts and 50K 300-dim vectors (except for unpruned), ~1M keys for the standard vectors:

    Vectors TAG POS MORPH DEP UAS DEP LAS ENTS F SPEED (syntax)
    none 93.5 92.5 86.2 79.4 72.7 62.0 12693
    standard (pruned) 96.6 95.6 89.6 84.6 79.6 72.2 13407
    standard (unpruned) 97.0 96.0 90.9 84.6 80.0 72.2 13269
    floret (minn 4, maxn 5) 97.1 96.0 91.6 84.6 80.2 73.6 12044
  • floret_hu_ner_demo: agglutinative languages with Hungarian NER

    With 500K (1.5G) tokenized training texts and 50K 300-dim vectors, with ~650K unique keys for the standard vectors:

    Vectors P R F
    none 93.7 93.5 93.6
    standard (pruned) 94.9 95.0 95.0
    floret (minn 5, maxn 6) 97.1 95.9 96.5
  • floret_en_noisy_ner_demo: noisy data with English NER for Twitter on emerging events

    With 500K (2.5G) tokenized training texts and 20K 300-dim vectors, with ~200K unique keys for the standard vectors:

    Vectors P R F
    none 30.7 18.7 23.3
    standard (pruned) 34.5 25.6 29.4
    floret (minn 5, maxn 6) 39.9 23.3 29.4
  • floret_en_so_ner_demo: out-of-domain data with English NER for StackOverflow vs. GitHub

    With 500K (2.5G) tokenized training texts and 20K 300-dim vectors:

    Vectors F (in-domain) F (out-of-domain)
    none 55.6 37.6
    standard (pruned) 55.3 35.1
    floret (minn 5, maxn 6) 55.5 35.9

    In this case, the OSCAR texts are not particularly suitable training data for the vectors. It would probably be better to train vectors on texts from StackOverflow or a similar source instead.

Notes

To test a workflow quickly, set max_texts to very small value like 100. A much larger amount of training data for the vectors is obviously needed for more meaningful comparisons in the test cases. The provided defaults are still quite small for typical vector training data, but should show some results and train in a not-too-unreasonable amount of time on a small number of threads.

For reference, 1M texts from the OSCAR datasets are about 5G for English, 4G for Hungarian, and 3G for Korean. fasttext does not support streamed input, so it is necessary to have the tokenized training data saved in a file. Outside of a demo, I'd often use tmpfs.

About

Demo projects for training and using fasttext-bloom vectors with spaCy

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages