Script to build FastText training file from OLMo sources #188

rauthur · 2023-05-30T12:08:11Z

This is intended for use with the existing ft_tagger.py. Example invocation:

python -m ai2_llm_filters.core_tools.ft_dataset \
  -t /Users/russell/Projects/ai2-llm/tmp-data \
  -s /Users/russell/Projects/ai2-llm/tmp-data-neg /Users/russell/Projects/ai2-llm/tmp-data-neg-2 \
  -m sentence \
  --newlines skip \
  -o ./test-output.txt \
  --n-segments 3 \
  --pos-label pos \
  --neg-label neg

This will create a datafile that has 3 positive and 3 negative examples at the sentence level taken from all files under the tmp-data (pos examples) and tmp-data-neg(-2) folders. Locations for input and output can be on S3. Downloading and processing files is in parallel. More than one path can be specified for the negative examples (if this is useful it can apply to pos as well easily).

The above might output a training file like this:

__label__pos The Trofeo S.A.R. Princesa Sofia was born in 1968 with only one participating class, the Dragon, based at Real Club Náutico de Palma.
__label__pos It was then that the event became organised by four clubs: Palma, S'Arenal, San Antonio de la Playa and de Mar -instead of the present organiser Calanova-.
__label__pos Two years later the Olympic character of the event began to show with the temporary replacement of the Dragon by the Star.
__label__neg I like dogs
__label__neg I like cats
__label__neg I don't like alligators

Note that examples are not currently shuffled between positive and negative classes. Within the order is semi-random across files depending on multiprocessing speed per process.

…remote OLMo sources

kyleclo

lgtm; can u add a README walking through use of this to create a filter using C4?

soldni

neat stuff! okay to merge

Script to build FastText compatible training data file from local or …

ef8626e

…remote OLMo sources

rauthur added the project/data Related to training and evaluation data label May 30, 2023

rauthur requested a review from kyleclo May 30, 2023 12:08

rauthur changed the title ~~Script to build FastText compatible training data file from OLMo sources~~ Script to build FastText training file from OLMo sources May 30, 2023

kyleclo approved these changes May 30, 2023

View reviewed changes

rauthur and others added 3 commits May 30, 2023 21:08

Add readme with c4 note

8ca6b7a

remove debugging

1ba9416

Merge branch 'main' into ft-tagger-dataset

94512a3

soldni approved these changes Jun 2, 2023

View reviewed changes

Merge branch 'main' into ft-tagger-dataset

a44ce9e

rauthur merged commit 0b55217 into main Jun 6, 2023

rauthur deleted the ft-tagger-dataset branch June 6, 2023 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script to build FastText training file from OLMo sources #188

Script to build FastText training file from OLMo sources #188

rauthur commented May 30, 2023

kyleclo left a comment

soldni left a comment

Script to build FastText training file from OLMo sources #188

Script to build FastText training file from OLMo sources #188

Conversation

rauthur commented May 30, 2023

kyleclo left a comment

Choose a reason for hiding this comment

soldni left a comment

Choose a reason for hiding this comment