Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script to build FastText training file from OLMo sources #188

Merged
merged 5 commits into from
Jun 6, 2023

Conversation

rauthur
Copy link
Collaborator

@rauthur rauthur commented May 30, 2023

This is intended for use with the existing ft_tagger.py. Example invocation:

python -m ai2_llm_filters.core_tools.ft_dataset \
  -t /Users/russell/Projects/ai2-llm/tmp-data \
  -s /Users/russell/Projects/ai2-llm/tmp-data-neg /Users/russell/Projects/ai2-llm/tmp-data-neg-2 \
  -m sentence \
  --newlines skip \
  -o ./test-output.txt \
  --n-segments 3 \
  --pos-label pos \
  --neg-label neg

This will create a datafile that has 3 positive and 3 negative examples at the sentence level taken from all files under the tmp-data (pos examples) and tmp-data-neg(-2) folders. Locations for input and output can be on S3. Downloading and processing files is in parallel. More than one path can be specified for the negative examples (if this is useful it can apply to pos as well easily).

The above might output a training file like this:

__label__pos The Trofeo S.A.R. Princesa Sofia was born in 1968 with only one participating class, the Dragon, based at Real Club Náutico de Palma.
__label__pos It was then that the event became organised by four clubs: Palma, S'Arenal, San Antonio de la Playa and de Mar -instead of the present organiser Calanova-.
__label__pos Two years later the Olympic character of the event began to show with the temporary replacement of the Dragon by the Star.
__label__neg I like dogs
__label__neg I like cats
__label__neg I don't like alligators

Note that examples are not currently shuffled between positive and negative classes. Within the order is semi-random across files depending on multiprocessing speed per process.

@rauthur rauthur added the project/data Related to training and evaluation data label May 30, 2023
@rauthur rauthur requested a review from kyleclo May 30, 2023 12:08
@rauthur rauthur changed the title Script to build FastText compatible training data file from OLMo sources Script to build FastText training file from OLMo sources May 30, 2023
Copy link
Contributor

@kyleclo kyleclo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; can u add a README walking through use of this to create a filter using C4?

Copy link
Member

@soldni soldni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat stuff! okay to merge

@rauthur rauthur merged commit 0b55217 into main Jun 6, 2023
@rauthur rauthur deleted the ft-tagger-dataset branch June 6, 2023 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
project/data Related to training and evaluation data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants