Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added models to run #953

Merged
merged 15 commits into from
Jun 20, 2024
Merged

added models to run #953

merged 15 commits into from
Jun 20, 2024

Conversation

KennethEnevoldsen
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen commented Jun 18, 2024

Added models to run. It still lacks retrieval, clustering and information retrieval. It also lacks some models.

To test that thing run I have added "check_run.sh" as well as run it locally.

Additionally I have done some bugfixes: Ensure that e5-instruct can take the device argument and ensured that prompt_name is not called task_name.

@Muennighoff I think the script in this PR can be run as is (or alternatively, you might want to split the gritLM up into two seperate scripts).

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

@Muennighoff
Copy link
Contributor

Looks great! However, I think there is an issue where the script does not provide the revision

    mteb run \
    -m $model \
    -t MindSmallReranking SemRel24STS AJGT SummEval NusaTranslationBitextMining \
    --output_folder $results_folder \
    --co2_tracker true

which then fails due to this check:

f"Model revision {revision} not found for model {model_name}"

Maybe we should just default to the revision we have if none is provided by the user?

bash scripts/mmteb/running_model/check_run.sh
Running model on a sample set of tasks
Running model: sentence-transformers/all-MiniLM-L6-v2
INFO:mteb.cli:Running with parameters: Namespace(model='sentence-transformers/all-MiniLM-L6-v2', task_types=None, categories=None, tasks
=['MindSmallReranking', 'SemRel24STS', 'AJGT', 'SummEval', 'NusaTranslationBitextMining'], languages=None, device=None, output_folder='r
esults', verbosity=2, co2_tracker=True, eval_splits=None, model_revision=None, func=<function run at 0x7fdd72b85a20>)
Traceback (most recent call last): 
  File "/env/lib/conda/gritkto/bin/mteb", line 8, in <module>
    sys.exit(main())
  File "/data/niklas/mteb/mteb/cli.py", line 362, in main
    args.func(args)
  File "/data/niklas/mteb/mteb/cli.py", line 108, in run
    model = mteb.get_model(args.model, args.model_revision, device=args.device)
  File "/data/niklas/mteb/mteb/models/__init__.py", line 37, in get_model
    meta = get_model_meta(model_name, revision)
  File "/data/niklas/mteb/mteb/models/__init__.py", line 61, in get_model_meta
    raise ValueError(
ValueError: Model revision None not found for model sentence-transformers/all-MiniLM-L6-v2

@KennethEnevoldsen
Copy link
Contributor Author

Hmm, odd. I feel like I have fixed that bug a while ago. There might have been a merge conflict somewhere. Anyway, I have pushed a fix for it to this branch along with one dataset run just to make sure it works.

@Muennighoff
Copy link
Contributor

Thanks! I think some of the model loaders are lacking the func attr? I can look into it but I guess you might already know what the problem is?

Traceback (most recent call last):
  File "/env/lib/conda/gritkto/bin/mteb", line 8, in <module>
    sys.exit(main())
  File "/data/niklas/mteb/mteb/cli.py", line 362, in main
    args.func(args)
  File "/data/niklas/mteb/mteb/cli.py", line 118, in run
    eval.run(
  File "/data/niklas/mteb/mteb/evaluation/MTEB.py", line 301, in run
    self._save_model_metadata(meta, output_path)
  File "/data/niklas/mteb/mteb/evaluation/MTEB.py", line 457, in _save_model_metadata
    json.dump(model_meta.to_dict(), f)
  File "/data/niklas/mteb/mteb/model_meta.py", line 80, in to_dict
    dict_repr["loader"] = loader.func.__name__ if loader is not None else None
AttributeError: 'function' object has no attribute 'func'

@KennethEnevoldsen
Copy link
Contributor Author

Fixed!

loader.func.__name__should be used for partial(loader, ...), while if the loader is just a function, we should use loader.__name__.

@Muennighoff
Copy link
Contributor

Running bash scripts/mmteb/running_model/run_baseline_models.sh right now but the utilization is poor. Can we turn it into an sbatch script that submits an array like this: https://github.com/ContextualAI/gritlm/blob/main/scripts/eval_mteb.sh

Screenshot 2024-06-19 at 1 36 38 PM

@KennethEnevoldsen
Copy link
Contributor Author

I sadly don't have too much experience with slurm arrays, but I have added a Python script (create_slurm_jobs.py
) for running the models as single jobs pr. task (you will probably need to change the slurm prefix, though).

alternatively creating the array script as I see it should be possible from a simple python script around the mteb.get_tasks and model_names objects defines.

@KennethEnevoldsen
Copy link
Contributor Author

@Muennighoff merging this as it contains some fixes which seems to cause issues elsewhere. We can continue the discussion either here or on slack

@KennethEnevoldsen KennethEnevoldsen enabled auto-merge (squash) June 20, 2024 08:57
@Muennighoff
Copy link
Contributor

Where should I push the results once done @KennethEnevoldsen ? Maybe to https://huggingface.co/datasets/mteb/results?
I think it is a bit confusing that we now have lots of results in this repo, too. I'm not sure we can put all results in this repo as it will make cloning it inconvenient due to the size I think 🤔 https://huggingface.co/datasets/mteb/results is 34M & I assume that what we will run here could easily add another 60M+

@KennethEnevoldsen
Copy link
Contributor Author

@orionw what are your thoughts on this?

@orionw
Copy link
Contributor

orionw commented Jun 21, 2024

@Muennighoff raises a good point.

I only see three options: (1) we use this repo and allow it to be larger as we get more results (~100 MB). (2) we keep our separate HF results dataset that people merge to or (3) we create a separate GitHub and add it as an optional submodule to mteb.

This is assuming we can’t make those files noticeably smaller (I don’t think we had much redundant info) as a way to fit them all in this repo.

If we do (2) we can at least separate the leaderboard calculation functionality and I can put that on GitHub actions in some GitHub repo. But users would still have to commit to the dataset.

Of these I might prefer (3) and then we can sync that repo to the HF dataset so it can be easily read from datasets but people can commit to GitHub PRs and it’s still available to checkout in mteb

@KennethEnevoldsen
Copy link
Contributor Author

I think I am leaning (1) or (3) while it does lead to notably larger GitHub repo it does not influence the package size and when working with the repo you are only uploading diffs anyway. It is only the first download that will take slightly longer. (1) is simpler so I would probably recommend that. If we go for either (2) or (3) I think we should streamline the method for pushing results (both for our and others sake)

@Muennighoff
Copy link
Contributor

Pushed them here preliminarily: https://github.com/embeddings-benchmark/results
I'd agree with @orionw on (3) --- ideally we only have one results repo I think, but if HF/datasets repos do not interface nicely with GitHub actions and we need an HF/datasets repo to make it loadable, then we need 2 I guess.

Alpaca does solution (1) (https://github.com/tatsu-lab/alpaca_eval/tree/main/results) but they have a few orders of magnitude less data. But if you feel strongly about it, I guess it is fine too 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants