# Tutorial 2: Filter the pool of intermediate datasets / ESMs

The idea of hf-dataset-selector is that suitable intermediate datasets should not be selected by heuristics but numerically by their ESM-LogME score. We advise to seach a large dataset pool as possible. However, we have the option to filter for specific datasets or their ESM representation.

For example, we might want to only evaluate ESMs with a specific architecture, or ESMs that are trained on a large enough (sub-)set of an intermediate dataset.

In [14]:
from hfselect import find_esm_repo_ids, fetch_esm_configs, Dataset

In [2]:
repo_ids = find_esm_repo_ids("bert-base-multilingual-uncased")
print(f"Found {len(repo_ids)} ESM repo IDs.")

Found 1509 ESM repo IDs.


In [3]:
esm_configs = fetch_esm_configs(repo_ids)

Fetching ESM Configs:   0%|          | 0/1509 [00:00<?, ?ESM Config/s]

## Filtering for specific datasets / ESMs

In this example, we only want to use ESMs that satisfy the following two conditions:
1. The ESM was trained on at least 500 examples of the intermediate dataset.
2. The name of the intermediate dataset contains the word "review".

In [4]:
filtered_repo_ids = []
for repo_id, esm_config in zip(repo_ids, esm_configs):
    if esm_config.num_examples >= 500 and "review" in esm_config.task_id:
        filtered_repo_ids.append(repo_id)
        
print(f"Found {len(filtered_repo_ids)} ESMs that satisfy the conditions.\n")
print("\n".join(filtered_repo_ids))

Found 20 ESMs that satisfy the conditions.

davidschulte/ESM_ar_res_reviews_default
davidschulte/ESM_yelp_review_full_yelp_review_full
davidschulte/ESM_wongnai_reviews_default
davidschulte/ESM_allegro_reviews_default
davidschulte/ESM_ohidaoui__darija-reviews_default
davidschulte/ESM_swedish_reviews_plain_text
davidschulte/ESM_imdb_urdu_reviews_default
davidschulte/ESM_scaredmeow__shopee-reviews-tl-stars_default
davidschulte/ESM_turkish_product_reviews_default
davidschulte/ESM_CATIE-AQ__french_book_reviews_fr_prompt_sentiment_analysis_default
davidschulte/ESM_jakartaresearch__google-play-review_default
davidschulte/ESM_Areeb123__drug_reviews_default
davidschulte/ESM_rotten_tomatoes_default
davidschulte/ESM_saattrupdan__womens-clothing-ecommerce-reviews_default
davidschulte/ESM_app_reviews_default
davidschulte/ESM_scaredmeow__shopee-reviews-tl-binary_default
davidschulte/ESM_CATIE-AQ__french_book_reviews_fr_prompt_stars_classification_default
davidschulte/ESM_CATIE-AQ__amazon_reviews_mul

In [5]:
from hfselect import fetch_esms

In [6]:
esms = fetch_esms(filtered_repo_ids)

Fetching ESMs:   0%|          | 0/20 [00:00<?, ?ESM/s]

In [15]:
esms[0].config

{'base_model_name': 'bert-base-multilingual-uncased',
 'developers': 'David Schulte',
 'esm_architecture': 'linear',
 'esm_batch_size': 32,
 'esm_learning_rate': 0.001,
 'esm_num_epochs': 10,
 'esm_optimizer': 'AdamW',
 'esm_weight_decay': 0.01,
 'label_column': 'polarity',
 'language': None,
 'lm_batch_size': 32,
 'lm_learning_rate': 2e-05,
 'lm_num_epochs': 3,
 'lm_optimizer': 'AdamW',
 'lm_weight_decay': 0.01,
 'num_examples': 8364,
 'seed': None,
 'task_id': 'hadyelsahar/ar_res_reviews',
 'task_split': 'train',
 'task_subset': 'default',
 'text_column': 'text',
 'transformers_version': '4.36.2'}

## Computing a task ranking from the filtered dataset pool

We can use the ESMs to rank them for our target dataset, which is the IMDB dataset.

In [18]:
from hfselect import Dataset, compute_task_ranking

MODEL_NAME = "bert-base-multilingual-uncased"

dataset = Dataset.from_hugging_face(
    name="imdb",
    split="train",
    text_col="text",
    label_col="label",
    is_regression=False,
    num_examples=1000,
    seed=42
)

In [19]:
task_ranking = compute_task_ranking(dataset, model_name=MODEL_NAME, esm_repo_ids=filtered_repo_ids)

Fetching ESMs:   0%|          | 0/20 [00:00<?, ?ESM/s]

Computing embeddings:   0%|          | 0/8 [00:00<?, ?batch/s]

Computing LogME:   0%|          | 0/20 [00:00<?, ?Task/s]

In [20]:
task_ranking.to_pandas()

Unnamed: 0_level_0,Task ID,Task Subset,Text Column,Label Column,Task Split,Num Examples,ESM Architecture,Score
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,cornell-movie-review-data/rotten_tomatoes,default,text,label,train,8530,linear,-0.640987
2,mirfan899/imdb_urdu_reviews,default,sentence,sentiment,train,10000,linear,-0.643653
3,Yelp/yelp_review_full,yelp_review_full,text,label,train,10000,linear,-0.646063
4,Sharathhebbar24/app_reviews_modded,default,review,star,train,10000,linear,-0.6486
5,saattrupdan/womens-clothing-ecommerce-reviews,default,review_text,recommended_ind,train,10000,linear,-0.648978
6,fthbrmnby/turkish_product_reviews,default,sentence,sentiment,train,10000,linear,-0.651304
7,Areeb123/drug_reviews,default,review,rating,train,10000,linear,-0.651694
8,scaredmeow/shopee-reviews-tl-stars,default,text,label,train,10000,linear,-0.653917
9,timpal0l/swedish_reviews,plain_text,text,label,train,10000,linear,-0.654108
10,scaredmeow/shopee-reviews-tl-binary,default,text,label,train,10000,linear,-0.654861
