# Tutorial 1: Rank  intermediate datasets

We can find suitable datasets for intermediate task transfer learning very easily.

In [1]:
from hfselect import Dataset, compute_task_ranking

In this example, we will use a multilingual BERT model as base model. Our target model is the IMDB dataset.

In [2]:
MODEL_NAME = "bert-base-multilingual-uncased"

dataset = Dataset.from_hugging_face(
    name="imdb",
    split="train",
    text_col="text",
    label_col="label",
    is_regression=False,
    num_examples=1000,
    seed=42
)

We compute the task ranking using a the following method.

In [3]:
task_ranking = compute_task_ranking(dataset, model_name=MODEL_NAME)

Fetching ESMs:   0%|          | 0/1509 [00:00<?, ?ESM/s]

Computing embeddings:   0%|          | 0/8 [00:00<?, ?batch/s]

Computing LogME:   0%|          | 0/1509 [00:00<?, ?Task/s]

In [5]:
print(len(task_ranking))

1509


In [6]:
print(f"The task ranking consists of {len(task_ranking)} intermediate datasets.\n")
print(task_ranking)

The task ranking consists of 1509 intermediate datasets.

1.   davanstrien/test_imdb_embedd2                     Score: -0.618529
2.   davanstrien/test_imdb_embedd                      Score: -0.618644
3.   davanstrien/test1                                 Score: -0.619334
4.   stanfordnlp/imdb                                  Score: -0.619454
5.   stanfordnlp/sst                                   Score: -0.62995
6.   stanfordnlp/sst                                   Score: -0.63312
7.   kuroneko5943/snap21                               Score: -0.634365
8.   kuroneko5943/snap21                               Score: -0.638787
9.   kuroneko5943/snap21                               Score: -0.639068
10.  fancyzhx/amazon_polarity                          Score: -0.639718
...


The ranking can be converted to Pandas dataframe for better visualization.

In [7]:
df = task_ranking.to_pandas()
df.head(10)

Unnamed: 0_level_0,Task ID,Task Subset,Text Column,Label Column,Task Split,Num Examples,ESM Architecture,Score
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,davanstrien/test_imdb_embedd2,default,text,label,train,10000,linear,-0.618529
2,davanstrien/test_imdb_embedd,default,text,label,train,10000,linear,-0.618644
3,davanstrien/test1,default,text,label,train,10000,linear,-0.619334
4,stanfordnlp/imdb,plain_text,text,label,train,10000,linear,-0.619454
5,stanfordnlp/sst,dictionary,phrase,label,dictionary,10000,linear,-0.62995
6,stanfordnlp/sst,default,sentence,label,train,8544,linear,-0.63312
7,kuroneko5943/snap21,CDs_and_Vinyl_5,sentence,label,train,6974,linear,-0.634365
8,kuroneko5943/snap21,Video_Games_5,sentence,label,train,6997,linear,-0.638787
9,kuroneko5943/snap21,Movies_and_TV_5,sentence,label,train,6989,linear,-0.639068
10,fancyzhx/amazon_polarity,amazon_polarity,content,label,train,10000,linear,-0.639718


Here, we see the top 10 recommended datasets by ESM-LogME. It should be noted that the top 4 datasets are all identical with the IMDB dataset (although their rows might be ordered differently).
This is assuring because ESM-LogME was able to find the IMDB dataset just by running it through its corresponding ESM.
The top recommendation that is not identical to our target dataset is the SST dataset.

In [10]:
print(task_ranking[:5])

1.   davanstrien/test_imdb_embedd2                     Score: -0.618529
2.   davanstrien/test_imdb_embedd                      Score: -0.618644
3.   davanstrien/test1                                 Score: -0.619334
4.   stanfordnlp/imdb                                  Score: -0.619454
5.   stanfordnlp/sst                                   Score: -0.62995


In [9]:
print(df.head(10).to_markdown())

|   Rank | Task ID                       | Task Subset     | Text Column   | Label Column   | Task Split   |   Num Examples | ESM Architecture   |     Score |
|-------:|:------------------------------|:----------------|:--------------|:---------------|:-------------|---------------:|:-------------------|----------:|
|      1 | davanstrien/test_imdb_embedd2 | default         | text          | label          | train        |          10000 | linear             | -0.618529 |
|      2 | davanstrien/test_imdb_embedd  | default         | text          | label          | train        |          10000 | linear             | -0.618644 |
|      3 | davanstrien/test1             | default         | text          | label          | train        |          10000 | linear             | -0.619334 |
|      4 | stanfordnlp/imdb              | plain_text      | text          | label          | train        |          10000 | linear             | -0.619454 |
|      5 | stanfordnlp/sst               | dic