<a href="https://colab.research.google.com/github/gaussalgo/L2L_MLPrague23/blob/main/notebooks/hands_on_improving_ICL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training generative models

Now that we have a theoretical background, we'll take a look at how the covered generative models are actually trained. 

After the overview, we will finally utilize this knowledge in training our own in-context learner, taylored for a new language, or improved on a specific task of your interest!


In [None]:
# notebook's requirements
!pip install transformers datasets

## Generation - refresher

Recall the **Causal Langauge Modeling (CLM)** from earlier, where the model predicts the **following token** from previous context.

![CLM_1](https://github.com/gaussalgo/L2L_MLPrague23/blob/main/notebooks/images/CLM_1_new.png?raw=1)  
![CLM_2](https://github.com/gaussalgo/L2L_MLPrague23/blob/main/notebooks/images/CLM_2_new.png?raw=1)
![CLM_3](https://github.com/gaussalgo/L2L_MLPrague23/blob/main/notebooks/images/CLM_3_new.png?raw=1)

[[images source]](https://www.rohanawhad.com/improvements-of-spanbert-over-bert/)

Note the relation of *generation* and CLM objective: We predict the next token conditionally to the input. Only in the case of "real" generation, the **input also contains the previous outputs** of the model.

## Construction of Training pipeline

Many libraries makes it easy to train your NN model, with different levels of user complexity - do not get confused by that.

Ordered by implementation complexity incrementally, for PyTorch models, you can find at least these: **Pure PyTorch, PyTorch Lightning, Transformers Trainer, Adaptor (ours)**. We will take a look at how it looks at the most low-level (Pure PyTorch), and the most high-level (Adaptor), but in your time, you can also look at [Sequence Classification tutorial](https://huggingface.co/docs/transformers/tasks/sequence_classification) from HuggingFace.


## ⛵ Low-level Training pipeline - Example

Here, we are going to take a look at the low level of updating a network.
The process can be summarized in the following steps:

1. **We pick our base model** to fine-tune. While all Transformer LMs can perform token classification, not all of them are equally good at it. We'll talk about it a bit more offline.

2. **We construct training dataset** from our data. Here, we transform texts into valid model inputs/samples (as we've seen in previous session) and assign true labels for each sample.

3. **We iterate over the samples** in so-called *epochs*. In this step, we get the model predictions for a *batch* of samples: The raw predictions take form of probabilities (usually log-probabilities, to make prediction faster).

4. **We update the model**. Here, we first compare the predicted probabilities with "true probabilities", where true category gets a probability==1, and other categories get 0. The comparison is done by so-called *loss* function, which is a special version of distance measure. Then, we update the weights of the model so that they improve the loss metric.

5. **We continue as long as the model improves**, which we measure on a held-out dataset, to avoid that the model just learns to remember our data (but then would perform badly in the real world).

### 1: Pick our base model

In [None]:
from transformers import GPT2LMHeadModel, AutoTokenizer

device = "cuda"

model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({'pad_token': 'pad'})

### 2: Constructing training dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb")



  0%|          | 0/3 [00:00<?, ?it/s]

### Dataset transformations

Before we jump into the training routine, we'll zoom in on data processing we need for training GPT-like Causal language models.

First, we'll transform the sample on model's input ids using the associated tokenizer. Then, we'll take a look at labels construction.

In [None]:
sample_encoding = tokenizer(dataset["unsupervised"]['text'][10], 
                            padding="longest",  # we'll set padding and truncation so that tokenizer allows us to directly obtain tensors
                            truncation=True,
                            return_tensors="pt"  # this returns the samples as PyTorch tensors, that we do not have to convert ourselves
                            )

#### Constructing training labels

In [None]:
# GPT's input ids, decoded:
"|".join(tokenizer.batch_decode(sample_encoding["input_ids"][0]))

"This| isn|'t| the| worst| comedy| of| all|-|time|,| but| that| is| about| the| best| thing| that| I| can| say| about| this| pathetic| film|.| I| didn|'t| laugh| once|,| or| even| smile| once| during| this| bomb|.| There| was| usually| something| going| on| on|-|screen|,| so| I| didn|'t| get| TO|O| bored|,| but| most| of| the| jokes| here| were| simply| awful|.| The| final| sequence| is| nothing| more| than| a| long| series| of| people| falling| through| doors| and| stumbling| all| over| the| place|.| Needless| to| say|,| it| was| a| fitting| way| to| end| a| movie| that| was| impossible| for| me| to| like|."

In [None]:
# GPT's training labels: when training to predict the following token, we simply shift the inputs one position to the right
labels = sample_encoding["input_ids"][..., 1:]

"|".join(tokenizer.batch_decode(labels[0]))

" isn|'t| the| worst| comedy| of| all|-|time|,| but| that| is| about| the| best| thing| that| I| can| say| about| this| pathetic| film|.| I| didn|'t| laugh| once|,| or| even| smile| once| during| this| bomb|.| There| was| usually| something| going| on| on|-|screen|,| so| I| didn|'t| get| TO|O| bored|,| but| most| of| the| jokes| here| were| simply| awful|.| The| final| sequence| is| nothing| more| than| a| long| series| of| people| falling| through| doors| and| stumbling| all| over| the| place|.| Needless| to| say|,| it| was| a| fitting| way| to| end| a| movie| that| was| impossible| for| me| to| like|."

#### Creating next-token prediction inputs from each sample

For the next-token prediction, we actually create multiple samples from each text: There are **many** tokens that we can use as **targets**!

To make it easier for us, we'll repeatedly use the **same input ids**, and only **attend to the previous tokens**, to be used in prediction. We just need to be careful not to un-mask the actually-predicted token.

We can implement this quite easily by constructing a **triangular attention mask** for each input from the batch.

In [None]:
attended_input_length = sample_encoding["attention_mask"].sum(axis=1)
attended_input_length

tensor([113])

**Input ids**: duplicate inputs by the number of predicted tokens

In [None]:
# duplicate inputs by the number of predicted tokens
input_ids = sample_encoding["input_ids"].expand(attended_input_length, -1)
input_ids

tensor([[1212, 2125,  470,  ...,  284,  588,   13],
        [1212, 2125,  470,  ...,  284,  588,   13],
        [1212, 2125,  470,  ...,  284,  588,   13],
        ...,
        [1212, 2125,  470,  ...,  284,  588,   13],
        [1212, 2125,  470,  ...,  284,  588,   13],
        [1212, 2125,  470,  ...,  284,  588,   13]])

In [None]:
input_ids.shape

torch.Size([113, 113])

**Attention mask**: we create triangles that will mask all future tokens from prediction

In [None]:
import torch 

# this is how we construct tensor triangles
torch.tril(torch.ones(4, 6), diagonal=0)

tensor([[1., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0.]])

In [None]:
attention_triangle = torch.tril(torch.ones(attended_input_length, input_ids.shape[1]), diagonal=0)
attention_triangle

tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [1., 1., 0.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        ...,
        [1., 1., 1.,  ..., 1., 0., 0.],
        [1., 1., 1.,  ..., 1., 1., 0.],
        [1., 1., 1.,  ..., 1., 1., 1.]])

In [None]:
attention_triangle.shape

torch.Size([113, 113])

**Labels**: Finally, we spread the pre-computed labels to assign exactly one label id to each new sample

In [None]:
labels = sample_encoding["input_ids"][..., 1:][:attended_input_length]
labels
"|".join(tokenizer.batch_decode(labels[0]))

" isn|'t| the| worst| comedy| of| all|-|time|,| but| that| is| about| the| best| thing| that| I| can| say| about| this| pathetic| film|.| I| didn|'t| laugh| once|,| or| even| smile| once| during| this| bomb|.| There| was| usually| something| going| on| on|-|screen|,| so| I| didn|'t| get| TO|O| bored|,| but| most| of| the| jokes| here| were| simply| awful|.| The| final| sequence| is| nothing| more| than| a| long| series| of| people| falling| through| doors| and| stumbling| all| over| the| place|.| Needless| to| say|,| it| was| a| fitting| way| to| end| a| movie| that| was| impossible| for| me| to| like|."

In [None]:
# After the model is done, we want it to generate a special <EoS> token. This way, we know that the model is done with generation.
model.config.eos_token_id

50256

In [None]:
# Hence, we add the token as the last label
labels = torch.hstack([labels, torch.tensor([[model.config.eos_token_id]])])

In [None]:
labels.shape

torch.Size([1, 113])

In [None]:
sample_encoding["input_ids"].shape

torch.Size([1, 113])

#### Now, we wrap the whole encoding into a method

In [None]:
from typing import Dict
import torch
import itertools

def construct_causalLM_sample(sample: Dict[str, torch.tensor]) -> Dict[str, torch.tensor]:
    extended_batch = {}

    attended_input_length = sample["input_ids"].shape[-1]

    extended_batch["input_ids"] = sample["input_ids"].expand(attended_input_length, -1)

    extended_batch["attention_mask"] = torch.tril(torch.ones(attended_input_length, attended_input_length), diagonal=0)

    extended_batch["labels"] = sample["input_ids"][..., 1:][:attended_input_length]
    extended_batch["labels"] = torch.hstack([extended_batch["labels"][0], torch.tensor([model.config.eos_token_id])])

    extended_batch["labels_position"] = torch.arange(sample["input_ids"].shape[-1])

    return extended_batch

### 3: Iterate over the samples and 4: Update the model

Finally, we plug in the processing into the large training loop.

#### What is happening here?

As in training any neural network, we need to take care of several things that are not directly related to our objective.

* Configure **batch size** and **learning rate**
* Initialize **optimizer** that updates the model according to the gradients of the loss from real data
* Initialize **loss function** measuring how well the model fits the data

After that we **iterate over data**:
* Obtain **batches of CLM samples**
* Run them through the model to **obtain predictions** in a form of (log) probabilities over the model's vocabulary
* Compute the value of the loss and register gradients of the model weights used later to update the model
* Update the model and restart the gradients
* Finally, we stop if the training does not improve for a while

In [None]:
from transformers import AdamW
from torch.nn import CrossEntropyLoss

batch_size = 8
learning_rate = 2e-6

optimizer = AdamW(model.parameters(),  # optimizer will actually update the model weights,
                  no_deprecation_warning=True,  # so that they get better at prediction after every step
                  lr=learning_rate)

loss_fn = CrossEntropyLoss()  # distance function comparing predictions to expected labels

while True:
    running_loss = 0  # aggregation variable, to observe if we progress
    last_running_loss = 10e28  # super high initial loss that will decrease
    for text in dataset["unsupervised"]['text']:  # per-sample iteration
        sample_encoding = tokenizer(text, 
                                    padding="longest",  # padding and truncation allows us to directly obtain tensors,
                                    truncation=True,    # but otherwise are not needed
                                    return_tensors="pt")
        
        clm_samples = construct_causalLM_sample(sample_encoding)  # transformation to CLM samples
        for batch_offset in range(0, len(clm_samples["input_ids"]), batch_size):  # per-CLM-samples iteration, batched

            # Construction of the training batch that we've seen above
            dataset_batch = {k: clm_samples[k][batch_offset: batch_offset+batch_size].to(model.device) 
                             for k in clm_samples.keys()}

            # Model prediction, (also called forward pass)
            model_logprobs = model(input_ids=dataset_batch["input_ids"],  # this can also be done with model(**dataset_batch)
                                   attention_mask=dataset_batch["attention_mask"]).logits          
            # HuggingFace implementation gives us predictions for all tokens, 
            # but we'll update the model only based on the predictions with labels 
            logprobs_with_labels = model_logprobs[torch.arange(model_logprobs.size(0)), dataset_batch["labels_position"]]
                        
            # we first compare the predicted probabilities with "true probabilities"
            loss_value = loss_fn(logprobs_with_labels, dataset_batch["labels"])

            # we note the errors (gradients) to each model parameters (also called backward pass)
            loss_value.backward()
            
            running_loss += loss_value.item()
            
            # 4. We update the model
            optimizer.step()
            optimizer.zero_grad()
            
            # 5: Evaluation: Check and stop the training if the model no longer improves
            if batch_offset!= 0 and (batch_offset/batch_size) % 10 == 0:
                # print our loss after every 1000-th step
                print("Current training loss: %s" % running_loss)
                # stop if the loss increased
                if last_running_loss < running_loss:
                    break

                running_loss = 0  # restart the log

# 🛥 High-level Training pipeline

Today, many libraries make it much easier to train your language model, with different levels of specialized knowledge. Under the hood, it always comes down to (roughly) what we see above, but the high-level interface allows you to iterate experiments much faster.

For PyTorch language models, you may consider at least these libraries (incrementally by usage complexity): **Pure PyTorch, PyTorch Lightning, Fairseq, Transformers Trainer, Adaptor (ours)**. We have seen the low-level side above, and now we'll peek into the most high-level (Adaptor). 

However, in your time, we also strongly recommend you to also take a look at how to use the ever-growing 🤗 HuggingFace Transformers library. You can find examples for training generative models in [Translation training tutorial](https://huggingface.co/docs/transformers/tasks/translation), [Summarization training tutorial](https://huggingface.co/docs/transformers/tasks/summarization) (it's almost the same) and [Generation example script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-generation/run_generation.py) from 🤗 HuggingFace.

### [Adaptor](https://github.com/gaussalgo/adaptor): Quick introduction

[Adaptor](https://github.com/gaussalgo/adaptor) is our in-house library that allows us to run large collections of similar experiments very quickly. If you take a look at the [example script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-generation/run_generation.py) above, you'll see that it has over 400 lines, with most code not directly relevant for the goal. Below, you'll see complete, similar example with Adaptor.

This complexity reduction is enabled by **objective-centric paradigm**, where the model is no longer the central part of the training; the central structure in Adaptor is **training objective** that is applied to the model.

Design-wise, Adaptor is relatively lightweight extension of 🤗 Transformers. Thanks to that, you can use almost all cutting-edge features as well as all the language models of 🤗 Transformers.


## Training generative models with Adaptor

We will take a look at how the training of generative model will look like if we use Adaptor. 

To give you an example for the final hands-on, we will demonstrate how to use the library on a fairly simple use-case, where we'll train a LM to **generate a rating of the review** that it gets in the input text. Again, we'll use the same `imdb` dataset for that.

In [None]:
!pip install -q datasets sentencepiece protobuf==3.20.0 adaptor==0.2.2  # required for generation
!pip uninstall -y -q tensorflow tensorboard

[0m

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb", split="train")

First, we pick the base model for adaptation. 

In [None]:
from adaptor.lang_module import LangModule

language_module = LangModule("google/mt5-small")

Second, we choose the objective that we want to fine-tune the model for. The objective will take care of configuring the model correctly. We just give it our desired inputs and outputs.

In [None]:
from adaptor.objectives.seq2seq import Sequence2Sequence

promt = "Is this review positive, or negative?"

eval_samples = 100

training_objective = Sequence2Sequence(lang_module=language_module,
                                       texts_or_path=[promt + review for review in dataset['text']][:-eval_samples],
                                       labels_or_path=["positive" if y == 1 else "negative" for y in dataset["label"]][:-eval_samples],
                                       val_texts_or_path=[promt + review for review in dataset['text']][-eval_samples:],
                                       val_labels_or_path=["positive" if y == 1 else "negative" for y in dataset["label"]][-eval_samples:],
                                       batch_size=1)

One more thing before the training: we need to **persist the weights** of the trained model somewhere. In our case, we create checkpoints that can be directly loaded as any HuggingFace model.

In Google Colab, you can mount your Google Drive to persist the model checkpoints using the following commands. If you run this script elsewhere, you may skip the following steps.

In [None]:
# This will mount your google drive to persist the training model later on. 
# If you do not want to do it, you can skip this command.

from google.colab import drive
drive.mount('/content/drive/')

In [None]:
output_dir = "/content/drive/MyDrive/training_output_dir"  # TODO: this is a path to your Google Drive - make sure that it is ok to write

In [None]:
# Before starting the training, check that the folder where the model will be persisted actually exist

!ls $output_dir

In [None]:
# if it does not, create it manually in the menu on the right

!mkdir -p $output_dir

In [None]:
# Check that the folder for checkpoints exist; Continue, if this command passes without errors

!ls $output_dir

The training process is configured through a possibly large set of 🤗 Training Arguments. You can read through each of them in [TrainingArgs documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

We will use our chosen `output_dir` here.

In [None]:
from adaptor.utils import AdaptationArguments, StoppingStrategy

args = AdaptationArguments(output_dir=output_dir,
                           learning_rate=2e-5,
                           warmup_steps=1000,
                           stopping_strategy=StoppingStrategy.FIRST_OBJECTIVE_CONVERGED,
                           do_train=True,
                           do_eval=True,
                           logging_steps=100,
                           eval_steps=200,
                           evaluation_strategy="steps",
                           save_steps=200,
                           save_total_limit=6,
                           stopping_patience=5,
                           num_train_epochs=20,
                           max_steps=500,  # remove this to remove a constraint on a training length
                           gradient_accumulation_steps=30)

The ordering of application of our defined objectives is determined by choosing a `Schedule`: Adaptor comes with `SequentialSchedule` and `ParallelSchedule`.
In a single-objective cases (like ours), a selection of Schedule does not really matter, but in multi-task training, it can come quite handy.

It is also fine to use more than one objective at once.
In such cases, the only extra thing that one needs to decide is if the objectives' heads would be shared or not. If yes, you should fill in the argument `share_other_objective_head=other_training_objective` to the new objective(s).

In [None]:
from adaptor.schedules import SequentialSchedule, ParallelSchedule
from adaptor.adapter import Adapter

# choose a schedule of applying objectives - with one objective does not really matter
parallel_schedule = SequentialSchedule(objectives=[training_objective], args=args)

# instantiate Adapter - analogical structure to HF Transformers' Trainer
adapter = Adapter(lang_module=language_module,
                  schedule=parallel_schedule,
                  args=args)

max_steps is given, it will override any value given in num_train_epochs


After all the configuration, we are ready to run the training and wait for the trained model.

Given the `stopping_strategy=StoppingStrategy.FIRST_OBJECTIVE_CONVERGED` and `stopping_patience=1`, the training will terminate after first evaluation, where `model_quality_evaluator` (or evaluation loss, if no Evaluator is given) does not improve over one evaluation.

In [None]:
adapter.train()

# Final Hands-on: Train your own In-context learner

Now your final task will be to improve In-context learning ability for a specific use-case that you have at hand. 

You can use any of the approaches of the existing models. Additionally, you can use datasets for **related tasks**. If you'd like to create an in-context learner for a **new language**, search if your target language has a QA dataset available. If not, chances are that you can still transfer using QA dataset in a similar language.


### Implementation template

Compared to the example of generation above, perhaps all you need to play with are the base model, and inputs and outputs. Think about the *relatedness* of the tasks and relevance of the existing Promptsource templates that you could use.

When you are done with the design of your experiment, try executing the plan by optionally filling the template below.

In [None]:
base_model = "google/mt5-base"  # TODO: pick the base model

In [None]:
from adaptor.lang_module import LangModule

language_module = LangModule(base_model)

In [None]:
from datasets import load_dataset
main_dataset = load_dataset("squad")  # TODO: pick datasets: see https://huggingface.co/datasets

# other_dataset = load_dataset("imdb")  # maybe do the same thing with your target dataset/similar templates?

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Sequence2Sequence:   0%|          | 8/24900 [09:53<512:49:54, 74.17s/batches, epoch=1, loss=26.2, split=train]
Sequence2Sequence:  92%|█████████▏| 92/100 [03:01<00:15,  1.97s/batches, epoch=1, loss=21.9, split=eval]


Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Input prompts & labels collection

In [None]:
input_texts = []
label_texts = []

Using Promptsource to verbalize squad's templates: See all templates on the project repo: https://github.com/bigscience-workshop/promptsource

In [None]:
!pip install -q git+https://github.com/fewshot-goes-multilingual/promptsource.git

  Preparing metadata (setup.py) ... [?25l[?25hdone


In [None]:
from promptsource.templates import DatasetTemplates

prompts = DatasetTemplates("squad")

for template_id in prompts.all_template_names:
    promt_template = prompts[template_id]

    prompt_label_pairs = main_dataset["validation"].map(lambda row: {"prompt": promt_template.apply(row)})["prompt"]

    input_texts.extend(prompt for prompt, label in prompt_label_pairs)
    label_texts.extend(label for prompt, label in prompt_label_pairs)

In [None]:
# shuffle the inputs
import random

data_index = list(range(len(input_texts)))

random.shuffle(data_index)

input_texts = [input_texts[i] for i in data_index]
label_texts = [label_texts[i] for i in data_index]

In [None]:
# dataset objective

val_samples = 100

seq2seq_squad = Sequence2Sequence(lang_module=language_module,
                                  texts_or_path=input_texts[:-100],
                                  labels_or_path=label_texts[:-100],
                                  val_texts_or_path=input_texts[-100:],
                                  val_labels_or_path=label_texts[-100:],
                                  batch_size=1)

Training as in the example above

In [None]:
from adaptor.utils import AdaptationArguments, StoppingStrategy

output_dir = "/content/drive/MyDrive/training_output_dir"

args = AdaptationArguments(output_dir=output_dir,
                           learning_rate=2e-5,
                           warmup_steps=1000,
                           stopping_strategy=StoppingStrategy.FIRST_OBJECTIVE_CONVERGED,
                           do_train=True,
                           do_eval=True,
                           logging_steps=100,
                           eval_steps=200,
                           evaluation_strategy="steps",
                           save_steps=200,
                           save_total_limit=6,
                           stopping_patience=5,
                           num_train_epochs=20,
                           gradient_accumulation_steps=20)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
from adaptor.schedules import ParallelSchedule
from adaptor.adapter import Adapter

# choose a schedule of applying objectives - with one objective does not really matter
parallel_schedule = ParallelSchedule(objectives=[seq2seq_squad], args=args)

# instantiate Adapter - analogical structure to HF Transformers' Trainer
adapter = Adapter(lang_module=language_module,
                  schedule=parallel_schedule,
                  args=args)



In [None]:
# Start the training!
adapter.train()

Sequence2Sequence:   0%|          | 20/150919 [01:07<135:22:00,  3.23s/batches, epoch=1, loss=15.7, split=train]

### Take-home exercise

Evaluate how your new ICL model stand in your task, using your evaluator from previous Hands-on!

Copy-paste your implementation of evaluation from **Hands on** in [ICL_intro notebook](https://github.com/gaussalgo/L2L_MLPrague23/blob/main/notebooks/ICL_intro.ipynb).

In [None]:
# TODO: copy-paste and run your evaluation here

### Take-home exercise

Evaluate how your new ICL model stand on the tasks that your model has never seen before. 

All you need for it is to execute the prepared evaluation script;
See **competition** in the [project repository](https://github.com/gaussalgo/L2L_MLPrague23) for details!