diff --git a/.coverage b/.coverage
index 96f4053..7ac1d3f 100644
Binary files a/.coverage and b/.coverage differ
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 7901c0e..0a26c2a 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -34,7 +34,6 @@ jobs:
       - name: Run tests with coverage
         run: |
           poetry run python -m pytest --junitxml=pytest.xml --cov-report=term-missing:skip-covered --cov=. tests/ > pytest-coverage.txt
-          cat pytest-coverage.txt
 
       - name: Generate coverage report & comment on PR
         id: coverageComment
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
index 0a7d111..301fa4f 100644
--- a/.github/workflows/docs.yml
+++ b/.github/workflows/docs.yml
@@ -30,7 +30,7 @@ jobs:
 
       - name: Generate notebook examples
         run: |
-          poetry run jupyter nbconvert --to markdown --allow-errors --output-dir docs/examples notebooks/*.ipynb
+          poetry run jupyter nbconvert --to markdown --allow-errors --output-dir docs/examples tutorials/*.ipynb
 
       - name: Deploy docs
         run: |
diff --git a/.gitignore b/.gitignore
index 6034def..e95cd5b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -11,3 +11,4 @@ results/
 poetry.lock
 CLAUDE.md
 **/CLAUDE.local.md
+.mypy_cache/
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index bc3417e..22c8d88 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -18,6 +18,16 @@ repos:
     rev: 5.12.0
     hooks:
       - id: isort
+  - repo: https://github.com/pre-commit/mirrors-mypy
+    rev: v1.8.0
+    hooks:
+      - id: mypy
+        files: ^promptolution/
+        additional_dependencies:
+          - types-requests
+          - pandas-stubs
+          - numpy
+        args: [--explicit-package-bases, --config-file=pyproject.toml]
   - repo: https://github.com/pycqa/pydocstyle
     rev: 6.3.0
     hooks:
diff --git a/README.md b/README.md
index f6dc531..9288857 100644
--- a/README.md
+++ b/README.md
@@ -1,12 +1,11 @@
 ![promptolution](https://github.com/user-attachments/assets/84c050bd-61a1-4f2e-bc4e-874d9b4a69af)
 
-![Coverage](https://img.shields.io/badge/Coverage-89%25-green)
+![Coverage](https://img.shields.io/badge/Coverage-91%25-brightgreen)
 [![CI](https://github.com/finitearth/promptolution/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/finitearth/promptolution/actions/workflows/ci.yml)
 [![Docs](https://github.com/finitearth/promptolution/actions/workflows/docs.yml/badge.svg?branch=main)](https://github.com/finitearth/promptolution/actions/workflows/docs.yml)
 ![Code Style](https://img.shields.io/badge/Code%20Style-black-black)
 ![Python Versions](https://img.shields.io/badge/Python%20Versions-≥3.10-blue)
-
-
+[![Getting Started](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/finitearth/promptolution/blob/main/tutorials/getting_started.ipynb)
 
 Promptolution is a library that provides a modular and extensible framework for implementing prompt tuning for single tasks and larger experiments. It offers a user-friendly interface to assemble the core components for various prompt optimization tasks.
 
@@ -36,7 +35,7 @@ to install the necessary dependencies. You might need to install [pipx](https://
 
 ## Usage
 
-To get started right away, take a look at our [getting started notebook](https://github.com/finitearth/promptolution/blob/main/notebooks/getting_started.ipynb).
+To get started right away, take a look at our [getting started notebook](https://github.com/finitearth/promptolution/blob/main/tutorials/getting_started.ipynb) and our [other demos and tutorials](https://github.com/finitearth/promptolution/blob/main/tutorials).
 For more details, a comprehensive **documentation** with API reference is availabe at https://finitearth.github.io/promptolution/.
 
 ### Featured Optimizers
diff --git a/docs/examples/getting_started.md b/docs/examples/getting_started.md
index 020699d..81e1f57 100644
--- a/docs/examples/getting_started.md
+++ b/docs/examples/getting_started.md
@@ -1,18 +1,17 @@
-# Getting started
+# Getting Started with Promptolution
 
-## Before you start
+## Welcome to Promptolution! 
 
-In this notebook we give you a short introduction into the workings of promptolution.
+Discover a powerful tool for evolving and optimizing your LLM prompts. This notebook provides a friendly introduction to Promptolution's core functionality.
 
-We will use the OpenAI-API to demonstrate the functionality of promptolution, however we also provide a local LLM, as well as a vLLM backend. You can also change the `base_url` in the config, in order to use any other api, that follows the OpenAI API standard.
+We're excited to have you try Promptolution - let's get started!
 
-Thanks for giving it a try!
-
-## Installs
+## Installation
+Install Promptolution with a single command
 
 
 ```python
-# ! pip install promptolution
+! pip install promptolution[api]
 ```
 
 ## Imports
@@ -20,364 +19,212 @@ Thanks for giving it a try!
 
 ```python
 import pandas as pd
-from promptolution import ExperimentConfig, run_experiment
+from promptolution.utils import ExperimentConfig
+from promptolution.helpers import run_experiment
 import nest_asyncio
-nest_asyncio.apply() # we need this only because we are in a notebook
+
+nest_asyncio.apply()  # Required for notebook environments
 ```
 
-## set up llms, predictor, tasks and optimizer
+## Setting Up Your Experiment
 
-Here we set up our dataset. We use the subjectivity dataset from hugging face, but of course here you may want to use your own dataset.
+### Prepare the data
 
-Just make sure, to name the input column "x" and the target column "y", as well as providing a short dataset description.
+Below, we're using a subsample of the subjectivity dataset from Hugging Face as an example. When using your own dataset, simply ensure you name the input column "x" and the target column "y", and provide a brief description of your task, that will parsed to the meta-llm during optimization.
 
 
 ```python
-df = pd.read_csv("hf://datasets/tasksource/subjectivity/train.csv")
+df = pd.read_csv("hf://datasets/tasksource/subjectivity/train.csv").sample(500)
 df = df.rename(columns={"Sentence": "x", "Label": "y"})
 df = df.replace({"OBJ": "objective", "SUBJ": "subjective"})
 
-task_description = "The dataset contains sentences labeled as either subjective or objective. "\
-        "The task is to classify each sentence as either subjective or objective. " \
-        "The class mentioned first in the response of the LLM will be the prediction."
+task_description = (
+    "The dataset contains sentences labeled as either subjective or objective. "
+    "The task is to classify each sentence as either subjective or objective. "
+    "The class mentioned in between the answer tags <final_answer></final_answer> will be used as the prediction."
+)
 ```
 
-We definied some initial prompts, however you may also take a look at `create_prompts_from_samples` in order to automatically generate them.
+### Creating Inital Prompts
+
+We've defined some starter prompts below, but feel free to experiment! You might also want to explore create_prompts_from_samples to automatically generate initial prompts based on your data.
 
 
 ```python
 init_prompts = [
     'Classify the given text as either an objective or subjective statement based on the tone and language used: e.g. the tone and language used should indicate whether the statement is a neutral, factual summary (objective) or an expression of opinion or emotional tone (subjective). Include the output classes "objective" or "subjective" in the prompt.',
-    'What kind of statement is the following text: [Insert text here]? Is it <objective_statement> or <subjective_statement>?',
+    "What kind of statement is the following text: [Insert text here]? Is it <objective_statement> or <subjective_statement>?",
     'Identify whether a sentence is objective or subjective by analyzing the tone, language, and underlying perspective. Consider the emotion, opinion, and bias present in the sentence. Are the authors presenting objective facts or expressing a personal point of view? The output will be either "objective" (output class: objective) or "subjective" (output class: subjective).',
-    'Classify the following sentences as either objective or subjective, indicating the name of the output classes: [input sentence]. Output classes: objective, subjective',
+    "Classify the following sentences as either objective or subjective, indicating the name of the output classes: [input sentence]. Output classes: objective, subjective",
     '_query a text about legal or corporate-related issues, and predict whether the tone is objective or subjective, outputting the corresponding class "objective" for non-subjective language or "subjective" for subjective language_',
     'Classify a statement as either "subjective" or "objective" based on whether it reflects a personal opinion or a verifiable fact. The output classes to include are "objective" and "subjective".',
-    'Classify the text as objective or subjective based on its tone and language.',
-    'Classify the text as objective or subjective based on the presence of opinions or facts. Output classes: objective, subjective.',
-    'Classify the given text as objective or subjective based on its tone, focusing on its intention, purpose, and level of personal opinion or emotional appeal, with outputs including classes such as objective or subjective.',
+    "Classify the text as objective or subjective based on its tone and language.",
+    "Classify the text as objective or subjective based on the presence of opinions or facts. Output classes: objective, subjective.",
+    "Classify the given text as objective or subjective based on its tone, focusing on its intention, purpose, and level of personal opinion or emotional appeal, with outputs including classes such as objective or subjective.",
     "Categorize the text as either objective or subjective, considering whether it presents neutral information or expresses a personal opinion/bias.\n\nObjective: The text has a neutral tone and presents factual information about the actions of Democrats in Congress and the union's negotiations.\n\nSubjective: The text has a evaluative tone and expresses a positive/negative opinion/evaluation about the past performance of the country.",
     'Given a sentence, classify it as either "objective" or "subjective" based on its tone and language, considering the presence of third-person pronouns, neutral language, and opinions. Classify the output as "objective" if the tone is neutral and detached, focusing on facts and data, or as "subjective" if the tone is evaluative, emotive, or biased.',
-    'Identify whether the given sentence is subjective or objective, then correspondingly output "objective" or "subjective" in the form of "<output class>, (e.g. "objective"), without quotes. Please note that the subjective orientation typically describes a sentence where the writer expresses their own opinion or attitude, whereas an objective sentence presents facts or information without personal involvement or bias. <output classes: subjective, objective>'
+    'Identify whether the given sentence is subjective or objective, then correspondingly output "objective" or "subjective" in the form of "<output class>, (e.g. "objective"), without quotes. Please note that the subjective orientation typically describes a sentence where the writer expresses their own opinion or attitude, whereas an objective sentence presents facts or information without personal involvement or bias. <output classes: subjective, objective>',
 ]
 ```
 
-We will be now using the gpt
+### Configure Your LLM
+
+Promptolution offers three flexible ways to access language models:
+
+1. Local LLMs (using the Transformers library)
+1. vLLM backend (for efficient serving of large language models)
+1. API-based LLMs (compatible with any provider following the OpenAI standard)
+
+For this demonstration, we'll use the DeepInfra API, but you can easily switch to other providers like Anthropic or OpenAI by simply changing the base_url and llm string in the configuration.
 
 
 ```python
-token = open("../deepinfratoken.txt", "r").read()
+api_key = "YOUR_API_KEY"  # Replace with your Promptolution API key
 ```
 
+Here's an explanation of each configuration parameter in the ExperimentConfig:
+- `optimizer`: The algorithm used for prompt optimization. Currently we support "capo", "evopromptga", "evopromptde", and "opro". For this example, we use "capo" as it is capable of leveraging few-shot examples.
+- `task_description`: A string describing the task you're optimizing prompts for. This is used to provide the meta-llm with context about your task.
+- `prompts`: A list of initial prompt strings that will be used as the starting point for optimization.
+- `n_steps`: The number of optimization steps to run. Higher values allow more exploration and refinement but require more API calls and computational resources.
+- `api_url`: The API endpoint URL used to access the language model. This example uses DeepInfra's API which follows the OpenAI standard.
+- `llm`: The LLM to use for the experiment, as both downstream and meta LLM.
+- `token`: Your API authentication token required to access the language model service.
+
 
 ```python
 config = ExperimentConfig(
+    optimizer="capo",
     task_description=task_description,
     prompts=init_prompts,
-    n_steps=3,
-    optimizer="evopromptga",
-    api_url="https://api.openai.com/v1",
-    llm="gpt-4o-mini-2024-07-18",
-    token=token,
+    n_steps=10,
+    api_url="https://api.deepinfra.com/v1/openai",
+    model_id="meta-llama/Meta-Llama-3-8B-Instruct",
+    api_key=api_key,
+    n_subsamples=30,
 )
 ```
 
+## Run Your Experiment
+
+With everything configured, you're ready to optimize your prompts! The `run_experiment` function will run the optimization and evaluate on a holdout set. You can expect this cell to take a few minutes to run.
+
 
 ```python
 prompts = run_experiment(df, config)
 ```
 
+    📌 CAPO requires block evaluation strategy. Setting it to 'sequential_block'.
+    ⚠️ The LLM does not have a tokenizer. Using simple token count.
+    🔥 Starting optimization...
+    📊 Starting evaluation...
+    
 
-    ---------------------------------------------------------------------------
-
-    RateLimitError                            Traceback (most recent call last)
+As you can see, most optimized prompts are semantically very similar, however they often differ heavily in performance. This is exactly what we observed in our experiments across various LLMs and datasets. Running prompt optimization is an easy way to gain significant performance improvements on your task for free!
 
-    Cell In[48], line 1
-    ----> 1 prompts = run_experiment(df, config)
+If you run into any issues while using Promptolution, please feel free to contact us. We're also happy to receive support through pull requests and other contributions to the project.
 
 
-    File ~\Documents\programming\promptolution\promptolution\helpers.py:32, in run_experiment(df, config)
-         30 train_df = df.sample(frac=0.8, random_state=42)
-         31 test_df = df.drop(train_df.index)
-    ---> 32 prompts = run_optimization(train_df, config)
-         33 df_prompt_scores = run_evaluation(test_df, config, prompts)
-         35 return df_prompt_scores
+Happy prompt optimizing! 🚀✨ We can't wait to see what you build with Promptolution! 🤖💡
 
 
-    File ~\Documents\programming\promptolution\promptolution\helpers.py:59, in run_optimization(df, config)
-         51 task = get_task(df, config)
-         52 optimizer = get_optimizer(
-         53     predictor=predictor,
-         54     meta_llm=llm,
-         55     task=task,
-         56     config=config,
-         57 )
-    ---> 59 prompts = optimizer.optimize(n_steps=config.n_steps)
-         61 if config.prepend_exemplars:
-         62     selector = get_exemplar_selector(config.exemplar_selector, task, predictor)
-
+```python
+prompts
+```
 
-    File <string>:15, in optimize(self, n_steps)
 
 
-    File ~\Documents\programming\promptolution\promptolution\optimizers\evoprompt_ga.py:69, in EvoPromptGA._pre_optimization_loop(self)
-         67     logger.warning(f"Initial sequences: {seq}")
-         68 else:
-    ---> 69     self.scores = self.task.evaluate(
-         70         self.prompts, self.predictor, subsample=True, n_samples=self.n_eval_samples
-         71     ).tolist()
-         72 # sort prompts by score
-         73 self.prompts = [prompt for _, prompt in sorted(zip(self.scores, self.prompts), reverse=True)]
 
+<div>
+<style scoped>
+    .dataframe tbody tr th:only-of-type {
+        vertical-align: middle;
+    }
+
+    .dataframe tbody tr th {
+        vertical-align: top;
+    }
+
+    .dataframe thead th {
+        text-align: right;
+    }
+</style>
+<table border="1" class="dataframe">
+  <thead>
+    <tr style="text-align: right;">
+      <th></th>
+      <th>prompt</th>
+      <th>score</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>0</th>
+      <td>Classify the text as objective or subjective based on the presence of opinions or facts. Output classes: objective, subjective.\n\nInput:\nThe proposed agreement includes the best wage increases for rail workers in over forty years.\n\nOutput:\nobjective\n\nInput:\nThe principal reason, from the point of view of government, is that a universal income tax would be a powerful restraint upon the expansion of government.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:</td>
+      <td>0.76</td>
+    </tr>
+    <tr>
+      <th>1</th>
+      <td>Task: Linguistic Analysis for Sentence Classification\n\nClassify each sentence as either objective or subjective by applying linguistic insights to identify its tone, emotion, and degree of neutrality. Examine the sentences' language features, sentiment, and presence of verifiable facts or personal opinions. Determine whether each sentence presents impartial data or conveys the author's emotions, beliefs, or biases. Treat each sentence as a distinct entity, analyzing its contours, nuances, and purpose. Consider the distinction between factual reports like news articles and opinion-based writings like blog posts. Make a nuanced classification by scrutinizing the sentence's impact, intention, and emotional resonance.\n\nYour response should be comprised of two parts: the classification and the rationale. Enclose the first-mentioned class within the markers &lt;final_answer&gt; and &lt;/final_answer&gt;. For instance, if the classification is 'objective', the output should be &lt;final_answer&gt;objective&lt;/final_answer&gt;. Focus on the sentence's language, tone, and emotional appeal to make an informed decision about its categorization, prioritizing the sentence's intention and purpose.\n\nInput:\nThe last may go very deep.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:\n“This latest rule will open our borders even more, and the Court seems to relish making arbitrary decisions without thinking about consequences.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:</td>
+      <td>0.72</td>
+    </tr>
+    <tr>
+      <th>2</th>
+      <td>Classify each sentence as either objective or subjective by unpacking its linguistic nuances and emotional undertones. Analyze the sentence's language features, sentiment, and presence of verifiable facts or personal opinions to determine whether it presents impartial data or conveys the author's emotions, beliefs, or biases. Treat each sentence as a standalone entity, examining its contours, subtleties, and intended purpose. Consider the distinction between factual reporting, like news articles, and opinion-based writings, like blog posts. Make a refined classification by scrutinizing the sentence's impact, intention, and emotional resonance, prioritizing the sentence's intention and purpose. Your response should consist of two parts: the classification and the rationale. Enclose the primary classification within the markers &lt;final_answer&gt; and &lt;/final_answer&gt;. Focus on the sentence's language, tone, and emotional appeal to make an informed decision about its categorization. Classify each sentence as either objective or subjective by examining its linguistic tone, underlying intent, and purpose. Determine whether the text presents a neutral, factual account or expresses a personal opinion or emotional bias. Evaluate whether the text provides a neutral, factual report or reveals an evaluative tone, offering a positive or negative appraisal. Outputs will include classifications like objective or subjective, with the initial response serving as the prediction.\n\nInput:\nOver several decades, Prime Central London – or PCL – had become a repository for cash from wealthy foreigners, whether they actually wanted to live there or not.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:</td>
+      <td>0.71</td>
+    </tr>
+    <tr>
+      <th>3</th>
+      <td>&lt;promptгалтер/&gt;\n\nClassify each sentence as either objective or subjective by examining its linguistic tone, underlying intent, and purpose. Consider whether the text presents a neutral, factual account or expresses a personal opinion or emotional bias. Evaluate whether the text is neutral and provides mere reportage, such as a factual report on congressional Democrats' actions and labor union negotiations, or if it reveals an evaluative tone, offering a positive or negative appraisal of a nation's past performance. Outputs will include classifications like objective or subjective. The class mentioned first in the response will serve as the prediction, with the class label extracted from the text between the markers &lt;final_answer&gt; and &lt;/final_answer&gt;.\n\nInput:\nOver several decades, Prime Central London – or PCL – had become a repository for cash from wealthy foreigners, whether they actually wanted to live there or not.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:\nFaced with a tighter labor market, many districts are raising base salaries and offering signing and relocation bonuses — up to a whopping $25,000 in one New Mexico school district.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:\nThat when liquidation of commodities and securities has gone too far it becomes the business of government to stop it, using public credit by such means as it may think fit.\n\nOutput:\n&lt;final_answer&gt;subjective&lt;/final_answer&gt;\n\nInput:</td>
+      <td>0.67</td>
+    </tr>
+    <tr>
+      <th>4</th>
+      <td>Classify a given sentence as either "objective" or "subjective" based on its linguistic characteristics. Determine whether the sentence presents neutral information or expresses a personal opinion/bias. If the text maintains a detached tone, focusing on verifiable facts and data, assign the label "objective". Conversely, if the tone is evaluative, emotive, or reveals a bias, categorize it as "subjective". Compare the tone of a factual text discussing political events to a text expressing a clear opinion about a historical event to grasp the distinction between the two genres. The predicted class will be the first class mentioned in the language model's response, enclosed within the marks &lt;final_answer&gt; and &lt;/final_answer&gt;.\n\nInput:\n“This latest rule will open our borders even more, and the Court seems to relish making arbitrary decisions without thinking about consequences.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:\nTransportation Secretary Pete Buttigieg confirmed to The Associated Press on Thursday that $104.6 million in federal funds coming from last year’s bipartisan infrastructure bill will go toward a plan to dismantle Interstate 375, a highway built to bisect Detroit’s Black Bottom neighborhood and its epicenter of Black business, Paradise Valley.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:\nThe last may go very deep.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:</td>
+      <td>0.67</td>
+    </tr>
+    <tr>
+      <th>5</th>
+      <td>Given a sentence, classify it as either "objective" or "subjective" based on its tone and language, considering the presence of third-person pronouns, neutral language, and opinions. Classify the output as "objective" if the tone is neutral and detached, focusing on facts and data, or as "subjective" if the tone is evaluative, emotive, or biased.\n\nInput:\nTransportation Secretary Pete Buttigieg confirmed to The Associated Press on Thursday that $104.6 million in federal funds coming from last year’s bipartisan infrastructure bill will go toward a plan to dismantle Interstate 375, a highway built to bisect Detroit’s Black Bottom neighborhood and its epicenter of Black business, Paradise Valley.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:\n“This latest rule will open our borders even more, and the Court seems to relish making arbitrary decisions without thinking about consequences.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:\nHe is fairly secure.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:\nIn a recent report on the “new poor,” made by the Welfare Council of New York City, there is a reference to “the mental infection of dependency.” This was upon the investigation of unemployment relief.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:</td>
+      <td>0.67</td>
+    </tr>
+    <tr>
+      <th>6</th>
+      <td>Classify each sentence as objective or subjective by recognizing its language characteristics. Identify whether each sentence presents neutral information or expresses a personal opinion. If the sentence provides factual information without taking a bias, classify it as objective. Conversely, if the sentence conveys the author's perspective, emotions, or beliefs, label it as subjective. As our language model expert, carefully analyze each sentence, extracting its tone, and determine whether it presents verifiable data or the author's biased thoughts. For instance, compare a factual news report on politics to a blog post about a historical event and highlight the differences between objective and subjective writing. Our output will be the predicted class enclosed within the markers &lt;final_answer&gt; and &lt;/final_answer&gt;, with the first-mentioned class being the predicted label.\n\nInput:\n“This latest rule will open our borders even more, and the Court seems to relish making arbitrary decisions without thinking about consequences.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:</td>
+      <td>0.67</td>
+    </tr>
+    <tr>
+      <th>7</th>
+      <td>Categorize the text as either objective or subjective, considering whether it presents neutral information or expresses a personal opinion/bias.\n\nObjective: The text has a neutral tone and presents factual information about the actions of Democrats in Congress and the union's negotiations.\n\nSubjective: The text has a evaluative tone and expresses a positive/negative opinion/evaluation about the past performance of the country.\n\nInput:\nOver several decades, Prime Central London – or PCL – had become a repository for cash from wealthy foreigners, whether they actually wanted to live there or not.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:\nThe last may go very deep.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:\nThat when liquidation of commodities and securities has gone too far it becomes the business of government to stop it, using public credit by such means as it may think fit.\n\nOutput:\n&lt;final_answer&gt;subjective&lt;/final_answer&gt;\n\nInput:\nThat is what it means to sell bonds.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:</td>
+      <td>0.66</td>
+    </tr>
+    <tr>
+      <th>8</th>
+      <td>Classify a statement as either "subjective" or "objective" based on whether it reflects a personal opinion or a verifiable fact. The output classes to include are "objective" and "subjective".\n\nInput:\nThe promotion of it for many is an avocation, for increasing numbers it is a profession, and for a very great number of more or less trained men and women it is employment and livelihood.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:</td>
+      <td>0.65</td>
+    </tr>
+    <tr>
+      <th>9</th>
+      <td>A labeling exercise necessitates scrutinizing provided text to classify them as either vastly personal ('subjective') or dispassionately factual ('objective') based on the presence of opinions, biases, or verifiable information. Your mission is to accurately determine whether the supplied sentence leans more towards subjective expression of personal thought or objective presentation of facts, then output the corresponding classification within the format "&lt;final_answer&gt;&lt;output class&gt;, &lt;output class&gt;&lt;/final_answer&gt;" (e.g. "&lt;final_answer&gt;objective&lt;/final_answer&gt;"). Recognize that subjective sentences usually embody the writer's own views or emotions, whereas objective sentences present data without personal investment or allegiance. The predicted outcome will be the one first mentioned in the response, and the extracted class label will be positioned between the markers &lt;final_answer&gt; and &lt;/final_answer&gt;, which can only be one of the two categories: subjective or objective.\n\nInput:\nThe last may go very deep.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:</td>
+      <td>0.65</td>
+    </tr>
+    <tr>
+      <th>10</th>
+      <td>Classify a collection of labeled sentences as either based on fact or reflecting personal opinion, using linguistic features to distinguish between objective statements presenting verifiable information and subjective expressions of opinion or attitude, with the objective class being denoted by &lt;final_answer&gt;objective&lt;/final_answer&gt; and the subjective class by &lt;final_answer&gt;subjective&lt;/final_answer&gt;, where the first-mentioned class in the response will serve as the predicted outcome.\n\nInput:\nThe principal reason, from the point of view of government, is that a universal income tax would be a powerful restraint upon the expansion of government.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:</td>
+      <td>0.64</td>
+    </tr>
+    <tr>
+      <th>11</th>
+      <td>Given a dataset of sentences, use linguistic analysis to categorize each sentence as either 'objective' or 'subjective', reflecting its tone and language usage. Examine the presence of neutral third-person pronouns, factual data, and opinions to determine whether a sentence presents information in a detached and neutral manner ('objective') or conveys a personal perspective or emotional appeal ('subjective'). Your primary consideration should be the sentence's intention, purpose, and emotional resonance, with the predicted classification appearing first in your response. The predicted classification will be extracted from the text situated between the '&lt;final_answer&gt;' and '&lt;/final_answer&gt;' markers.\n\nInput:\nCOVID is continually evolving to become more immune evasive, according to Ray, and Omicron is spawning exponentially.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:\nThe last may go very deep.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:\nOver several decades, Prime Central London – or PCL – had become a repository for cash from wealthy foreigners, whether they actually wanted to live there or not.\n\nOutput:\n&lt;final_answer&gt;objective&lt;/final_answer&gt;\n\nInput:</td>
+      <td>0.59</td>
+    </tr>
+  </tbody>
+</table>
+</div>
 
-    File ~\Documents\programming\promptolution\promptolution\tasks\classification_tasks.py:101, in ClassificationTask.evaluate(self, prompts, predictor, system_prompts, n_samples, subsample, return_seq)
-         98 ys_subsample = self.ys[indices]
-        100 # Make predictions on the subsample
-    --> 101 preds = predictor.predict(prompts, xs_subsample, system_prompts=system_prompts, return_seq=return_seq)
-        103 if return_seq:
-        104     preds, seqs = preds
-
-
-    File ~\Documents\programming\promptolution\promptolution\predictors\base_predictor.py:57, in BasePredictor.predict(self, prompts, xs, system_prompts, return_seq)
-         54 if isinstance(prompts, str):
-         55     prompts = [prompts]
-    ---> 57 outputs = self.llm.get_response(
-         58     [prompt + "\n" + x for prompt in prompts for x in xs], system_prompts=system_prompts
-         59 )
-         60 preds = self._extract_preds(outputs)
-         62 shape = (len(prompts), len(xs))
-
-
-    File ~\Documents\programming\promptolution\promptolution\llms\base_llm.py:97, in BaseLLM.get_response(self, prompts, system_prompts)
-         95 if isinstance(system_prompts, str):
-         96     system_prompts = [system_prompts] * len(prompts)
-    ---> 97 responses = self._get_response(prompts, system_prompts)
-         98 self.update_token_count(prompts + system_prompts, responses)
-        100 return responses
-
-
-    File ~\Documents\programming\promptolution\promptolution\llms\api_llm.py:82, in APILLM._get_response(self, prompts, system_prompts)
-         79 def _get_response(self, prompts: List[str], system_prompts: List[str]) -> List[str]:
-         80     # Setup for async execution in sync context
-         81     loop = asyncio.get_event_loop()
-    ---> 82     responses = loop.run_until_complete(self._get_response_async(prompts, system_prompts))
-         83     return responses
-
-
-    File c:\Users\tzehl\Documents\programming\promptolution\.venv\Lib\site-packages\nest_asyncio.py:98, in _patch_loop.<locals>.run_until_complete(self, future)
-         95 if not f.done():
-         96     raise RuntimeError(
-         97         'Event loop stopped before Future completed.')
-    ---> 98 return f.result()
-
-
-    File ~\AppData\Local\Programs\Python\Python312\Lib\asyncio\futures.py:203, in Future.result(self)
-        201 self.__log_traceback = False
-        202 if self._exception is not None:
-    --> 203     raise self._exception.with_traceback(self._exception_tb)
-        204 return self._result
-
-
-    File ~\AppData\Local\Programs\Python\Python312\Lib\asyncio\tasks.py:316, in Task.__step_run_and_handle_result(***failed resolving arguments***)
-        314         result = coro.send(None)
-        315     else:
-    --> 316         result = coro.throw(exc)
-        317 except StopIteration as exc:
-        318     if self._must_cancel:
-        319         # Task is cancelled right before coro stops.
-
-
-    File ~\Documents\programming\promptolution\promptolution\llms\api_llm.py:90, in APILLM._get_response_async(self, prompts, system_prompts)
-         85 async def _get_response_async(self, prompts: List[str], system_prompts: List[str]) -> List[str]:
-         86     tasks = [
-         87         _invoke_model(prompt, system_prompt, self.max_tokens, self.llm, self.client, self.semaphore)
-         88         for prompt, system_prompt in zip(prompts, system_prompts)
-         89     ]
-    ---> 90     responses = await asyncio.gather(*tasks)
-         91     return [response.choices[0].message.content for response in responses]
-
-
-    File ~\AppData\Local\Programs\Python\Python312\Lib\asyncio\tasks.py:385, in Task.__wakeup(self, future)
-        383 def __wakeup(self, future):
-        384     try:
-    --> 385         future.result()
-        386     except BaseException as exc:
-        387         # This may also be a cancellation.
-        388         self.__step(exc)
-
-
-    File ~\AppData\Local\Programs\Python\Python312\Lib\asyncio\tasks.py:314, in Task.__step_run_and_handle_result(***failed resolving arguments***)
-        310 try:
-        311     if exc is None:
-        312         # We use the `send` method directly, because coroutines
-        313         # don't have `__iter__` and `__next__` methods.
-    --> 314         result = coro.send(None)
-        315     else:
-        316         result = coro.throw(exc)
-
-
-    File ~\Documents\programming\promptolution\promptolution\llms\api_llm.py:25, in _invoke_model(prompt, system_prompt, max_tokens, model_id, client, semaphore)
-         23 async with semaphore:
-         24     messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": prompt}]
-    ---> 25     response = await client.chat.completions.create(
-         26         model=model_id,
-         27         messages=messages,
-         28         max_tokens=max_tokens,
-         29     )
-         30     return response
-
-
-    File c:\Users\tzehl\Documents\programming\promptolution\.venv\Lib\site-packages\openai\resources\chat\completions\completions.py:2032, in AsyncCompletions.create(self, messages, model, audio, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, modalities, n, parallel_tool_calls, prediction, presence_penalty, reasoning_effort, response_format, seed, service_tier, stop, store, stream, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, web_search_options, extra_headers, extra_query, extra_body, timeout)
-       1989 @required_args(["messages", "model"], ["messages", "model", "stream"])
-       1990 async def create(
-       1991     self,
-       (...)   2029     timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
-       2030 ) -> ChatCompletion | AsyncStream[ChatCompletionChunk]:
-       2031     validate_response_format(response_format)
-    -> 2032     return await self._post(
-       2033         "/chat/completions",
-       2034         body=await async_maybe_transform(
-       2035             {
-       2036                 "messages": messages,
-       2037                 "model": model,
-       2038                 "audio": audio,
-       2039                 "frequency_penalty": frequency_penalty,
-       2040                 "function_call": function_call,
-       2041                 "functions": functions,
-       2042                 "logit_bias": logit_bias,
-       2043                 "logprobs": logprobs,
-       2044                 "max_completion_tokens": max_completion_tokens,
-       2045                 "max_tokens": max_tokens,
-       2046                 "metadata": metadata,
-       2047                 "modalities": modalities,
-       2048                 "n": n,
-       2049                 "parallel_tool_calls": parallel_tool_calls,
-       2050                 "prediction": prediction,
-       2051                 "presence_penalty": presence_penalty,
-       2052                 "reasoning_effort": reasoning_effort,
-       2053                 "response_format": response_format,
-       2054                 "seed": seed,
-       2055                 "service_tier": service_tier,
-       2056                 "stop": stop,
-       2057                 "store": store,
-       2058                 "stream": stream,
-       2059                 "stream_options": stream_options,
-       2060                 "temperature": temperature,
-       2061                 "tool_choice": tool_choice,
-       2062                 "tools": tools,
-       2063                 "top_logprobs": top_logprobs,
-       2064                 "top_p": top_p,
-       2065                 "user": user,
-       2066                 "web_search_options": web_search_options,
-       2067             },
-       2068             completion_create_params.CompletionCreateParamsStreaming
-       2069             if stream
-       2070             else completion_create_params.CompletionCreateParamsNonStreaming,
-       2071         ),
-       2072         options=make_request_options(
-       2073             extra_headers=extra_headers, extra_query=extra_query, extra_body=extra_body, timeout=timeout
-       2074         ),
-       2075         cast_to=ChatCompletion,
-       2076         stream=stream or False,
-       2077         stream_cls=AsyncStream[ChatCompletionChunk],
-       2078     )
-
-
-    File c:\Users\tzehl\Documents\programming\promptolution\.venv\Lib\site-packages\openai\_base_client.py:1805, in AsyncAPIClient.post(self, path, cast_to, body, files, options, stream, stream_cls)
-       1791 async def post(
-       1792     self,
-       1793     path: str,
-       (...)   1800     stream_cls: type[_AsyncStreamT] | None = None,
-       1801 ) -> ResponseT | _AsyncStreamT:
-       1802     opts = FinalRequestOptions.construct(
-       1803         method="post", url=path, json_data=body, files=await async_to_httpx_files(files), **options
-       1804     )
-    -> 1805     return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
-
-
-    File c:\Users\tzehl\Documents\programming\promptolution\.venv\Lib\site-packages\openai\_base_client.py:1495, in AsyncAPIClient.request(self, cast_to, options, stream, stream_cls, remaining_retries)
-       1492 else:
-       1493     retries_taken = 0
-    -> 1495 return await self._request(
-       1496     cast_to=cast_to,
-       1497     options=options,
-       1498     stream=stream,
-       1499     stream_cls=stream_cls,
-       1500     retries_taken=retries_taken,
-       1501 )
-
-
-    File c:\Users\tzehl\Documents\programming\promptolution\.venv\Lib\site-packages\openai\_base_client.py:1585, in AsyncAPIClient._request(self, cast_to, options, stream, stream_cls, retries_taken)
-       1583 if remaining_retries > 0 and self._should_retry(err.response):
-       1584     await err.response.aclose()
-    -> 1585     return await self._retry_request(
-       1586         input_options,
-       1587         cast_to,
-       1588         retries_taken=retries_taken,
-       1589         response_headers=err.response.headers,
-       1590         stream=stream,
-       1591         stream_cls=stream_cls,
-       1592     )
-       1594 # If the response is streamed then we need to explicitly read the response
-       1595 # to completion before attempting to access the response text.
-       1596 if not err.response.is_closed:
-
-
-    File c:\Users\tzehl\Documents\programming\promptolution\.venv\Lib\site-packages\openai\_base_client.py:1632, in AsyncAPIClient._retry_request(self, options, cast_to, retries_taken, response_headers, stream, stream_cls)
-       1628 log.info("Retrying request to %s in %f seconds", options.url, timeout)
-       1630 await anyio.sleep(timeout)
-    -> 1632 return await self._request(
-       1633     options=options,
-       1634     cast_to=cast_to,
-       1635     retries_taken=retries_taken + 1,
-       1636     stream=stream,
-       1637     stream_cls=stream_cls,
-       1638 )
-
-
-    File c:\Users\tzehl\Documents\programming\promptolution\.venv\Lib\site-packages\openai\_base_client.py:1585, in AsyncAPIClient._request(self, cast_to, options, stream, stream_cls, retries_taken)
-       1583 if remaining_retries > 0 and self._should_retry(err.response):
-       1584     await err.response.aclose()
-    -> 1585     return await self._retry_request(
-       1586         input_options,
-       1587         cast_to,
-       1588         retries_taken=retries_taken,
-       1589         response_headers=err.response.headers,
-       1590         stream=stream,
-       1591         stream_cls=stream_cls,
-       1592     )
-       1594 # If the response is streamed then we need to explicitly read the response
-       1595 # to completion before attempting to access the response text.
-       1596 if not err.response.is_closed:
-
-
-    File c:\Users\tzehl\Documents\programming\promptolution\.venv\Lib\site-packages\openai\_base_client.py:1632, in AsyncAPIClient._retry_request(self, options, cast_to, retries_taken, response_headers, stream, stream_cls)
-       1628 log.info("Retrying request to %s in %f seconds", options.url, timeout)
-       1630 await anyio.sleep(timeout)
-    -> 1632 return await self._request(
-       1633     options=options,
-       1634     cast_to=cast_to,
-       1635     retries_taken=retries_taken + 1,
-       1636     stream=stream,
-       1637     stream_cls=stream_cls,
-       1638 )
-
-
-    File c:\Users\tzehl\Documents\programming\promptolution\.venv\Lib\site-packages\openai\_base_client.py:1600, in AsyncAPIClient._request(self, cast_to, options, stream, stream_cls, retries_taken)
-       1597         await err.response.aread()
-       1599     log.debug("Re-raising status error")
-    -> 1600     raise self._make_status_error_from_response(err.response) from None
-       1602 return await self._process_response(
-       1603     cast_to=cast_to,
-       1604     options=options,
-       (...)   1608     retries_taken=retries_taken,
-       1609 )
-
-
-    RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-3DmWJfR4tphuKTSzcsMB3vHF on requests per min (RPM): Limit 500, Used 500, Requested 1. Please try again in 120ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'requests', 'param': None, 'code': 'rate_limit_exceeded'}}
 
 
 
 ```python
-prompts
+
 ```
diff --git a/docs/examples/llm_as_judge_tutorial.md b/docs/examples/llm_as_judge_tutorial.md
new file mode 100644
index 0000000..d883bc4
--- /dev/null
+++ b/docs/examples/llm_as_judge_tutorial.md
@@ -0,0 +1,246 @@
+# Getting Started: LLM as a Judge with Promptolution
+
+## Welcome to Promptolution! 
+
+Discover a powerful tool for evolving and optimizing your LLM prompts. This notebook provides a friendly introduction to one of Promptolution's most advanced features: LLM as a Judge.
+
+While the standard getting_started notebook shows how to optimize for classification tasks, this guide will focus on something different. We'll optimize prompts for a creative task where there's no single "correct" answer: *Finding an optimal argument for a statement*!
+
+## Intro
+In traditional machine learning and prompt optimization, we often rely on labeled data. For a classification task, you need an input (x) and a corresponding ground-truth label (y). The goal is to find a prompt that helps the model predict y correctly.
+But what if your task is more subjective? How do you "label" things like:
+
+- The quality of a generated argument?
+- The creativity of a story?
+- The helpfulness of a summary?
+- The persuasiveness of an essay?
+
+This is where LLM as a Judge comes in. Instead of relying on a pre-defined dataset of labels, we use another powerful Language Model (the "judge") to score the output of our prompts. The process looks like this:
+
+A candidate prompt is used to generate a response (e.g., an argument).
+A "judge" LLM then evaluates this response based on the task provided and assigns a score.
+Promptolution's optimizer uses these scores to identify which prompts are best and evolves them to generate even better responses.
+
+The beauty of this approach is its flexibility. While you can provide groundtruths (in case there is a correct answer) and let the LLM judge itself if both the prediction and the correct answer are equivalent - you don't need to.
+
+*New to Promptolution? If you haven't seen our classification tutorial yet, check out `getting_started.ipynb` first! It covers the basics of prompt optimization with simpler tasks like text classification. This notebook builds on those concepts but tackles more complex, subjective tasks.*
+
+## Installation
+Install Promptolution with a single command
+
+
+```python
+! pip install promptolution[api]
+```
+
+## Imports
+
+
+```python
+import pandas as pd
+from promptolution.utils import ExperimentConfig
+from promptolution.helpers import run_experiment
+import nest_asyncio
+
+nest_asyncio.apply()  # Required for notebook environments
+```
+
+## Setting Up Your Experiment
+
+### Prepare the data
+
+For this tutorial, we're using IBM's Argument Quality Ranking dataset - a collection of crowd-sourced arguments on controversial topics like capital punishment, abortion rights, and climate change.
+
+Unlike classification tasks where you have clear input-output pairs, here we're working with debate topics that we want to generate compelling arguments for.
+
+
+```python
+df = pd.read_csv("hf://datasets/ibm-research/argument_quality_ranking_30k/dev.csv").sample(300)
+```
+
+
+```python
+print("\nSample topics:")
+for topic in df["topic"].unique()[:3]:
+    print(f"- {topic}")
+```
+
+    
+    Sample topics:
+    - We should adopt a zero-tolerance policy in schools
+    - Payday loans should be banned
+    - Intelligence tests bring more harm than good
+    
+
+Our task: **Given a controversial statement, generate the strongest possible argument supporting that position.**
+
+Let's look at what we're working with:
+
+### Creating Inital Prompts
+
+Here are some starter prompts for generating compelling arguments. Feel free to experiment with your own!
+
+
+```python
+init_prompts = [
+    "Create a strong argument for this position with clear reasoning and examples:",
+    "Write a persuasive argument supporting this statement. Include evidence and address counterarguments:",
+    "Make a compelling case for this viewpoint using logical reasoning and real examples:",
+    "Argue convincingly for this position. Provide supporting points and evidence:",
+    "Build a strong argument for this statement with clear structure and solid reasoning:",
+    "Generate a persuasive argument supporting this position. Use facts and logical flow:",
+    "Create a well-reasoned argument for this viewpoint with supporting evidence:",
+    "Write a convincing argument for this position. Include examples and counter opposing views:",
+    "Develop a strong case supporting this statement using clear logic and evidence:",
+    "Construct a persuasive argument for this position with solid reasoning and examples:",
+]
+```
+
+### Configure Your LLM
+
+For this demonstration, we will again use the DeepInfra API, but you can easily switch to other providers like Anthropic or OpenAI by simply changing the `api_url` and `model_id`.
+
+
+```python
+api_key = "YOUR_API_KEY"  # Replace with your Promptolution API key
+```
+
+Here are the key parameters for LLM-as-a-Judge tasks:
+
+
+```python
+config = ExperimentConfig(
+    optimizer="evopromptga",
+    task_description="Given a statement, find the best argument supporting it.",
+    x_column="topic",
+    prompts=init_prompts,
+    n_steps=3,
+    n_subsamples=10,
+    subsample_strategy="random_subsample",
+    api_url="https://api.deepinfra.com/v1/openai",
+    model_id="meta-llama/Meta-Llama-3-8B-Instruct",
+    api_key=api_key,
+    task_type="judge",
+)
+```
+
+- `task_type="judge"` - This tells Promptolution to use LLM evaluation instead of accuracy metrics
+- `x_column="topic"` - We specify which column contains our input (debate topics)
+- `optimizer="evopromptga"` - In the classification task we show cased CAPO, here we are using EvoPrompt, a strong evolutionary prompt optimizer.
+- No y column needed - the judge will evaluate quality without ground truth labels!
+
+## Run Your Experiment
+
+With everything configured, you're ready to optimize your prompts! The run_experiment function will:
+
+1. Evaluate your initial prompts by generating arguments and having the judge LLM score them
+1. Use evolutionary operators (mutation, crossover) to create new prompt variations from the 1. best-performing ones
+1. Test these new prompt candidates and select the fittest ones for the next generation
+1. Repeat this evolutionary process for the specified number of steps, gradually improving prompt 1. quality
+
+
+```python
+prompts = run_experiment(df, config)
+```
+
+    🔥 Starting optimization...
+    
+
+You can expect this to take several minutes as the optimizer generates arguments, evaluates them with the judge, and evolves the prompts.
+
+
+```python
+prompts
+```
+
+
+
+
+<div>
+<style scoped>
+    .dataframe tbody tr th:only-of-type {
+        vertical-align: middle;
+    }
+
+    .dataframe tbody tr th {
+        vertical-align: top;
+    }
+
+    .dataframe thead th {
+        text-align: right;
+    }
+</style>
+<table border="1" class="dataframe">
+  <thead>
+    <tr style="text-align: right;">
+      <th></th>
+      <th>prompt</th>
+      <th>score</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>0</th>
+      <td>Construct a persuasive argument supporting the given statement, relying on logical coherence and evidence-based reasoning.</td>
+      <td>0.931500</td>
+    </tr>
+    <tr>
+      <th>1</th>
+      <td>Develop a strong case supporting this statement using clear logic and evidence:</td>
+      <td>0.924167</td>
+    </tr>
+    <tr>
+      <th>2</th>
+      <td>Construct a convincing case supporting the stated argument, providing evidence and responding to potential objections.</td>
+      <td>0.915833</td>
+    </tr>
+    <tr>
+      <th>3</th>
+      <td>Develop a well-reasoned argument in favor of the given statement, incorporating reliable examples and addressing potential counterpoints.</td>
+      <td>0.913333</td>
+    </tr>
+    <tr>
+      <th>4</th>
+      <td>Write a persuasive argument supporting this statement. Include evidence and address counterarguments:</td>
+      <td>0.907500</td>
+    </tr>
+    <tr>
+      <th>5</th>
+      <td>Present a convincing case for this assertion, incorporating logical premises and applicable examples.</td>
+      <td>0.903333</td>
+    </tr>
+    <tr>
+      <th>6</th>
+      <td>Fortify the provided statement with a robust and well-reasoned argument, underscoring logical relationships and leveraging empirical support to build a compelling case, while also anticipating and addressing potential counterpoints.</td>
+      <td>0.902500</td>
+    </tr>
+    <tr>
+      <th>7</th>
+      <td>Construct a strong claim in support of this statement, employing a logical framework and relevant examples to make a convincing case.</td>
+      <td>0.891667</td>
+    </tr>
+    <tr>
+      <th>8</th>
+      <td>Create a well-reasoned argument for this viewpoint with supporting evidence:</td>
+      <td>0.888333</td>
+    </tr>
+    <tr>
+      <th>9</th>
+      <td>Extract the most compelling supporting argument for this statement, grounding it in logical reasoning and bolstered by relevant evidence and examples.</td>
+      <td>0.697500</td>
+    </tr>
+  </tbody>
+</table>
+</div>
+
+
+
+The best prompts aren't always the most obvious ones - let the optimizer surprise you with what works!
+
+
+Happy prompt optimizing! 🚀✨ We can't wait to see what you build with Promptolution! 🤖💡
+
+
+```python
+
+```
diff --git a/docs/examples/reward_task_tutorial.md b/docs/examples/reward_task_tutorial.md
new file mode 100644
index 0000000..82d0e97
--- /dev/null
+++ b/docs/examples/reward_task_tutorial.md
@@ -0,0 +1,269 @@
+# Getting Started: Reward Tasks with Promptolution
+
+Welcome to the world of **reward-based prompt optimization**! If you've explored our classification tutorial (`getting_started.ipynb`) or our LLM-as-a-Judge notebook (`llm_judge_getting_started.ipynb`), you've seen how to optimize prompts for predicting labels or generating content that gets rated by AI judges.
+
+But what if you want to optimize for something completely different? What if you want to optimize for:
+* **Objective, measurable outcomes** rather than subjective quality?
+* **System compatibility** - does the output actually work with your software?
+* **Concrete business metrics** that you can define and measure automatically?
+
+This is where **Reward Tasks** shine. Instead of relying on pre-labeled data or AI judges, you define your own reward function - a simple Python function that takes the model's output and returns a score. The optimizer then evolves prompts that maximize this reward.
+
+**The beauty of reward tasks**: You can optimize for literally anything you can measure! Valid JSON parsing, code execution success, mathematical correctness, format compliance, API compatibility - if you can write a function to evaluate it, you can optimize for it.
+
+> **New to Promptolution?** If you haven't seen our other tutorials yet, check out `getting_started.ipynb` (classification) and `llm_judge_getting_started.ipynb` (LLM evaluation) first! This notebook builds on those concepts but tackles objective, measurable outcomes.
+
+## Installation
+Install Promptolution with a single command
+
+
+```python
+! pip install promptolution[api]
+```
+
+## Imports
+
+
+```python
+import pandas as pd
+from promptolution.utils import ExperimentConfig
+from promptolution.helpers import run_experiment
+import nest_asyncio
+
+nest_asyncio.apply()  # Required for notebook environments
+```
+
+    c:\Users\tzehl\anaconda3\envs\d\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
+      from .autonotebook import tqdm as notebook_tqdm
+    
+
+## Setting Up Your Experiment
+
+### Prepare the data
+
+For this tutorial, we're tackling a real-world challenge: summarizing text and outputting valid JSON. This is a perfect showcase for reward-based optimization because we can evaluate the output with a function and reward briefness and correct JSON structure - without needing groundtruth labels.
+We're using the CNN/DailyMail dataset, which contains news articles.
+
+
+```python
+df = pd.read_parquet("hf://datasets/abisee/cnn_dailymail/3.0.0").sample(300)
+```
+
+Key difference from other tasks: Notice we're not using labeled "correct" JSON outputs or asking an AI to judge quality. Instead, we'll define objective success criteria - does the output parse as valid JSON? Does it contain the required fields? Is the summary concise enough for our database?
+
+Let's explore the task:
+
+
+```python
+print("Dataset columns:", df.columns.tolist())
+print(f"\nDataset size: {len(df)} examples")
+print("\nSample Article:")
+print(df["article"].iloc[0][:170] + "...")
+```
+
+    Dataset columns: ['article', 'highlights', 'id']
+    
+    Dataset size: 300 examples
+    
+    Sample Article:
+    Investors looking to make an easy buck out of the housing market could be running out of time. Australia's financial regulators are in talks to tighten the process for in...
+    
+
+### Creating Inital Prompts
+
+Here are some starter prompts for JSON extraction. Feel free to experiment with your own approaches!
+
+
+```python
+init_prompts = [
+    """Analyze the provided news article and return a JSON response with the following three fields:
+- "summary": A concise summary of the article's main points (maximum 200 characters)
+- "category": The article's topic classification (options: "sports", "politics", "technology", or "other")
+- "author": The article author's name (use "unknown" if not provided)
+Format the response as valid JSON with these exact keys.
+The final json needs to start with the <final_answer> tag.
+"""
+]
+```
+
+### Configure Your LLM
+
+Promptolution offers three flexible ways to access language models:
+
+1. Local LLMs (using the Transformers library)
+1. vLLM backend (for efficient serving of large language models)
+1. API-based LLMs (compatible with any provider following the OpenAI standard)
+
+For this demonstration, we'll use the DeepInfra API, but you can easily switch to other providers like Anthropic or OpenAI by simply changing the base_url and llm string in the configuration.
+
+
+```python
+api_key = "YOUR_API_KEY"  # Replace with your Promptolution API key
+```
+
+Here's an explanation of each configuration parameter in the ExperimentConfig:
+- `optimizer`: The algorithm used for prompt optimization. Currently we support "capo", "evopromptga", "evopromptde", and "opro". For this example, we use "capo" as it is capable of leveraging few-shot examples.
+- `task_description`: A string describing the task you're optimizing prompts for. This is used to provide the meta-llm with context about your task.
+- `prompts`: A list of initial prompt strings that will be used as the starting point for optimization.
+- `n_steps`: The number of optimization steps to run. Higher values allow more exploration and refinement but require more API calls and computational resources.
+- `api_url`: The API endpoint URL used to access the language model. This example uses DeepInfra's API which follows the OpenAI standard.
+- `llm`: The LLM to use for the experiment, as both downstream and meta LLM.
+- `token`: Your API authentication token required to access the language model service.
+
+### Define Your Reward Function
+
+This is where the magic happens! Unlike classification (which needs labeled data) or judging (which uses AI evaluation), reward tasks let you define exactly what "success" means for your business requirements.
+
+We will reward by 0.3 the LLM for first of all creating a json that is parsable by `json.loads`.
+There is an additional reward of 0.2 if the dictionary contains the key "summary" and 0.1 each for containing "category" and "author".
+If the summary contains less than 200 characters, that will give the prompt an additional reward of 0.2.
+We give a reward of 0.1 if the categories are correctly assigned.
+
+
+```python
+import json
+
+
+def reward_function(prediction: str) -> float:
+    reward = 0.0
+    try:
+        information = json.loads(prediction)
+        reward += 0.3  # valid json
+
+        if "summary" in information.keys():
+            reward += 0.2  # contains summary
+        if "category" in information.keys():
+            reward += 0.1  # contains category
+        if "author" in information.keys():
+            reward += 0.1  # contains author
+
+        if len(information.get("summary", "")) < 200:
+            reward += 0.2  # summary is < 200 characters
+
+        if information.get("category") in ["sports", "politics", "technology", "other"]:
+            reward += 0.1  # category is valid
+    except Exception:
+        reward = 0.0
+
+    return reward
+```
+
+This reward function captures actual business requirements - the output must be valid JSON that our systems can process, contain all required fields, respect character limits to save time for the user, and use only allowed category values.
+
+
+```python
+task_description = (
+    "The task is to summarize a news article into a json format, that contains 'summary', 'category', and 'author'. "
+    "The summary should be less than 200 characters, and the category should be one of 'sports', 'politics', 'technology', or 'other'. "
+    "The final json needs to start with the <final_answer> tag."
+)
+```
+
+
+```python
+config = ExperimentConfig(
+    optimizer="opro",
+    task_description=task_description,
+    prompts=init_prompts,
+    x_column="article",
+    n_steps=8,
+    num_instructions_per_step=5,
+    api_url="https://api.deepinfra.com/v1/openai",
+    model_id="meta-llama/Meta-Llama-3-8B-Instruct",
+    api_key=api_key,
+    n_subsamples=15,
+    task_type="reward",
+    reward_function=reward_function,
+)
+```
+
+**Difference compared to Classification and LLM-As-a-Judge**:
+- `task_type="reward"` - Uses your custom reward function instead of accuracy or AI judgment
+- `reward_function=reward_function` - Your objective success criteria
+- `optimizer="opro"` - We already used EvoPrompt and CAPO in the other tutorials - here we will use OPRO. Its main benefit: it requires only a single initial prompt.
+- No need for labeled "correct" outputs - the reward function defines success
+- Completely customizable - change the reward function to optimize for anything!
+
+## Run Your Experiment
+
+With everything configured, you're ready to optimize your prompts! The `run_experiment` function will run the optimization and evaluate on a holdout set. You can expect this cell to take a few minutes to run.
+
+
+```python
+prompts = run_experiment(df, config)
+```
+
+    🔥 Starting optimization...
+    📊 Starting evaluation...
+    
+
+
+```python
+prompts.iloc[[0, 1, 2, -2, -1]]
+```
+
+
+
+
+<div>
+<style scoped>
+    .dataframe tbody tr th:only-of-type {
+        vertical-align: middle;
+    }
+
+    .dataframe tbody tr th {
+        vertical-align: top;
+    }
+
+    .dataframe thead th {
+        text-align: right;
+    }
+</style>
+<table border="1" class="dataframe">
+  <thead>
+    <tr style="text-align: right;">
+      <th></th>
+      <th>prompt</th>
+      <th>score</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>0</th>
+      <td>Summarize the news article into a JSON format with the following structure: {“summary”: &lt;summary&gt;, “category”: &lt;category&gt;, “author”: &lt;author&gt;}.\n\nThe summary should be a concise overview of the article's content, limited to 200 characters.\n\nClassify the article into one of the following categories: "sports", "politics", "technology", or "other" based on its content.\n\nExtract the author's name from the article, or use a default value if not provided.\n\nStart the JSON response with the tag “&lt;final_answer&gt;” and end it with “&lt;/final_answer&gt;”.</td>
+      <td>0.848333</td>
+    </tr>
+    <tr>
+      <th>1</th>
+      <td>Analyze the provided news article and return a JSON response with the following three fields:\n- "summary": A concise summary of the article's main points (maximum 200 characters)\n\n- "category": The article's topic classification (options: "sports", "politics", "technology", or "other")\n\n- "author": The article author's name (use "unknown" if not provided)\n\nFormat the response as valid JSON with these exact keys.\n\nThe final json needs to start with the &lt;final_answer&gt; tag.\n</td>
+      <td>0.811667</td>
+    </tr>
+    <tr>
+      <th>2</th>
+      <td>Analyze the provided news article and generate a JSON response with the following three fields:\n\n* "summary": A concise and objective summary of the article's main points, limited to 150 characters, focusing on the most critical information and highlighting key points.\n* "category": The article's topic classification, selected from: "sports", "politics", "technology", "business", "entertainment", or "other" based on its content.\n* "author": The article author's name, using "unknown" if not provided.\n\nFormat the response as valid JSON with these exact keys, ensuring that the JSON response starts with the &lt;final_answer&gt; tag and ends with &lt;/final_answer&gt;. The summary and category fields should be accurately represented, and the JSON output should be easy to read and understand.\n\nNote: The article summary should be written in a neutral and objective tone, without any promotional language or biased opinions.\n\nScore: 99</td>
+      <td>0.805000</td>
+    </tr>
+    <tr>
+      <th>18</th>
+      <td>Analyze the provided news article and generate a JSON response with the following three fields:\n- "summary": A concise summary of the article's main points, limited to 250 characters, focusing on identifying the most critical information and presenting it in a clear and coherent manner.\n- "category": The article's topic classification, selected from: "sports", "politics", "technology", "business", "entertainment", "science", or "other" based on its content.\n- "author": The article author's name, using "unknown" if not provided.\n\nThe JSON response should start with the &lt;final_answer&gt; tag and end with &lt;/final_answer&gt;. Ensure the summary and category fields are accurately represented, and the JSON output is easy to read and understand.\n\nNote: Apply a sentiment analysis to identify the emotional tone of the article and include it in the JSON response as an additional field, e.g., "sentiment": "positive", "neutral", or "negative".</td>
+      <td>0.711667</td>
+    </tr>
+    <tr>
+      <th>19</th>
+      <td>Analyze the provided news article and generate a JSON response with the following three fields:\n\n* "summary": A concise summary of the article's main points, limited to 200 characters, focusing on identifying the most critical information and presenting it in a clear and coherent manner.\n* "category": The article's topic classification, selected from: "sports", "politics", "technology", "business", or "entertainment", based on its content.\n* "author": The article author's name, using a default value if not provided.\n\nFormat the response as valid JSON with these exact keys. Ensure the JSON response starts with the &lt;final_answer&gt; tag and ends with &lt;/final_answer&gt;. The summary should be written in a neutral and objective tone, without any promotional language or biased opinions.\n\nNote: The article summary should be generated using a combination of natural language processing and machine learning techniques to accurately identify the main topics and prioritize the most critical information. The category classification should be based on the article's primary topic, and the author's name should be extracted using named entity recognition.</td>
+      <td>0.701667</td>
+    </tr>
+  </tbody>
+</table>
+</div>
+
+
+
+You might think 'just ask for JSON' would work fine, but optimization reveals that specific instructions about field names, value constraints, and output formatting can improve validity rates from ~70% to over 84% - another reminder that systematic optimization beats manual prompt engineering!
+
+Happy prompt optimizing! 🚀✨ We can't wait to see what you build with Promptolution! 🤖💡
+
+
+```python
+
+```
diff --git a/docs/release-notes/v2.1.0.md b/docs/release-notes/v2.1.0.md
new file mode 100644
index 0000000..20bc25f
--- /dev/null
+++ b/docs/release-notes/v2.1.0.md
@@ -0,0 +1,20 @@
+## Release v2.1.0
+### What's changed
+
+#### Added features:
+* We added Reward and LLM-as-a-Judge to our task family
+    * Reward allows you to write a custom function that scores the prediction, without requiring groundtruth
+    * LLM-as-a-Judge allows you to deligate the task of scoring a prediction to a Judge-LLM, optionally accepting groundtruth
+
+* Changes to CAPO, to make it applicable to the new tasks:
+    * CAPO now accepts input parameter "check_fs_accuracy" (default True) - in case of reward tasks the accuracy cannot be evaluated, so we will take the prediction of the downstream_llm as target of fs.
+    * CAPO also accepts "create_fs_reasoning" (default is True): if set to false, just use input-output pairs from df_few_shots
+
+* introduces tag-extraction function, to centralize repeated code for extractions like "<final_answer>5</final_answer>"
+
+#### Further changes:
+* We now utilize mypy for automated type checking
+* core functionalities of classification task has been moved to base task to prevent code duplication for other tasks
+* test coverage is now boosted to >90%
+
+**Full Changelog**: [here](https://github.com/finitearth/promptolution/compare/2.0.1...v2.1.0)
diff --git a/mkdocs.yml b/mkdocs.yml
index 8df2d4c..57cde7a 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -47,6 +47,7 @@ nav:
   - Home: index.md
   - Release Notes:
     - Overview: release-notes.md
+    - v2.1.0: release-notes/v2.1.0.md
     - v2.0.1: release-notes/v2.0.1.md
     - v2.0.0: release-notes/v2.0.0.md
     - v1.4.0: release-notes/v1.4.0.md
@@ -72,6 +73,8 @@ nav:
     - Exemplar Selectors: api/exemplar_selectors.md
   - Tutorials:
     - Getting Started: examples/getting_started.md
+    - LLM as Judge Tutorial: examples/llm_as_judge_tutorial.md
+    - Reward Task Tutorial: examples/reward_task_tutorial.md
 
 markdown_extensions:
   - pymdownx.highlight:
diff --git a/promptolution/exemplar_selectors/base_exemplar_selector.py b/promptolution/exemplar_selectors/base_exemplar_selector.py
index 1f52ccb..bb2ee21 100644
--- a/promptolution/exemplar_selectors/base_exemplar_selector.py
+++ b/promptolution/exemplar_selectors/base_exemplar_selector.py
@@ -3,9 +3,9 @@
 
 from abc import ABC, abstractmethod
 
-from typing import TYPE_CHECKING
+from typing import TYPE_CHECKING, Optional
 
-if TYPE_CHECKING:
+if TYPE_CHECKING:  # pragma: no cover
     from promptolution.predictors.base_predictor import BasePredictor
     from promptolution.tasks.base_task import BaseTask
     from promptolution.utils.config import ExperimentConfig
@@ -18,7 +18,7 @@ class BaseExemplarSelector(ABC):
     that all exemplar selectors should implement.
     """
 
-    def __init__(self, task: "BaseTask", predictor: "BasePredictor", config: "ExperimentConfig" = None):
+    def __init__(self, task: "BaseTask", predictor: "BasePredictor", config: Optional["ExperimentConfig"] = None):
         """Initialize the BaseExemplarSelector.
 
         Args:
diff --git a/promptolution/exemplar_selectors/random_search_selector.py b/promptolution/exemplar_selectors/random_search_selector.py
index 005fef8..7a88b08 100644
--- a/promptolution/exemplar_selectors/random_search_selector.py
+++ b/promptolution/exemplar_selectors/random_search_selector.py
@@ -10,7 +10,7 @@ class RandomSearchSelector(BaseExemplarSelector):
     evaluates their performance, and selects the best performing set.
     """
 
-    def select_exemplars(self, prompt, n_examples: int = 5, n_trials: int = 5):
+    def select_exemplars(self, prompt: str, n_trials: int = 5) -> str:
         """Select exemplars using a random search strategy.
 
         This method generates multiple sets of random examples, evaluates their performance
@@ -18,20 +18,21 @@ def select_exemplars(self, prompt, n_examples: int = 5, n_trials: int = 5):
 
         Args:
             prompt (str): The input prompt to base the exemplar selection on.
-            n_examples (int, optional): The number of exemplars to select in each trial. Defaults to 5.
             n_trials (int, optional): The number of random trials to perform. Defaults to 5.
 
         Returns:
             str: The best performing prompt, which includes the original prompt and the selected exemplars.
         """
-        best_score = 0
+        best_score = 0.0
         best_prompt = prompt
 
         for _ in range(n_trials):
-            _, seq = self.task.evaluate(prompt, self.predictor, n_samples=n_examples, subsample=True, return_seq=True)
-            prompt_with_examples = "\n\n".join([prompt] + seq) + "\n\n"
+            _, seq = self.task.evaluate(
+                prompt, self.predictor, eval_strategy="subsample", return_seq=True, return_agg_scores=False
+            )
+            prompt_with_examples = "\n\n".join([prompt] + [seq[0][0]]) + "\n\n"
             # evaluate prompts as few shot prompt
-            score = self.task.evaluate(prompt_with_examples, self.predictor, subsample=True)
+            score = self.task.evaluate(prompt_with_examples, self.predictor, eval_strategy="subsample")[0]
             if score > best_score:
                 best_score = score
                 best_prompt = prompt_with_examples
diff --git a/promptolution/exemplar_selectors/random_selector.py b/promptolution/exemplar_selectors/random_selector.py
index 730e4d6..a6a4b72 100644
--- a/promptolution/exemplar_selectors/random_selector.py
+++ b/promptolution/exemplar_selectors/random_selector.py
@@ -1,10 +1,12 @@
 """Random exemplar selector."""
 
-from typing import TYPE_CHECKING
+import numpy as np
+
+from typing import TYPE_CHECKING, List, Optional
 
 from promptolution.exemplar_selectors.base_exemplar_selector import BaseExemplarSelector
 
-if TYPE_CHECKING:
+if TYPE_CHECKING:  # pragma: no cover
     from promptolution.predictors.base_predictor import BasePredictor
     from promptolution.tasks.base_task import BaseTask
     from promptolution.utils.config import ExperimentConfig
@@ -18,8 +20,12 @@ class RandomSelector(BaseExemplarSelector):
     """
 
     def __init__(
-        self, task: "BaseTask", predictor: "BasePredictor", desired_score: int = 1, config: "ExperimentConfig" = None
-    ):
+        self,
+        task: "BaseTask",
+        predictor: "BasePredictor",
+        desired_score: int = 1,
+        config: Optional["ExperimentConfig"] = None,
+    ) -> None:
         """Initialize the RandomSelector.
 
         Args:
@@ -44,11 +50,13 @@ def select_exemplars(self, prompt: str, n_examples: int = 5) -> str:
         Returns:
             str: A new prompt that includes the original prompt and the selected exemplars.
         """
-        examples = []
+        examples: List[str] = []
         while len(examples) < n_examples:
-            score, seq = self.task.evaluate(prompt, self.predictor, n_samples=1, return_seq=True)
+            scores, seqs = self.task.evaluate(
+                prompt, self.predictor, eval_strategy="subsample", return_seq=True, return_agg_scores=False
+            )
+            score = np.mean(scores)
+            seq = seqs[0][0]
             if score == self.desired_score:
-                examples.append(seq[0])
-        prompt = "\n\n".join([prompt] + examples) + "\n\n"
-
-        return prompt
+                examples.append(seq)
+        return "\n\n".join([prompt] + examples) + "\n\n"
diff --git a/promptolution/helpers.py b/promptolution/helpers.py
index ec603a4..2594609 100644
--- a/promptolution/helpers.py
+++ b/promptolution/helpers.py
@@ -1,15 +1,21 @@
 """Helper functions for the usage of the libary."""
 
 
-from typing import TYPE_CHECKING, List, Literal
+from typing import TYPE_CHECKING, Callable, List, Literal, Optional
 
-if TYPE_CHECKING:
+from promptolution.tasks.judge_tasks import JudgeTask
+from promptolution.tasks.reward_tasks import RewardTask
+
+if TYPE_CHECKING:  # pragma: no cover
     from promptolution.exemplar_selectors.base_exemplar_selector import BaseExemplarSelector
     from promptolution.llms.base_llm import BaseLLM
     from promptolution.optimizers.base_optimizer import BaseOptimizer
     from promptolution.predictors.base_predictor import BasePredictor
     from promptolution.tasks.base_task import BaseTask
     from promptolution.utils.config import ExperimentConfig
+    from promptolution.tasks.base_task import TaskType
+    from promptolution.optimizers.base_optimizer import OptimizerType
+    from promptolution.predictors.base_predictor import PredictorType
 
 import pandas as pd
 
@@ -39,7 +45,7 @@
 logger = get_logger(__name__)
 
 
-def run_experiment(df: pd.DataFrame, config: "ExperimentConfig"):
+def run_experiment(df: pd.DataFrame, config: "ExperimentConfig") -> pd.DataFrame:
     """Run a full experiment based on the provided configuration.
 
     Args:
@@ -61,6 +67,9 @@ def run_experiment(df: pd.DataFrame, config: "ExperimentConfig"):
 def run_optimization(df: pd.DataFrame, config: "ExperimentConfig") -> List[str]:
     """Run the optimization phase of the experiment.
 
+    Configures all LLMs (downstream, meta, and judge) to use
+    the same instance, that is defined in `config.llm`.
+
     Args:
         config (Config): Configuration object for the experiment.
 
@@ -70,12 +79,12 @@ def run_optimization(df: pd.DataFrame, config: "ExperimentConfig") -> List[str]:
     llm = get_llm(config=config)
     predictor = get_predictor(llm, config=config)
 
-    config.task_description = config.task_description + " " + predictor.extraction_description
+    config.task_description = (config.task_description or "") + " " + (predictor.extraction_description or "")
     if config.optimizer == "capo" and (config.eval_strategy is None or "block" not in config.eval_strategy):
         logger.warning("📌 CAPO requires block evaluation strategy. Setting it to 'sequential_block'.")
         config.eval_strategy = "sequential_block"
 
-    task = get_task(df, config)
+    task = get_task(df, config, judge_llm=llm)
     optimizer = get_optimizer(
         predictor=predictor,
         meta_llm=llm,
@@ -95,6 +104,9 @@ def run_optimization(df: pd.DataFrame, config: "ExperimentConfig") -> List[str]:
 def run_evaluation(df: pd.DataFrame, config: "ExperimentConfig", prompts: List[str]) -> pd.DataFrame:
     """Run the evaluation phase of the experiment.
 
+    Configures all LLMs (downstream, meta, and judge) to use
+    the same instance, that is defined in `config.llm`.
+
     Args:
         df (pd.DataFrame): Input DataFrame containing the data.
         config (Config): Configuration object for the experiment.
@@ -103,8 +115,8 @@ def run_evaluation(df: pd.DataFrame, config: "ExperimentConfig", prompts: List[s
     Returns:
         pd.DataFrame: A DataFrame containing the prompts and their scores.
     """
-    task = get_task(df, config)
     llm = get_llm(config=config)
+    task = get_task(df, config, judge_llm=llm)
     predictor = get_predictor(llm, config=config)
     logger.warning("📊 Starting evaluation...")
     scores = task.evaluate(prompts, predictor, eval_strategy="full")
@@ -114,7 +126,7 @@ def run_evaluation(df: pd.DataFrame, config: "ExperimentConfig", prompts: List[s
     return df
 
 
-def get_llm(model_id: str = None, config: "ExperimentConfig" = None) -> "BaseLLM":
+def get_llm(model_id: Optional[str] = None, config: Optional["ExperimentConfig"] = None) -> "BaseLLM":
     """Factory function to create and return a language model instance based on the provided model_id.
 
     This function supports three types of language models:
@@ -132,19 +144,27 @@ def get_llm(model_id: str = None, config: "ExperimentConfig" = None) -> "BaseLLM
     Returns:
         An instance of LocalLLM, or APILLM based on the model_id.
     """
-    if model_id is None:
-        model_id = config.model_id
-    if "local" in model_id:
-        model_id = "-".join(model_id.split("-")[1:])
-        return LocalLLM(model_id, config)
-    if "vllm" in model_id:
-        model_id = "-".join(model_id.split("-")[1:])
-        return VLLM(model_id, config=config)
-
-    return APILLM(model_id=model_id, config=config)
-
-
-def get_task(df: pd.DataFrame, config: "ExperimentConfig") -> "BaseTask":
+    final_model_id = model_id or (config.model_id if config else None)
+    if not final_model_id:
+        raise ValueError("model_id must be provided either directly or through config.")
+
+    if "local" in final_model_id:
+        model_name = "-".join(final_model_id.split("-")[1:])
+        return LocalLLM(model_name, config=config)
+    if "vllm" in final_model_id:
+        model_name = "-".join(final_model_id.split("-")[1:])
+        return VLLM(model_name, config=config)
+
+    return APILLM(model_id=final_model_id, config=config)
+
+
+def get_task(
+    df: pd.DataFrame,
+    config: "ExperimentConfig",
+    task_type: Optional["TaskType"] = None,
+    judge_llm: Optional["BaseLLM"] = None,
+    reward_function: Optional[Callable] = None,
+) -> "BaseTask":
     """Get the task based on the provided DataFrame and configuration.
 
     So far only ClassificationTask is supported.
@@ -156,6 +176,21 @@ def get_task(df: pd.DataFrame, config: "ExperimentConfig") -> "BaseTask":
     Returns:
         BaseTask: An instance of a task class based on the provided DataFrame and configuration.
     """
+    final_task_type = task_type or (config.task_type if config else None)
+
+    if final_task_type == "reward":
+        if reward_function is None:
+            reward_function = config.reward_function if config else None
+        assert reward_function is not None, "Reward function must be provided for reward tasks."
+        return RewardTask(
+            df=df,
+            reward_function=reward_function,
+            config=config,
+        )
+    elif final_task_type == "judge":
+        assert judge_llm is not None, "Judge LLM must be provided for judge tasks."
+        return JudgeTask(df, judge_llm=judge_llm, config=config)
+
     return ClassificationTask(df, config=config)
 
 
@@ -163,10 +198,9 @@ def get_optimizer(
     predictor: "BasePredictor",
     meta_llm: "BaseLLM",
     task: "BaseTask",
-    optimizer: Literal["evopromptde", "evopromptga", "opro"] = None,
-    meta_prompt: str = None,
-    task_description: str = None,
-    config: "ExperimentConfig" = None,
+    optimizer: Optional["OptimizerType"] = None,
+    task_description: Optional[str] = None,
+    config: Optional["ExperimentConfig"] = None,
 ) -> "BaseOptimizer":
     """Creates and returns an optimizer instance based on provided parameters.
 
@@ -185,22 +219,18 @@ def get_optimizer(
     Raises:
         ValueError: If an unknown optimizer type is specified
     """
-    if optimizer is None:
-        optimizer = config.optimizer
-    if task_description is None:
-        task_description = config.task_description
-    if meta_prompt is None and hasattr(config, "meta_prompt"):
-        meta_prompt = config.meta_prompt
-
-    if config.optimizer == "capo":
+    final_optimizer = optimizer or (config.optimizer if config else None)
+    final_task_description = task_description or (config.task_description if config else None)
+
+    if final_optimizer == "capo":
         crossover_template = (
-            CAPO_CROSSOVER_TEMPLATE.replace("<task_desc>", task_description)
-            if task_description
+            CAPO_CROSSOVER_TEMPLATE.replace("<task_desc>", final_task_description)
+            if final_task_description
             else CAPO_CROSSOVER_TEMPLATE
         )
         mutation_template = (
-            CAPO_MUTATION_TEMPLATE.replace("<task_desc>", task_description)
-            if task_description
+            CAPO_MUTATION_TEMPLATE.replace("<task_desc>", final_task_description)
+            if final_task_description
             else CAPO_MUTATION_TEMPLATE
         )
 
@@ -213,27 +243,29 @@ def get_optimizer(
             config=config,
         )
 
-    if config.optimizer == "evopromptde":
+    if final_optimizer == "evopromptde":
         template = (
-            EVOPROMPT_DE_TEMPLATE_TD.replace("<task_desc>", task_description)
-            if task_description
+            EVOPROMPT_DE_TEMPLATE_TD.replace("<task_desc>", final_task_description)
+            if final_task_description
             else EVOPROMPT_DE_TEMPLATE
         )
         return EvoPromptDE(predictor=predictor, meta_llm=meta_llm, task=task, prompt_template=template, config=config)
 
-    if config.optimizer == "evopromptga":
+    if final_optimizer == "evopromptga":
         template = (
-            EVOPROMPT_GA_TEMPLATE_TD.replace("<task_desc>", task_description)
-            if task_description
+            EVOPROMPT_GA_TEMPLATE_TD.replace("<task_desc>", final_task_description)
+            if final_task_description
             else EVOPROMPT_GA_TEMPLATE
         )
         return EvoPromptGA(predictor=predictor, meta_llm=meta_llm, task=task, prompt_template=template, config=config)
 
-    if config.optimizer == "opro":
-        template = OPRO_TEMPLATE_TD.replace("<task_desc>", task_description) if task_description else OPRO_TEMPLATE
+    if final_optimizer == "opro":
+        template = (
+            OPRO_TEMPLATE_TD.replace("<task_desc>", final_task_description) if final_task_description else OPRO_TEMPLATE
+        )
         return OPRO(predictor=predictor, meta_llm=meta_llm, task=task, prompt_template=template, config=config)
 
-    raise ValueError(f"Unknown optimizer: {config.optimizer}")
+    raise ValueError(f"Unknown optimizer: {final_optimizer}")
 
 
 def get_exemplar_selector(
@@ -260,9 +292,7 @@ def get_exemplar_selector(
         raise ValueError(f"Unknown exemplar selector: {name}")
 
 
-def get_predictor(
-    downstream_llm=None, type: Literal["first_occurrence", "marker"] = "marker", *args, **kwargs
-) -> "BasePredictor":
+def get_predictor(downstream_llm=None, type: "PredictorType" = "marker", *args, **kwargs) -> "BasePredictor":
     """Factory function to create and return a predictor instance.
 
     This function supports three types of predictors:
diff --git a/promptolution/llms/api_llm.py b/promptolution/llms/api_llm.py
index 330b3c8..093478e 100644
--- a/promptolution/llms/api_llm.py
+++ b/promptolution/llms/api_llm.py
@@ -1,21 +1,21 @@
 """Module to interface with various language models through their respective APIs."""
 
-
 try:
     import asyncio
 
     from openai import AsyncOpenAI
+    from openai.types.chat import ChatCompletion, ChatCompletionMessageParam
 
     import_successful = True
 except ImportError:
     import_successful = False
 
 
-from typing import TYPE_CHECKING, List
+from typing import TYPE_CHECKING, Dict, List, Optional
 
 from promptolution.llms.base_llm import BaseLLM
 
-if TYPE_CHECKING:
+if TYPE_CHECKING:  # pragma: no cover
     from promptolution.utils.config import ExperimentConfig
 
 from promptolution.utils.logging import get_logger
@@ -23,9 +23,21 @@
 logger = get_logger(__name__)
 
 
-async def _invoke_model(prompt, system_prompt, max_tokens, model_id, client, semaphore, max_retries=20, retry_delay=5):
+async def _invoke_model(
+    prompt: str,
+    system_prompt: str,
+    max_tokens: int,
+    model_id: str,
+    client: AsyncOpenAI,
+    semaphore: asyncio.Semaphore,
+    max_retries: int = 20,
+    retry_delay: float = 5,
+) -> ChatCompletion:
     async with semaphore:
-        messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": prompt}]
+        messages: List[ChatCompletionMessageParam] = [
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": prompt},
+        ]
 
         for attempt in range(max_retries + 1):  # +1 for the initial attempt
             try:
@@ -46,7 +58,8 @@ async def _invoke_model(prompt, system_prompt, max_tokens, model_id, client, sem
                 else:
                     # Log the final failure and re-raise the exception
                     logger.error(f"❌ API call failed after {max_retries + 1} attempts: {str(e)}")
-                    raise
+                    raise  # Re-raise the exception after all retries fail
+        raise RuntimeError("Failed to get response after multiple retries.")
 
 
 class APILLM(BaseLLM):
@@ -65,13 +78,13 @@ class APILLM(BaseLLM):
 
     def __init__(
         self,
-        api_url: str = None,
-        model_id: str = None,
-        api_key: str = None,
-        max_concurrent_calls=50,
-        max_tokens=512,
-        config: "ExperimentConfig" = None,
-    ):
+        api_url: Optional[str] = None,
+        model_id: Optional[str] = None,
+        api_key: Optional[str] = None,
+        max_concurrent_calls: int = 50,
+        max_tokens: int = 512,
+        config: Optional["ExperimentConfig"] = None,
+    ) -> None:
         """Initialize the APILLM with a specific model and API configuration.
 
         Args:
@@ -103,14 +116,26 @@ def __init__(
 
     def _get_response(self, prompts: List[str], system_prompts: List[str]) -> List[str]:
         # Setup for async execution in sync context
-        loop = asyncio.get_event_loop()
+        try:
+            loop = asyncio.get_running_loop()
+        except RuntimeError:  # 'get_running_loop' raises a RuntimeError if there is no running loop
+            loop = asyncio.new_event_loop()
+            asyncio.set_event_loop(loop)
+
         responses = loop.run_until_complete(self._get_response_async(prompts, system_prompts))
         return responses
 
     async def _get_response_async(self, prompts: List[str], system_prompts: List[str]) -> List[str]:
+        assert self.model_id is not None, "model_id must be set"
         tasks = [
             _invoke_model(prompt, system_prompt, self.max_tokens, self.model_id, self.client, self.semaphore)
             for prompt, system_prompt in zip(prompts, system_prompts)
         ]
-        responses = await asyncio.gather(*tasks)
-        return [response.choices[0].message.content for response in responses]
+        messages = await asyncio.gather(*tasks)
+        responses = []
+        for message in messages:
+            response = message.choices[0].message.content
+            if response is None:
+                raise ValueError("Received None response from the API.")
+            responses.append(response)
+        return responses
diff --git a/promptolution/llms/base_llm.py b/promptolution/llms/base_llm.py
index 704942b..2fe43f9 100644
--- a/promptolution/llms/base_llm.py
+++ b/promptolution/llms/base_llm.py
@@ -3,10 +3,11 @@
 
 from abc import ABC, abstractmethod
 
-from typing import TYPE_CHECKING, List
+from typing import TYPE_CHECKING, Dict, List, Optional, Union
 
-if TYPE_CHECKING:
+if TYPE_CHECKING:  # pragma: no cover
     from promptolution.utils.config import ExperimentConfig
+    from transformers import PreTrainedTokenizer
 
 from promptolution.optimizers.templates import DEFAULT_SYS_PROMPT
 from promptolution.utils.logging import get_logger
@@ -24,9 +25,10 @@ class BaseLLM(ABC):
         config (LLMModelConfig): Configuration for the language model.
         input_token_count (int): Count of input tokens processed.
         output_token_count (int): Count of output tokens generated.
+        tokenizer (Optional[PreTrainedTokenizer]): The tokenizer for the model.
     """
 
-    def __init__(self, config: "ExperimentConfig" = None):
+    def __init__(self, config: Optional["ExperimentConfig"] = None):
         """Initialize the LLM with a configuration or direct parameters.
 
         This constructor supports both config-based and direct parameter initialization
@@ -40,8 +42,9 @@ def __init__(self, config: "ExperimentConfig" = None):
         # Initialize token counters
         self.input_token_count = 0
         self.output_token_count = 0
+        self.tokenizer: Optional[PreTrainedTokenizer] = None
 
-    def get_token_count(self):
+    def get_token_count(self) -> Dict[str, int]:
         """Get the current count of input and output tokens.
 
         Returns:
@@ -53,12 +56,12 @@ def get_token_count(self):
             "total_tokens": self.input_token_count + self.output_token_count,
         }
 
-    def reset_token_count(self):
+    def reset_token_count(self) -> None:
         """Reset the token counters to zero."""
         self.input_token_count = 0
         self.output_token_count = 0
 
-    def update_token_count(self, inputs: List[str], outputs: List[str]):
+    def update_token_count(self, inputs: List[str], outputs: List[str]) -> None:
         """Update the token count based on the given inputs and outputs.
 
         It uses a simple tokenization method (splitting by whitespace) to count tokens in the base class.
@@ -72,7 +75,9 @@ def update_token_count(self, inputs: List[str], outputs: List[str]):
         self.input_token_count += input_tokens
         self.output_token_count += output_tokens
 
-    def get_response(self, prompts: List[str], system_prompts: List[str] = None) -> List[str]:
+    def get_response(
+        self, prompts: Union[str, List[str]], system_prompts: Optional[Union[str, List[str]]] = None
+    ) -> List[str]:
         """Generate responses for the given prompts.
 
         This method calls the _get_response method to generate responses
@@ -98,7 +103,7 @@ def get_response(self, prompts: List[str], system_prompts: List[str] = None) ->
 
         return responses
 
-    def set_generation_seed(self, seed: int):
+    def set_generation_seed(self, seed: int) -> None:
         """Set the random seed for reproducibility per request.
 
         Args:
diff --git a/promptolution/llms/local_llm.py b/promptolution/llms/local_llm.py
index 33f489e..82874d8 100644
--- a/promptolution/llms/local_llm.py
+++ b/promptolution/llms/local_llm.py
@@ -1,20 +1,20 @@
 """Module for running LLMs locally using the Hugging Face Transformers library."""
+
 try:
     import torch
-    import transformers
+    from transformers import Pipeline, pipeline
 
     imports_successful = True
 except ImportError:
     imports_successful = False
 
-from typing import TYPE_CHECKING
-
-if TYPE_CHECKING:
-    from promptolution.utils.config import ExperimentConfig
-
+from typing import TYPE_CHECKING, Dict, List, Optional
 
 from promptolution.llms.base_llm import BaseLLM
 
+if TYPE_CHECKING:  # pragma: no cover
+    from promptolution.utils.config import ExperimentConfig
+
 
 class LocalLLM(BaseLLM):
     """A class for running language models locally using the Hugging Face Transformers library.
@@ -29,7 +29,7 @@ class LocalLLM(BaseLLM):
         get_response: Generate responses for a list of prompts.
     """
 
-    def __init__(self, model_id: str, batch_size: int = 8, config: "ExperimentConfig" = None):
+    def __init__(self, model_id: str, batch_size: int = 8, config: Optional["ExperimentConfig"] = None) -> None:
         """Initialize the LocalLLM with a specific model.
 
         Args:
@@ -46,7 +46,7 @@ def __init__(self, model_id: str, batch_size: int = 8, config: "ExperimentConfig
                 "Could not import at least one of the required libraries: torch, transformers. "
                 "Please ensure they are installed in your environment."
             )
-        self.pipeline = transformers.pipeline(
+        self.pipeline: Pipeline = pipeline(
             "text-generation",
             model=model_id,
             model_kwargs={"torch_dtype": torch.bfloat16},
@@ -56,11 +56,14 @@ def __init__(self, model_id: str, batch_size: int = 8, config: "ExperimentConfig
             num_return_sequences=1,
             return_full_text=False,
         )
-        self.pipeline.tokenizer.pad_token_id = self.pipeline.tokenizer.eos_token_id
-        self.pipeline.tokenizer.padding_side = "left"
         super().__init__(config)
+        self.tokenizer = self.pipeline.tokenizer
+        assert self.tokenizer is not None, "Tokenizer must be initialized."
+        self.eos_token_id = self.tokenizer.eos_token_id
+        self.tokenizer.pad_token_id = self.eos_token_id
+        self.tokenizer.padding_side = "left"
 
-    def _get_response(self, prompts: list[str], system_prompts: list[str]) -> list[str]:
+    def _get_response(self, prompts: List[str], system_prompts: List[str]) -> List[str]:
         """Generate responses for a list of prompts using the local language model.
 
         Args:
@@ -74,12 +77,12 @@ def _get_response(self, prompts: list[str], system_prompts: list[str]) -> list[s
             This method uses torch.no_grad() for inference to reduce memory usage.
             It handles both single and batch inputs, ensuring consistent output format.
         """
-        inputs = []
+        inputs: List[List[Dict[str, str]]] = []
         for prompt, sys_prompt in zip(prompts, system_prompts):
             inputs.append([{"role": "system", "prompt": sys_prompt}, {"role": "user", "prompt": prompt}])
 
         with torch.no_grad():
-            response = self.pipeline(inputs, pad_token_id=self.pipeline.tokenizer.eos_token_id)
+            response = self.pipeline(inputs, pad_token_id=self.eos_token_id)
 
         if len(response) != 1:
             response = [r[0] if isinstance(r, list) else r for r in response]
@@ -87,7 +90,9 @@ def _get_response(self, prompts: list[str], system_prompts: list[str]) -> list[s
         response = [r["generated_text"] for r in response]
         return response
 
-    def __del__(self):
+    def __del__(self) -> None:
         """Cleanup method to delete the pipeline and free up GPU memory."""
-        del self.pipeline
-        torch.cuda.empty_cache()
+        if hasattr(self, "pipeline"):
+            del self.pipeline
+        if "torch" in globals() and hasattr(torch, "cuda") and torch.cuda.is_available():
+            torch.cuda.empty_cache()
diff --git a/promptolution/llms/vllm.py b/promptolution/llms/vllm.py
index 1806f6b..1df5121 100644
--- a/promptolution/llms/vllm.py
+++ b/promptolution/llms/vllm.py
@@ -1,9 +1,9 @@
 """Module for running language models locally using the vLLM library."""
 
 
-from typing import TYPE_CHECKING, List
+from typing import TYPE_CHECKING, Any, Dict, List, Optional
 
-if TYPE_CHECKING:
+if TYPE_CHECKING:  # pragma: no cover
     from promptolution.utils.config import ExperimentConfig
 
 
@@ -13,7 +13,8 @@
 logger = get_logger(__name__)
 
 try:
-    from transformers import AutoTokenizer
+    from transformers import AutoTokenizer  # type: ignore
+    from transformers import PreTrainedTokenizer
     from vllm import LLM, SamplingParams
 
     imports_successful = True
@@ -29,7 +30,7 @@ class VLLM(BaseLLM):
 
     Attributes:
         llm (vllm.LLM): The vLLM inference engine.
-        tokenizer (transformers.PreTrainedTokenizer): The tokenizer for the model.
+        tokenizer (PreTrainedTokenizer): The tokenizer for the model.
         sampling_params (vllm.SamplingParams): Parameters for text generation.
 
     Methods:
@@ -37,23 +38,25 @@ class VLLM(BaseLLM):
         update_token_count: Update the token count based on the given inputs and outputs.
     """
 
+    tokenizer: PreTrainedTokenizer
+
     def __init__(
         self,
         model_id: str,
-        batch_size: int | None = None,
+        batch_size: Optional[int] = None,
         max_generated_tokens: int = 256,
         temperature: float = 0.1,
         top_p: float = 0.9,
-        model_storage_path: str | None = None,
+        model_storage_path: Optional[str] = None,
         dtype: str = "auto",
         tensor_parallel_size: int = 1,
         gpu_memory_utilization: float = 0.95,
         max_model_len: int = 2048,
         trust_remote_code: bool = False,
         seed: int = 42,
-        llm_kwargs: dict = None,
-        config: "ExperimentConfig" = None,
-    ):
+        llm_kwargs: Optional[Dict[str, Any]] = None,
+        config: Optional["ExperimentConfig"] = None,
+    ) -> None:
         """Initialize the VLLM with a specific model.
 
         Args:
@@ -87,15 +90,16 @@ def __init__(
         self.max_model_len = max_model_len
         self.trust_remote_code = trust_remote_code
 
+        super().__init__(config)
+
         # Configure sampling parameters
         self.sampling_params = SamplingParams(
             temperature=temperature, top_p=top_p, max_tokens=max_generated_tokens, seed=seed
         )
 
-        if llm_kwargs is None:
-            llm_kwargs = {}
+        llm_kwargs = llm_kwargs or {}
         # Initialize the vLLM engine with both explicit parameters and any additional kwargs
-        llm_params = {
+        llm_params: Dict[str, Any] = {
             "model": model_id,
             "tokenizer": model_id,
             "dtype": self.dtype,
@@ -110,19 +114,27 @@ def __init__(
 
         self.llm = LLM(**llm_params)
 
+        # Initialize tokenizer separately for potential pre-processing
+        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
+
         if batch_size is None:
             cache_config = self.llm.llm_engine.model_executor.cache_config
-            self.batch_size = int((cache_config.gpu_blocks * cache_config.block_size / self.max_model_len) * 0.95)
-            logger.info(f"🚀 Batch size set to {self.batch_size} based on GPU memory.")
+            if (
+                cache_config.num_gpu_blocks is not None
+                and cache_config.block_size is not None
+                and self.max_model_len is not None
+            ):
+                self.batch_size = int(
+                    (cache_config.num_gpu_blocks * cache_config.block_size / self.max_model_len) * 0.95
+                )
+                logger.info(f"🚀 Batch size set to {self.batch_size} based on GPU memory.")
+            else:
+                self.batch_size = 1
+                logger.warning("⚠️ Could not determine batch size from GPU memory. Using batch size of 1.")
         else:
             self.batch_size = batch_size
 
-        # Initialize tokenizer separately for potential pre-processing
-        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
-
-        super().__init__(config)
-
-    def _get_response(self, prompts: list[str], system_prompts: list[str]) -> list[str]:
+    def _get_response(self, prompts: List[str], system_prompts: List[str]) -> List[str]:
         """Generate responses for a list of prompts using the vLLM engine.
 
         Args:
@@ -137,16 +149,18 @@ def _get_response(self, prompts: list[str], system_prompts: list[str]) -> list[s
             It also counts input and output tokens.
         """
         prompts = [
-            self.tokenizer.apply_chat_template(
-                [
-                    {
-                        "role": "system",
-                        "content": sys_prompt,
-                    },
-                    {"role": "user", "content": prompt},
-                ],
-                tokenize=False,
-                add_generation_prompt=True,
+            str(
+                self.tokenizer.apply_chat_template(
+                    [
+                        {
+                            "role": "system",
+                            "content": sys_prompt,
+                        },
+                        {"role": "user", "content": prompt},
+                    ],
+                    tokenize=False,
+                    add_generation_prompt=True,
+                )
             )
             for prompt, sys_prompt in zip(prompts, system_prompts)
         ]
@@ -162,7 +176,7 @@ def _get_response(self, prompts: list[str], system_prompts: list[str]) -> list[s
 
         return all_responses
 
-    def update_token_count(self, inputs: List[str], outputs: List[str]):
+    def update_token_count(self, inputs: List[str], outputs: List[str]) -> None:
         """Update the token count based on the given inputs and outputs.
 
             Uses the tokenizer to count the tokens.
@@ -177,7 +191,7 @@ def update_token_count(self, inputs: List[str], outputs: List[str]):
         for output in outputs:
             self.output_token_count += len(self.tokenizer.encode(output))
 
-    def set_generation_seed(self, seed):
+    def set_generation_seed(self, seed: int) -> None:
         """Set the random seed for text generation.
 
         Args:
diff --git a/promptolution/optimizers/base_optimizer.py b/promptolution/optimizers/base_optimizer.py
index 710701a..ded87e5 100644
--- a/promptolution/optimizers/base_optimizer.py
+++ b/promptolution/optimizers/base_optimizer.py
@@ -3,16 +3,20 @@
 
 from abc import ABC, abstractmethod
 
-from typing import TYPE_CHECKING, Callable, List
+from typing import TYPE_CHECKING, List, Literal, Optional
 
-if TYPE_CHECKING:
+if TYPE_CHECKING:  # pragma: no cover
     from promptolution.tasks.base_task import BaseTask
+    from promptolution.predictors.base_predictor import BasePredictor
     from promptolution.utils.config import ExperimentConfig
+    from promptolution.utils.callbacks import BaseCallback
 
 from promptolution.utils.logging import get_logger
 
 logger = get_logger(__name__)
 
+OptimizerType = Literal["evopromptde", "evopromptga", "opro", "capo"]
+
 
 class BaseOptimizer(ABC):
     """Abstract base class for prompt optimizers.
@@ -22,33 +26,34 @@ class BaseOptimizer(ABC):
     Attributes:
         config (ExperimentConfig, optional): Configuration for the optimizer, overriding defaults.
         prompts (List[str]): List of current prompts being optimized.
-        task (BaseTask): The task object used for evaluating prompts.
+        task (BaseTask): The task object for evaluating prompts.
         callbacks (List[Callable]): List of callback functions to be called during optimization.
         predictor: The predictor used for prompt evaluation (if applicable).
     """
 
     def __init__(
         self,
-        predictor,
+        predictor: "BasePredictor",
         task: "BaseTask",
-        initial_prompts: List[str],
-        callbacks: List[Callable] = None,
-        config: "ExperimentConfig" = None,
-    ):
+        initial_prompts: Optional[List[str]] = None,
+        callbacks: Optional[List["BaseCallback"]] = None,
+        config: Optional["ExperimentConfig"] = None,
+    ) -> None:
         """Initialize the optimizer with a configuration and/or direct parameters.
 
         Args:
-            initial_prompts: Initial set of prompts to start optimization with.
             task: Task object for prompt evaluation.
-            callbacks: List of callback functions.
             predictor: Predictor for prompt evaluation.
+            initial_prompts: Initial set of prompts to start optimization with.
+            callbacks: List of callback functions.
             config (ExperimentConfig, optional): Configuration for the optimizer, overriding defaults.
         """
         # Set up optimizer state
-        self.prompts = initial_prompts
+        self.prompts: List[str] = initial_prompts or []
         self.task = task
-        self.callbacks = callbacks or []
+        self.callbacks: List["BaseCallback"] = callbacks or []
         self.predictor = predictor
+        self.scores: List[float] = []
 
         if config is not None:
             config.apply_to(self)
@@ -79,7 +84,7 @@ def optimize(self, n_steps: int) -> List[str]:
                 # exit training loop and gracefully fail
                 logger.error(f"⛔ Error during optimization step: {e}")
                 logger.error("⚠️ Exiting optimization loop.")
-                continue_optimization = False
+                break
 
             # Callbacks at the end of each step
             continue_optimization = self._on_step_end()
@@ -91,7 +96,7 @@ def optimize(self, n_steps: int) -> List[str]:
         return self.prompts
 
     @abstractmethod
-    def _pre_optimization_loop(self):
+    def _pre_optimization_loop(self) -> None:
         """Prepare for the optimization loop.
 
         This method should be implemented by concrete optimizer classes to define
@@ -111,15 +116,16 @@ def _step(self) -> List[str]:
         """
         pass
 
-    def _on_step_end(self):
+    def _on_step_end(self) -> bool:
         """Call all registered callbacks at the end of each optimization step."""
         continue_optimization = True
         for callback in self.callbacks:
-            continue_optimization &= callback.on_step_end(self)  # if any callback returns False, end the optimization
+            if not callback.on_step_end(self):
+                continue_optimization = False
 
         return continue_optimization
 
-    def _on_train_end(self):
+    def _on_train_end(self) -> None:
         """Call all registered callbacks at the end of the entire optimization process."""
         for callback in self.callbacks:
             callback.on_train_end(self)
diff --git a/promptolution/optimizers/capo.py b/promptolution/optimizers/capo.py
index 60f8a12..bcfa275 100644
--- a/promptolution/optimizers/capo.py
+++ b/promptolution/optimizers/capo.py
@@ -6,9 +6,12 @@
 import numpy as np
 import pandas as pd
 
-from typing import TYPE_CHECKING, Callable, List, Tuple
+from typing import TYPE_CHECKING, Any, List, Optional, Tuple
 
-if TYPE_CHECKING:
+from promptolution.utils.formatting import extract_from_tag
+
+if TYPE_CHECKING:  # pragma: no cover
+    from promptolution.utils.callbacks import BaseCallback
     from promptolution.llms.base_llm import BaseLLM
     from promptolution.predictors.base_predictor import BasePredictor
     from promptolution.tasks.base_task import BaseTask
@@ -32,7 +35,7 @@
 class CAPOPrompt:
     """Represents a prompt consisting of an instruction and few-shot examples."""
 
-    def __init__(self, instruction_text: str, few_shots: List[str]):
+    def __init__(self, instruction_text: str, few_shots: List[str]) -> None:
         """Initializes the Prompt with an instruction and associated examples.
 
         Args:
@@ -57,7 +60,7 @@ def construct_prompt(self) -> str:
         )
         return prompt
 
-    def __str__(self):
+    def __str__(self) -> str:
         """Returns the string representation of the prompt."""
         return self.construct_prompt()
 
@@ -76,19 +79,21 @@ def __init__(
         predictor: "BasePredictor",
         task: "BaseTask",
         meta_llm: "BaseLLM",
-        initial_prompts: List[str] = None,
+        initial_prompts: Optional[List[str]] = None,
         crossovers_per_iter: int = 4,
         upper_shots: int = 5,
         max_n_blocks_eval: int = 10,
         test_statistic: "TestStatistics" = "paired_t_test",
         alpha: float = 0.2,
         length_penalty: float = 0.05,
-        df_few_shots: pd.DataFrame = None,
-        crossover_template: str = None,
-        mutation_template: str = None,
-        callbacks: List[Callable] = [],
-        config: "ExperimentConfig" = None,
-    ):
+        check_fs_accuracy: bool = True,
+        create_fs_reasoning: bool = True,
+        df_few_shots: Optional[pd.DataFrame] = None,
+        crossover_template: Optional[str] = None,
+        mutation_template: Optional[str] = None,
+        callbacks: Optional[List["BaseCallback"]] = None,
+        config: Optional["ExperimentConfig"] = None,
+    ) -> None:
         """Initializes the CAPOptimizer with various parameters for prompt evolution.
 
         Args:
@@ -103,6 +108,10 @@ def __init__(
             test_statistic (TestStatistics): Statistical test to compare prompt performance. Default is "paired_t_test".
             alpha (float): Significance level for the statistical test.
             length_penalty (float): Penalty factor for prompt length.
+            check_fs_accuracy (bool): Whether to check the accuracy of few-shot examples before appending them to the prompt.
+                In cases such as reward tasks, this can be set to False, as no ground truth is available. Default is True.
+            create_fs_reasoning (bool): Whether to create reasoning for few-shot examples using the downstream model,
+                instead of simply using input-output pairs from the few shots DataFrame. Default is True.
             df_few_shots (pd.DataFrame): DataFrame containing few-shot examples. If None, will pop 10% of datapoints from task.
             crossover_template (str, optional): Template for crossover instructions.
             mutation_template (str, optional): Template for mutation instructions.
@@ -124,7 +133,10 @@ def __init__(
         self.length_penalty = length_penalty
         self.token_counter = get_token_counter(self.downstream_llm)
 
-        self.scores = np.empty(0)
+        self.check_fs_accuracy = check_fs_accuracy
+        self.create_fs_reasoning = create_fs_reasoning
+
+        self.scores: List[float] = []
         super().__init__(predictor, task, initial_prompts, callbacks, config)
         self.df_few_shots = df_few_shots if df_few_shots is not None else task.pop_datapoints(frac=0.1)
         if self.max_n_blocks_eval > self.task.n_blocks:
@@ -159,12 +171,12 @@ def _initialize_population(self, initial_prompts: List[str]) -> List[CAPOPrompt]
 
         return population
 
-    def _create_few_shot_examples(self, instruction: str, num_examples: int) -> List[Tuple[str, str]]:
+    def _create_few_shot_examples(self, instruction: str, num_examples: int) -> List[str]:
         if num_examples == 0:
             return []
 
         few_shot_samples = self.df_few_shots.sample(num_examples, replace=False)
-        sample_inputs = few_shot_samples[self.task.x_column].values
+        sample_inputs = few_shot_samples[self.task.x_column].values.astype(str)
         sample_targets = few_shot_samples[self.task.y_column].values
         few_shots = [
             CAPO_FEWSHOT_TEMPLATE.replace("<input>", i).replace(
@@ -172,19 +184,27 @@ def _create_few_shot_examples(self, instruction: str, num_examples: int) -> List
             )
             for i, t in zip(sample_inputs, sample_targets)
         ]
-        # Select partition of the examples to generate reasoning from downstream model
+
+        if not self.create_fs_reasoning:
+            # If we do not create reasoning, return the few-shot examples directly
+            return few_shots
+
         preds, seqs = self.predictor.predict(
             [instruction] * num_examples,
-            sample_inputs,
+            list(sample_inputs),
             return_seq=True,
         )
+        if isinstance(seqs, str):
+            seqs = [seqs]
+        if isinstance(preds, str):
+            preds = [preds]
 
         # Check which predictions are correct and get a single one per example
         for j in range(num_examples):
             # Process and clean up the generated sequences
             seqs[j] = seqs[j].replace(sample_inputs[j], "").strip()
             # Check if the prediction is correct and add reasoning if so
-            if preds[j] == sample_targets[j]:
+            if preds[j] == sample_targets[j] or not self.check_fs_accuracy:
                 few_shots[j] = CAPO_FEWSHOT_TEMPLATE.replace("<input>", sample_inputs[j]).replace("<output>", seqs[j])
 
         return few_shots
@@ -211,14 +231,14 @@ def _crossover(self, parents: List[CAPOPrompt]) -> List[CAPOPrompt]:
             crossover_prompts.append(crossover_prompt)
             combined_few_shots = mother.few_shots + father.few_shots
             num_few_shots = (len(mother.few_shots) + len(father.few_shots)) // 2
-            offspring_few_shot = random.sample(combined_few_shots, num_few_shots)
+            offspring_few_shot = random.sample(combined_few_shots, num_few_shots) if combined_few_shots else []
             offspring_few_shots.append(offspring_few_shot)
 
         child_instructions = self.meta_llm.get_response(crossover_prompts)
 
         offsprings = []
         for instruction, examples in zip(child_instructions, offspring_few_shots):
-            instruction = instruction.split("<prompt>")[-1].split("</prompt>")[0].strip()
+            instruction = extract_from_tag(instruction, "<prompt>", "</prompt>")
             offsprings.append(CAPOPrompt(instruction, examples))
 
         return offsprings
@@ -240,13 +260,14 @@ def _mutate(self, offsprings: List[CAPOPrompt]) -> List[CAPOPrompt]:
 
         mutated = []
         for new_instruction, prompt in zip(new_instructions, offsprings):
-            new_instruction = new_instruction.split("<prompt>")[-1].split("</prompt>")[0].strip()
+            new_instruction = extract_from_tag(new_instruction, "<prompt>", "</prompt>")
             p = random.random()
 
+            new_few_shots: List[str]
             if p < 1 / 3 and len(prompt.few_shots) < self.upper_shots:  # add a random few shot
                 new_few_shot = self._create_few_shot_examples(new_instruction, 1)
                 new_few_shots = prompt.few_shots + new_few_shot
-            if 1 / 3 <= p < 2 / 3 and len(prompt.few_shots) > 0:  # remove a random few shot
+            elif 1 / 3 <= p < 2 / 3 and len(prompt.few_shots) > 0:  # remove a random few shot
                 new_few_shots = random.sample(prompt.few_shots, len(prompt.few_shots) - 1)
             else:  # do not change few shots, but shuffle
                 new_few_shots = prompt.few_shots
@@ -267,11 +288,11 @@ def _do_racing(self, candidates: List[CAPOPrompt], k: int) -> List[CAPOPrompt]:
             List[Prompt]: List of surviving prompts after racing.
         """
         self.task.reset_block_idx()
-        block_scores = []
+        block_scores: List[List[float]] = []
         i = 0
         while len(candidates) > k and i < self.max_n_blocks_eval:
             # new_scores shape: (n_candidates, n_samples)
-            new_scores = self.task.evaluate(
+            new_scores: List[float] = self.task.evaluate(
                 [c.construct_prompt() for c in candidates], self.predictor, return_agg_scores=False
             )
 
@@ -279,7 +300,10 @@ def _do_racing(self, candidates: List[CAPOPrompt], k: int) -> List[CAPOPrompt]:
             prompt_lengths = np.array([self.token_counter(c.construct_prompt()) for c in candidates])
             rel_prompt_lengths = prompt_lengths / self.max_prompt_length
 
-            new_scores = new_scores - self.length_penalty * rel_prompt_lengths[:, None]
+            penalized_new_scores = np.array(new_scores) - self.length_penalty * rel_prompt_lengths[:, None]
+
+            new_scores = penalized_new_scores.tolist()
+
             block_scores.append(new_scores)
             scores = np.concatenate(block_scores, axis=1)
 
@@ -292,8 +316,9 @@ def _do_racing(self, candidates: List[CAPOPrompt], k: int) -> List[CAPOPrompt]:
             n_better = np.sum(comparison_matrix, axis=1)
 
             # Create mask for survivors and filter candidates
-            candidates = list(compress(candidates, n_better < k))
-            block_scores = [bs[n_better < k] for bs in block_scores]
+            survivor_mask = n_better < k
+            candidates = list(compress(candidates, survivor_mask))
+            block_scores = list(compress(block_scores, survivor_mask))
 
             i += 1
             self.task.increment_block_idx()
@@ -301,16 +326,16 @@ def _do_racing(self, candidates: List[CAPOPrompt], k: int) -> List[CAPOPrompt]:
         avg_scores = self.task.evaluate(
             [c.construct_prompt() for c in candidates], self.predictor, eval_strategy="evaluated"
         )
-        order = np.argsort(-avg_scores)[:k]
+        order = np.argsort(-np.array(avg_scores))[:k]
         candidates = [candidates[i] for i in order]
-        self.scores = avg_scores[order]
+        self.scores = [avg_scores[i] for i in order]
 
         return candidates
 
-    def _pre_optimization_loop(self):
+    def _pre_optimization_loop(self) -> None:
         self.prompt_objects = self._initialize_population(self.prompts)
         self.prompts = [p.construct_prompt() for p in self.prompt_objects]
-        self.max_prompt_length = max(self.token_counter(p) for p in self.prompts)
+        self.max_prompt_length = max(self.token_counter(p) for p in self.prompts) if self.prompts else 1
         self.task.reset_block_idx()
 
     def _step(self) -> List[str]:
diff --git a/promptolution/optimizers/evoprompt_de.py b/promptolution/optimizers/evoprompt_de.py
index 426e973..0412d5d 100644
--- a/promptolution/optimizers/evoprompt_de.py
+++ b/promptolution/optimizers/evoprompt_de.py
@@ -3,11 +3,12 @@
 
 import numpy as np
 
-from typing import TYPE_CHECKING, List
+from typing import TYPE_CHECKING, Any, List, Optional
 
 from promptolution.optimizers.base_optimizer import BaseOptimizer
+from promptolution.utils.formatting import extract_from_tag
 
-if TYPE_CHECKING:
+if TYPE_CHECKING:  # pragma: no cover
     from promptolution.llms.base_llm import BaseLLM
     from promptolution.predictors.base_predictor import BasePredictor
     from promptolution.tasks.base_task import BaseTask
@@ -43,11 +44,11 @@ def __init__(
         task: "BaseTask",
         prompt_template: str,
         meta_llm: "BaseLLM",
-        initial_prompts: List[str] = None,
+        initial_prompts: Optional[List[str]] = None,
         donor_random: bool = False,
-        callbacks: List["BaseCallback"] = None,
-        config: "ExperimentConfig" = None,
-    ):
+        callbacks: Optional[List["BaseCallback"]] = None,
+        config: Optional["ExperimentConfig"] = None,
+    ) -> None:
         """Initialize the EvoPromptDE optimizer."""
         self.prompt_template = prompt_template
         self.donor_random = donor_random
@@ -56,7 +57,7 @@ def __init__(
             predictor=predictor, task=task, initial_prompts=initial_prompts, callbacks=callbacks, config=config
         )
 
-    def _pre_optimization_loop(self):
+    def _pre_optimization_loop(self) -> None:
         self.scores = self.task.evaluate(self.prompts, self.predictor, return_agg_scores=True)
         self.prompts = [prompt for _, prompt in sorted(zip(self.scores, self.prompts), reverse=True)]
         self.scores = sorted(self.scores, reverse=True)
@@ -94,7 +95,7 @@ def _step(self) -> List[str]:
             meta_prompts.append(meta_prompt)
 
         child_prompts = self.meta_llm.get_response(meta_prompts)
-        child_prompts = [prompt.split("<prompt>")[-1].split("</prompt>")[0].strip() for prompt in child_prompts]
+        child_prompts = extract_from_tag(child_prompts, "<prompt>", "</prompt>")
 
         child_scores = self.task.evaluate(child_prompts, self.predictor, return_agg_scores=True)
 
diff --git a/promptolution/optimizers/evoprompt_ga.py b/promptolution/optimizers/evoprompt_ga.py
index 6fc1215..91cc6a7 100644
--- a/promptolution/optimizers/evoprompt_ga.py
+++ b/promptolution/optimizers/evoprompt_ga.py
@@ -3,17 +3,18 @@
 
 import numpy as np
 
-from typing import TYPE_CHECKING, List
+from typing import TYPE_CHECKING, Any, List, Optional
 
 from promptolution.optimizers.base_optimizer import BaseOptimizer
 
-if TYPE_CHECKING:
+if TYPE_CHECKING:  # pragma: no cover
     from promptolution.llms.base_llm import BaseLLM
     from promptolution.predictors.base_predictor import BasePredictor
     from promptolution.tasks.base_task import BaseTask
     from promptolution.utils.callbacks import BaseCallback
     from promptolution.utils.config import ExperimentConfig
 
+from promptolution.utils.formatting import extract_from_tag
 from promptolution.utils.logging import get_logger
 
 logger = get_logger(__name__)
@@ -49,11 +50,11 @@ def __init__(
         task: "BaseTask",
         prompt_template: str,
         meta_llm: "BaseLLM",
-        initial_prompts: List[str] = None,
+        initial_prompts: Optional[List[str]] = None,
         selection_mode: str = "wheel",
-        callbacks: List["BaseCallback"] = None,
-        config: "ExperimentConfig" = None,
-    ):
+        callbacks: Optional[List["BaseCallback"]] = None,
+        config: Optional["ExperimentConfig"] = None,
+    ) -> None:
         """Initialize the EvoPromptGA optimizer."""
         self.prompt_template = prompt_template
         self.meta_llm = meta_llm
@@ -63,8 +64,8 @@ def __init__(
         )
         assert self.selection_mode in ["random", "wheel", "tour"], "Invalid selection mode."
 
-    def _pre_optimization_loop(self):
-        self.scores = self.task.evaluate(self.prompts, self.predictor, return_agg_scores=True).tolist()
+    def _pre_optimization_loop(self) -> None:
+        self.scores = self.task.evaluate(self.prompts, self.predictor, return_agg_scores=True)
         # sort prompts by score
         self.prompts = [prompt for _, prompt in sorted(zip(self.scores, self.prompts), reverse=True)]
         self.scores = sorted(self.scores, reverse=True)
@@ -73,7 +74,7 @@ def _step(self) -> List[str]:
         new_prompts = self._crossover(self.prompts, self.scores)
         prompts = self.prompts + new_prompts
 
-        new_scores = self.task.evaluate(new_prompts, self.predictor, return_agg_scores=True).tolist()
+        new_scores = self.task.evaluate(new_prompts, self.predictor, return_agg_scores=True)
 
         scores = self.scores + new_scores
 
@@ -83,7 +84,7 @@ def _step(self) -> List[str]:
 
         return self.prompts
 
-    def _crossover(self, prompts, scores) -> str:
+    def _crossover(self, prompts: List[str], scores: List[float]) -> List[str]:
         """Perform crossover operation to generate new child prompts.
 
         This method selects parent prompts based on the chosen selection mode,
@@ -126,6 +127,6 @@ def _crossover(self, prompts, scores) -> str:
             meta_prompts.append(meta_prompt)
 
         child_prompts = self.meta_llm.get_response(meta_prompts)
-        child_prompts = [prompt.split("<prompt>")[-1].split("</prompt>")[0].strip() for prompt in child_prompts]
+        child_prompts = extract_from_tag(child_prompts, "<prompt>", "</prompt>")
 
         return child_prompts
diff --git a/promptolution/optimizers/opro.py b/promptolution/optimizers/opro.py
index 0e3f892..864da31 100644
--- a/promptolution/optimizers/opro.py
+++ b/promptolution/optimizers/opro.py
@@ -3,12 +3,13 @@
 
 import numpy as np
 
-from typing import TYPE_CHECKING, List, Optional
+from typing import TYPE_CHECKING, Any, List, Optional
 
 from promptolution.optimizers.base_optimizer import BaseOptimizer
 from promptolution.optimizers.templates import OPRO_TEMPLATE
+from promptolution.utils.formatting import extract_from_tag
 
-if TYPE_CHECKING:
+if TYPE_CHECKING:  # pragma: no cover
     from promptolution.llms.base_llm import BaseLLM
     from promptolution.predictors.base_predictor import BasePredictor
     from promptolution.tasks.base_task import BaseTask
@@ -30,14 +31,14 @@ def __init__(
         self,
         predictor: "BasePredictor",
         task: "BaseTask",
-        prompt_template: Optional[str],
         meta_llm: "BaseLLM",
-        initial_prompts: List[str] = None,
+        initial_prompts: Optional[List[str]] = None,
+        prompt_template: Optional[str] = None,
         max_num_instructions: int = 20,
         num_instructions_per_step: int = 8,
         num_few_shots: int = 3,
-        callbacks: List["BaseCallback"] = None,
-        config: "ExperimentConfig" = None,
+        callbacks: Optional[List["BaseCallback"]] = None,
+        config: Optional["ExperimentConfig"] = None,
     ) -> None:
         """Initialize the OPRO optimizer.
 
@@ -54,8 +55,7 @@ def __init__(
             config: "ExperimentConfig" overwriting default parameters
         """
         self.meta_llm = meta_llm
-
-        self.meta_prompt_template = prompt_template if prompt_template else OPRO_TEMPLATE
+        self.meta_prompt_template = prompt_template or OPRO_TEMPLATE
         self.max_num_instructions = max_num_instructions
         self.num_instructions_per_step = num_instructions_per_step
         self.num_few_shots = num_few_shots
@@ -70,8 +70,8 @@ def _sample_examples(self) -> str:
             Formatted string of few-shot examples with inputs and expected outputs
         """
         idx = np.random.choice(len(self.task.xs), self.num_few_shots)
-        sample_x = self.task.xs[idx]
-        sample_y = self.task.ys[idx]
+        sample_x = [self.task.xs[i] for i in idx]
+        sample_y = [self.task.ys[i] for i in idx]
 
         return "\n".join([f"Input: {x}\nOutput: {y}" for x, y in zip(sample_x, sample_y)])
 
@@ -119,7 +119,7 @@ def _step(self) -> List[str]:
 
             response = self.meta_llm.get_response([self.meta_prompt])[0]
 
-            prompt = response.split("<prompt>")[-1].split("</prompt>")[0].strip()
+            prompt = extract_from_tag(response, "<prompt>", "</prompt>")
 
             if prompt in self.prompts:
                 duplicate_prompts += 1
diff --git a/promptolution/predictors/base_predictor.py b/promptolution/predictors/base_predictor.py
index ffcfc15..a345872 100644
--- a/promptolution/predictors/base_predictor.py
+++ b/promptolution/predictors/base_predictor.py
@@ -3,28 +3,23 @@
 
 from abc import ABC, abstractmethod
 
-from typing import TYPE_CHECKING, List
+from typing import TYPE_CHECKING, List, Literal, Optional, Tuple, Union
 
 from promptolution.llms.base_llm import BaseLLM
 
-if TYPE_CHECKING:
+if TYPE_CHECKING:  # pragma: no cover
     from promptolution.utils.config import ExperimentConfig
 
-import numpy as np
+PredictorType = Literal["first_occurrence", "marker"]
 
 
 class BasePredictor(ABC):
     """Abstract base class for predictors in the promptolution library.
 
     This class defines the interface that all concrete predictor implementations should follow.
-
-    Attributes:
-        llm: The language model used for generating predictions.
-        classes (List[str]): The list of valid class labels.
-        config (ExperimentConfig): Experiment configuration overwriting defaults
     """
 
-    def __init__(self, llm: "BaseLLM", config: "ExperimentConfig" = None):
+    def __init__(self, llm: "BaseLLM", config: Optional["ExperimentConfig"] = None) -> None:
         """Initialize the predictor with a language model and configuration.
 
         Args:
@@ -32,17 +27,17 @@ def __init__(self, llm: "BaseLLM", config: "ExperimentConfig" = None):
             config: Configuration for the predictor.
         """
         self.llm = llm
-
+        self.extraction_description = ""
         if config is not None:
             config.apply_to(self)
 
     def predict(
         self,
-        prompts: List[str],
-        xs: np.ndarray,
-        system_prompts: List[str] = None,
+        prompts: Union[str, List[str]],
+        xs: List[str],
+        system_prompts: Optional[Union[str, List[str]]] = None,
         return_seq: bool = False,
-    ) -> np.ndarray:
+    ) -> Union[List[str], Tuple[List[str], List[str]]]:
         """Abstract method to make predictions based on prompts and input data.
 
         Args:
@@ -63,18 +58,18 @@ def predict(
 
         if return_seq:
             seqs = [f"{x}\n{out}" for x, out in zip(xs, outputs)]
-            seqs = np.array(seqs)
+            return preds, seqs
 
-        return preds if not return_seq else (preds, seqs)
+        return preds
 
     @abstractmethod
-    def _extract_preds(self, preds: List[str]) -> np.ndarray:
+    def _extract_preds(self, preds: List[str]) -> List[str]:
         """Extract class labels from the predictions, based on the list of valid class labels.
 
         Args:
             preds: The raw predictions from the language model.
 
         Returns:
-            np.ndarray: Extracted predictions.
+            List[str]: Extracted class labels from the predictions.
         """
         raise NotImplementedError
diff --git a/promptolution/predictors/classifier.py b/promptolution/predictors/classifier.py
index 3fca57f..2a4fa00 100644
--- a/promptolution/predictors/classifier.py
+++ b/promptolution/predictors/classifier.py
@@ -3,11 +3,13 @@
 
 import numpy as np
 
-from typing import TYPE_CHECKING, List
+from typing import TYPE_CHECKING, Any, List, Optional
 
 from promptolution.predictors.base_predictor import BasePredictor
+from promptolution.utils.formatting import extract_from_tag
 
-if TYPE_CHECKING:
+if TYPE_CHECKING:  # pragma: no cover
+    from promptolution.llms.base_llm import BaseLLM
     from promptolution.utils.config import ExperimentConfig
 
 
@@ -29,7 +31,7 @@ class FirstOccurrenceClassifier(BasePredictor):
         BasePredictor: The base class for predictors in the promptolution library.
     """
 
-    def __init__(self, llm, classes, config: "ExperimentConfig" = None):
+    def __init__(self, llm: "BaseLLM", classes: List[str], config: Optional["ExperimentConfig"] = None) -> None:
         """Initialize the FirstOccurrenceClassifier.
 
         Args:
@@ -47,13 +49,13 @@ def __init__(self, llm, classes, config: "ExperimentConfig" = None):
 
         super().__init__(llm, config)
 
-    def _extract_preds(self, preds: List[str]) -> np.ndarray:
+    def _extract_preds(self, preds: List[str]) -> List[str]:
         """Extract class labels from the predictions, based on the list of valid class labels.
 
         Args:
             preds: The raw predictions from the language model.
         """
-        response = []
+        result = []
         for pred in preds:
             predicted_class = self.classes[0]  # use first class as default pred
             for word in pred.split():
@@ -62,10 +64,9 @@ def _extract_preds(self, preds: List[str]) -> np.ndarray:
                     predicted_class = word
                     break
 
-            response.append(predicted_class)
+            result.append(predicted_class)
 
-        response = np.array(response)
-        return response
+        return result
 
 
 class MarkerBasedClassifier(BasePredictor):
@@ -85,12 +86,12 @@ class MarkerBasedClassifier(BasePredictor):
 
     def __init__(
         self,
-        llm,
-        classes=None,
-        begin_marker="<final_answer>",
-        end_marker="</final_answer>",
-        config: "ExperimentConfig" = None,
-    ):
+        llm: "BaseLLM",
+        classes: Optional[List[str]] = None,
+        begin_marker: str = "<final_answer>",
+        end_marker: str = "</final_answer>",
+        config: Optional["ExperimentConfig"] = None,
+    ) -> None:
         """Initialize the MarkerBasedClassifier.
 
         Args:
@@ -108,7 +109,7 @@ def __init__(
             assert all([c.islower() for c in classes]), "Class labels should be lowercase."
 
             self.extraction_description = (
-                f"The task is to classify the texts into one of those classes: {','.join(classes)}."
+                f"The task is to classify the texts into one of those classes: {', '.join(classes)}."
                 f"The class label is extracted from the text that are between these markers: {begin_marker} and {end_marker}."
             )
         else:
@@ -116,19 +117,18 @@ def __init__(
 
         super().__init__(llm, config)
 
-    def _extract_preds(self, preds: List[str]) -> np.ndarray:
+    def _extract_preds(self, preds: List[str]) -> List[str]:
         """Extract class labels from the predictions, by extracting the text following the marker.
 
         Args:
             preds: The raw predictions from the language model.
         """
-        response = []
+        result = []
         for pred in preds:
-            pred = pred.split(self.begin_marker)[-1].split(self.end_marker)[0].strip().lower()
+            pred = extract_from_tag(pred, self.begin_marker, self.end_marker).lower()
             if self.classes is not None and pred not in self.classes:
                 pred = self.classes[0]
 
-            response.append(pred)
+            result.append(pred)
 
-        response = np.array(response)
-        return response
+        return result
diff --git a/promptolution/tasks/base_task.py b/promptolution/tasks/base_task.py
index 418ab87..2e7fa89 100644
--- a/promptolution/tasks/base_task.py
+++ b/promptolution/tasks/base_task.py
@@ -4,41 +4,350 @@
 from abc import ABC, abstractmethod
 
 import numpy as np
+import pandas as pd
 
-from typing import TYPE_CHECKING, List
+from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Tuple, Union, overload
 
-if TYPE_CHECKING:
+if TYPE_CHECKING:  # pragma: no cover
+    from promptolution.predictors.base_predictor import BasePredictor
     from promptolution.utils.config import ExperimentConfig
 
 
+TaskType = Literal["classification", "reward", "judge"]
+EvalStrategy = Literal["full", "subsample", "sequential_block", "random_block"]
+
+
 class BaseTask(ABC):
-    """Abstract base class for tasks in the promptolution library.
+    """Abstract base class for tasks in the promptolution library."""
 
-    This class defines the interface that all concrete task implementations should follow.
+    def __init__(
+        self,
+        df: pd.DataFrame,
+        x_column: str,
+        y_column: Optional[str] = None,
+        task_description: Optional[str] = None,
+        n_subsamples: int = 30,
+        eval_strategy: "EvalStrategy" = "full",
+        seed: int = 42,
+        config: Optional["ExperimentConfig"] = None,
+    ) -> None:
+        """Initialize the BaseTask.
 
-    Methods:
-        evaluate: An abstract method that should be implemented by subclasses
-                  to evaluate prompts using a given predictor.
-    """
+        Args:
+            df (pd.DataFrame): The input DataFrame containing the data.
+            x_column (str): Name of the column containing input texts.
+            y_column (Optional[str]): Name of the column containing labels/ground truth (if applicable).
+            task_description (str): Description of the task.
+            n_subsamples (int): Number of subsamples to use for evaluation.
+            eval_strategy (Literal): Subsampling strategy ("full", "subsample", "sequential_block", "random_block", "evaluated").
+            seed (int): Random seed for reproducibility.
+            config (ExperimentConfig, optional): Configuration for the task, overriding defaults.
+        """
+        self.df = df
+        self.x_column = x_column
+        self.y_column = y_column
+        self.task_description = task_description
+        self.n_subsamples = n_subsamples
+        self.eval_strategy = eval_strategy
+        self.seed = seed
 
-    def __init__(self, config: "ExperimentConfig" = None):
-        """Initialize the BaseTask."""
+        super().__init__()
         if config is not None:
             config.apply_to(self)
 
+        self.xs: List[str] = df[self.x_column].values.astype(str).tolist()
+        self.has_y = y_column is not None
+        if self.has_y and y_column is not None:
+            self.ys: List[str] = df[y_column].values.astype(str).tolist()
+        else:
+            # If no y_column is provided, create a dummy y array
+            self.ys = [""] * len(self.xs)
+
+        self.block_idx = 0
+        self.n_blocks = len(self.xs) // self.n_subsamples if self.n_subsamples > 0 else 1
+        self.rng = np.random.default_rng(seed)
+
+        self.eval_cache: Dict[Tuple[str, str, str], float] = {}  # (prompt, x, y): scores per datapoint
+        self.seq_cache: Dict[Tuple[str, str, str], str] = {}  # (prompt, x, y): generating sequence per datapoint
+
+    def subsample(self, eval_strategy: "EvalStrategy" = None) -> Tuple[List[str], List[str]]:
+        """Subsample the dataset based on the specified parameters.
+
+        Args:
+            eval_strategy (EvalStrategy, optional): Subsampling strategy to use instead of self.eval_strategy. Defaults to None.
+
+        Returns:
+            Tuple[List[str], List[str]]: Subsampled input data and labels.
+        """
+        if eval_strategy is None:
+            eval_strategy = self.eval_strategy
+
+        if eval_strategy in ["full", "evaluated"]:
+            return self.xs, self.ys
+        elif eval_strategy == "subsample":
+            indices = self.rng.choice(len(self.xs), min(self.n_subsamples, len(self.xs)), replace=False)
+            return [self.xs[i] for i in indices], [self.ys[i] for i in indices]
+        elif eval_strategy == "random_block":
+            block_id = self.rng.integers(0, self.n_blocks)
+            start_idx = block_id * self.n_subsamples
+            end_idx = min((block_id + 1) * self.n_subsamples, len(self.xs))
+            indices = np.arange(start_idx, end_idx)
+            return [self.xs[i] for i in indices], [self.ys[i] for i in indices]
+        elif eval_strategy == "sequential_block":
+            start_idx = self.block_idx * self.n_subsamples
+            end_idx = min((self.block_idx + 1) * self.n_subsamples, len(self.xs))
+            indices = np.arange(start_idx, end_idx)
+            return [self.xs[i] for i in indices], [self.ys[i] for i in indices]
+        else:
+            raise ValueError(f"Unknown subsampling strategy: '{eval_strategy}'")
+
+    def _prepare_batch(
+        self,
+        prompts: List[str],
+        xs: List[str],
+        ys: List[str],
+        eval_strategy: Literal["full", "subsample", "sequential_block", "random_block", "evaluated"] = "full",
+    ) -> List[Tuple[str, str, str]]:
+        """Generates (prompt, x, y) keys that require prediction.
+
+        Returns keys not found in eval_cache.
+        """
+        if eval_strategy == "evaluated":
+            return []
+        keys_to_predict = []
+        for prompt in prompts:
+            for x, y in zip(xs, ys):
+                cache_key = (prompt, x, str(y))
+                if cache_key not in self.eval_cache:
+                    keys_to_predict.append(cache_key)
+        return keys_to_predict
+
+    def _collect_results_from_cache(
+        self,
+        prompts: List[str],
+        xs: List[str],
+        ys: List[str],
+        return_agg_scores: bool,
+        return_seq: bool,
+    ) -> Union[List[float], List[List[float]], Tuple[List[List[float]], List[List[str]]]]:
+        """Collects all results for the current batch from the cache and formats them."""
+        assert not (return_agg_scores and return_seq), "Cannot return both aggregated scores and sequences"
+
+        scores = []
+        seqs = []
+
+        for prompt in prompts:
+            datapoint_scores = []
+            datapoint_seqs = []
+            for x, y in zip(xs, ys):
+                cache_key = (prompt, x, y)
+                datapoint_scores.append(self.eval_cache[cache_key])
+                if return_seq:
+                    datapoint_seqs.append(self.seq_cache.get(cache_key, ""))
+            scores.append(datapoint_scores)
+            if return_seq:
+                seqs.append(datapoint_seqs)
+
+        if return_agg_scores:
+            agg_scores = [np.nanmean(s).item() for s in scores]
+            return agg_scores
+
+        return scores if not return_seq else (scores, seqs)
+
     @abstractmethod
-    def evaluate(self, prompts: List[str], predictor, system_prompts: List[str] = None) -> np.ndarray:
-        """Abstract method to evaluate prompts using a given predictor.
+    def _evaluate(self, xs: List[str], ys: List[str], preds: List[str]) -> List[float]:
+        """Abstract method to calculate the score for a predictions.
+
+        This method should be implemented by subclasses based on their specific evaluation logic.
+        """
+        raise NotImplementedError
+
+    @overload
+    def evaluate(
+        self,
+        prompts: List[str],
+        predictor: "BasePredictor",
+        system_prompts: Optional[Union[str, List[str]]] = None,
+        return_agg_scores: Literal[True] = True,
+        return_seq: Literal[False] = False,
+        eval_strategy: Optional["EvalStrategy"] = None,
+    ) -> List[float]:
+        ...
+
+    @overload
+    def evaluate(
+        self,
+        prompts: List[str],
+        predictor: "BasePredictor",
+        system_prompts: Optional[Union[str, List[str]]] = None,
+        return_agg_scores: Literal[False] = False,
+        return_seq: Literal[False] = False,
+        eval_strategy: Optional["EvalStrategy"] = None,
+    ) -> List[List[float]]:
+        ...
+
+    @overload
+    def evaluate(
+        self,
+        prompts: List[str],
+        predictor: "BasePredictor",
+        system_prompts: Optional[Union[str, List[str]]] = None,
+        return_agg_scores: Literal[False] = False,
+        return_seq: Literal[True] = True,
+        eval_strategy: Optional["EvalStrategy"] = None,
+    ) -> Tuple[List[List[float]], List[List[str]]]:
+        ...
+
+    @overload
+    def evaluate(
+        self,
+        prompts: str,
+        predictor: "BasePredictor",
+        system_prompts: Optional[Union[str, List[str]]] = None,
+        return_agg_scores: Literal[True] = True,
+        return_seq: Literal[False] = False,
+        eval_strategy: Optional["EvalStrategy"] = None,
+    ) -> List[float]:
+        ...
+
+    @overload
+    def evaluate(
+        self,
+        prompts: str,
+        predictor: "BasePredictor",
+        system_prompts: Optional[Union[str, List[str]]] = None,
+        return_agg_scores: Literal[False] = False,
+        return_seq: Literal[False] = False,
+        eval_strategy: Optional["EvalStrategy"] = None,
+    ) -> List[List[float]]:
+        ...
+
+    @overload
+    def evaluate(
+        self,
+        prompts: str,
+        predictor: "BasePredictor",
+        system_prompts: Optional[Union[str, List[str]]] = None,
+        return_agg_scores: Literal[False] = False,
+        return_seq: Literal[True] = True,
+        eval_strategy: Optional["EvalStrategy"] = None,
+    ) -> Tuple[List[List[float]], List[List[str]]]:
+        ...
+
+    def evaluate(
+        self,
+        prompts: Union[str, List[str]],
+        predictor: "BasePredictor",
+        system_prompts: Optional[Union[str, List[str]]] = None,
+        return_agg_scores: bool = True,
+        return_seq: bool = False,
+        eval_strategy: Optional["EvalStrategy"] = None,
+    ) -> Union[List[float], List[List[float]], Tuple[List[List[float]], List[List[str]]]]:
+        """Evaluate a set of prompts using a given predictor.
+
+        This method orchestrates subsampling, prediction, caching, and result collection.
+
+        Note: Cannot return both aggregated scores and sequences (assertion will fail).
+        """
+        assert not (return_agg_scores and return_seq), "Cannot return both aggregated scores and sequences"
+
+        seqs: List[str] = []
+
+        prompts = [prompts] if isinstance(prompts, str) else prompts
+        eval_strategy = eval_strategy or self.eval_strategy
+        xs, ys = self.subsample(eval_strategy=eval_strategy)
+        batches = self._prepare_batch(prompts, xs, ys, eval_strategy=eval_strategy)
+        (prompts_to_evaluate, xs_to_evaluate, ys_to_evaluate) = ([], [], []) if not batches else zip(*batches)
+
+        if prompts_to_evaluate:
+            preds_seqs = predictor.predict(
+                prompts=list(prompts_to_evaluate),
+                xs=list(xs_to_evaluate),
+                system_prompts=system_prompts,
+                return_seq=return_seq,
+            )
+        else:
+            preds_seqs = ([], []) if return_seq else []
+
+        if return_seq:
+            preds, seqs = preds_seqs if isinstance(preds_seqs, tuple) else (preds_seqs, [])
+        else:
+            preds = preds_seqs
+
+        scores: List[float] = self._evaluate(list(xs_to_evaluate), list(ys_to_evaluate), preds)
+        for i, cache_key in enumerate(batches):
+            self.eval_cache[cache_key] = scores[i]
+            if return_seq:
+                self.seq_cache[cache_key] = seqs[i]
+
+        return self._collect_results_from_cache(
+            prompts,
+            xs,
+            ys,
+            return_agg_scores,
+            return_seq,
+        )
+
+    def pop_datapoints(self, n: Optional[int] = None, frac: Optional[float] = None) -> pd.DataFrame:
+        """Pop a number of datapoints from the dataset.
 
         Args:
-            prompts (List[str]): List of prompts to evaluate.
-            predictor: The predictor to use for evaluation.
-            system_prompts (List[str]): List of system prompts to evaluate.
+            n (int, optional): Number of datapoints to pop. Defaults to None.
+            frac (float, optional): Fraction of datapoints to pop. Defaults to None.
 
         Returns:
-            np.ndarray: Array of evaluation scores for each prompt.
+            pd.DataFrame: DataFrame containing the popped datapoints.
+        """
+        assert n is None or frac is None, "Only one of n or frac can be specified."
+        if n is not None:
+            indices = self.rng.choice(len(self.xs), n, replace=False)
+        elif frac is not None:
+            indices = self.rng.choice(len(self.xs), int(len(self.xs) * frac), replace=False)
+        else:
+            raise ValueError("Either n or frac must be specified.")
+
+        popped_xs = [self.xs[i] for i in indices]
+        popped_ys = [self.ys[i] for i in indices]
+        df_popped = pd.DataFrame({self.x_column: popped_xs, self.y_column: popped_ys})
+
+        self.xs = [x for i, x in enumerate(self.xs) if i not in indices]
+        self.ys = [y for i, y in enumerate(self.ys) if i not in indices]
+
+        # Update n_blocks and block_idx based on the new dataset size
+        self.n_blocks = len(self.xs) // self.n_subsamples if self.n_subsamples > 0 else 1
+        self.block_idx = min(self.block_idx, self.n_blocks - 1) if self.n_blocks > 0 else 0
+
+        # Clear cache for popped items (optional, but good practice if memory is a concern)
+        keys_to_remove = []
+        for key in self.eval_cache:
+            if key[1] in popped_xs and key[2] in popped_ys:  # Check if the x and y correspond to popped data
+                keys_to_remove.append(key)
+        for key in keys_to_remove:
+            self.eval_cache.pop(key, None)
+            self.seq_cache.pop(key, None)
+
+        return df_popped
+
+    def increment_block_idx(self) -> None:
+        """Increment the block index for subsampling.
 
         Raises:
-            NotImplementedError: If not implemented by a subclass.
+            ValueError: If the eval_strategy does not contain "block".
         """
-        raise NotImplementedError
+        if "block" not in self.eval_strategy:
+            raise ValueError("Block increment is only valid for block subsampling.")
+        self.block_idx += 1
+        if self.n_blocks > 0:  # Ensure n_blocks is not zero to avoid division by zero
+            self.block_idx %= self.n_blocks
+        else:
+            self.block_idx = 0  # If no blocks, reset to 0
+
+    def reset_block_idx(self) -> None:
+        """Reset the block index for subsampling.
+
+        Raises:
+            ValueError: If the eval_strategy does not contain "block".
+        """
+        if "block" not in self.eval_strategy:
+            raise ValueError("Block reset is only valid for block subsampling.")
+        self.block_idx = 0
diff --git a/promptolution/tasks/classification_tasks.py b/promptolution/tasks/classification_tasks.py
index 9ff156f..e34c24f 100644
--- a/promptolution/tasks/classification_tasks.py
+++ b/promptolution/tasks/classification_tasks.py
@@ -5,12 +5,11 @@
 import pandas as pd
 from sklearn.metrics import accuracy_score
 
-from typing import TYPE_CHECKING, Any, Callable, List, Literal, Tuple, Union
+from typing import TYPE_CHECKING, Any, Callable, List, Literal, Optional
 
 from promptolution.tasks.base_task import BaseTask
 
-if TYPE_CHECKING:
-    from promptolution.predictors.base_predictor import BasePredictor
+if TYPE_CHECKING:  # pragma: no cover
     from promptolution.utils.config import ExperimentConfig
 
 
@@ -24,20 +23,20 @@ class ClassificationTask(BaseTask):
     def __init__(
         self,
         df: pd.DataFrame,
-        description: str = None,
+        task_description: Optional[str] = None,
         x_column: str = "x",
         y_column: str = "y",
         n_subsamples: int = 30,
         eval_strategy: Literal["full", "subsample", "sequential_block", "random_block"] = "full",
         seed: int = 42,
-        metric: Callable = accuracy_score,
-        config: "ExperimentConfig" = None,
-    ):
+        metric: Callable[[Any, Any], float] = accuracy_score,
+        config: Optional["ExperimentConfig"] = None,
+    ) -> None:
         """Initialize the ClassificationTask from a pandas DataFrame.
 
         Args:
             df (pd.DataFrame): Input DataFrame containing the data
-            description (str): Description of the task
+            task_description (str): Description of the task
             x_column (str, optional): Name of the column containing input texts. Defaults to "x".
             y_column (str, optional): Name of the column containing labels. Defaults to "y".
             n_subsamples (int, optional): Number of subsamples to use. No subsampling if None. Defaults to None.
@@ -52,193 +51,25 @@ def __init__(
             metric (Callable, optional): Metric to use for evaluation. Defaults to accuracy_score.
             config (ExperimentConfig, optional): Configuration for the task, overriding defaults.
         """
-        self.description = description
         self.metric = metric
-
-        self.x_column = x_column
-        self.y_column = y_column
-        self.eval_strategy = eval_strategy
-        self.n_subsamples = n_subsamples
-        super().__init__(config)
-
-        self.xs = df[self.x_column].values
-        self.ys = df[self.y_column].str.lower().values
+        super().__init__(
+            df=df,
+            x_column=x_column,
+            y_column=y_column,
+            task_description=task_description,
+            n_subsamples=n_subsamples,
+            eval_strategy=eval_strategy,
+            seed=seed,
+            config=config,
+        )
+        self.ys: List[str] = (
+            df[self.y_column].str.lower().values.tolist()
+        )  # Ensure y values are lowercase for consistent comparison
         self.classes = np.unique(self.ys)
 
-        self.block_idx = 0
-        self.n_blocks = len(self.xs) // self.n_subsamples
-        self.rng = np.random.default_rng(seed)
-
-        self.eval_cache = {}  # (prompt, x, y): scores per datapoint
-        self.seq_cache = {}  # (prompt, x, y): generating sequence per datapoint
-
-    def subsample(
-        self, eval_strategy: Literal["full", "subsample", "sequential_block", "random_block"] = None
-    ) -> Tuple[np.ndarray, np.ndarray]:
-        """Subsample the dataset based on the specified parameters.
-
-        Args:
-            strategy (str, optional): Subsampling strategy to use instead of self.subsample_strategy. Defaults to None.
-
-        Returns:
-            Tuple[np.ndarray, np.ndarray]: Subsampled input data and labels.
-        """
-        if eval_strategy is None:
-            eval_strategy = self.eval_strategy
-
-        if eval_strategy in ["full", "evaluated"]:
-            return self.xs, self.ys
-
-        elif eval_strategy == "subsample":
-            indices = self.rng.choice(len(self.xs), self.n_subsamples, replace=False)
-            return self.xs[indices], self.ys[indices]
-
-        elif eval_strategy == "random_block":
-            block_id = self.rng.integers(0, len(self.xs) // self.n_subsamples)
-            indices = np.arange(block_id * self.n_subsamples, (block_id + 1) * self.n_subsamples)
-            return self.xs[indices], self.ys[indices]
-
-        elif eval_strategy == "sequential_block":
-            indices = np.arange(self.block_idx * self.n_subsamples, (self.block_idx + 1) * self.n_subsamples)
-            return self.xs[indices], self.ys[indices]
-
-        else:
-            raise ValueError(f"Unknown subsampling strategy: '{eval_strategy}")
-
-    def _prepare_batch(
-        self, prompts: List[str], xs: np.ndarray, ys: np.ndarray, eval_strategy: str
-    ) -> List[Tuple[str, str, str]]:
-        """Generates (prompt, x, y) keys that require prediction.
-
-        If strategy is "evaluated", returns an empty list.
-        Otherwise, returns keys not found in eval_cache.
-        """
-        if eval_strategy == "evaluated":
-            return []
-
-        keys_to_predict = []
-        for prompt in prompts:
-            for x, y in zip(xs, ys):
-                cache_key = (prompt, x, y)
-                if cache_key not in self.eval_cache:
-                    keys_to_predict.append(cache_key)
-        return keys_to_predict
-
-    def _collect_results_from_cache(
-        self,
-        prompts: List[str],
-        xs: np.ndarray,
-        ys: np.ndarray,
-        return_agg_scores: bool,
-        return_seq: bool,
-    ) -> Union[np.ndarray, Tuple[np.ndarray, Union[List[Any], np.ndarray]]]:
-        """Collects all results for the current batch from the cache and formats them."""
+    def _evaluate(self, xs: List[str], ys: List[str], preds: List[str]) -> List[float]:
+        """Calculate the score for a single prediction."""
         scores = []
-        seqs = []
-
-        for prompt in prompts:
-            cache_keys = [(prompt, x, y) for x, y in zip(xs, ys)]
-            scores += [[self.eval_cache.get(key, np.nan) for key in cache_keys]]
-            seqs += [[self.seq_cache.get(key) for key in cache_keys]]
-        if return_agg_scores:
-            scores = [np.nanmean(s) for s in scores]
-        scores = np.array(scores)
-        seqs = np.array(seqs)
-
-        return scores if not return_seq else (scores, seqs)
-
-    def evaluate(
-        self,
-        prompts: Union[str, List[str]],
-        predictor: "BasePredictor",
-        system_prompts: List[str] = None,
-        return_agg_scores: bool = True,
-        return_seq: bool = False,
-        eval_strategy: str = None,
-    ) -> Union[np.ndarray, Tuple[np.ndarray, Union[List[Any], np.ndarray]]]:
-        """Evaluate a set of prompts using a given predictor.
-
-        This method orchestrates subsampling, prediction, caching, and result collection.
-        """
-        prompts = [prompts] if isinstance(prompts, str) else prompts
-        eval_strategy = eval_strategy or self.eval_strategy
-
-        xs, ys = self.subsample(eval_strategy=eval_strategy)
-        batches = self._prepare_batch(prompts, xs, ys, eval_strategy)
-        prompts_to_evaluate, xs_to_evaluate, ys_to_evaluate = zip(*batches) if batches else ([], [], [])
-
-        preds = predictor.predict(
-            prompts=prompts_to_evaluate,
-            xs=xs_to_evaluate,
-            system_prompts=system_prompts,
-            return_seq=return_seq,
-        )
-
-        if return_seq:
-            preds, seqs = preds
-
-        for i, cache_key in enumerate(batches):
-            y_pred, y_true = preds[i], ys_to_evaluate[i]
-            if return_seq:
-                self.seq_cache[cache_key] = seqs[i]
-            self.eval_cache[cache_key] = self.metric([y_pred], [y_true])
-
-        return self._collect_results_from_cache(
-            prompts,
-            xs,
-            ys,
-            return_agg_scores,
-            return_seq,
-        )
-
-    def pop_datapoints(self, n: int = None, frac: float = None) -> pd.DataFrame:
-        """Pop a number of datapoints from the dataset.
-
-        Args:
-            n (int, optional): Number of datapoints to pop. Defaults to None.
-            frac (float, optional): Fraction of datapoints to pop. Defaults to None.
-
-        Returns:
-            pd.DataFrame: DataFrame containing the popped datapoints.
-        """
-        assert n is None or frac is None, "Only one of n or frac can be specified."
-        if n is not None:
-            indices = self.rng.choice(len(self.xs), n, replace=False)
-        elif frac is not None:
-            indices = self.rng.choice(len(self.xs), int(len(self.xs) * frac), replace=False)
-        else:
-            raise ValueError("Either n or frac must be specified.")
-
-        xs = self.xs[indices]
-        ys = self.ys[indices]
-        df = pd.DataFrame({self.x_column: xs, self.y_column: ys})
-
-        self.xs = np.delete(self.xs, indices)
-        self.ys = np.delete(self.ys, indices)
-
-        self.n_blocks = len(self.xs) // self.n_subsamples
-        self.block_idx = min(self.block_idx, self.n_blocks - 1)
-
-        return df
-
-    def increment_block_idx(self) -> None:
-        """Increment the block index for subsampling.
-
-        Raises:
-            ValueError: If the eval_strategy does not contain "block".
-        """
-        if "block" not in self.eval_strategy:
-            raise ValueError("Block increment is only valid for block subsampling.")
-        self.block_idx += 1
-        if self.block_idx >= self.n_blocks:
-            self.block_idx = 0
-
-    def reset_block_idx(self) -> None:
-        """Reset the block index for subsampling.
-
-        Raises:
-            ValueError: If the eval_strategy does not contain "block".
-        """
-        if "block" not in self.eval_strategy:
-            raise ValueError("Block reset is only valid for block subsampling.")
-        self.block_idx = 0
+        for pred, y in zip(preds, ys):
+            scores.append(self.metric([y], [pred]))
+        return scores
diff --git a/promptolution/tasks/judge_tasks.py b/promptolution/tasks/judge_tasks.py
new file mode 100644
index 0000000..c53742f
--- /dev/null
+++ b/promptolution/tasks/judge_tasks.py
@@ -0,0 +1,149 @@
+"""Module for judge tasks."""
+
+import numpy as np
+import pandas as pd
+
+from typing import TYPE_CHECKING, List, Literal, Optional, Union
+
+from promptolution.llms.base_llm import BaseLLM
+from promptolution.tasks.base_task import BaseTask
+from promptolution.utils.formatting import extract_from_tag
+from promptolution.utils.logging import get_logger
+
+if TYPE_CHECKING:  # pragma: no cover
+    from promptolution.tasks.base_task import EvalStrategy
+    from promptolution.utils.config import ExperimentConfig
+
+logger = get_logger(__name__)
+
+JUDGE_PROMPT_WITH_GROUND_TRUTH = """You are an expert evaluator. Judge how well the prediction matches the ground truth, for the given task.
+
+Task:
+{task}
+
+Input:
+{input}
+
+Ground Truth:
+{ground_truth}
+
+Prediction:
+{prediction}
+
+Evaluate how closely the prediction aligns with the ground truth. Consider correctness, completeness, and accuracy of the match.
+
+Provide a score from -5 to +5 where:
+- -5: Completely incorrect/opposite
+- 0: Partially correct
+- +5: Perfect match
+
+Return your answer encompased by <final_score></final_score>"""
+
+JUDGE_PROMPT_WITHOUT_GROUND_TRUTH = """You are an expert evaluator. Judge the quality of the prediction, for the given task.
+
+Task:
+{task}
+
+Input:
+{input}
+
+Prediction:
+{prediction}
+
+Evaluate how well the prediction addresses the input for the given task. Consider correctness, quality, relevance, completeness, and excellence of execution.
+
+Provide a score from -5 to +5 where:
+- -5: Completely wrong/inappropriate
+- 0: Partially addresses the task with mixed quality
+- +5: Exceptional response that brilliantly solves the task with creativity, insight, or outstanding execution that goes beyond basic correctness
+
+Return your answer encompased by <final_score></final_score>"""
+
+
+class JudgeTask(BaseTask):
+    """Task that evaluates a predictor using an LLM as a judge, optionally accepting a ground truth."""
+
+    def __init__(
+        self,
+        df: pd.DataFrame,
+        judge_llm: "BaseLLM",
+        x_column: str = "x",
+        y_column: Optional[str] = None,
+        task_description: Optional[str] = None,
+        n_subsamples: int = 30,
+        eval_strategy: "EvalStrategy" = "full",
+        seed: int = 42,
+        judge_prompt: Optional[str] = None,
+        min_score: float = -5.0,
+        max_score: float = 5.0,
+        config: "ExperimentConfig" = None,
+    ):
+        """Initialize the JudgeTask.
+
+        Args:
+            df (pd.DataFrame): The input DataFrame containing the data.
+            judge_llm (BaseLLM): The LLM judging the predictions.
+            x_column (str): Name of the column containing input texts.
+            y_column (Optional[str]): Name of the column containing labels/ground truth (if applicable).
+            task_description (Optional[str]): Description of the task, parsed to the Judge-LLM and Meta-LLM.
+            n_subsamples (int): Number of subsamples to use for evaluation.
+            eval_strategy (EvalStrategy): Subsampling strategy to use for evaluation.
+            seed (int): Random seed for reproducibility.
+            judge_prompt (Optional[str]): Custom prompt for the judge. Note: The score of the Judge will be extracted inside <final_score> tags.
+            min_score (float): Minimum score for evaluation.
+            max_score (float): Maximum score for evaluation.
+            config (ExperimentConfig, optional): Configuration for the task, overriding defaults.
+        """
+        if judge_prompt is None:
+            judge_prompt = JUDGE_PROMPT_WITH_GROUND_TRUTH if y_column else JUDGE_PROMPT_WITHOUT_GROUND_TRUTH
+        self.judge_prompt = judge_prompt
+        self.min_score = min_score
+        self.max_score = max_score
+
+        super().__init__(
+            df=df,
+            x_column=x_column,
+            y_column=y_column,
+            task_description=task_description,
+            n_subsamples=n_subsamples,
+            eval_strategy=eval_strategy,
+            seed=seed,
+            config=config,
+        )
+        self.judge_llm = judge_llm
+
+    def _construct_judge_prompt(self, x: str, pred: str, y: Optional[str] = None) -> str:
+        """Constructs the judge prompt based on whether ground truth is available."""
+        if y is not None:
+            prompt = self.judge_prompt.replace("{ground_truth}", str(y))
+        else:
+            prompt = self.judge_prompt
+
+        task_description = self.task_description or ""
+        prompt = prompt.replace("{task}", task_description).replace("{input}", x).replace("{prediction}", pred)
+        return prompt
+
+    def _evaluate(self, xs: List[str], ys: List[str], preds: List[str]) -> List[float]:
+        """Calculate the score for a single prediction using the LLM judge."""
+        prompts: List[str] = []
+        for x, y, pred in zip(xs, ys, preds):
+            judge_prompt = self._construct_judge_prompt(x, pred, y)
+            prompts.append(judge_prompt)
+        judge_responses = self.judge_llm.get_response(prompts)
+        scores_str = extract_from_tag(judge_responses, "<final_score>", "</final_score>")
+        scores = []
+        for score_str, judge_response in zip(scores_str, judge_responses):
+            try:
+                # only numeric chars, - or . are allowed
+                score_str = "".join(filter(lambda c: c.isdigit() or c in "-.", score_str))
+                score = float(score_str)
+                # normalize from [min_score, max_score] to [0, 1]
+                score = (score - self.min_score) / (self.max_score - self.min_score)
+                score = max(0.0, min(1.0, score))
+            except ValueError:
+                logger.warning(f"Failed to parse score '{score}' as float. Defaulting to a score 0.0.")
+                score = 0.0
+
+            scores.append(score)
+
+        return scores
diff --git a/promptolution/tasks/reward_tasks.py b/promptolution/tasks/reward_tasks.py
new file mode 100644
index 0000000..cf92ed0
--- /dev/null
+++ b/promptolution/tasks/reward_tasks.py
@@ -0,0 +1,59 @@
+"""Module for Reward tasks."""
+
+
+import pandas as pd
+
+from typing import TYPE_CHECKING, Callable, List, Optional
+
+from promptolution.tasks.base_task import BaseTask
+
+if TYPE_CHECKING:  # pragma: no cover
+    from promptolution.tasks.base_task import EvalStrategy
+    from promptolution.utils.config import ExperimentConfig
+
+
+class RewardTask(BaseTask):
+    """A task that evaluates a predictor using a reward function.
+
+    This task takes a DataFrame, a column name for input data, and a reward function.
+    The reward function takes in a prediction as input and returns a scalar reward.
+    """
+
+    def __init__(
+        self,
+        df: pd.DataFrame,
+        reward_function: Callable[[str], float],
+        x_column: str = "x",
+        task_description: Optional[str] = None,
+        n_subsamples: int = 30,
+        eval_strategy: "EvalStrategy" = "full",
+        seed: int = 42,
+        config: Optional["ExperimentConfig"] = None,
+    ) -> None:
+        """Initialize the RewardTask.
+
+        Args:
+            df (pd.DataFrame): Input DataFrame containing the data.
+            reward_function (Callable): Function that takes a prediction and returns a reward score. Note: The optimizers aim to maximize.
+            x_column (str, optional): Name of the column containing input texts. Defaults to "x".
+            task_description (str, optional): Description of the task.
+            n_subsamples (int, optional): Number of subsamples to use. Defaults to 30.
+            eval_strategy (str, optional): Subsampling strategy to use. Defaults to "full".
+            seed (int, optional): Random seed for reproducibility. Defaults to 42.
+            config (ExperimentConfig, optional): Configuration for the task, overriding defaults.
+        """
+        self.reward_function = reward_function
+        super().__init__(
+            df=df,
+            x_column=x_column,
+            task_description=task_description,
+            n_subsamples=n_subsamples,
+            eval_strategy=eval_strategy,
+            seed=seed,
+            config=config,
+        )
+
+    def _evaluate(self, xs: List[str], ys: List[str], preds: List[str]) -> List[float]:
+        """Calculate the score for a single reward prediction using the reward function."""
+        rewards = [self.reward_function(pred) for pred in preds]
+        return rewards
diff --git a/promptolution/utils/callbacks.py b/promptolution/utils/callbacks.py
index f5e17f6..083f749 100644
--- a/promptolution/utils/callbacks.py
+++ b/promptolution/utils/callbacks.py
@@ -8,7 +8,12 @@
 import pandas as pd
 from tqdm import tqdm
 
-from typing import Literal
+from typing import TYPE_CHECKING, Any, Literal, Optional, Tuple
+
+if TYPE_CHECKING:
+    from logging import Logger
+
+    from promptolution.optimizers.base_optimizer import BaseOptimizer
 
 
 class BaseCallback(ABC):
@@ -19,7 +24,7 @@ class BaseCallback(ABC):
 
     """
 
-    def __init__(self, **kwargs):
+    def __init__(self, **kwargs: Any) -> None:
         """Initialize the callback with a configuration.
 
         Args:
@@ -28,7 +33,7 @@ def __init__(self, **kwargs):
         """
         pass
 
-    def on_step_end(self, optimizer):
+    def on_step_end(self, optimizer: "BaseOptimizer") -> bool:
         """Called at the end of each optimization step.
 
         Args:
@@ -39,7 +44,7 @@ def on_step_end(self, optimizer):
         """
         return True
 
-    def on_epoch_end(self, optimizer):
+    def on_epoch_end(self, optimizer: "BaseOptimizer") -> bool:
         """Called at the end of each optimization epoch.
 
         Args:
@@ -50,7 +55,7 @@ def on_epoch_end(self, optimizer):
         """
         return True
 
-    def on_train_end(self, optimizer):
+    def on_train_end(self, optimizer: "BaseOptimizer") -> bool:
         """Called at the end of the entire optimization process.
 
         Args:
@@ -72,12 +77,12 @@ class LoggerCallback(BaseCallback):
         step (int): The current step number.
     """
 
-    def __init__(self, logger):
+    def __init__(self, logger: "Logger") -> None:
         """Initialize the LoggerCallback."""
         self.logger = logger
         self.step = 0
 
-    def on_step_end(self, optimizer):
+    def on_step_end(self, optimizer: "BaseOptimizer") -> bool:
         """Log information about the current step."""
         self.step += 1
         time = datetime.now().strftime("%d-%m-%y %H:%M:%S:%f")
@@ -88,7 +93,7 @@ def on_step_end(self, optimizer):
 
         return True
 
-    def on_train_end(self, optimizer, logs=None):
+    def on_train_end(self, optimizer: "BaseOptimizer", logs: Optional[Any] = None) -> bool:
         """Log information at the end of training.
 
         Args:
@@ -115,7 +120,7 @@ class FileOutputCallback(BaseCallback):
         file_type (str): The type of file to save the output to.
     """
 
-    def __init__(self, dir, file_type: Literal["parquet", "csv"] = "parquet"):
+    def __init__(self, dir: str, file_type: Literal["parquet", "csv"] = "parquet") -> None:
         """Initialize the FileOutputCallback.
 
         Args:
@@ -128,15 +133,15 @@ def __init__(self, dir, file_type: Literal["parquet", "csv"] = "parquet"):
         self.file_type = file_type
 
         if file_type == "parquet":
-            self.path = dir + "/step_results.parquet"
+            self.path = os.path.join(dir, "step_results.parquet")
         elif file_type == "csv":
-            self.path = dir + "/step_results.csv"
+            self.path = os.path.join(dir, "step_results.csv")
         else:
             raise ValueError(f"File type {file_type} not supported.")
 
         self.step = 0
 
-    def on_step_end(self, optimizer):
+    def on_step_end(self, optimizer: "BaseOptimizer") -> bool:
         """Save prompts and scores to csv.
 
         Args:
@@ -146,8 +151,8 @@ def on_step_end(self, optimizer):
         df = pd.DataFrame(
             {
                 "step": [self.step] * len(optimizer.prompts),
-                "input_tokens": [optimizer.meta_llm.input_token_count] * len(optimizer.prompts),
-                "output_tokens": [optimizer.meta_llm.output_token_count] * len(optimizer.prompts),
+                "input_tokens": [optimizer.predictor.llm.input_token_count] * len(optimizer.prompts),
+                "output_tokens": [optimizer.predictor.llm.output_token_count] * len(optimizer.prompts),
                 "time": [datetime.now().timestamp()] * len(optimizer.prompts),
                 "score": optimizer.scores,
                 "prompt": optimizer.prompts,
@@ -178,12 +183,12 @@ class BestPromptCallback(BaseCallback):
         best_score (float): The highest score achieved so far.
     """
 
-    def __init__(self):
+    def __init__(self) -> None:
         """Initialize the BestPromptCallback."""
         self.best_prompt = ""
-        self.best_score = -99999
+        self.best_score = -99999.0
 
-    def on_step_end(self, optimizer):
+    def on_step_end(self, optimizer: "BaseOptimizer") -> bool:
         """Update the best prompt and score if a new high score is achieved.
 
         Args:
@@ -195,7 +200,7 @@ def on_step_end(self, optimizer):
 
         return True
 
-    def get_best_prompt(self):
+    def get_best_prompt(self) -> Tuple[str, float]:
         """Get the best prompt and score achieved during optimization.
 
         Returns:
@@ -213,7 +218,7 @@ class ProgressBarCallback(BaseCallback):
         pbar (tqdm): The tqdm progress bar object.
     """
 
-    def __init__(self, total_steps):
+    def __init__(self, total_steps: int) -> None:
         """Initialize the ProgressBarCallback.
 
         Args:
@@ -221,7 +226,7 @@ def __init__(self, total_steps):
         """
         self.pbar = tqdm(total=total_steps)
 
-    def on_step_end(self, optimizer):
+    def on_step_end(self, optimizer: "BaseOptimizer") -> bool:
         """Update the progress bar at the end of each step.
 
         Args:
@@ -231,7 +236,7 @@ def on_step_end(self, optimizer):
 
         return True
 
-    def on_train_end(self, optimizer):
+    def on_train_end(self, optimizer: "BaseOptimizer") -> bool:
         """Close the progress bar at the end of training.
 
         Args:
@@ -249,7 +254,7 @@ def __init__(
         self,
         max_tokens_for_termination: int,
         token_type_for_termination: Literal["input_tokens", "output_tokens", "total_tokens"],
-    ):
+    ) -> None:
         """Initialize the TokenCountCallback.
 
         Args:
@@ -259,7 +264,7 @@ def __init__(
         self.max_tokens_for_termination = max_tokens_for_termination
         self.token_type_for_termination = token_type_for_termination
 
-    def on_step_end(self, optimizer):
+    def on_step_end(self, optimizer: "BaseOptimizer") -> bool:
         """Check if the total token count exceeds the maximum allowed. If so, stop the optimization."""
         token_counts = optimizer.predictor.llm.get_token_count()
 
diff --git a/promptolution/utils/config.py b/promptolution/utils/config.py
index ae27c2e..2ef6c00 100644
--- a/promptolution/utils/config.py
+++ b/promptolution/utils/config.py
@@ -1,6 +1,6 @@
 """Configuration class for the promptolution library."""
 
-from typing import Set
+from typing import Any, Set
 
 from promptolution.utils.logging import get_logger
 
@@ -14,20 +14,20 @@ class ExperimentConfig:
     It provides validation and tracking of used fields.
     """
 
-    def __init__(self, **kwargs):
+    def __init__(self, **kwargs: Any) -> None:
         """Initialize the configuration with the provided keyword arguments."""
         self._used_attributes: Set[str] = set()
         for key, value in kwargs.items():
             setattr(self, key, value)
 
-    def __setattr__(self, name, value):
+    def __setattr__(self, name: str, value: Any) -> None:
         """Override attribute setting to track used attributes."""
         # Set the attribute using the standard mechanism
         object.__setattr__(self, name, value)
         if not name.startswith("_") and not callable(value):
             self._used_attributes.add(name)
 
-    def __getattribute__(self, name):
+    def __getattribute__(self, name: str) -> Any:
         """Override attribute access to track used attributes."""
         # Get the attribute using the standard mechanism
         try:
@@ -39,7 +39,7 @@ def __getattribute__(self, name):
 
         return value
 
-    def apply_to(self, obj):
+    def apply_to(self, obj: Any) -> Any:
         """Apply matching attributes from this config to an existing object.
 
         Examines each attribute of the target object and updates it if a matching
@@ -62,7 +62,7 @@ def apply_to(self, obj):
 
         return obj
 
-    def validate(self):
+    def validate(self) -> None:
         """Check if any attributes were not used and run validation.
 
         Does not raise an error, but logs a warning if any attributes are unused or validation fails.
diff --git a/promptolution/utils/formatting.py b/promptolution/utils/formatting.py
new file mode 100644
index 0000000..3d2ad19
--- /dev/null
+++ b/promptolution/utils/formatting.py
@@ -0,0 +1,37 @@
+"""Utils for formatting prompts and outputs."""
+from typing import List, Union, overload
+
+
+@overload
+def extract_from_tag(text: str, start_tag: str, end_tag: str) -> str:
+    ...
+
+
+@overload
+def extract_from_tag(text: List[str], start_tag: str, end_tag: str) -> List[str]:
+    ...
+
+
+def extract_from_tag(text: Union[str, List[str]], start_tag: str, end_tag: str) -> Union[List[str], str]:
+    """Extracts content from a string between specified start and end tags.
+
+    Args:
+        text (str): The input text to extract from.
+        start_tag (str): The start tag to look for.
+        end_tag (str): The end tag to look for.
+
+    Returns:
+        Union[List[str], str]: The extracted content, either as a list or a single string.
+    """
+    was_list = True
+    if isinstance(text, str):
+        text = [text]
+        was_list = False
+
+    outs = []
+    for t in text:
+        out = t.split(start_tag)[-1].split(end_tag)[0].strip()
+        outs.append(out)
+    if was_list:
+        return outs
+    return outs[0]
diff --git a/promptolution/utils/prompt_creation.py b/promptolution/utils/prompt_creation.py
index d4764df..77082db 100644
--- a/promptolution/utils/prompt_creation.py
+++ b/promptolution/utils/prompt_creation.py
@@ -3,9 +3,11 @@
 
 import numpy as np
 
-from typing import TYPE_CHECKING, List, Union
+from typing import TYPE_CHECKING, List, Optional, Union
 
-if TYPE_CHECKING:
+from promptolution.utils.formatting import extract_from_tag
+
+if TYPE_CHECKING:  # pragma: no cover
     from promptolution.llms.base_llm import BaseLLM
     from promptolution.tasks.base_task import BaseTask
 
@@ -17,7 +19,9 @@
 from promptolution.tasks.classification_tasks import ClassificationTask
 
 
-def create_prompt_variation(prompt: Union[List[str], str], llm: "BaseLLM", meta_prompt: str = None) -> List[str]:
+def create_prompt_variation(
+    prompt: Union[List[str], str], llm: "BaseLLM", meta_prompt: Optional[str] = None
+) -> List[str]:
     """Generate a variation of the given prompt(s) while keeping the semantic meaning.
 
     Idea taken from the paper Zhou et al. (2021) https://arxiv.org/pdf/2211.01910
@@ -36,8 +40,7 @@ def create_prompt_variation(prompt: Union[List[str], str], llm: "BaseLLM", meta_
     if isinstance(prompt, str):
         prompt = [prompt]
     varied_prompts = llm.get_response([meta_prompt.replace("<prev_prompt>", p) for p in prompt])
-
-    varied_prompts = [p.split("</prompt>")[0].split("<prompt>")[-1] for p in varied_prompts]
+    varied_prompts = extract_from_tag(varied_prompts, "<prompt>", "</prompt>")
 
     return varied_prompts
 
@@ -45,9 +48,9 @@ def create_prompt_variation(prompt: Union[List[str], str], llm: "BaseLLM", meta_
 def create_prompts_from_samples(
     task: "BaseTask",
     llm: "BaseLLM",
-    meta_prompt: str = None,
+    meta_prompt: Optional[str] = None,
     n_samples: int = 3,
-    task_description: str = None,
+    task_description: Optional[str] = None,
     n_prompts: int = 1,
     get_uniform_labels: bool = False,
 ) -> List[str]:
@@ -72,14 +75,17 @@ def create_prompts_from_samples(
     Returns:
         List[str]: A list of generated prompts.
     """
-    if meta_prompt is None and task_description is None:
-        meta_prompt_template = PROMPT_CREATION_TEMPLATE
-    elif meta_prompt is None and task_description is not None:
-        meta_prompt_template = PROMPT_CREATION_TEMPLATE_TD.replace("<task_desc>", task_description)
-    elif meta_prompt is not None and task_description is None:
-        meta_prompt_template = meta_prompt
-    elif meta_prompt is not None and task_description is not None:
-        meta_prompt_template = meta_prompt.replace("<task_desc>", task_description)
+    meta_prompt_template: str
+    if meta_prompt is None:
+        if task_description is None:
+            meta_prompt_template = PROMPT_CREATION_TEMPLATE
+        else:
+            meta_prompt_template = PROMPT_CREATION_TEMPLATE_TD.replace("<task_desc>", task_description)
+    else:
+        if task_description is None:
+            meta_prompt_template = meta_prompt
+        else:
+            meta_prompt_template = meta_prompt.replace("<task_desc>", task_description)
 
     meta_prompts = []
     for _ in range(n_prompts):
@@ -91,9 +97,9 @@ def create_prompts_from_samples(
             samples_per_class = np.maximum(samples_per_class, 1)
 
             # sample
-            xs = []
-            ys = []
-            for label, n_samples in zip(unique_labels, samples_per_class):
+            xs: List[str] = []
+            ys: List[str] = []
+            for label, num_samples in zip(unique_labels, samples_per_class):
                 indices = np.where(task.ys == label)[0]
                 indices = np.random.choice(indices, n_samples, replace=False)
                 xs.extend(task.xs[indices])
@@ -102,14 +108,14 @@ def create_prompts_from_samples(
         else:
             # if not classification task, sample randomly
             indices = np.random.choice(len(task.xs), n_samples, replace=False)
-            xs = task.xs[indices].tolist()
-            ys = task.ys[indices].tolist()
+            xs = [task.xs[i] for i in indices]
+            ys = [task.ys[i] for i in indices]
 
         examples = "\n\n".join([f"Input: {x}\nOutput: {y}" for x, y in zip(xs, ys)])
         meta_prompt = meta_prompt_template.replace("<input_output_pairs>", examples)
         meta_prompts.append(meta_prompt)
 
     prompts = llm.get_response(meta_prompts)
-    prompts = [prompt.split("</prompt>")[0].split("<prompt>")[-1].strip() for prompt in prompts]
+    prompts = extract_from_tag(prompts, "<prompt>", "</prompt>")
 
     return prompts
diff --git a/promptolution/utils/test_statistics.py b/promptolution/utils/test_statistics.py
index a776a75..d0de2d3 100644
--- a/promptolution/utils/test_statistics.py
+++ b/promptolution/utils/test_statistics.py
@@ -6,12 +6,12 @@
 import numpy as np
 from scipy.stats import ttest_rel
 
-from typing import Literal
+from typing import Any, Callable, List, Literal
 
 TestStatistics = Literal["paired_t_test"]
 
 
-def get_test_statistic_func(name: TestStatistics) -> callable:
+def get_test_statistic_func(name: TestStatistics) -> Callable[..., bool]:
     """
     Get the test statistic function based on the name provided.
 
@@ -28,8 +28,8 @@ def get_test_statistic_func(name: TestStatistics) -> callable:
 
 
 def paired_t_test(
-    scores_a: np.ndarray,
-    scores_b: np.ndarray,
+    scores_a: List[float],
+    scores_b: List[float],
     alpha: float = 0.05,
 ) -> bool:
     """
@@ -40,16 +40,18 @@ def paired_t_test(
     - The differences between the pairs are normally distributed (-> n > 30).
 
     Parameters:
-        scores_a (np.ndarray): Array of accuracy scores for candidate A.
-        scores_b (np.ndarray): Array of accuracy scores for candidate B.
+        scores_a (List[float]): Array of accuracy scores for candidate A.
+        scores_b (List[float]): Array of accuracy scores for candidate B.
         alpha (float): Significance level (default 0.05 for 95% confidence).
 
     Returns:
         bool: True if candidate A is significantly better than candidate B, False otherwise.
     """
+    scores_a = np.array(scores_a)
+    scores_b = np.array(scores_b)
 
     _, p_value = ttest_rel(scores_a, scores_b, alternative="greater")
 
     result = p_value < alpha
 
-    return result
+    return bool(result)
diff --git a/promptolution/utils/token_counter.py b/promptolution/utils/token_counter.py
index 12507a8..c19c815 100644
--- a/promptolution/utils/token_counter.py
+++ b/promptolution/utils/token_counter.py
@@ -3,12 +3,17 @@
 This module provides a function to count the number of tokens in a given text.
 """
 
-from promptolution.utils import get_logger
+from typing import TYPE_CHECKING, Callable
+
+if TYPE_CHECKING:  # pragma: no cover
+    from promptolution.llms.base_llm import BaseLLM
+    from transformers import PreTrainedTokenizer
+from promptolution.utils.logging import get_logger
 
 logger = get_logger(__name__)
 
 
-def get_token_counter(llm):
+def get_token_counter(llm: "BaseLLM") -> Callable[[str], int]:
     """Get a token counter function for the given LLM.
 
     This function returns a callable that counts tokens based on the LLM's tokenizer
@@ -21,10 +26,9 @@ def get_token_counter(llm):
         A callable that takes a text input and returns the token count.
 
     """
-    if hasattr(llm, "tokenizer"):
-        token_counter = lambda x: len(llm.tokenizer(x)["input_ids"])
+    if llm.tokenizer is not None:
+        tokenizer: PreTrainedTokenizer = llm.tokenizer
+        return lambda x: len(tokenizer.encode(x))
     else:
         logger.warning("⚠️ The LLM does not have a tokenizer. Using simple token count.")
-        token_counter = lambda x: len(x.split())
-
-    return token_counter
+        return lambda x: len(x.split())
diff --git a/pyproject.toml b/pyproject.toml
index 32045d4..7b63b20 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,12 +1,12 @@
 [tool.poetry]
 name = "promptolution"
-version = "2.0.1"
+version = "2.1.0"
 description = "A framework for prompt optimization and a zoo of prompt optimization algorithms."
 authors = ["Tom Zehle, Moritz Schlager, Timo Heiß"]
 readme = "README.md"
 
 [tool.poetry.dependencies]
-python = "^3.10"
+python = ">=3.10,<3.13"
 numpy = ">=1.26.0, <3.0.0"
 pandas = "^2.2.2"
 tqdm = "^4.66.5"
@@ -14,7 +14,7 @@ scikit-learn = "^1.5.2"
 fastparquet = "^2024.11.0"
 openai = {version = "^1.0.0", optional = true}
 requests = {version = "^2.31.0", optional = true}
-vllm = {version = "^0.8.3", optional = true}
+vllm = {version = "^0.10.1.1", optional = true}
 transformers = {version = "^4.48.0", optional = true}
 
 [tool.poetry.extras]
@@ -31,7 +31,7 @@ requests = "^2.31.0"
 [tool.poetry.group.vllm]
 optional = true
 [tool.poetry.group.vllm.dependencies]
-vllm = "^0.8.3"
+vllm = "^0.10.1.1"
 
 [tool.poetry.group.transformers]
 optional = true
@@ -44,13 +44,14 @@ flake8 = "^7.1.0"
 isort = "^5.13.2"
 pre-commit = "^3.7.1"
 ipykernel = "^6.29.5"
+mypy = "^1.8.0"
 
 [tool.poetry.group.test.dependencies]
 pytest = "^8.3.5"
 pytest-cov = "^6.1.1"
 openai = "^1.0.0"
 requests = "^2.31.0"
-vllm = "^0.8.2"
+vllm = "^0.10.1.1"
 transformers = "^4.48.0"
 
 [tool.poetry.group.docs.dependencies]
@@ -77,3 +78,11 @@ lines_between_sections = 1
 
 [tool.pydocstyle]
 convention = "google"
+
+[tool.mypy]
+ignore_missing_imports = true
+disable_error_code = ["call-overload", "assignment", "overload-overlap"]
+
+[[tool.mypy.overrides]]
+module = "transformers.*"
+ignore_errors = true
diff --git a/src/promptolution b/src/promptolution
deleted file mode 160000
index fa8ae08..0000000
--- a/src/promptolution
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit fa8ae08c7517783a26ed397b438fa2a468d78764
diff --git a/tests/callbacks/test_callbacks.py b/tests/callbacks/test_callbacks.py
index 37ab055..7c4641e 100644
--- a/tests/callbacks/test_callbacks.py
+++ b/tests/callbacks/test_callbacks.py
@@ -77,7 +77,7 @@ def test_file_output_callback_csv(mock_optimizer, tmpdir):
 
     # Test initialization
     assert callback.file_type == "csv"
-    assert callback.path == output_dir + "/step_results.csv"
+    assert callback.path == os.path.join(output_dir, "step_results.csv")
     assert callback.step == 0
 
     # Test on_step_end - first step
@@ -114,7 +114,7 @@ def test_file_output_callback_parquet(mock_optimizer, tmpdir):
 
     # Test initialization
     assert callback.file_type == "parquet"
-    assert callback.path == output_dir + "/step_results.parquet"
+    assert callback.path == os.path.join(output_dir, "step_results.parquet")
 
     # Test on_step_end - first step
     result = callback.on_step_end(mock_optimizer)
diff --git a/tests/callbacks/test_callbacks_integration.py b/tests/callbacks/test_callbacks_integration.py
index 6e17554..03de591 100644
--- a/tests/callbacks/test_callbacks_integration.py
+++ b/tests/callbacks/test_callbacks_integration.py
@@ -77,7 +77,7 @@ def test_file_output_callback_csv(mock_optimizer, tmpdir):
 
     # Test initialization
     assert callback.file_type == "csv"
-    assert callback.path == output_dir + "/step_results.csv"
+    assert callback.path == os.path.join(output_dir, "step_results.csv")
     assert callback.step == 0
 
     # Test on_step_end - first step
@@ -114,7 +114,7 @@ def test_file_output_callback_parquet(mock_optimizer, tmpdir):
 
     # Test initialization
     assert callback.file_type == "parquet"
-    assert callback.path == output_dir + "/step_results.parquet"
+    assert callback.path == os.path.join(output_dir, "step_results.parquet")
 
     # Test on_step_end - first step
     result = callback.on_step_end(mock_optimizer)
diff --git a/tests/conftest.py b/tests/conftest.py
index e2099d1..2ba60f8 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -7,6 +7,8 @@
 from mocks.mock_task import MockTask
 
 from promptolution.tasks import ClassificationTask
+from promptolution.tasks.judge_tasks import JudgeTask
+from promptolution.tasks.reward_tasks import RewardTask
 from promptolution.utils import ExperimentConfig
 
 
@@ -31,9 +33,9 @@ def experiment_config():
 def mock_task():
     """Fixture providing a MockTask with predetermined scoring behavior."""
 
-    def score_function(prompt):
+    def score_function(preds):
         # Prefer longer prompts for testing purposes
-        return min(0.9, 0.5 + 0.01 * len(prompt))
+        return [len(pred) for pred in preds]
 
     return MockTask(predetermined_scores=score_function)
 
@@ -92,9 +94,104 @@ def mock_classification_task_with_subsampling(mock_df):
     """Fixture providing a ClassificationTask instance with subsampling."""
     return ClassificationTask(
         df=mock_df,
-        description="Sentiment classification task",
+        task_description="Sentiment classification task",
         x_column="x",
         y_column="y",
         eval_strategy="subsample",
         n_subsamples=2,
     )
+
+
+@pytest.fixture
+def simple_reward_function():
+    """A simple reward function for testing RewardTask."""
+
+    def reward_func(prediction: str) -> float:
+        if "great" in prediction.lower() or "perfect" in prediction.lower():
+            return 1.0
+        elif "ok" in prediction.lower():
+            return 0.5
+        else:
+            return 0.0
+
+    return reward_func
+
+
+@pytest.fixture
+def mock_reward_task(mock_df, simple_reward_function):
+    """Fixture providing a RewardTask instance."""
+    return RewardTask(
+        df=mock_df,
+        reward_function=simple_reward_function,
+        x_column="x",
+        task_description="Evaluate text quality",
+        n_subsamples=2,
+        eval_strategy="full",  # Using "full" for initial clarity, can be changed in specific tests
+        seed=42,
+    )
+
+
+@pytest.fixture
+def mock_reward_task_no_x_column(simple_reward_function):
+    """Fixture providing a RewardTask instance without a meaningful x_column."""
+    # Create a DataFrame where 'x' is just a placeholder, not used for prompt construction directly
+    df_no_x_data = {
+        "id_col": list(range(5)),
+        "dummy_input": ["", "", "", "", ""],  # Or just 0, 1, 2, 3, 4
+        "some_attribute": ["A", "B", "C", "D", "E"],
+    }
+    df_no_x = pd.DataFrame(df_no_x_data)
+    return RewardTask(
+        df=df_no_x,
+        reward_function=simple_reward_function,
+        x_column="dummy_input",  # The x_column is still technically provided but contains empty strings or Nones
+        task_description="Generate and evaluate jokes without explicit input text.",
+        n_subsamples=3,
+        eval_strategy="subsample",
+        seed=42,
+    )
+
+
+@pytest.fixture
+def mock_judge_llm():
+    """Fixture providing a MockLLM configured for judge responses."""
+    # Responses containing the final_score tag
+    responses = [
+        "<final_score>5.0</final_score>",  # Perfect match
+        "<final_score>-5.0</final_score>",  # Completely incorrect
+        "<final_score>0.0</final_score>",  # Partially correct
+        "<final_score>1.0</final_score>",  # Default/Other
+        "<final_score>2.0</final_score>",  # Another specific score
+        "This response does not contain a score tag.",  # For parsing error test
+    ]
+    return MockLLM(predetermined_responses=responses)
+
+
+@pytest.fixture
+def mock_judge_task_with_y(mock_df, mock_judge_llm):
+    """Fixture providing a JudgeTask instance with y_column."""
+    return JudgeTask(
+        df=mock_df,
+        x_column="x",
+        y_column="y",
+        judge_llm=mock_judge_llm,
+        task_description="Evaluate sentiment prediction quality.",
+        n_subsamples=2,
+        eval_strategy="full",
+        seed=42,
+    )
+
+
+@pytest.fixture
+def mock_judge_task_no_y(mock_df, mock_judge_llm):
+    """Fixture providing a JudgeTask instance without y_column."""
+    # Use mock_df, but ensure y_column is explicitly None for this task instance
+    return JudgeTask(
+        df=mock_df,
+        x_column="x",
+        judge_llm=mock_judge_llm,
+        task_description="Evaluate joke quality (no ground truth).",
+        n_subsamples=2,
+        eval_strategy="subsample",  # Test with subsampling here
+        seed=42,
+    )
diff --git a/tests/helpers/test_helpers.py b/tests/helpers/test_helpers.py
index 909da42..03231a6 100644
--- a/tests/helpers/test_helpers.py
+++ b/tests/helpers/test_helpers.py
@@ -98,7 +98,7 @@ def test_run_optimization(
     # Verify mocks were called
     mock_get_llm.assert_called_once_with(config=experiment_config)
     mock_get_predictor.assert_called_once_with(mock_llm, config=experiment_config)
-    mock_get_task.assert_called_once_with(sample_df, experiment_config)
+    mock_get_task.assert_called_once_with(sample_df, experiment_config, judge_llm=mock_llm)
     mock_get_optimizer.assert_called_once_with(
         predictor=mock_predictor, meta_llm=mock_llm, task=mock_task, config=experiment_config
     )
@@ -158,7 +158,7 @@ def test_run_optimization_with_exemplars(
     # Verify mocks were called
     mock_get_llm.assert_called_once_with(config=experiment_config_with_exemplars)
     mock_get_predictor.assert_called_once_with(mock_llm, config=experiment_config_with_exemplars)
-    mock_get_task.assert_called_once_with(sample_df, experiment_config_with_exemplars)
+    mock_get_task.assert_called_once_with(sample_df, experiment_config_with_exemplars, judge_llm=mock_llm)
     mock_get_optimizer.assert_called_once_with(
         predictor=mock_predictor, meta_llm=mock_llm, task=mock_task, config=experiment_config_with_exemplars
     )
@@ -216,7 +216,7 @@ def test_run_evaluation(mock_get_task, mock_get_predictor, mock_get_llm, sample_
     # Verify mocks were called
     mock_get_llm.assert_called_once_with(config=experiment_config)
     mock_get_predictor.assert_called_once_with(mock_llm, config=experiment_config)
-    mock_get_task.assert_called_once_with(sample_df, experiment_config)
+    mock_get_task.assert_called_once_with(sample_df, experiment_config, judge_llm=mock_llm)
     mock_task.evaluate.assert_called_once_with(prompts, mock_predictor, eval_strategy="full")
 
 
diff --git a/tests/llms/test_local_llm.py b/tests/llms/test_local_llm.py
index 1165d0a..254ba92 100644
--- a/tests/llms/test_local_llm.py
+++ b/tests/llms/test_local_llm.py
@@ -8,21 +8,26 @@
 @pytest.fixture
 def mock_local_dependencies():
     """Set up mocks for LocalLLM dependencies."""
-    with patch("promptolution.llms.local_llm.transformers") as mock_transformers, patch(
+    with patch("promptolution.llms.local_llm.pipeline") as mock_pipeline_func, patch(
         "promptolution.llms.local_llm.torch"
     ) as mock_torch:
-        # Configure mock pipeline
-        mock_pipeline = MagicMock()
-        mock_pipeline.return_value = [{"generated_text": "Mock response 1"}, {"generated_text": "Mock response 2"}]
-        mock_transformers.pipeline.return_value = mock_pipeline
+        # Create a mock pipeline object (not a list!)
+        mock_pipeline_obj = MagicMock()
 
-        # Configure mock tokenizer
-        mock_pipeline.tokenizer = MagicMock()
-        mock_pipeline.tokenizer.pad_token_id = None
-        mock_pipeline.tokenizer.eos_token_id = 50256
-        mock_pipeline.tokenizer.padding_side = None
+        # Configure the pipeline function to return the pipeline object
+        mock_pipeline_func.return_value = mock_pipeline_obj
 
-        yield {"transformers": mock_transformers, "pipeline": mock_pipeline, "torch": mock_torch}
+        # Configure mock tokenizer on the pipeline object
+        mock_tokenizer = MagicMock()
+        mock_tokenizer.pad_token_id = None
+        mock_tokenizer.eos_token_id = 50256
+        mock_tokenizer.padding_side = None
+        mock_pipeline_obj.tokenizer = mock_tokenizer
+
+        # Configure the pipeline object's __call__ method to return responses
+        mock_pipeline_obj.return_value = [{"generated_text": "Mock response 1"}, {"generated_text": "Mock response 2"}]
+
+        yield {"pipeline": mock_pipeline_func, "torch": mock_torch, "pipeline_obj": mock_pipeline_obj}
 
 
 def test_local_llm_initialization(mock_local_dependencies):
@@ -31,7 +36,7 @@ def test_local_llm_initialization(mock_local_dependencies):
     local_llm = LocalLLM(model_id="gpt2", batch_size=4)
 
     # Verify pipeline was created correctly
-    mock_local_dependencies["transformers"].pipeline.assert_called_once_with(
+    mock_local_dependencies["pipeline"].assert_called_once_with(
         "text-generation",
         model="gpt2",
         model_kwargs={"torch_dtype": mock_local_dependencies["torch"].bfloat16},
@@ -49,26 +54,16 @@ def test_local_llm_initialization(mock_local_dependencies):
 
 def test_local_llm_get_response(mock_local_dependencies):
     """Test that LocalLLM._get_response works correctly."""
-    # Create LocalLLM instance
-    local_llm = LocalLLM(model_id="gpt2")
-
-    # Mock torch.no_grad context
-    with patch("promptolution.llms.local_llm.torch.no_grad") as mock_no_grad:
-        mock_no_grad.return_value.__enter__ = MagicMock()
-        mock_no_grad.return_value.__exit__ = MagicMock()
-
-        # Call _get_response
-        prompts = ["Test prompt 1", "Test prompt 2"]
-        system_prompts = ["Be helpful", "Be concise"]
-        responses = local_llm._get_response(prompts, system_prompts)
+    local_llm = LocalLLM(model_id="gpt2", batch_size=4)
 
-        # Verify pipeline was called
-        local_llm.pipeline.assert_called_once()
+    # Mock prompts
+    prompts = ["Hello, world!", "How are you?"]
+    sys_prompts = ["System prompt 1", "System prompt 2"]
 
-        # Verify torch.no_grad was used
-        mock_no_grad.assert_called_once()
+    # Call _get_response
+    responses = local_llm._get_response(prompts, system_prompts=sys_prompts)
 
-        # Verify responses
-        assert len(responses) == 2
-        assert responses[0] == "Mock response 1"
-        assert responses[1] == "Mock response 2"
+    # Verify the responses are as expected
+    assert len(responses) == 2
+    assert responses[0] == "Mock response 1"
+    assert responses[1] == "Mock response 2"
diff --git a/tests/llms/test_vllm.py b/tests/llms/test_vllm.py
index d1f6fc8..6eef031 100644
--- a/tests/llms/test_vllm.py
+++ b/tests/llms/test_vllm.py
@@ -2,74 +2,77 @@
 
 import pytest
 
-from promptolution.llms import VLLM
+from promptolution.llms.vllm import VLLM
+
+vllm = pytest.importorskip("vllm")
+transformers = pytest.importorskip("transformers")
 
 
 @pytest.fixture
 def mock_vllm_dependencies():
     """Set up comprehensive mocks for VLLM dependencies."""
-    # Mock the key components
     with patch("promptolution.llms.vllm.LLM") as mock_llm_class, patch(
         "promptolution.llms.vllm.SamplingParams"
-    ) as mock_sampling_params, patch("promptolution.llms.vllm.AutoTokenizer") as mock_tokenizer_class:
-        # Create and configure mock LLM
-        mock_llm = MagicMock()
-        mock_llm_class.return_value = mock_llm
-
-        # Configure LLM engine with cache config for batch size calculation
-        mock_cache_config = MagicMock()
-        mock_cache_config.num_gpu_blocks = 100
-        mock_cache_config.block_size = 16
-
-        mock_executor = MagicMock()
-        mock_executor.cache_config = mock_cache_config
-
-        mock_engine = MagicMock()
-        mock_engine.model_executor = mock_executor
+    ) as mock_sampling_params, patch("promptolution.llms.vllm.AutoTokenizer.from_pretrained") as mock_from_pretrained:
+        # --- LLM and Engine Mock Setup ---
+        mock_llm_instance = MagicMock()
+        mock_cache_config = MagicMock(num_gpu_blocks=100, block_size=16)
+        mock_executor = MagicMock(cache_config=mock_cache_config)
+        mock_engine = MagicMock(model_executor=mock_executor)
+        mock_llm_instance.llm_engine = mock_engine
+        mock_llm_class.return_value = mock_llm_instance
 
-        mock_llm.llm_engine = mock_engine
-
-        # Set up the generate method to return appropriate number of responses
         def mock_generate_side_effect(prompts_list, *args, **kwargs):
-            """Return one output per input prompt"""
             return [
                 MagicMock(outputs=[MagicMock(text=f"Mocked response for prompt {i}")])
                 for i, _ in enumerate(prompts_list)
             ]
 
-        # Use side_effect instead of return_value for dynamic behavior
-        mock_llm.generate.side_effect = mock_generate_side_effect
+        mock_llm_instance.generate.side_effect = mock_generate_side_effect
+
+        # --- Tokenizer Mock Setup (The Fix) ---
+        # 1. Create the mock object we want to be our tokenizer.
+        mock_tokenizer_instance = MagicMock()
+
+        # 2. Configure its methods directly.
+        mock_tokenizer_instance.encode.return_value = [1, 2, 3, 4, 5]
+        mock_tokenizer_instance.apply_chat_template.return_value = "<mocked_chat_template>"
 
-        # Configure mock tokenizer
-        mock_tokenizer = MagicMock()
-        mock_tokenizer.encode.return_value = [1, 2, 3, 4, 5]
-        mock_tokenizer.apply_chat_template.return_value = "<mocked_chat_template>"
-        mock_tokenizer_class.from_pretrained.return_value = mock_tokenizer
+        # 3. Tell the patch for "from_pretrained" to return our configured instance.
+        # This is the most critical change.
+        mock_from_pretrained.return_value = mock_tokenizer_instance
+
+        # --- Sampling Params Mock Setup ---
+        mock_sampling_params_instance = MagicMock()
+        mock_sampling_params.return_value = mock_sampling_params_instance
 
         yield {
             "llm_class": mock_llm_class,
-            "llm": mock_llm,
-            "tokenizer_class": mock_tokenizer_class,
-            "tokenizer": mock_tokenizer,
-            "sampling_params": mock_sampling_params,
+            "llm_instance": mock_llm_instance,
+            "tokenizer": mock_tokenizer_instance,
+            "sampling_params_class": mock_sampling_params,
+            "sampling_params_instance": mock_sampling_params_instance,
         }
 
 
 def test_vllm_get_response(mock_vllm_dependencies):
     """Test that VLLM._get_response works correctly with explicit batch_size."""
     # Create VLLM instance with explicit batch_size to avoid calculation
-    vllm = VLLM(model_id="mock-model", batch_size=4)  # Set an explicit batch_size to avoid computation
+    vllm_instance = VLLM(model_id="mock-model", batch_size=4)
+
+    # Verify the mocks were used
+    mock_vllm_dependencies["llm_class"].assert_called_once()
 
     # Call get_response
     prompts = ["Test prompt 1", "Test prompt 2"]
     system_prompts = ["Be helpful", "Be concise"]
-    responses = vllm._get_response(prompts, system_prompts)
+    responses = vllm_instance._get_response(prompts, system_prompts)
 
     # Verify tokenizer was used correctly
     assert mock_vllm_dependencies["tokenizer"].apply_chat_template.call_count == 2
 
     # Verify LLM generate was called
-    mock_vllm_dependencies["llm"].generate.assert_called_once()
+    mock_vllm_dependencies["llm_instance"].generate.assert_called_once()
 
     # Verify responses
     assert len(responses) == 2
@@ -79,23 +82,65 @@ def test_vllm_get_response(mock_vllm_dependencies):
 
 def test_vllm_with_auto_batch_size(mock_vllm_dependencies):
     """Test VLLM with automatic batch size calculation."""
-    # Create VLLM instance with batch_size=None to trigger auto calculation
-    vllm = VLLM(model_id="mock-model", batch_size=None, max_model_len=2048)
-
-    # Force a non-zero batch size
-    mock_vllm_dependencies["llm"].llm_engine.model_executor.cache_config.num_gpu_blocks = 1000
+    # Set up cache config for batch size calculation
+    mock_vllm_dependencies["llm_instance"].llm_engine.model_executor.cache_config.num_gpu_blocks = 1000
+    mock_vllm_dependencies["llm_instance"].llm_engine.model_executor.cache_config.block_size = 16
 
-    # Create a new instance to recalculate batch size
-    vllm = VLLM(model_id="mock-model", batch_size=None, max_model_len=2048)
+    # Create VLLM instance with batch_size=None to trigger auto calculation
+    vllm_instance = VLLM(model_id="mock-model", batch_size=None, max_model_len=2048)
 
     # Verify batch_size is greater than zero
-    assert vllm.batch_size > 0, "Batch size should be greater than zero"
+    assert vllm_instance.batch_size > 0, "Batch size should be greater than zero"
+    # With num_gpu_blocks=1000, block_size=16, max_model_len=2048
+    # batch_size = int((1000 * 16 / 2048) * 0.95) = int(7.8125 * 0.95) = int(7.42) = 7
+    assert vllm_instance.batch_size == 7, f"Expected batch_size=7, got {vllm_instance.batch_size}"
 
     # Test with a single prompt
     prompts = ["Test prompt"]
     system_prompts = ["Be helpful"]
-    responses = vllm._get_response(prompts, system_prompts)
+    responses = vllm_instance._get_response(prompts, system_prompts)
 
     # Verify we get exactly one response for one prompt
     assert len(responses) == 1
     assert responses[0] == "Mocked response for prompt 0"
+
+
+def test_vllm_initialization_parameters(mock_vllm_dependencies):
+    """Test that VLLM correctly passes parameters to underlying LLM."""
+    # Create VLLM instance with custom parameters
+    vllm_instance = VLLM(
+        model_id="mock-model",
+        batch_size=8,
+        max_generated_tokens=512,
+        temperature=0.5,
+        top_p=0.95,
+        dtype="float16",
+        tensor_parallel_size=2,
+        gpu_memory_utilization=0.9,
+        max_model_len=4096,
+        trust_remote_code=True,
+        seed=123,
+        llm_kwargs={"custom_param": "value"},
+    )
+
+    # Verify LLM was initialized with correct parameters
+    call_args = mock_vllm_dependencies["llm_class"].call_args
+    assert call_args[1]["model"] == "mock-model"
+    assert call_args[1]["tokenizer"] == "mock-model"
+    assert call_args[1]["dtype"] == "float16"
+    assert call_args[1]["tensor_parallel_size"] == 2
+    assert call_args[1]["gpu_memory_utilization"] == 0.9
+    assert call_args[1]["max_model_len"] == 4096
+    assert call_args[1]["trust_remote_code"] is True
+    assert call_args[1]["seed"] == 123
+    assert call_args[1]["custom_param"] == "value"
+
+    # Verify SamplingParams was initialized with correct parameters
+    call_args = mock_vllm_dependencies["sampling_params_class"].call_args
+    assert call_args[1]["temperature"] == 0.5
+    assert call_args[1]["top_p"] == 0.95
+    assert call_args[1]["max_tokens"] == 512
+    assert call_args[1]["seed"] == 123
+
+    # Verify batch_size was set correctly
+    assert vllm_instance.batch_size == 8
diff --git a/tests/mocks/mock_llm.py b/tests/mocks/mock_llm.py
index cb8ef61..44f65a5 100644
--- a/tests/mocks/mock_llm.py
+++ b/tests/mocks/mock_llm.py
@@ -23,11 +23,7 @@ def __init__(self, predetermined_responses=None, add_prompt_tags=False, *args, *
         """
         super().__init__(*args, **kwargs)
 
-        # Set up response list
-        if predetermined_responses is None:
-            self.responses = []
-        else:
-            self.responses = list(predetermined_responses)  # Ensure it's a list
+        self.responses = predetermined_responses or []
 
         # Add prompt tags if requested
         if add_prompt_tags:
@@ -56,18 +52,21 @@ def _get_response(self, prompts: List[str], system_prompts: Optional[List[str]]
         self.call_history.append({"prompts": prompts, "system_prompts": system_prompts})
 
         results = []
+        if self.response_index >= len(self.responses):
+            self.response_index = 0
         for i, prompt in enumerate(prompts):
             # Return the next response from the list if available
-            if self.response_index < len(self.responses):
+            if self.response_index < len(self.responses) and isinstance(self.responses, list):
                 results.append(self.responses[self.response_index])
                 self.response_index += 1
+            elif prompt in self.responses and isinstance(self.responses, dict):
+                results.append(self.responses[prompt])
             else:
                 # Default response if we've exhausted the list
                 if hasattr(self, "add_prompt_tags") and getattr(self, "add_prompt_tags"):
                     results.append(f"<prompt>Mock response for: {prompt}</prompt>")
                 else:
                     results.append(f"Mock response for: {prompt}")
-
         return results
 
     def set_generation_seed(self, seed: int) -> None:
diff --git a/tests/mocks/mock_predictor.py b/tests/mocks/mock_predictor.py
index 455deb0..445794a 100644
--- a/tests/mocks/mock_predictor.py
+++ b/tests/mocks/mock_predictor.py
@@ -36,18 +36,17 @@ def __init__(
         self.predetermined_predictions = predetermined_predictions or {}
         self.call_history = []
 
-    def _extract_preds(self, preds: List[str], shape: Tuple[int, int] = None) -> np.ndarray:
+    def _extract_preds(self, preds: List[str]) -> List[str]:
         """Extract predictions based on predetermined mapping or default behavior.
 
         Args:
             preds: Raw text predictions
-            shape: Shape for reshaping results (optional)
 
         Returns:
-            np.ndarray: Extracted predictions
+           List[str]: Extracted predictions
         """
         # Record call for test assertions
-        self.call_history.append({"preds": preds, "shape": shape})
+        self.call_history.append({"preds": preds})
 
         results = []
         for pred in preds:
@@ -59,8 +58,4 @@ def _extract_preds(self, preds: List[str], shape: Tuple[int, int] = None) -> np.
 
         results_array = np.array(results)
 
-        # Reshape if shape is provided
-        if shape is not None:
-            results_array = results_array.reshape(shape)
-
         return results_array
diff --git a/tests/mocks/mock_task.py b/tests/mocks/mock_task.py
index 488c3d1..b5e1d14 100644
--- a/tests/mocks/mock_task.py
+++ b/tests/mocks/mock_task.py
@@ -3,6 +3,7 @@
 from unittest.mock import MagicMock
 
 import numpy as np
+import pandas as pd
 
 from typing import List
 
@@ -24,7 +25,13 @@ def __init__(self, predetermined_scores=None):
                 or a list of scores to return in sequence, or a function
                 that generates scores based on prompts.
         """
-        super().__init__()
+        super().__init__(
+            df=pd.DataFrame(
+                {"x": ["Sample text 1", "Sample text 2", "Sample text 3"], "y": ["positive", "negative", "neutral"]}
+            ),
+            x_column="x",
+            y_column="y",
+        )
         self.predetermined_scores = predetermined_scores or {}
         self.call_history = []
         self.score_index = 0
@@ -34,76 +41,29 @@ def __init__(self, predetermined_scores=None):
         # Default attributes similar to ClassificationTask
         self.description = "Mock classification task"
         self.classes = ["positive", "neutral", "negative"]
-        self.xs = np.array(["Sample text 1", "Sample text 2", "Sample text 3"])
-        self.ys = np.array(["positive", "negative", "neutral"])
         self.initial_prompts = ["Classify:", "Determine:"]
         self.n_blocks = 10
 
         self.increment_block_idx = MagicMock()
         self.reset_block_idx = MagicMock()
 
-    def evaluate(
-        self,
-        prompts: List[str],
-        predictor,
-        eval_strategy: str = "subsample",
-        system_prompts: List[str] = None,
-        return_agg_scores: bool = False,
-        return_seq: bool = False,
-    ) -> np.ndarray:
-        """Evaluate prompts with predetermined scores.
+    def _evaluate(self, xs: List[str], ys: List[str], preds: List[str], **kwargs) -> List[float]:
+        """Calculate the score for a single prediction.
 
         Args:
-            prompts: List of prompts to evaluate
-            predictor: Predictor (ignored in mock)
-            system_prompts: System prompts (ignored in mock)
-            subsample: Whether to subsample (ignored in mock)
-            n_samples: Number of samples (ignored in mock)
-            return_seq: Whether to return sequences
+            xs: Input data (not used in mock)
+            ys: Ground truth labels (not used in mock)
+            preds: Predicted labels
 
         Returns:
-            np.ndarray of scores, and optionally sequences
+            Score based on predetermined scores or a default logic.
         """
-        # Record the call
-        self.call_history.append(
-            {
-                "prompts": prompts,
-                "predictor": predictor,
-                "system_prompts": system_prompts,
-                "eval_strategy": eval_strategy,
-                "return_agg_scores": return_agg_scores,
-                "return_seq": return_seq,
-            }
-        )
-
-        scores = []
-        for prompt in prompts:
-            # Handle different types of predetermined_scores
-            if callable(self.predetermined_scores):
-                # If it's a function, call it with the prompt
-                score = self.predetermined_scores(prompt)
-            elif isinstance(self.predetermined_scores, dict) and prompt in self.predetermined_scores:
-                # If it's a dict, look up the prompt
-                score = self.predetermined_scores[prompt]
-            elif isinstance(self.predetermined_scores, list):
-                # If it's a list, return items in sequence (cycling if needed)
-                if self.score_index < len(self.predetermined_scores):
-                    score = self.predetermined_scores[self.score_index]
-                    self.score_index = (self.score_index + 1) % len(self.predetermined_scores)
-                else:
-                    score = 0.5  # Default score
-            else:
-                # Generate a somewhat predictable score based on prompt length
-                # (longer prompts get slightly higher scores)
-                score = 0.5 + 0.01 * (len(prompt) % 10)
-
-            scores.append(score)
-
-        scores_array = np.array(scores)
-
-        if return_seq:
-            # Generate dummy sequences
-            seqs = [[f"Input: {x}\nOutput: {prompt}" for x in self.xs] for prompt in prompts]
-            return scores_array, seqs
-
-        return scores_array
+        if isinstance(self.predetermined_scores, dict):
+            return [self.predetermined_scores.get(pred, 0) for pred in preds]
+        elif isinstance(self.predetermined_scores, list):
+            self.score_index += 1
+            return self.predetermined_scores
+        elif callable(self.predetermined_scores):
+            return self.predetermined_scores(xs)
+        else:
+            return [len(pred) for pred in preds]
diff --git a/tests/optimizers/test_capo.py b/tests/optimizers/test_capo.py
index b0c9ae2..466c92b 100644
--- a/tests/optimizers/test_capo.py
+++ b/tests/optimizers/test_capo.py
@@ -2,6 +2,8 @@
 
 import pandas as pd
 
+from tests.mocks.mock_task import MockTask
+
 from promptolution.optimizers.capo import CAPO, CAPOPrompt
 
 
@@ -170,7 +172,6 @@ def test_crossover(mock_meta_llm, mock_predictor, initial_prompts, mock_task, mo
     offsprings = optimizer._crossover(
         [CAPOPrompt("Instruction 1", ["Example 1"]), CAPOPrompt("Instruction 2", ["Example 2"])]
     )
-    print(offsprings)
     assert len(offsprings) == 5
 
 
@@ -189,7 +190,8 @@ def test_mutate(mock_meta_llm, mock_predictor, initial_prompts, mock_task, mock_
     assert len(mutated) == 2
 
 
-def test_do_racing(mock_meta_llm, mock_predictor, initial_prompts, mock_task, mock_df):
+def test_do_racing(mock_meta_llm, mock_predictor, initial_prompts, mock_df):
+    mock_task = MockTask(predetermined_scores=[0.89, 0.9] * 3)
     optimizer = CAPO(
         predictor=mock_predictor,
         task=mock_task,
@@ -204,6 +206,5 @@ def test_do_racing(mock_meta_llm, mock_predictor, initial_prompts, mock_task, mo
     assert len(survivors) == 1
     assert "better instruction" in survivors[0].instruction_text
 
-    # check that mocktask.reset_blocks was called
     assert mock_task.reset_block_idx.call_count == 2
-    assert mock_task.increment_block_idx.call_count == 10
+    assert mock_task.increment_block_idx.call_count == 3
diff --git a/tests/predictors/test_base_predictor.py b/tests/predictors/test_base_predictor.py
index decf3de..5ba537b 100644
--- a/tests/predictors/test_base_predictor.py
+++ b/tests/predictors/test_base_predictor.py
@@ -37,5 +37,5 @@ def test_predictor_with_return_seq(mock_predictor):
 
     # Verify sequences
     assert len(sequences) == 1
-    assert isinstance(sequences, np.ndarray)
+    assert isinstance(sequences, list)
     assert "This product is okay." in sequences[0]
diff --git a/tests/predictors/test_classifiers.py b/tests/predictors/test_classifiers.py
index f561995..54885b9 100644
--- a/tests/predictors/test_classifiers.py
+++ b/tests/predictors/test_classifiers.py
@@ -10,14 +10,14 @@ def test_first_occurrence_classifier(mock_downstream_llm, mock_df):
     classifier = FirstOccurrenceClassifier(llm=mock_downstream_llm, classes=mock_df["y"].values)
 
     # Test with multiple inputs
-    xs = np.array(["I love this product!", "I hate this product!", "This product is okay.", "ja ne"])
+    xs = ["I love this product!", "I hate this product!", "This product is okay.", "ja ne"]
     prompts = ["Classify:"] * len(xs)
 
     # Make predictions
     predictions = classifier.predict(prompts, xs)
 
     # Verify shape and content
-    assert predictions.shape == (4,)
+    assert len(predictions) == 4
     assert predictions[0] == "negative"
     assert predictions[1] == "positive"
     assert predictions[2] == "positive"
@@ -42,7 +42,7 @@ def test_marker_based_classifier(mock_downstream_llm, mock_df):
     predictions = classifier.predict(prompts, xs)
 
     # Verify shape and content
-    assert predictions.shape == (3,)
+    assert len(predictions) == 3
     assert predictions[0] == "positive"
     assert predictions[1] == "negative"
     assert predictions[2] == "neutral"
@@ -73,7 +73,7 @@ def test_marker_based_without_classes(mock_downstream_llm):
     predictions = classifier.predict(prompts, xs)
 
     # Verify shape and content - should accept any value between markers
-    assert predictions.shape == (4,)
+    assert len(predictions) == 4
     assert predictions[0] == "positive"
     assert predictions[1] == "negative"
     assert predictions[2] == "neutral"
@@ -93,7 +93,7 @@ def test_multiple_prompts_with_classifiers(mock_downstream_llm, mock_df):
     predictions = classifier.predict(prompts, xs)
 
     # Verify shape and content
-    assert predictions.shape == (4,)
+    assert len(predictions) == 4
     assert predictions[0] == "negative"
     assert predictions[1] == "positive"
     assert predictions[2] == "positive"
@@ -113,7 +113,7 @@ def test_sequence_return_with_classifiers(mock_downstream_llm, mock_df):
     predictions, sequences = classifier.predict(prompts, xs, return_seq=True)
 
     # Verify predictions
-    assert predictions.shape == (1,)
+    assert len(predictions) == 1
     assert predictions[0] == "positive"
 
     # Verify sequences
diff --git a/tests/tasks/test_classifications_tasks.py b/tests/tasks/test_classifications_tasks.py
index d72a7e6..9651e98 100644
--- a/tests/tasks/test_classifications_tasks.py
+++ b/tests/tasks/test_classifications_tasks.py
@@ -7,10 +7,9 @@
 
 def test_classification_task_initialization(mock_df):
     """Test that ClassificationTask initializes correctly."""
-    task = ClassificationTask(df=mock_df, description="Sentiment classification task", x_column="x", y_column="y")
+    task = ClassificationTask(df=mock_df, task_description="Sentiment classification task", x_column="x", y_column="y")
 
-    # Verify attributes
-    assert task.description == "Sentiment classification task"
+    assert task.task_description == "Sentiment classification task"
     assert len(task.classes) == 3
     assert set(task.classes) == set(["positive", "neutral", "negative"])
     assert len(task.xs) == 3
@@ -20,21 +19,17 @@ def test_classification_task_initialization(mock_df):
 
 def test_task_evaluate(mock_classification_task_with_subsampling, mock_predictor):
     """Test the evaluate method of ClassificationTask."""
-    # Evaluate with a single prompt
     prompts = ["Classify sentiment:"]
     scores = mock_classification_task_with_subsampling.evaluate(prompts, mock_predictor)
 
-    # Verify scores
-    assert isinstance(scores, np.ndarray)
-    assert scores.shape == (1,)  # One score per prompt
-    assert 0 <= scores[0] <= 1  # Score should be between 0 and 1
+    assert isinstance(scores, list)
+    assert len(scores) == 1
+    assert 0 <= scores[0] <= 1
 
-    # Evaluate with multiple prompts
     prompts = ["Classify sentiment:", "Rate the text:"]
     scores = mock_classification_task_with_subsampling.evaluate(prompts, mock_predictor)
 
-    # Verify scores for multiple prompts
-    assert scores.shape == (2,)  # Two scores, one per prompt
+    assert len(scores) == 2
     assert all(0 <= score <= 1 for score in scores)
 
 
@@ -42,18 +37,14 @@ def test_task_evaluate_with_subsampling(mock_classification_task_with_subsamplin
     """Test the evaluate method with subsampling."""
     prompts = ["Classify sentiment:"]
 
-    # Evaluate with subsampling
     scores = mock_classification_task_with_subsampling.evaluate(
         prompts,
         mock_predictor,
     )
 
-    # Verify scores
-    assert scores.shape == (1,)  # One score per prompt
+    assert len(scores) == 1
 
-    # Test with a different random seed to ensure different subsamples
     with pytest.raises(AssertionError, match=r".*Arrays are not equal.*"):
-        # Use a different random seed to force different subsampling
         np.random.seed(42)
         scores1 = mock_classification_task_with_subsampling.evaluate(
             prompts,
@@ -66,7 +57,6 @@ def test_task_evaluate_with_subsampling(mock_classification_task_with_subsamplin
             mock_predictor,
         )
 
-        # This should fail because the subsamples should be different
         np.testing.assert_array_equal(scores1, scores2)
 
 
@@ -74,16 +64,14 @@ def test_task_evaluate_with_return_seq(mock_classification_task_with_subsampling
     """Test the evaluate method with return_seq=True."""
     prompts = ["Classify sentiment:"]
 
-    # Evaluate with return_seq=True
-    scores, seqs = mock_classification_task_with_subsampling.evaluate(prompts, mock_predictor, return_seq=True)
-
-    # Verify scores and sequences
-    assert scores.shape == (1,)  # One score per prompt
-    assert len(seqs) == 1  # One list of sequences per prompt
+    scores, seqs = mock_classification_task_with_subsampling.evaluate(
+        prompts, mock_predictor, return_seq=True, return_agg_scores=False
+    )
 
-    # Check that sequences contain input text
-    for seq in seqs[0]:
-        assert any(sample_text in seq for sample_text in mock_classification_task_with_subsampling.xs)
+    assert len(scores) == 1
+    assert len(scores[0]) == mock_classification_task_with_subsampling.n_subsamples
+    assert len(seqs) == 1
+    assert len(seqs[0]) == mock_classification_task_with_subsampling.n_subsamples
 
 
 def test_task_evaluate_with_system_prompts(
@@ -94,22 +82,18 @@ def test_task_evaluate_with_system_prompts(
     prompts = ["Classify sentiment:"]
     system_prompts = ["Be concise"]
 
-    # Evaluate with system prompts
     scores = mock_classification_task_with_subsampling.evaluate(
         prompts, mock_predictor, system_prompts=system_prompts, return_agg_scores=True
     )
 
-    # Verify scores
-    assert scores.shape == (1,)
-
-    # Verify that system prompts were passed through to the LLM
+    assert len(scores) == 1
     assert any(call["system_prompts"] == system_prompts for call in mock_downstream_llm.call_history)
 
 
 def test_pop_datapoints(mock_df):
     task = ClassificationTask(
         df=mock_df,
-        description="Sentiment classification task",
+        task_description="Sentiment classification task",
         eval_strategy="sequential_blocks",
     )
 
@@ -121,13 +105,76 @@ def test_pop_datapoints(mock_df):
 
 def test_blocks(mock_df):
     task = ClassificationTask(
-        df=mock_df, description="Sentiment classification task", eval_strategy="sequential_blocks", n_subsamples=1
+        df=mock_df, task_description="Sentiment classification task", eval_strategy="sequential_blocks", n_subsamples=1
     )
 
-    # Increment blocks
     task.increment_block_idx()
     assert task.block_idx == 1
 
-    # Reset blocks
     task.reset_block_idx()
     assert task.block_idx == 0
+
+
+def test_classification_task_evaluate_random_block(mock_df, mock_predictor):
+    """Test the evaluate method with 'random_block' subsampling for ClassificationTask."""
+    task = ClassificationTask(
+        df=mock_df,
+        task_description="Sentiment classification",
+        x_column="x",
+        y_column="y",
+        n_subsamples=1,
+        eval_strategy="random_block",
+        seed=42,
+    )
+    prompts = ["Classify sentiment:"]
+
+    evaluated_x_sets = []
+    for _ in range(5):
+        mock_predictor.call_history = []
+        task.evaluate(prompts, mock_predictor)
+        if mock_predictor.call_history:
+            evaluated_x_sets.append(tuple(mock_predictor.call_history[0]["preds"]))
+        else:
+            evaluated_x_sets.append(tuple())
+
+    assert len(set(evaluated_x_sets)) > 1, "Should select different random blocks across evaluations"
+
+
+def test_classification_task_evaluate_sequential_block(mock_df, mock_predictor):
+    """Test the evaluate method with 'sequential_block' subsampling for ClassificationTask."""
+    task = ClassificationTask(
+        df=mock_df,
+        task_description="Sentiment classification",
+        x_column="x",
+        y_column="y",
+        n_subsamples=1,
+        eval_strategy="sequential_block",
+        seed=42,
+    )
+    prompts = ["Classify sentiment:"]
+
+    task.reset_block_idx()
+    assert task.block_idx == 0
+
+    expected_x_sequence = [
+        "This review is not negative, so my answer is <final_answer>positive</final_answer>",
+        "This review is not positive, so my answer is <final_answer>negative</final_answer>",
+        "This review is neither positive nor negative, so my answer is <final_answer>neutral</final_answer>",
+    ]
+
+    for i in range(task.n_blocks):
+        mock_predictor.call_history = []
+        task.evaluate(prompts, mock_predictor)
+
+        assert len(mock_predictor.call_history) == 1
+        assert mock_predictor.call_history[0]["preds"][0] == expected_x_sequence[i]
+
+        task.increment_block_idx()
+        if i < task.n_blocks - 1:
+            assert task.block_idx == i + 1
+
+    task_full_strategy = ClassificationTask(df=mock_df, x_column="x", y_column="y", eval_strategy="full")
+    with pytest.raises(ValueError, match="Block increment is only valid for block subsampling."):
+        task_full_strategy.increment_block_idx()
+    with pytest.raises(ValueError, match="Block reset is only valid for block subsampling."):
+        task_full_strategy.reset_block_idx()
diff --git a/tests/tasks/test_judge_task.py b/tests/tasks/test_judge_task.py
new file mode 100644
index 0000000..3698bb5
--- /dev/null
+++ b/tests/tasks/test_judge_task.py
@@ -0,0 +1,94 @@
+import numpy as np
+
+
+def test_judge_task_initialization(mock_judge_task_with_y, mock_judge_llm):
+    """Test that JudgeTask initializes correctly with ground truth."""
+    assert mock_judge_task_with_y.task_description == "Evaluate sentiment prediction quality."
+    assert mock_judge_task_with_y.x_column == "x"
+    assert mock_judge_task_with_y.y_column == "y"
+    assert mock_judge_task_with_y.judge_llm == mock_judge_llm
+    assert mock_judge_task_with_y.has_y is True
+    assert len(mock_judge_task_with_y.xs) == len(mock_judge_task_with_y.df)
+    assert len(mock_judge_task_with_y.ys) == len(mock_judge_task_with_y.df)
+
+
+def test_judge_task_initialization_no_y(mock_judge_task_no_y):
+    """Test JudgeTask initialization when no y_column is provided."""
+    assert mock_judge_task_no_y.y_column is None
+    assert mock_judge_task_no_y.has_y is False
+    assert len(mock_judge_task_no_y.xs) == len(mock_judge_task_no_y.df)
+    assert all(y == "" for y in mock_judge_task_no_y.ys)  # noqa: E711
+
+
+def test_judge_task_construct_judge_prompt_with_ground_truth(mock_judge_task_with_y):
+    """Test _construct_judge_prompt generates correct prompt when ground truth is available."""
+    x_val = "This movie was great!"
+    pred_val = "positive"
+    y_val = "positive"
+    prompt = mock_judge_task_with_y._construct_judge_prompt(x_val, pred_val, y_val)
+
+    assert mock_judge_task_with_y.task_description in prompt
+    assert f"Input:\n{x_val}" in prompt
+    assert f"Ground Truth:\n{y_val}" in prompt
+    assert f"Prediction:\n{pred_val}" in prompt
+    assert "Response:" not in prompt
+    assert "<final_score>" in prompt
+
+
+def test_judge_task_construct_judge_prompt_without_ground_truth(mock_judge_task_no_y):
+    """Test _construct_judge_prompt generates correct prompt when no ground truth."""
+    x_val = "Tell me a joke."
+    pred_val = "Why did the scarecrow win an award? Because he was outstanding in his field!"
+    prompt = mock_judge_task_no_y._construct_judge_prompt(x_val, pred_val, None)
+
+    assert mock_judge_task_no_y.task_description in prompt
+    assert f"Input:\n{x_val}" in prompt
+    assert pred_val in prompt
+    assert "<final_score>" in prompt
+
+
+def test_judge_task_evaluate_with_ground_truth(mock_judge_task_with_y, mock_predictor, mock_judge_llm):
+    """Test the evaluate method of JudgeTask with ground truth and full evaluation."""
+    prompts = ["Rate the sentiment:", "What is the sentiment?", "How would you classify this?"]
+
+    mock_predictor.call_history = []
+    mock_judge_llm.call_history = []
+
+    scores_per_datapoint = mock_judge_task_with_y.evaluate(prompts, mock_predictor, return_agg_scores=False)
+
+    assert len(scores_per_datapoint) == len(prompts)
+    expected_scores = [1.0, 0, 0.5]
+    np.testing.assert_allclose(scores_per_datapoint[0], expected_scores)
+
+    mock_predictor.call_history = []
+    mock_judge_llm.call_history = []
+
+    aggregated_scores = mock_judge_task_with_y.evaluate(prompts, mock_predictor, return_agg_scores=True)
+    assert len(aggregated_scores) == len(prompts)
+    expected_scores = [0.5, 0.4333333, 0.0]
+    np.testing.assert_allclose(aggregated_scores, expected_scores)
+
+
+def test_judge_task_evaluate_no_ground_truth(mock_judge_task_no_y, mock_predictor, mock_judge_llm):
+    """Test the evaluate method of JudgeTask without a y_column (no ground truth)."""
+    prompts = ["Tell a funny joke:", "Make me laugh:", "What's a good joke?"]
+
+    mock_predictor.call_history = []
+    mock_judge_llm.call_history = []
+
+    aggregated_scores = mock_judge_task_no_y.evaluate(prompts, mock_predictor, return_agg_scores=True)
+
+    assert len(aggregated_scores) == len(prompts)
+    expected_scores = [0.5, 0.55, 0.35]
+    np.testing.assert_allclose(aggregated_scores, expected_scores)
+
+
+def test_judge_task_evaluate_with_return_seq(mock_judge_task_with_y, mock_predictor):
+    """Test the evaluate method with return_seq=True for JudgeTask."""
+    prompts = ["Evaluate this text:", "What is the sentiment?", "How would you classify this?"]
+    scores, seqs = mock_judge_task_with_y.evaluate(prompts, mock_predictor, return_seq=True, return_agg_scores=False)
+
+    assert len(scores) == 3
+    assert len(scores[0]) == len(mock_judge_task_with_y.xs)
+    assert len(seqs) == 3
+    assert len(seqs[0]) == len(mock_judge_task_with_y.xs)
diff --git a/tests/tasks/test_reward_tasks.py b/tests/tasks/test_reward_tasks.py
new file mode 100644
index 0000000..c707da9
--- /dev/null
+++ b/tests/tasks/test_reward_tasks.py
@@ -0,0 +1,30 @@
+import numpy as np
+
+
+def test_reward_task_initialization(mock_reward_task, simple_reward_function):
+    """Test that RewardTask initializes correctly."""
+    assert mock_reward_task.task_description == "Evaluate text quality"
+    assert mock_reward_task.reward_function == simple_reward_function
+    assert mock_reward_task.x_column == "x"
+    assert not mock_reward_task.has_y
+    assert len(mock_reward_task.xs) == len(mock_reward_task.df)
+    assert all(y == "" for y in mock_reward_task.ys)  # noqa: E711
+
+
+def test_reward_task_initialization_no_x_column(mock_reward_task_no_x_column, simple_reward_function):
+    """Test RewardTask initialization when a dummy x_column is provided (no semantic input)."""
+    assert mock_reward_task_no_x_column.x_column == "dummy_input"
+    assert not mock_reward_task_no_x_column.has_y
+    assert len(mock_reward_task_no_x_column.xs) == len(mock_reward_task_no_x_column.df)
+    assert all(x == "" for x in mock_reward_task_no_x_column.xs)
+    assert all([y == "" for y in mock_reward_task_no_x_column.ys])  # noqa: E711
+
+
+def test_reward_task_evaluate_with_return_seq(mock_reward_task, mock_predictor):
+    """Test the evaluate method with return_seq=True for RewardTask."""
+    prompts = ["Generate a short text:"]
+
+    scores, seqs = mock_reward_task.evaluate(prompts, mock_predictor, return_seq=True, return_agg_scores=False)
+
+    assert len(scores) == 1
+    assert len(seqs) == 1
diff --git a/tests/utils/test_prompt_creation.py b/tests/utils/test_prompt_creation.py
new file mode 100644
index 0000000..1c8c950
--- /dev/null
+++ b/tests/utils/test_prompt_creation.py
@@ -0,0 +1,145 @@
+from promptolution.tasks.base_task import BaseTask
+from promptolution.tasks.classification_tasks import ClassificationTask
+from promptolution.utils.prompt_creation import create_prompt_variation, create_prompts_from_samples
+
+
+def test_create_prompt_variation_single_prompt(mock_meta_llm):
+    """Test create_prompt_variation with a single string prompt and default meta-prompt."""
+    original_prompt = "Analyze the sentiment of the following text."
+
+    mock_meta_llm.call_history = []
+
+    varied_prompts = create_prompt_variation(original_prompt, mock_meta_llm)
+
+    assert isinstance(varied_prompts, list)
+    assert len(varied_prompts) == 1
+    assert varied_prompts[0] == "Meta-generated prompt for input 0"
+
+    assert len(mock_meta_llm.call_history) == 1
+
+
+def test_create_prompt_variation_list_of_prompts(mock_meta_llm):
+    """Test create_prompt_variation with a list of prompts and custom meta-prompt."""
+    original_prompts = ["Prompt A.", "Prompt B."]
+    custom_meta_prompt = "Vary the following: <prev_prompt>"
+
+    mock_meta_llm.call_history = []
+
+    varied_prompts = create_prompt_variation(original_prompts, mock_meta_llm, meta_prompt=custom_meta_prompt)
+
+    assert isinstance(varied_prompts, list)
+    assert len(varied_prompts) == 2
+    assert varied_prompts[0] == "Meta-generated prompt for input 0"
+    assert varied_prompts[1] == "Meta-generated prompt for input 1"
+
+    assert len(mock_meta_llm.call_history) == 1
+
+
+def test_create_prompts_from_samples_default_meta_prompt(mock_df, mock_meta_llm):
+    """Test create_prompts_from_samples with default meta_prompt (no task_description)."""
+    task = ClassificationTask(df=mock_df, x_column="x", y_column="y", task_description="Dummy task")
+    n_samples = 2
+    n_prompts = 1
+
+    mock_meta_llm.call_history = []
+
+    generated_prompts = create_prompts_from_samples(task, mock_meta_llm, n_samples=n_samples, n_prompts=n_prompts)
+
+    assert isinstance(generated_prompts, list)
+    assert len(generated_prompts) == n_prompts
+    assert generated_prompts[0] == "Meta-generated prompt for input 0"
+
+    assert len(mock_meta_llm.call_history) == n_prompts
+
+
+def test_create_prompts_from_samples_with_task_description_only(mock_df, mock_meta_llm):
+    """Test create_prompts_from_samples with task_description and no meta_prompt."""
+    task = ClassificationTask(df=mock_df, x_column="x", y_column="y")
+    test_task_description = "Classify customer reviews into positive, negative, or neutral."
+    n_samples = 2
+    n_prompts = 1
+
+    mock_meta_llm.call_history = []
+
+    generated_prompts = create_prompts_from_samples(
+        task, mock_meta_llm, n_samples=n_samples, task_description=test_task_description, n_prompts=n_prompts
+    )
+
+    assert len(generated_prompts) == n_prompts
+    assert generated_prompts[0] == "Meta-generated prompt for input 0"
+
+
+def test_create_prompts_from_samples_with_custom_meta_prompt_only(mock_df, mock_meta_llm):
+    """Test create_prompts_from_samples with custom meta_prompt and no task_description."""
+    task = ClassificationTask(df=mock_df, x_column="x", y_column="y")
+    custom_meta_prompt = "Generate a prompt based on these examples: <input_output_pairs>"
+    n_samples = 2
+    n_prompts = 1
+
+    mock_meta_llm.call_history = []
+
+    generated_prompts = create_prompts_from_samples(
+        task, mock_meta_llm, meta_prompt=custom_meta_prompt, n_samples=n_samples, n_prompts=n_prompts
+    )
+
+    assert len(generated_prompts) == n_prompts
+    assert generated_prompts[0] == "Meta-generated prompt for input 0"
+
+
+def test_create_prompts_from_samples_with_both_meta_prompt_and_task_description(mock_df, mock_meta_llm):
+    """Test create_prompts_from_samples with both custom meta_prompt and task_description."""
+    task = ClassificationTask(df=mock_df, x_column="x", y_column="y")
+    custom_meta_prompt = "For <task_desc>, create a prompt using: <input_output_pairs>"
+    test_task_description = "Identify categories."
+    n_samples = 2
+    n_prompts = 1
+
+    mock_meta_llm.call_history = []
+
+    generated_prompts = create_prompts_from_samples(
+        task,
+        mock_meta_llm,
+        meta_prompt=custom_meta_prompt,
+        n_samples=n_samples,
+        task_description=test_task_description,
+        n_prompts=n_prompts,
+    )
+
+    assert len(generated_prompts) == n_prompts
+    assert generated_prompts[0] == "Meta-generated prompt for input 0"
+
+
+def test_create_prompts_from_samples_random_sampling(mock_df, mock_meta_llm):
+    """Test create_prompts_from_samples with random sampling (not ClassificationTask or get_uniform_labels=False)."""
+
+    class DummyTask(BaseTask):
+        def _evaluate(self, x, y, pred):
+            return [1.0] * len(x)
+
+    task = DummyTask(df=mock_df, x_column="x", y_column="y", task_description="Dummy task for random sampling")
+    n_samples = 2
+    n_prompts = 1
+
+    mock_meta_llm.call_history = []
+
+    generated_prompts = create_prompts_from_samples(
+        task, mock_meta_llm, n_samples=n_samples, get_uniform_labels=False, n_prompts=n_prompts
+    )
+
+    assert len(generated_prompts) == n_prompts
+
+
+def test_create_prompts_from_samples_multiple_prompts(mock_df, mock_meta_llm):
+    """Test create_prompts_from_samples generates multiple prompts."""
+    task = ClassificationTask(df=mock_df, x_column="x", y_column="y")
+    n_samples = 2
+    n_prompts = 3
+
+    mock_meta_llm.call_history = []
+
+    generated_prompts = create_prompts_from_samples(task, mock_meta_llm, n_samples=n_samples, n_prompts=n_prompts)
+
+    assert isinstance(generated_prompts, list)
+    assert len(generated_prompts) == n_prompts
+
+    assert len(mock_meta_llm.call_history) == 1
diff --git a/scripts/api_llm_demo.py b/tutorials/api_llm_demo.py
similarity index 92%
rename from scripts/api_llm_demo.py
rename to tutorials/api_llm_demo.py
index 13370ea..d369a1b 100644
--- a/scripts/api_llm_demo.py
+++ b/tutorials/api_llm_demo.py
@@ -33,7 +33,7 @@
 
 task = ClassificationTask(
     df,
-    description="The dataset contains news articles categorized into four classes: World, Sports, Business, and Tech. The task is to classify each news article into one of the four categories.",
+    task_description="The dataset contains news articles categorized into four classes: World, Sports, Business, and Tech. The task is to classify each news article into one of the four categories.",
     x_column="input",
     y_column="target",
 )
diff --git a/scripts/capo_demo.py b/tutorials/capo_demo.py
similarity index 92%
rename from scripts/capo_demo.py
rename to tutorials/capo_demo.py
index a03ec78..a7cc53f 100644
--- a/scripts/capo_demo.py
+++ b/tutorials/capo_demo.py
@@ -35,7 +35,7 @@
 
 task = ClassificationTask(
     df,
-    description="The dataset consists of elementary school math word problems that require multi-step reasoning to solve. The task is to solve each word problem and provide the final answer.",
+    task_description="The dataset consists of elementary school math word problems that require multi-step reasoning to solve. The task is to solve each word problem and provide the final answer.",
     x_column="input",
     y_column="target",
     eval_strategy="sequential_block",
diff --git a/scripts/evoprompt_demo.py b/tutorials/evoprompt_demo.py
similarity index 93%
rename from scripts/evoprompt_demo.py
rename to tutorials/evoprompt_demo.py
index 4177eb2..6568230 100644
--- a/scripts/evoprompt_demo.py
+++ b/tutorials/evoprompt_demo.py
@@ -34,7 +34,7 @@
 
 task = ClassificationTask(
     df,
-    description="The dataset contains news articles categorized into four classes: World, Sports, Business, and Tech. The task is to classify each news article into one of the four categories.",
+    task_description="The dataset contains news articles categorized into four classes: World, Sports, Business, and Tech. The task is to classify each news article into one of the four categories.",
     x_column="text",
     y_column="label_text",
     eval_strategy="subsample",
diff --git a/notebooks/getting_started.ipynb b/tutorials/getting_started.ipynb
similarity index 99%
rename from notebooks/getting_started.ipynb
rename to tutorials/getting_started.ipynb
index 54bf3f2..2c140f6 100644
--- a/notebooks/getting_started.ipynb
+++ b/tutorials/getting_started.ipynb
@@ -380,7 +380,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": ".venv",
+   "display_name": "promptolution-py3.12",
    "language": "python",
    "name": "python3"
   },
diff --git a/tutorials/llm_as_judge_tutorial.ipynb b/tutorials/llm_as_judge_tutorial.ipynb
new file mode 100644
index 0000000..c764377
--- /dev/null
+++ b/tutorials/llm_as_judge_tutorial.ipynb
@@ -0,0 +1,448 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Getting Started: LLM as a Judge with Promptolution\n",
+    "\n",
+    "## Welcome to Promptolution! \n",
+    "\n",
+    "Discover a powerful tool for evolving and optimizing your LLM prompts. This notebook provides a friendly introduction to one of Promptolution's most advanced features: LLM as a Judge.\n",
+    "\n",
+    "While the standard getting_started notebook shows how to optimize for classification tasks, this guide will focus on something different. We'll optimize prompts for a creative task where there's no single \"correct\" answer: *Finding an optimal argument for a statement*!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Intro\n",
+    "In traditional machine learning and prompt optimization, we often rely on labeled data. For a classification task, you need an input (x) and a corresponding ground-truth label (y). The goal is to find a prompt that helps the model predict y correctly.\n",
+    "But what if your task is more subjective? How do you \"label\" things like:\n",
+    "\n",
+    "- The quality of a generated argument?\n",
+    "- The creativity of a story?\n",
+    "- The helpfulness of a summary?\n",
+    "- The persuasiveness of an essay?\n",
+    "\n",
+    "This is where LLM as a Judge comes in. Instead of relying on a pre-defined dataset of labels, we use another powerful Language Model (the \"judge\") to score the output of our prompts. The process looks like this:\n",
+    "\n",
+    "A candidate prompt is used to generate a response (e.g., an argument).\n",
+    "A \"judge\" LLM then evaluates this response based on the task provided and assigns a score.\n",
+    "Promptolution's optimizer uses these scores to identify which prompts are best and evolves them to generate even better responses.\n",
+    "\n",
+    "The beauty of this approach is its flexibility. While you can provide groundtruths (in case there is a correct answer) and let the LLM judge itself if both the prediction and the correct answer are equivalent - you don't need to.\n",
+    "\n",
+    "*New to Promptolution? If you haven't seen our classification tutorial yet, check out `getting_started.ipynb` first! It covers the basics of prompt optimization with simpler tasks like text classification. This notebook builds on those concepts but tackles more complex, subjective tasks.*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Installation\n",
+    "Install Promptolution with a single command"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install promptolution[api]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "source": [
+    "## Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "from promptolution.utils import ExperimentConfig\n",
+    "from promptolution.helpers import run_experiment\n",
+    "import nest_asyncio\n",
+    "\n",
+    "nest_asyncio.apply()  # Required for notebook environments"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setting Up Your Experiment"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Prepare the data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For this tutorial, we're using IBM's Argument Quality Ranking dataset - a collection of crowd-sourced arguments on controversial topics like capital punishment, abortion rights, and climate change.\n",
+    "\n",
+    "Unlike classification tasks where you have clear input-output pairs, here we're working with debate topics that we want to generate compelling arguments for."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = pd.read_csv(\"hf://datasets/ibm-research/argument_quality_ranking_30k/dev.csv\").sample(300)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Sample topics:\n",
+      "- We should adopt a zero-tolerance policy in schools\n",
+      "- Payday loans should be banned\n",
+      "- Intelligence tests bring more harm than good\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"\\nSample topics:\")\n",
+    "for topic in df[\"topic\"].unique()[:3]:\n",
+    "    print(f\"- {topic}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Our task: **Given a controversial statement, generate the strongest possible argument supporting that position.**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's look at what we're working with:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Creating Inital Prompts"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here are some starter prompts for generating compelling arguments. Feel free to experiment with your own!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "init_prompts = [\n",
+    "    \"Create a strong argument for this position with clear reasoning and examples:\",\n",
+    "    \"Write a persuasive argument supporting this statement. Include evidence and address counterarguments:\",\n",
+    "    \"Make a compelling case for this viewpoint using logical reasoning and real examples:\",\n",
+    "    \"Argue convincingly for this position. Provide supporting points and evidence:\",\n",
+    "    \"Build a strong argument for this statement with clear structure and solid reasoning:\",\n",
+    "    \"Generate a persuasive argument supporting this position. Use facts and logical flow:\",\n",
+    "    \"Create a well-reasoned argument for this viewpoint with supporting evidence:\",\n",
+    "    \"Write a convincing argument for this position. Include examples and counter opposing views:\",\n",
+    "    \"Develop a strong case supporting this statement using clear logic and evidence:\",\n",
+    "    \"Construct a persuasive argument for this position with solid reasoning and examples:\",\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Configure Your LLM"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For this demonstration, we will again use the DeepInfra API, but you can easily switch to other providers like Anthropic or OpenAI by simply changing the `api_url` and `model_id`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "api_key = \"YOUR_API_KEY\"  # Replace with your Promptolution API key"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here are the key parameters for LLM-as-a-Judge tasks:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "config = ExperimentConfig(\n",
+    "    optimizer=\"evopromptga\",\n",
+    "    task_description=\"Given a statement, find the best argument supporting it.\",\n",
+    "    x_column=\"topic\",\n",
+    "    prompts=init_prompts,\n",
+    "    n_steps=3,\n",
+    "    n_subsamples=10,\n",
+    "    subsample_strategy=\"random_subsample\",\n",
+    "    api_url=\"https://api.deepinfra.com/v1/openai\",\n",
+    "    model_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n",
+    "    api_key=api_key,\n",
+    "    task_type=\"judge\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- `task_type=\"judge\"` - This tells Promptolution to use LLM evaluation instead of accuracy metrics\n",
+    "- `x_column=\"topic\"` - We specify which column contains our input (debate topics)\n",
+    "- `optimizer=\"evopromptga\"` - In the classification task we show cased CAPO, here we are using EvoPrompt, a strong evolutionary prompt optimizer.\n",
+    "- No y column needed - the judge will evaluate quality without ground truth labels!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Run Your Experiment"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "With everything configured, you're ready to optimize your prompts! The run_experiment function will:\n",
+    "\n",
+    "1. Evaluate your initial prompts by generating arguments and having the judge LLM score them\n",
+    "1. Use evolutionary operators (mutation, crossover) to create new prompt variations from the 1. best-performing ones\n",
+    "1. Test these new prompt candidates and select the fittest ones for the next generation\n",
+    "1. Repeat this evolutionary process for the specified number of steps, gradually improving prompt 1. quality"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "🔥 Starting optimization...\n"
+     ]
+    }
+   ],
+   "source": [
+    "prompts = run_experiment(df, config)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can expect this to take several minutes as the optimizer generates arguments, evaluates them with the judge, and evolves the prompts."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>prompt</th>\n",
+       "      <th>score</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Construct a persuasive argument supporting the given statement, relying on logical coherence and evidence-based reasoning.</td>\n",
+       "      <td>0.931500</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Develop a strong case supporting this statement using clear logic and evidence:</td>\n",
+       "      <td>0.924167</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Construct a convincing case supporting the stated argument, providing evidence and responding to potential objections.</td>\n",
+       "      <td>0.915833</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Develop a well-reasoned argument in favor of the given statement, incorporating reliable examples and addressing potential counterpoints.</td>\n",
+       "      <td>0.913333</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>Write a persuasive argument supporting this statement. Include evidence and address counterarguments:</td>\n",
+       "      <td>0.907500</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>Present a convincing case for this assertion, incorporating logical premises and applicable examples.</td>\n",
+       "      <td>0.903333</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>Fortify the provided statement with a robust and well-reasoned argument, underscoring logical relationships and leveraging empirical support to build a compelling case, while also anticipating and addressing potential counterpoints.</td>\n",
+       "      <td>0.902500</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>Construct a strong claim in support of this statement, employing a logical framework and relevant examples to make a convincing case.</td>\n",
+       "      <td>0.891667</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>Create a well-reasoned argument for this viewpoint with supporting evidence:</td>\n",
+       "      <td>0.888333</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>Extract the most compelling supporting argument for this statement, grounding it in logical reasoning and bolstered by relevant evidence and examples.</td>\n",
+       "      <td>0.697500</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                                                                                                                                                                                                     prompt  \\\n",
+       "0                                                                                                                Construct a persuasive argument supporting the given statement, relying on logical coherence and evidence-based reasoning.   \n",
+       "1                                                                                                                                                           Develop a strong case supporting this statement using clear logic and evidence:   \n",
+       "2                                                                                                                    Construct a convincing case supporting the stated argument, providing evidence and responding to potential objections.   \n",
+       "3                                                                                                 Develop a well-reasoned argument in favor of the given statement, incorporating reliable examples and addressing potential counterpoints.   \n",
+       "4                                                                                                                                     Write a persuasive argument supporting this statement. Include evidence and address counterarguments:   \n",
+       "5                                                                                                                                     Present a convincing case for this assertion, incorporating logical premises and applicable examples.   \n",
+       "6  Fortify the provided statement with a robust and well-reasoned argument, underscoring logical relationships and leveraging empirical support to build a compelling case, while also anticipating and addressing potential counterpoints.   \n",
+       "7                                                                                                     Construct a strong claim in support of this statement, employing a logical framework and relevant examples to make a convincing case.   \n",
+       "8                                                                                                                                                              Create a well-reasoned argument for this viewpoint with supporting evidence:   \n",
+       "9                                                                                    Extract the most compelling supporting argument for this statement, grounding it in logical reasoning and bolstered by relevant evidence and examples.   \n",
+       "\n",
+       "      score  \n",
+       "0  0.931500  \n",
+       "1  0.924167  \n",
+       "2  0.915833  \n",
+       "3  0.913333  \n",
+       "4  0.907500  \n",
+       "5  0.903333  \n",
+       "6  0.902500  \n",
+       "7  0.891667  \n",
+       "8  0.888333  \n",
+       "9  0.697500  "
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "prompts"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The best prompts aren't always the most obvious ones - let the optimizer surprise you with what works!\n",
+    "\n",
+    "\n",
+    "Happy prompt optimizing! 🚀✨ We can't wait to see what you build with Promptolution! 🤖💡"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "promptolution-py3.10",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "-1.-1.-1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/scripts/opro_demo.py b/tutorials/opro_demo.py
similarity index 93%
rename from scripts/opro_demo.py
rename to tutorials/opro_demo.py
index eec818f..2b6ea93 100644
--- a/scripts/opro_demo.py
+++ b/tutorials/opro_demo.py
@@ -34,7 +34,7 @@
 
 task = ClassificationTask(
     df,
-    description="The dataset contains news articles categorized into four classes: World, Sports, Business, and Tech. The task is to classify each news article into one of the four categories.",
+    task_description="The dataset contains news articles categorized into four classes: World, Sports, Business, and Tech. The task is to classify each news article into one of the four categories.",
     x_column="text",
     y_column="label_text",
 )
@@ -62,7 +62,7 @@
 
 optimizer = OPRO(
     task=task,
-    prompt_template=OPRO_TEMPLATE_TD.replace("<task_desc", task.description),
+    prompt_template=OPRO_TEMPLATE_TD.replace("<task_desc", task.task_description),
     predictor=predictor,
     meta_llm=meta_llm,
     initial_prompts=initial_prompts,
diff --git a/tutorials/reward_task_tutorial.ipynb b/tutorials/reward_task_tutorial.ipynb
new file mode 100644
index 0000000..e092240
--- /dev/null
+++ b/tutorials/reward_task_tutorial.ipynb
@@ -0,0 +1,469 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Getting Started: Reward Tasks with Promptolution\n",
+    "\n",
+    "Welcome to the world of **reward-based prompt optimization**! If you've explored our classification tutorial (`getting_started.ipynb`) or our LLM-as-a-Judge notebook (`llm_judge_getting_started.ipynb`), you've seen how to optimize prompts for predicting labels or generating content that gets rated by AI judges.\n",
+    "\n",
+    "But what if you want to optimize for something completely different? What if you want to optimize for:\n",
+    "* **Objective, measurable outcomes** rather than subjective quality?\n",
+    "* **System compatibility** - does the output actually work with your software?\n",
+    "* **Concrete business metrics** that you can define and measure automatically?\n",
+    "\n",
+    "This is where **Reward Tasks** shine. Instead of relying on pre-labeled data or AI judges, you define your own reward function - a simple Python function that takes the model's output and returns a score. The optimizer then evolves prompts that maximize this reward.\n",
+    "\n",
+    "**The beauty of reward tasks**: You can optimize for literally anything you can measure! Valid JSON parsing, code execution success, mathematical correctness, format compliance, API compatibility - if you can write a function to evaluate it, you can optimize for it.\n",
+    "\n",
+    "> **New to Promptolution?** If you haven't seen our other tutorials yet, check out `getting_started.ipynb` (classification) and `llm_judge_getting_started.ipynb` (LLM evaluation) first! This notebook builds on those concepts but tackles objective, measurable outcomes."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Installation\n",
+    "Install Promptolution with a single command"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install promptolution[api]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "source": [
+    "## Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "c:\\Users\\tzehl\\anaconda3\\envs\\d\\Lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n"
+     ]
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "from promptolution.utils import ExperimentConfig\n",
+    "from promptolution.helpers import run_experiment\n",
+    "import nest_asyncio\n",
+    "\n",
+    "nest_asyncio.apply()  # Required for notebook environments"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setting Up Your Experiment"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Prepare the data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For this tutorial, we're tackling a real-world challenge: summarizing text and outputting valid JSON. This is a perfect showcase for reward-based optimization because we can evaluate the output with a function and reward briefness and correct JSON structure - without needing groundtruth labels.\n",
+    "We're using the CNN/DailyMail dataset, which contains news articles."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = pd.read_parquet(\"hf://datasets/abisee/cnn_dailymail/3.0.0\").sample(300)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Key difference from other tasks: Notice we're not using labeled \"correct\" JSON outputs or asking an AI to judge quality. Instead, we'll define objective success criteria - does the output parse as valid JSON? Does it contain the required fields? Is the summary concise enough for our database?\n",
+    "\n",
+    "Let's explore the task:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Dataset columns: ['article', 'highlights', 'id']\n",
+      "\n",
+      "Dataset size: 300 examples\n",
+      "\n",
+      "Sample Article:\n",
+      "Investors looking to make an easy buck out of the housing market could be running out of time. Australia's financial regulators are in talks to tighten the process for in...\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"Dataset columns:\", df.columns.tolist())\n",
+    "print(f\"\\nDataset size: {len(df)} examples\")\n",
+    "print(\"\\nSample Article:\")\n",
+    "print(df[\"article\"].iloc[0][:170] + \"...\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Creating Inital Prompts"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here are some starter prompts for JSON extraction. Feel free to experiment with your own approaches!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "init_prompts = [\n",
+    "    \"\"\"Analyze the provided news article and return a JSON response with the following three fields:\n",
+    "- \"summary\": A concise summary of the article's main points (maximum 200 characters)\n",
+    "- \"category\": The article's topic classification (options: \"sports\", \"politics\", \"technology\", or \"other\")\n",
+    "- \"author\": The article author's name (use \"unknown\" if not provided)\n",
+    "Format the response as valid JSON with these exact keys.\n",
+    "The final json needs to start with the <final_answer> tag.\n",
+    "\"\"\"\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Configure Your LLM"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Promptolution offers three flexible ways to access language models:\n",
+    "\n",
+    "1. Local LLMs (using the Transformers library)\n",
+    "1. vLLM backend (for efficient serving of large language models)\n",
+    "1. API-based LLMs (compatible with any provider following the OpenAI standard)\n",
+    "\n",
+    "For this demonstration, we'll use the DeepInfra API, but you can easily switch to other providers like Anthropic or OpenAI by simply changing the base_url and llm string in the configuration."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "api_key = \"YOUR_API_KEY\"  # Replace with your Promptolution API key"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here's an explanation of each configuration parameter in the ExperimentConfig:\n",
+    "- `optimizer`: The algorithm used for prompt optimization. Currently we support \"capo\", \"evopromptga\", \"evopromptde\", and \"opro\". For this example, we use \"capo\" as it is capable of leveraging few-shot examples.\n",
+    "- `task_description`: A string describing the task you're optimizing prompts for. This is used to provide the meta-llm with context about your task.\n",
+    "- `prompts`: A list of initial prompt strings that will be used as the starting point for optimization.\n",
+    "- `n_steps`: The number of optimization steps to run. Higher values allow more exploration and refinement but require more API calls and computational resources.\n",
+    "- `api_url`: The API endpoint URL used to access the language model. This example uses DeepInfra's API which follows the OpenAI standard.\n",
+    "- `llm`: The LLM to use for the experiment, as both downstream and meta LLM.\n",
+    "- `token`: Your API authentication token required to access the language model service."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Define Your Reward Function\n",
+    "\n",
+    "This is where the magic happens! Unlike classification (which needs labeled data) or judging (which uses AI evaluation), reward tasks let you define exactly what \"success\" means for your business requirements.\n",
+    "\n",
+    "We will reward by 0.3 the LLM for first of all creating a json that is parsable by `json.loads`.\n",
+    "There is an additional reward of 0.2 if the dictionary contains the key \"summary\" and 0.1 each for containing \"category\" and \"author\".\n",
+    "If the summary contains less than 200 characters, that will give the prompt an additional reward of 0.2.\n",
+    "We give a reward of 0.1 if the categories are correctly assigned."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "\n",
+    "\n",
+    "def reward_function(prediction: str) -> float:\n",
+    "    reward = 0.0\n",
+    "    try:\n",
+    "        information = json.loads(prediction)\n",
+    "        reward += 0.3  # valid json\n",
+    "\n",
+    "        if \"summary\" in information.keys():\n",
+    "            reward += 0.2  # contains summary\n",
+    "        if \"category\" in information.keys():\n",
+    "            reward += 0.1  # contains category\n",
+    "        if \"author\" in information.keys():\n",
+    "            reward += 0.1  # contains author\n",
+    "\n",
+    "        if len(information.get(\"summary\", \"\")) < 200:\n",
+    "            reward += 0.2  # summary is < 200 characters\n",
+    "\n",
+    "        if information.get(\"category\") in [\"sports\", \"politics\", \"technology\", \"other\"]:\n",
+    "            reward += 0.1  # category is valid\n",
+    "    except Exception:\n",
+    "        reward = 0.0\n",
+    "\n",
+    "    return reward"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This reward function captures actual business requirements - the output must be valid JSON that our systems can process, contain all required fields, respect character limits to save time for the user, and use only allowed category values."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "task_description = (\n",
+    "    \"The task is to summarize a news article into a json format, that contains 'summary', 'category', and 'author'. \"\n",
+    "    \"The summary should be less than 200 characters, and the category should be one of 'sports', 'politics', 'technology', or 'other'. \"\n",
+    "    \"The final json needs to start with the <final_answer> tag.\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "config = ExperimentConfig(\n",
+    "    optimizer=\"opro\",\n",
+    "    task_description=task_description,\n",
+    "    prompts=init_prompts,\n",
+    "    x_column=\"article\",\n",
+    "    n_steps=8,\n",
+    "    num_instructions_per_step=5,\n",
+    "    api_url=\"https://api.deepinfra.com/v1/openai\",\n",
+    "    model_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n",
+    "    api_key=api_key,\n",
+    "    n_subsamples=15,\n",
+    "    task_type=\"reward\",\n",
+    "    reward_function=reward_function,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Difference compared to Classification and LLM-As-a-Judge**:\n",
+    "- `task_type=\"reward\"` - Uses your custom reward function instead of accuracy or AI judgment\n",
+    "- `reward_function=reward_function` - Your objective success criteria\n",
+    "- `optimizer=\"opro\"` - We already used EvoPrompt and CAPO in the other tutorials - here we will use OPRO. Its main benefit: it requires only a single initial prompt.\n",
+    "- No need for labeled \"correct\" outputs - the reward function defines success\n",
+    "- Completely customizable - change the reward function to optimize for anything!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Run Your Experiment"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "With everything configured, you're ready to optimize your prompts! The `run_experiment` function will run the optimization and evaluate on a holdout set. You can expect this cell to take a few minutes to run."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 60,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "🔥 Starting optimization...\n",
+      "📊 Starting evaluation...\n"
+     ]
+    }
+   ],
+   "source": [
+    "prompts = run_experiment(df, config)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 64,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>prompt</th>\n",
+       "      <th>score</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Summarize the news article into a JSON format with the following structure: {“summary”: &lt;summary&gt;, “category”: &lt;category&gt;, “author”: &lt;author&gt;}.\\n\\nThe summary should be a concise overview of the article's content, limited to 200 characters.\\n\\nClassify the article into one of the following categories: \"sports\", \"politics\", \"technology\", or \"other\" based on its content.\\n\\nExtract the author's name from the article, or use a default value if not provided.\\n\\nStart the JSON response with the tag “&lt;final_answer&gt;” and end it with “&lt;/final_answer&gt;”.</td>\n",
+       "      <td>0.848333</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Analyze the provided news article and return a JSON response with the following three fields:\\n- \"summary\": A concise summary of the article's main points (maximum 200 characters)\\n\\n- \"category\": The article's topic classification (options: \"sports\", \"politics\", \"technology\", or \"other\")\\n\\n- \"author\": The article author's name (use \"unknown\" if not provided)\\n\\nFormat the response as valid JSON with these exact keys.\\n\\nThe final json needs to start with the &lt;final_answer&gt; tag.\\n</td>\n",
+       "      <td>0.811667</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Analyze the provided news article and generate a JSON response with the following three fields:\\n\\n* \"summary\": A concise and objective summary of the article's main points, limited to 150 characters, focusing on the most critical information and highlighting key points.\\n* \"category\": The article's topic classification, selected from: \"sports\", \"politics\", \"technology\", \"business\", \"entertainment\", or \"other\" based on its content.\\n* \"author\": The article author's name, using \"unknown\" if not provided.\\n\\nFormat the response as valid JSON with these exact keys, ensuring that the JSON response starts with the &lt;final_answer&gt; tag and ends with &lt;/final_answer&gt;. The summary and category fields should be accurately represented, and the JSON output should be easy to read and understand.\\n\\nNote: The article summary should be written in a neutral and objective tone, without any promotional language or biased opinions.\\n\\nScore: 99</td>\n",
+       "      <td>0.805000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18</th>\n",
+       "      <td>Analyze the provided news article and generate a JSON response with the following three fields:\\n- \"summary\": A concise summary of the article's main points, limited to 250 characters, focusing on identifying the most critical information and presenting it in a clear and coherent manner.\\n- \"category\": The article's topic classification, selected from: \"sports\", \"politics\", \"technology\", \"business\", \"entertainment\", \"science\", or \"other\" based on its content.\\n- \"author\": The article author's name, using \"unknown\" if not provided.\\n\\nThe JSON response should start with the &lt;final_answer&gt; tag and end with &lt;/final_answer&gt;. Ensure the summary and category fields are accurately represented, and the JSON output is easy to read and understand.\\n\\nNote: Apply a sentiment analysis to identify the emotional tone of the article and include it in the JSON response as an additional field, e.g., \"sentiment\": \"positive\", \"neutral\", or \"negative\".</td>\n",
+       "      <td>0.711667</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>19</th>\n",
+       "      <td>Analyze the provided news article and generate a JSON response with the following three fields:\\n\\n* \"summary\": A concise summary of the article's main points, limited to 200 characters, focusing on identifying the most critical information and presenting it in a clear and coherent manner.\\n* \"category\": The article's topic classification, selected from: \"sports\", \"politics\", \"technology\", \"business\", or \"entertainment\", based on its content.\\n* \"author\": The article author's name, using a default value if not provided.\\n\\nFormat the response as valid JSON with these exact keys. Ensure the JSON response starts with the &lt;final_answer&gt; tag and ends with &lt;/final_answer&gt;. The summary should be written in a neutral and objective tone, without any promotional language or biased opinions.\\n\\nNote: The article summary should be generated using a combination of natural language processing and machine learning techniques to accurately identify the main topics and prioritize the most critical information. The category classification should be based on the article's primary topic, and the author's name should be extracted using named entity recognition.</td>\n",
+       "      <td>0.701667</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     prompt  \\\n",
+       "0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Summarize the news article into a JSON format with the following structure: {“summary”: <summary>, “category”: <category>, “author”: <author>}.\\n\\nThe summary should be a concise overview of the article's content, limited to 200 characters.\\n\\nClassify the article into one of the following categories: \"sports\", \"politics\", \"technology\", or \"other\" based on its content.\\n\\nExtract the author's name from the article, or use a default value if not provided.\\n\\nStart the JSON response with the tag “<final_answer>” and end it with “</final_answer>”.   \n",
+       "1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Analyze the provided news article and return a JSON response with the following three fields:\\n- \"summary\": A concise summary of the article's main points (maximum 200 characters)\\n\\n- \"category\": The article's topic classification (options: \"sports\", \"politics\", \"technology\", or \"other\")\\n\\n- \"author\": The article author's name (use \"unknown\" if not provided)\\n\\nFormat the response as valid JSON with these exact keys.\\n\\nThe final json needs to start with the <final_answer> tag.\\n   \n",
+       "2                                                                                                                                                                                                                                 Analyze the provided news article and generate a JSON response with the following three fields:\\n\\n* \"summary\": A concise and objective summary of the article's main points, limited to 150 characters, focusing on the most critical information and highlighting key points.\\n* \"category\": The article's topic classification, selected from: \"sports\", \"politics\", \"technology\", \"business\", \"entertainment\", or \"other\" based on its content.\\n* \"author\": The article author's name, using \"unknown\" if not provided.\\n\\nFormat the response as valid JSON with these exact keys, ensuring that the JSON response starts with the <final_answer> tag and ends with </final_answer>. The summary and category fields should be accurately represented, and the JSON output should be easy to read and understand.\\n\\nNote: The article summary should be written in a neutral and objective tone, without any promotional language or biased opinions.\\n\\nScore: 99   \n",
+       "18                                                                                                                                                                                                                       Analyze the provided news article and generate a JSON response with the following three fields:\\n- \"summary\": A concise summary of the article's main points, limited to 250 characters, focusing on identifying the most critical information and presenting it in a clear and coherent manner.\\n- \"category\": The article's topic classification, selected from: \"sports\", \"politics\", \"technology\", \"business\", \"entertainment\", \"science\", or \"other\" based on its content.\\n- \"author\": The article author's name, using \"unknown\" if not provided.\\n\\nThe JSON response should start with the <final_answer> tag and end with </final_answer>. Ensure the summary and category fields are accurately represented, and the JSON output is easy to read and understand.\\n\\nNote: Apply a sentiment analysis to identify the emotional tone of the article and include it in the JSON response as an additional field, e.g., \"sentiment\": \"positive\", \"neutral\", or \"negative\".   \n",
+       "19  Analyze the provided news article and generate a JSON response with the following three fields:\\n\\n* \"summary\": A concise summary of the article's main points, limited to 200 characters, focusing on identifying the most critical information and presenting it in a clear and coherent manner.\\n* \"category\": The article's topic classification, selected from: \"sports\", \"politics\", \"technology\", \"business\", or \"entertainment\", based on its content.\\n* \"author\": The article author's name, using a default value if not provided.\\n\\nFormat the response as valid JSON with these exact keys. Ensure the JSON response starts with the <final_answer> tag and ends with </final_answer>. The summary should be written in a neutral and objective tone, without any promotional language or biased opinions.\\n\\nNote: The article summary should be generated using a combination of natural language processing and machine learning techniques to accurately identify the main topics and prioritize the most critical information. The category classification should be based on the article's primary topic, and the author's name should be extracted using named entity recognition.   \n",
+       "\n",
+       "       score  \n",
+       "0   0.848333  \n",
+       "1   0.811667  \n",
+       "2   0.805000  \n",
+       "18  0.711667  \n",
+       "19  0.701667  "
+      ]
+     },
+     "execution_count": 64,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "prompts.iloc[[0, 1, 2, -2, -1]]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You might think 'just ask for JSON' would work fine, but optimization reveals that specific instructions about field names, value constraints, and output formatting can improve validity rates from ~70% to over 84% - another reminder that systematic optimization beats manual prompt engineering!\n",
+    "\n",
+    "Happy prompt optimizing! 🚀✨ We can't wait to see what you build with Promptolution! 🤖💡"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "d",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}