<a href="https://colab.research.google.com/drive/1PMZJ8vsnimRMVbD5nmDk5UMMv3eyqjbb?usp=sharing#scrollTo=s3liO0IK-dU-" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cookbook: News article summary with CoD  👨‍🍳👩‍🍳
Experience Yival's powerful features by trying to summarise articles with Chain of Density (CoD).

Just set **OpenAI key**, and try Chain of Density **for free**! 😍

## **What is YiVal?**
> YiVal is a versatile platform support customize test data, evaluation methods and enhancement strategy , all in one.
It enpowers you to generate better results, reduce latency and decrease inference cost.

**~~TL~~DR**: YiVal streamlines the **evaluation** and **enhancement** of GenAI Apps, enhance ane evaluate **everything** with ease.

## **Why YiVal**


*   Native support **Multi-modal** apps: text📄 + audio🎙 + image🌃 + video🎥
*   **Multi-components**: which doesn't even have to be GenAI 😁
*   Native **RLHF** and **RLAIF** ⚙️
*   Most advanced open source **enhancement algorithms** 🪄

## **Introduction: News Summary with CoD**


In this notebook, we'll use Yival to conduct a "Chain of Density" experiment for summarizing news articles, based on the paper: [From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting](https://huggingface.co/papers/2309.04269). By integrating the CoD method into Yival, we aim to evaluate the enhancer's ability in text summarization.

### What is Chain of Density (CoD)?
- When generating article summaries, it is challenging to make them contain a reasonable amount of information. A good summary should be detailed and entity-centred, rather than entity-dense and difficult to understand.

- CoD is a nifty technique originally designed to craft increasingly concise, yet information-rich summaries from a given content. It identifies a set of unique and salient entities in the source text at a fixed number of iteration rounds and fuses them into the previous summary without increasing the text length. Here's how it works:

  CoD operates in a two-step cycle, repeated five times:
  - Think of it as a detective, first identifying 1-3 key pieces of information - or 'entities' - from the article that are absent from the existing summary.
  
  - Then, like a skilled writer, it crafts a new, denser summary. This new summary is no longer than the previous one, yet it cleverly incorporates all previous details plus the newly identified entities.

- CoD follows a set of guidelines to ensure the summaries are not just dense, but also stand-alone. It's like a mini masterclass in writing - promoting the use of fusion, compression, and the removal of uninformative phrases to make room for additional entities. And, it never forgets - entities from the previous summary are always retained.

![image](https://user-images.githubusercontent.com/71804564/277099109-bdfc384e-bd0f-4868-9ac8-2e8cd5a6af0a.png)

# Install Dependencies & Necessary Configurations✈️

Let's start with installing all the dependencies and make all necessary configurations in Colab!🔛

## Install YiVal with git

In [None]:
# Clone the latest yival
import os
!python --version
!rm -rf YiVal
!git clone https://github.com/YiVal/YiVal.git

# Install and config poetry
import shutil
!pip install poetry
POETRY_PATH = shutil.which("poetry") or (os.getenv("HOME") + "/.local/bin/poetry")
os.environ["PATH"] += os.pathsep + os.path.dirname(POETRY_PATH)
!poetry --version
!poetry config virtualenvs.create true

In [None]:
os.chdir("/content/YiVal")
!poetry install --no-ansi

Now you have all necessary dependencies in Colab. One step left to complete!💪



## Configure your OpenAI API key

Acquire your OpenAI API key from the [OpenAI platform](https://platform.openai.com/) and paste it below.

In [None]:
os.environ['OPENAI_API_KEY']= ''

## **[Optional] Change gpt-4 to gpt-3.5-turbo in config**

If you don't have a GPT-4 account, you can also use GPT-3.5-turbo to complete the entire process, you just need to modify the **model_name** in the config file.

For example , you can find `model_name` below

```yaml
description: Generate test data
dataset:
  data_generators:
    openai_prompt_data_generator:
      chunk_size: 100000
      diversify: true
      model_name: gpt-4 #Change the model_name to gpt-3.5-turbo here 🦄️
      input_function:
        description:
          Given a tech startup business, generate a corresponding landing
          page headline
        name: headline_generation_for_business
        parameters:
          tech_startup_business: str
      number_of_examples: 3
      output_csv_path: generated_examples.csv
  source_type: machine_generated
```

If you want to use gpt-3.5-turbo, change the `use_gpt_35_turbo` to `True` in the below cell and run it, after you save your configurations below the `/demo/configs` folder.

It will autotimatically replace all `gpt-4` to `gpt-3.5-turbo` in all yamls provided by yival

In [None]:
import os, glob, yaml
use_gpt_35_turbo = True  #change it to True if you don't want to use gpt-4

def replace_gpt4_recursive(data):
    if isinstance(data, str):
        return data.replace('gpt-4', 'gpt-3.5-turbo')
    elif isinstance(data, list):
        return [replace_gpt4_recursive(item) for item in data]
    elif isinstance(data, dict):
        return {key: replace_gpt4_recursive(value) for key, value in data.items()}
    else:
        return data


def replace_in_yaml_files(directory):
    for filename in glob.glob(os.path.join(directory, '*.yml')):
        with open(filename, 'r') as file:
            data = yaml.safe_load(file)
        data = replace_gpt4_recursive(data)
        with open(filename, 'w') as file:
            yaml.safe_dump(data, file)

if use_gpt_35_turbo:
  replace_in_yaml_files("/content/YiVal/demo/configs")
  print("[INFO] replace all gpt-4 to gpt-3.5-turbo. Use gpt-3.5-turbo in the coming page")
else:
  print("[INFO] use default gpt-4")



Now you are fully ready. Prepare to start our journey!🚗🚗

# News Article Summary Demo📕

Let's see how to write YiVal's configuration file for news article summary with CoD. The overall pipeline is shown in the following graph:

![graph](https://user-images.githubusercontent.com/71804564/277108899-4aff68cd-40c0-47aa-ab5f-55a8fa3444a1.png)


## Configs
The configuration file outlines the setup for an experiment involving the
summarization of news articles. It is structured with the following parts:

- Dataset Configs

- Custom Function

- Human Rating Configs

- Variation Configs

- Evaluator Configs

- Selection Strategy Configs

- Improver Configs

### Dataset Configs
The data for the experiment is sourced from [HuggingFace dataset](https://huggingface.co/datasets/griffin/chain_of_density).   The configuration is shown below:

```yaml
dataset:
  file_path: https://datasets-server.huggingface.co/rows?dataset=griffin%2Fchain_of_density&config=annotated&split=test
  reader: huggingface_dataset_reader
  source_type: dataset
  reader_config:
    example_limit: 3
    output_mapping:
      article: article
```
Fields in the configuration:
- **file_path**: The path of the dataset. Here we use the link of the dataset as its path.
- **reader**: The reader of the dataset. It is set to read HuggingFace datasets.
- **source_type**: The source type of the dataset. Here "dataset" means the data is from a dataset file rather than generated.
- **example_limit**: The number limit of data example used in the experiment.
- **output_mapping**: The mapping between fields in the dataset and the fields in the output. Here the output mapping is set to map the 'article' field in the dataset to 'article' in the output.

### Custom Function
In this function, we take the content of an article as input, and output the article summary generated by GPT-4 for this article.

Here, we write the custom function into `summarize.py` file for later use.

In [None]:
code = '''
import os

import openai

from yival.logger.token_logger import TokenLogger
from yival.schemas.experiment_config import MultimodalOutput
from yival.states.experiment_state import ExperimentState
from yival.wrappers.string_wrapper import StringWrapper


def summarize(article: str, state: ExperimentState) -> MultimodalOutput:
    logger = TokenLogger()
    logger.reset()
    # Ensure you have your OpenAI API key set up
    openai.api_key = os.getenv("OPENAI_API_KEY")

    # Create a chat message sequence
    messages = [{
        "role":
        "system",
        "content":
        str(
            StringWrapper(
                "You are a robot summarizing article summaries. Please summarize the article based on the article given below. Your abstract should be informative, capture the important information in the article, and present it accurately and concisely. In addition, the summary should be easy to understand, coherent, well-structured, and well-organized. Finally, the summary of your summary should convey the main points of the article in a concise, logical, and coherent manner.",
                name="summarization",
                state=state
            )
        )
    }, {
        "role": "user",
        "content": article
    }]
    # Use the chat-based completion
    response = openai.ChatCompletion.create(model="gpt-4", messages=messages)

    answer = MultimodalOutput(
        text_output=response['choices'][0]['message']['content'],
    )
    token_usage = response['usage']['total_tokens']
    logger.log(token_usage)

    return answer

'''
with open('/content/YiVal/demo/summarize.py', 'w') as file:
    file.write(code)

### Human Rating Configs
A human rating configuration is defined with the name 'preference'. The instructions are to rate the quality of the generated summary, on a scale from 0 to 4.
```yaml
human_rating_configs:
  - name: quality
    instructions: Rate the quality of the generated summary.
    scale: [0, 4]
```
Fields in the configuration:
- **name**: Specifies the criterion for rating.
- **instructions**: Provides guidelines to the rater on how to evaluate the content based on the defined criterion.
- **scale**: The rating scale, where `1` is the lowest and `5` is the highest.

### Variation Configs
Generating prompt variations is key for generating high quality summaries.

These variations are different prompts for GPT-4 to generate article summary with CoD. We tell GPT-4 the basic needs of the prompts required for CoD in this configuration. It is designed to guide the summarization of news articles in a way that is engaging and informative, while also being concise and to the point.

```yaml
variations:
  - name: CoD_prompt_var_generation
    generator_name: openai_prompt_based_variation_generator
    generator_config:
      model_name: gpt-4
      diversify: false
      max_tokens: 7000
      number_of_variations: 5
      variables:
        - ARTICLE
      prompt:
        - content: |-

            Your mission is to build a clean, precise instruction prompt for GPT-4. This prompt will guide GPT-4 to step-by-step generate increasingly concise, entity-dense summaries of the `{ARTICLE}` as required.

            The key of your instruction is that it should prompt GPT-4 to repeat the following 2 steps 5 times:
            - Step 1: Identify 1-3 informative Entites (";" delimited) from the `{ARTICLE}` which are missing from the previously generated summary.
            - Step 2: Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entites.
            
            A Missing Entity is:
            - Relevent: to the main story.
            - Specific: descriptive yet concise (5 words or fewer).
            - Novel: not in the previous summary.
            - Faithful: present in the `{ARTICLE}`
            - Anywhere: located anywhere in the `{ARTICLE}`

            Craft your prompt to make sure each summary use the exact same number of words.
            The instruction should ensure that the GPT-4 will answer the final generated summary after repeating the steps 5 times.
            Keep your output crisp: only the prompt, devoid of any extraneous content.

          role: system
        - content: |-

            {ARTICLE} represent the text of article.

          role: user
      output_path: generated_cod_prompts.pkl
```
Variations allow for dynamic content during experiments. They are identified by a globally unique name. For example, in your code, you might reference a variation by its name, like: `variation = StringWrapper("hello", 'test_experiment')`. In this config, you would define the variations associated with that name.

Fields in the configuration:

- **generator_name**: Represents the class name of the generator.
- **number_of_variations**: Specifies the number of variations to be generated.
- **prompt**: The prompts used to generate the variation.
- **output_path**: This is the temporary storage location for the generated data. If a file at this path exists, the data is read from it.

The generated variations can be printed by running the code in the cell below:


In [None]:
code = '''
from yival.schemas.varation_generator_configs import OpenAIPromptBasedVariationGeneratorConfig as VarConfig
from yival.variation_generators.openai_prompt_based_variation_generator import OpenAIPromptBasedVariationGenerator as VarGenerator
from pprint import pprint

config = VarConfig(
    model_name="gpt-4",
    number_of_variations=3,
    diversify=False,
    max_tokens=7000,
    variables=['ARTICLE'],
    prompt="""
            role: system

            Your mission is to build a clean, precise instruction prompt for GPT-4. This prompt will guide GPT-4 to step-by-step generate increasingly concise, entity-dense summaries of the `{ARTICLE}` as required.

            The key of your instruction is that it should prompt GPT-4 to repeat the following 2 steps 5 times:
            - Step 1: Identify 1-3 informative Entites (";" delimited) from the `{ARTICLE}` which are missing from the previously generated summary.
            - Step 2: Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entites.

            A Missing Entity is:
            - Relevent: to the main story.
            - Specific: descriptive yet concise (5 words or fewer).
            - Novel: not in the previous summary.
            - Faithful: present in the `{ARTICLE}`
            - Anywhere: located anywhere in the `{ARTICLE}`

            Craft your prompt to make sure each summary use the exact same number of words.
            The instruction should ensure that the GPT-4 will answer the final generated summary after repeating the steps 5 times.
            Keep your output crisp: only the prompt, devoid of any extraneous content.

            role: user

            {ARTICLE} represent the text of articles
    """,
)

generator = VarGenerator(config)
results = generator.generate_variations()
for item in results:
    for var in item:
        pprint(var.asdict().get('value',None))
        print()
'''

with open('test_variation_generator.py', 'w') as file:
    file.write(code)

!poetry run python test_variation_generator.py

### Evaluator Configs
Evaluators are really import parts in yival.

According to many [recent studies](https://ar5iv.labs.arxiv.org/html/2305.01937), large language models are human-level evaluators.

For this reason, we provide the openai_prompt_based_generator in Yival, which serves to evaluate generated results through LLM.

In the following cell, you can see the basic construction of the openai_prompt_based_evaluator. We will provide a detailed explanation of the different criteria for LLM and guide LLM in scoring.

```yaml
evaluators:

  - evaluator_type: individual
    metric_calculators:
      - method: AVERAGE
    name: openai_prompt_based_evaluator
    display_name: informative
    prompt: |-

      You are assessing a submitted answer on a given task based on a specific criterion. Here is the data:
      - Task: Given an article, generate a concise, entity-dense summary of identical length.
      - Please act like a reader who has read the article. Consider the following criterion to assess the quality the summary generated by the GPT-4:
        - How informative is the summary? A good summary should be strongly relevant to the content of the article, and captures the important information in the article.
      [Input]: {article}
      [Result]: {raw_output}
      Follow the criterion, evaluate the quality of the generated summary strictly:
      A It fails to meet the criterion at all.
      B It somewhat meets the criterion, but there is significant room for improvement.
      C It meets the criterion to a satisfactory degree.
      D It meets the criterion very well.
      E It meets the criterion exceptionally well, with little to no room for improvement.
    choices: ["A", "B", "C", "D", "E"]
    model_name: gpt-4
    description: "Evaluate the informative quality of the generated summary."
    scale_description: "0-4"
    choice_scores:
      A: 0
      B: 1
      C: 2
      D: 3
      E: 4

  - evaluator_type: individual
    metric_calculators:
      - method: AVERAGE
    name: openai_prompt_based_evaluator
    display_name: coherent
    prompt: |-

      You are assessing a submitted answer on a given task based on a specific criterion. Here is the data:
      - Task: Given an article, generate a concise, entity-dense summary of identical length.
      - Please act like a reader who has read the article. Consider the following criterion to assess the quality the summary generated by the GPT-4:
        - How coherent is the summary? A good summary should be linguistically fluent and free of grammatical errors, well-structured and well-organized.
      [Input]: {article}
      [Result]: {raw_output}
      Follow the criterion, evaluate the quality of the generated summary strictly:
      A It fails to meet the criterion at all.
      B It somewhat meets the criterion, but there is significant room for improvement.
      C It meets the criterion to a satisfactory degree.
      D It meets the criterion very well.
      E It meets the criterion exceptionally well, with little to no room for improvement.
    choices: ["A", "B", "C", "D", "E"]
    model_name: gpt-4
    description: "Evaluate the coherent quality of the generated summary."
    scale_description: "0-4"
    choice_scores:
      A: 0
      B: 1
      C: 2
      D: 3
      E: 4

  - evaluator_type: individual
    metric_calculators:
      - method: AVERAGE
    name: openai_prompt_based_evaluator
    display_name: attributive
    prompt: |-

      You are assessing a submitted answer on a given task based on a specific criterion. Here is the data:
      - Task: Given an article, generate a concise, entity-dense summary of identical length.
      - Please act like a reader who has read the article. Consider the following criterion to assess the quality the summary generated by the GPT-4:
        - Is all the information in the summary fully attributable to the Article?
      [Input]: {article}
      [Result]: {raw_output}
      Follow the criterion, evaluate the quality of the generated summary strictly:
      A It fails to meet the criterion at all.
      B It somewhat meets the criterion, but there is significant room for improvement.
      C It meets the criterion to a satisfactory degree.
      D It meets the criterion very well.
      E It meets the criterion exceptionally well, with little to no room for improvement.
    choices: ["A", "B", "C", "D", "E"]
    model_name: gpt-4
    description: "Evaluate the attributive quality of the generated summary."
    scale_description: "0-4"
    choice_scores:
      A: 0
      B: 1
      C: 2
      D: 3
      E: 4
```

Fields in the configuration:
- **evaluator_type**: Designates the type of evaluation.
 - `all`: The evaluator considers all experiment results across all variations. It uses the elo algorithm and employs GPT-4 as the judge.
 - `individual`: The evaluator focuses solely on the current variation's results.
- **name**: Represents the evaluator's name or identifier.
- **prompt**: Provides the template and context for the automated evaluator to assess a given result.
- **display_name**: Specifies the displayed criterion name on the user interface.
- **choices**: Lists all possible rating options for the evaluator.
- **description**: Offers a brief description of the evaluation criterion.
- **scale_description**: Details the numeric scoring scale.
- **choice_scores**: Maps each choice to its respective numeric score.

### Selection Strategy Configs
You might have noticed that we support a wide variety of evaluators in Yival configurations 🌟!

In this case, you can assess the results from multiple aspects such as similarity, accuracy and also  latency, and token_usage , all important factors to consider 🤔.

But of course, we need a selection strategy 🎯 to handle the outputs from various evaluators and pick the best one. In this case, we're using the AHP_strategy with different weights configured ⚖️. Here's a detailed config for your reference:

```yaml
selection_strategy:
  ahp_selection:
    criteria:
      - "openai_prompt_based_evaluator: informative"
      - "openai_prompt_based_evaluator: coherent"
      - "openai_prompt_based_evaluator: attributive"
      - average_token_usage
      - average_latency
    criteria_maximization:
      "openai_prompt_based_evaluator: informative": true
      "openai_prompt_based_evaluator: coherent": true
      "openai_prompt_based_evaluator: attributive": true
      average_latency: false
      average_token_usage: false
    criteria_weights:
      "openai_prompt_based_evaluator: informative": 0.33
      "openai_prompt_based_evaluator: coherent": 0.33
      "openai_prompt_based_evaluator: attributive": 0.33
      average_latency: 0.0
      average_token_usage: 0.0
```
Fields in the configuration:
- **selection_strategy**: Represents the overarching approach for making selections.
- **ahp_selection**: Specifies that the Analytic Hierarchy Process (AHP) algorithm is employed for the selection strategy.
- **criteria**: Lists the evaluators and metrics that are considered during the selection process.
- **criteria_maximization**: Indicates whether each criterion should be maximized. For instance, while a high score from the `openai_prompt_based_evaluator` is desirable (`true`), a lower `average_latency` or `average_token_usage` is preferred (`false`).
- **criteria_weights**: Assigns a weight to each criterion, determining its importance in the overall evaluation. The weights sum up to 1, indicating the relative significance of each criterion in the final decision-making process.

### Improver Configs
YiVal means evaluate and enhance!

Enhancer is the enhance part of yival , which is definitely important in yival part.

We have implemented many cutting-edge enhancer algorithms in YiVal.

In this demo, we will be using the [opro_enhancer](https://github.com/YiVal/YiVal/blob/master/src/yival/combination_improvers/optimize_by_prompt_improver.py#L153), which is one of the research achievements of the DeepMind team.

<img width="579" alt="opro" src="https://github.com/crazycth/pictures/assets/55043304/b2589368-caca-4e8a-af5f-f2bcee70d89c">


In opro evolve algorihm, given the meta-prompt as the input , the LLM generates new solutions to our objective function, then new solutions and their scores are added to the meta-prompt for the next step.

YiVal's architecture is perfectly suited for this iterative approach, requiring only a simple configuration file to achieve powerful enhancement.😁

You can find our enhancer config below：
```yaml
improver:
  name: "optimize_by_prompt_improver"
  model_name: "gpt-4"
  max_iterations: 2
  improve_var: ["CoD_prompt_var_generation"]
  head_meta_instruction: |-

    Your mission is to build a clean, precise instruction prompt for GPT-4. This prompt will guide GPT-4 to step-by-step generate increasingly concise, entity-dense summaries of the `{ARTICLE}` as required.

    I already have some prompts and their evaluation results:

  end_meta_instruction: |-

    Give me a new prompt that is different from all pairs above, and has a evaluation value higher than any of above.
```

Fields in the configuration:
- **name**: Specifies the identifier or the class of the improver. In this instance, `optimize_by_prompt_improver` is utilized.
- **max_iterations**: Designates the upper limit for the number of improvement cycles. The process will not exceed 2 iterations, irrespective of other conditions.
- **model_name**: Indicates the model to be utilized for the improvement process, which here is `gpt-4`.

## Full Configuration
Based on the above configurations, we will write the final overall configuration into the config_cod.yml file.

In [None]:
code = '''
custom_function: demo.summarize.summarize
description: CoD Experiment Config
dataset:
  file_path: https://datasets-server.huggingface.co/rows?dataset=griffin%2Fchain_of_density&config=annotated&split=test
  reader: huggingface_dataset_reader
  source_type: dataset
  reader_config:
    example_limit: 3
    output_mapping:
      article: article

human_rating_configs:
  - name: quality
    instructions: Rate the quality of the generated summary.
    scale: [1, 5]

variations:
  - name: CoD_prompt_var_generation
    generator_name: openai_prompt_based_variation_generator
    generator_config:
      model_name: gpt-4
      diversify: false
      max_tokens: 7000
      number_of_variations: 5
      variables:
        - ARTICLE
      prompt:
        - content: |-

            Your mission is to build a clean, precise instruction prompt for GPT-4. This prompt will guide GPT-4 to step-by-step generate increasingly concise, entity-dense summaries of the `{ARTICLE}` as required.

            The key of your instruction is that it should prompt GPT-4 to repeat the following 2 steps 5 times:
            - Step 1: Identify 1-3 informative Entites (";" delimited) from the `{ARTICLE}` which are missing from the previously generated summary.
            - Step 2: Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entites.

            A Missing Entity is:
            - Relevent: to the main story.
            - Specific: descriptive yet concise (5 words or fewer).
            - Novel: not in the previous summary.
            - Faithful: present in the `{ARTICLE}`
            - Anywhere: located anywhere in the `{ARTICLE}`

            Craft your prompt to make sure each summary use the exact same number of words.
            The instruction should ensure that the GPT-4 will answer the final generated summary after repeating the steps 5 times.
            Keep your output crisp: only the prompt, devoid of any extraneous content.

          role: system
        - content: |-

            {ARTICLE} represent the text of article.

          role: user
      output_path: generated_cod_prompts.pkl

      # Variations allow for dynamic content during experiments.
      # They are identified by a globally unique name. For example, in your code,
      # you might reference a variation by its name, like:
      # variation = StringWrapper("hello", 'test_experiment')
      # In this config, you would define the variations associated with that name

evaluators:

  - evaluator_type: individual
    metric_calculators:
      - method: AVERAGE
    name: openai_prompt_based_evaluator
    display_name: informative
    prompt: |-

      You are assessing a submitted answer on a given task based on a specific criterion. Here is the data:
      - Task: Given an article, generate a concise, entity-dense summary of identical length.
      - Please act like a reader who has read the article. Consider the following criterion to assess the quality the summary generated by the GPT-4:
        - How informative is the summary? A good summary should be strongly relevant to the content of the article, and captures the important information in the article.
      [Input]: {article}
      [Result]: {raw_output}
      Follow the criterion, evaluate the quality of the generated summary strictly:
      A It fails to meet the criterion at all.
      B It somewhat meets the criterion, but there is significant room for improvement.
      C It meets the criterion to a satisfactory degree.
      D It meets the criterion very well.
      E It meets the criterion exceptionally well, with little to no room for improvement.
    choices: ["A", "B", "C", "D", "E"]
    model_name: gpt-4
    description: "Evaluate the informative quality of the generated summary."
    scale_description: "0-4"
    choice_scores:
      A: 0
      B: 1
      C: 2
      D: 3
      E: 4

  - evaluator_type: individual
    metric_calculators:
      - method: AVERAGE
    name: openai_prompt_based_evaluator
    display_name: coherent
    prompt: |-

      You are assessing a submitted answer on a given task based on a specific criterion. Here is the data:
      - Task: Given an article, generate a concise, entity-dense summary of identical length.
      - Please act like a reader who has read the article. Consider the following criterion to assess the quality the summary generated by the GPT-4:
        - How coherent is the summary? A good summary should be linguistically fluent and free of grammatical errors, well-structured and well-organized.
      [Input]: {article}
      [Result]: {raw_output}
      Follow the criterion, evaluate the quality of the generated summary strictly:
      A It fails to meet the criterion at all.
      B It somewhat meets the criterion, but there is significant room for improvement.
      C It meets the criterion to a satisfactory degree.
      D It meets the criterion very well.
      E It meets the criterion exceptionally well, with little to no room for improvement.
    choices: ["A", "B", "C", "D", "E"]
    model_name: gpt-4
    description: "Evaluate the coherent quality of the generated summary."
    scale_description: "0-4"
    choice_scores:
      A: 0
      B: 1
      C: 2
      D: 3
      E: 4

  - evaluator_type: individual
    metric_calculators:
      - method: AVERAGE
    name: openai_prompt_based_evaluator
    display_name: attributive
    prompt: |-

      You are assessing a submitted answer on a given task based on a specific criterion. Here is the data:
      - Task: Given an article, generate a concise, entity-dense summary of identical length.
      - Please act like a reader who has read the article. Consider the following criterion to assess the quality the summary generated by the GPT-4:
        - Is all the information in the summary fully attributable to the Article?
      [Input]: {article}
      [Result]: {raw_output}
      Follow the criterion, evaluate the quality of the generated summary strictly:
      A It fails to meet the criterion at all.
      B It somewhat meets the criterion, but there is significant room for improvement.
      C It meets the criterion to a satisfactory degree.
      D It meets the criterion very well.
      E It meets the criterion exceptionally well, with little to no room for improvement.
    choices: ["A", "B", "C", "D", "E"]
    model_name: gpt-4
    description: "Evaluate the attributive quality of the generated summary."
    scale_description: "0-4"
    choice_scores:
      A: 0
      B: 1
      C: 2
      D: 3
      E: 4

selection_strategy:
  ahp_selection:
    criteria:
      - "openai_prompt_based_evaluator: length"
      - "openai_prompt_based_evaluator: informative"
      - "openai_prompt_based_evaluator: coherent"
      - "openai_prompt_based_evaluator: attributive"
      - average_token_usage
      - average_latency
    criteria_maximization:
      "openai_prompt_based_evaluator: length": true
      "openai_prompt_based_evaluator: informative": true
      "openai_prompt_based_evaluator: coherent": true
      "openai_prompt_based_evaluator: attributive": true
      average_latency: false
      average_token_usage: false
    criteria_weights:
      "openai_prompt_based_evaluator: length": 0.25
      "openai_prompt_based_evaluator: informative": 0.25
      "openai_prompt_based_evaluator: coherent": 0.25
      "openai_prompt_based_evaluator: attributive": 0.25
      average_latency: 0.0
      average_token_usage: 0.0

improver:
  name: "optimize_by_prompt_improver"
  model_name: "gpt-4"
  max_iterations: 2
  improve_var: ["CoD_prompt_var_generation"]
  head_meta_instruction: |-

    Your mission is to build a clean, precise instruction prompt for GPT-4. This prompt will guide GPT-4 to generate increasingly concise, entity-dense summaries of the article as required.

    I already have some prompts and their evaluation results:

  end_meta_instruction: |-

    Give me a new prompt that is different from all pairs above, and has a evaluation value higher than any of above.

'''

with open('/content/YiVal/demo/configs/config_cod.yml', 'w') as file:
    file.write(code)

# Configure your Ngrok token
Our current ngrok authtoken only supports one public session at a time. If it's being used by others or if you're using it to run multiple Colabs at once, you might bump into a Network error. To avoid this, we suggest getting your own ngrok authtoken for your Colab notebooks. It's easy and free to get your own authtoken from ngrok.

Here's how to do it:
- If you don't have a ngrok account yet, head over to https://dashboard.ngrok.com/login to sign up.
- Once you're logged in, you can grab your authtoken at https://dashboard.ngrok.com/get-started/your-authtoken.

Prior to initiating a new demo, ensure that all other applications utilizing ngrok within Colab have been terminated via the `Connect -> Manage Sessions` pathway. You can check and manage your sessions as follow picture.

<img src="https://github.com/uni-zhuan/uni_CDN/blob/master/picture/Yival/iShot_2023-10-12_22.51.49.png?raw=true" width="80%" height="50%">

In [None]:
!pip install pyngrok
from pyngrok import ngrok

os.environ['ngrok']='true'
public_url = ngrok.connect(addr = 8501)
!poetry run ngrok config add-authtoken {your ngrok authtoken}

# Yival!

Access Yival from NgrokTunnel public URL and check the result!

In [None]:
!poetry run yival run /content/YiVal/demo/configs/config_cod.yml

# Results Example
After launching Yival on web, you can easily check and analysis your experiment results and detailed results as following pictures.

## Improvements by YiVal
With yival's incrediable ability to streamlines the evaluation and enhancement of AIGC:
* Using **3** articles points generated by GPT-4 for evaluation, after two rounds of enhancement:
 - The coherent score increased by **20.03%**!🚀
 - The attributive score increased by **25.18%!**🚀
 - The average token usage from **2054.6 -> 1473.4(-28.3%)**!

## Test Results
Here, we utilize 3 news articles and to obtain and evaluate responses. Here's the test results by the improver:

![image](https://user-images.githubusercontent.com/71804564/277123610-a8414c08-9d29-497e-ae2b-58c987c1ff17.png)

## Inteactive mode
Moreover, we offer an interactive mode that allows you to generate headlines and evaluation results based on user input parameters and combinations. This gives you the freedom to explore new ideas using prompts that have been evaluated and enhanced.

**Here are headlines that our generation bot created for Yival in interactive mode!** 😆
![image](https://user-images.githubusercontent.com/71804564/277114832-391da628-9fb6-45c6-9376-b776cbbe943a.png)