# This is colab tutorial notebook for LLMFlowOptimizer
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Yongtae723/LLMFlowOptimizer/blob/main/notebooks/LLMFlowOptimizer_tutorial_notebook.ipynb)

LLMFlowOptimizer is made for treating component of LLMFlow as parameter and optimizing the parameters.


![concept_image](https://github.com/Yongtae723/LLMFlowOptimizer/blob/main/documents/image/concept.png?raw=true)

We assume LLMFlowOptimizer is used as repository scale, but you can experience flow and concept of LLMFlowOptimizer in this notebook. 

In [None]:
# clone git repo
!git clone https://github.com/Yongtae723/LLMFlowOptimizer.git
%cd LLMFlowOptimizer

In [None]:
# install dependencies
!pip install poetry
!poetry config virtualenvs.in-project true
!poetry install --no-ansi

import sys

VENV_PATH = "/content/LLMFlowOptimizer/.venv/lib/python3.10/site-packages"
sys.path.insert(0, VENV_PATH)

In [None]:
# replace dummy values with your own
import os

os.environ["OPENAI_API_KEY"] = "dummy"

# Step 1. Model definition
This section corresponds to the [Step 1 : Define model architect and config.](https://github.com/Yongtae723/LLMFlowOptimizer#step-1--define-model-architect-and-config) in the README.

In first, you have to create class which specify your model structure.
the code below is an example of model definition, But you can edit it as you like.

Note that arguments of `__init__` can be treated as hyperparameter adn optimized afterword

In [None]:
from langchain.chains import RetrievalQA
from langchain.chains.base import Chain
from langchain.document_loaders import TextLoader
from langchain.indexes.vectorstore import VectorstoreIndexCreator
from langchain.schema.embeddings import Embeddings
from langchain.schema.language_model import BaseLanguageModel
from langchain.text_splitter import TextSplitter


class SampleQA:
    """Define the flow of the model to be adjusted."""

    def __init__(
        self,
        data_path: str,
        embedding: Embeddings,
        text_splitter: TextSplitter,
        llm: BaseLanguageModel,
    ) -> None:
        """Input the elements necessary for LLM flow The arguments here will be used as a
        hyperparameters and optimized.

        the arguments are defined by `configs/model/sample.yaml`
        """
        self.embedding = embedding
        self.text_splitter = text_splitter
        self.text_loader = TextLoader(data_path)
        self.llm = llm
        self.index = VectorstoreIndexCreator(
            embedding=self.embedding, text_splitter=self.text_splitter
        ).from_loaders([self.text_loader])

        self.chain = RetrievalQA.from_chain_type(
            self.llm,
            retriever=self.index.vectorstore.as_retriever(),
            return_source_documents=True,
        )

    def __call__(self, question: str) -> str:
        """Answer the question."""
        return self.chain(question)

    def get_chain(self) -> Chain:
        """Get langchain chain."""
        return self.chain

### Defined component by python
To understand easily, let's start with actual python code and see what happened.

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

data_path = "data/reference/nyc_wikipedia.txt"
embedding = OpenAIEmbeddings()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=200)
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

model_class = SampleQA(
    data_path=data_path,
    embedding=embedding,
    text_splitter=text_splitter,
    llm=llm,
)

print(model_class("What is the population of New York City?")["result"])

In [None]:
# check the model
print(f"Embedding: {model_class.embedding.__class__.__name__}")
print(f"LLM: {model_class.llm.__class__.__name__}")
print(f"Text Splitter: {model_class.text_splitter.__class__.__name__}")

### Define component by yaml.
you succeeded? Awesome! 
Then let's do the same thing with yaml file.

First, please copy and pased your model into `/content/LLMFlowOptimizer/llmflowoptimizer/component/model/sample_qa.py`

Then, you can load the class by Hydra based on the information of `/content/LLMFlowOptimizer/configs/model/default.yaml`.

default yaml file is like below.
```yaml
defaults:
  - _self_
  - embedding: OpenAI
  - text_splitter: RecursiveCharacter
  - llm: OpenAI

_target_: llmflowoptimizer.component.model.sample_qa.SampleQA

data_path: ${paths.reference_data_dir}/nyc_wikipedia.txt

```

This means that you can load `llmflowoptimizer.component.model.sample_qa.SampleQA` class, and component of `__init__` are defined the same folder.

For example, llm is defined in `/content/LLMFlowOptimizer/configs/model/llm/GPTTurbo.yaml`

```yaml
_target_: langchain.chat_models.ChatOpenAI
model_name: gpt-3.5-turbo
temperature: 0
```

that means you will load `langchain.chat_models.ChatOpenAI` class with arguments `model_name` and `temperature`.

In [None]:
import hydra
import rootutils
from hydra import compose, initialize
from hydra.core.hydra_config import HydraConfig
from omegaconf import open_dict


def load_hydra_config():
    with initialize(version_base="1.3", config_path="configs"):
        cfg = compose(config_name="run.yaml", return_hydra_config=True, overrides=[])
        with open_dict(cfg):
            cfg.paths.root = str(rootutils.find_root(indicator=".project-root"))
    HydraConfig().set_config(cfg)
    return cfg


cfg = load_hydra_config()
for key, value in cfg.model.items():
    print(f"{key}: {value}")

In [None]:
model_class = hydra.utils.instantiate(cfg.model)
print(model_class("What is the population of New York City?")["result"])

In [None]:
# check the model
print(f"Embedding: {model_class.embedding.__class__.__name__}")
print(f"LLM: {model_class.llm.__class__.__name__}")
print(f"Text Splitter: {model_class.text_splitter.__class__.__name__}")

# Step 2. Define evaluation 
After you define your own model, then create evaluation class.

Return value of `evaluate` is used as score and component will be optimized to maximize/minimize the score.
And method of `evaluate` use model_class which you defined in previous section.


Following cell is example of evaluation class, which is written in `/content/LLMFlowOptimizer/llmflowoptimizer/component/evaluation/sample.py`.


In [None]:
import json
from typing import Any

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import answer_relevancy


class Evaluation:
    """Define the evaluation system.

    llmflowoptimizer optimizes the hyperparameters of the model
    Return value of `__call__` is used as score and component will be optimized to maximize/minimize the score.
    """

    def __init__(
        self,
        eval_dataset_path: str,
    ):
        with open(eval_dataset_path) as f:
            self.eval_data = json.load(f)

    def evaluate(
        self,
        model: Any,  # this model should be defined in llmflowoptimizer/component/model/sample_qa.py
    ):
        # simple evaluation using ragas
        evaluation_dataset = {
            "question": [],
            "answer": [],
            "contexts": [],
            "ground_truths": [],
        }
        for data in self.eval_data:
            output = model(data["question"])
            evaluation_dataset["question"].append(data["question"])
            evaluation_dataset["answer"].append(output["result"])
            evaluation_dataset["contexts"].append(
                [document.page_content for document in output["source_documents"]]
            )
            evaluation_dataset["ground_truths"].append([data["ground_truth"]])
        evaluation_dataset = Dataset.from_dict(evaluation_dataset)

        result = evaluate(evaluation_dataset, metrics=[answer_relevancy])

        return result["answer_relevancy"]

In [None]:
eval_dataset_path = "data/evaluation/NY_eval_data.json"
evaluator = Evaluation(eval_dataset_path=eval_dataset_path)
res = evaluator.evaluate(model_class)

print(f"Ragas Score: {res}")

You can also load this evaluation model by yaml file.

Please save evaluator on `/content/LLMFlowOptimizer/llmflowoptimizer/component/evaluation/sample.py`, then you can specify the evaluator by yaml file.
Default yaml file is like below.

```yaml
_target_: llmflowoptimizer.component.evaluation.sample.Evaluation
eval_dataset_path: ${paths.evaluation_data_dir}/NY_eval_data.json
```


In [None]:
cfg = load_hydra_config()
evaluator = hydra.utils.instantiate(cfg.evaluation)

In [None]:
res = evaluator.evaluate(model_class)

print(f"Score: {res}")

You can check model build and evaluation flow by following command. 

In [None]:
!poetry run python llmflowoptimizer/run.py

# Step 3. Hyperparameter optimization


Please add yaml that you want to add in hyperparameter search on each folder of `/content/LLMFlowOptimizer/configs/model/`

Then you can specify search range on yaml file.
example of yaml file is like below.

```yaml
model/text_splitter: choice(RecursiveCharacter, CharacterTextSplitter)
model.text_splitter.chunk_size: range(500, 1500, 100)
model/llm: choice(OpenAI, GPTTurbo, GPT4)
```

This example is a part of `configs/hparams_search/optuna.yaml` , and it means this system will search best hyperparameter from RecursiveCharacter or CharacterTextSplitter for model.text_splitter component, chunk_size is between 500 and 1500, and OpenAI, GPTTurbo, GPT4 for model.llm component.

Also complicated search range can be defined by python like configs/hparams_search/custom-search-space-objective.py

You can start hyperparameter search by following command.

```bash
poetry run python llmflowoptimizer/run.py hparams_search=optuna
```

In [None]:
!poetry run python llmflowoptimizer/run.py hparams_search=optuna

Then you can see the best parameter on `logs/initial_task/multirruns/{timestamp}/optimization_results.yaml`.

# If you like this project💖

if you like this project, you can use [this project](https://github.com/Yongtae723/LLMFlowOptimizer/blob/main/notebooks/tutorial_notebook.ipynb) as a template for your own project.
Push `Use this template` button on the top of [this repo](https://github.com/Yongtae723/LLMFlowOptimizer/blob/main/notebooks/tutorial_notebook.ipynb), then you can create your own project based on this project.