<a href="https://colab.research.google.com/github/auto-res/researchgraph/blob/feature%2F%2388-python-package-update/examples/ai_integrator_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI Integrator Adjustment Notebook

In [1]:
%%capture
%pip install unsloth "xformers==0.0.28.post2"
%pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
%pip install researchgraph -U
%pip install transformers==4.46.2

In [2]:
%pip show researchgraph

Name: researchgraph
Version: 0.0.64
Summary: Add your description here
Home-page: https://www.autores.one/japanese
Author: 
Author-email: Toma Tanaka <ulti4929@gmail.com>
License: 
Location: /usr/local/lib/python3.10/dist-packages
Requires: aider-chat, arxiv, jinja2, langchain, langchain-community, langgraph, litellm, llmlinks, openai, pandas, pyalex, pydantic, pypdf, semanticscholar, setuptools, tomli, tomli-w
Required-by: 


In [3]:
from google.colab import drive
drive.mount('/content/drive')

import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

Mounted at /content/drive


In [4]:
import os
import logging
from IPython.display import Image
from typing_extensions import TypedDict
from langgraph.graph import StateGraph

from researchgraph.core.factory import NodeFactory
from researchgraph.graphs.ai_integrator.ai_integrator_v1 import ai_integratorv1_setting

node_names = [
    "structuredoutput_llmnode",
    "retrieve_arxiv_text_node",
    "retrieve_github_repository_node",
    "text2script_node",
    "llmsfttrain_node",
    "llminference_node",
    "llmevaluate_node",
]

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


## Implementation

#### StructuredLLMNode Prompt

In [9]:
extractor_prompt_template = """
You are a researcher working on machine learning.
The following <paper_text> tags enclose the full text data of the paper.
Please extract the explanation of the method introduced in the given paper.
<paper_text>
{{paper_text}}
</paper_text>
"""

codeextractor_prompt_template = """
<RULE>
You are a researcher working on machine learning.
- Tag Descriptions
    - The text enclosed within the <add_method_text> tag contains an explanation of a method extracted from a machine learning paper.
    - The text enclosed within the <folder_structure> tag shows the folder structure of the corresponding GitHub repository for the paper.
    - The text enclosed within the <github_file> tag contains the code from Python files in the corresponding GitHub repository.
- Instructions for Extracting Python Code
    - Extract the relevant sections of Python code from the content enclosed within the <github_file> tag based on the method described in the <add_method_text> tag.
    - Use the folder structure provided within the <folder_structure> tag as a reference when extracting the code.
    - Please extract any code that seems to be related.
    - If no corresponding code exists, output "No corresponding code exists."
</RULE>
<add_method_text>
{{add_method_text}}
</add_method_text>
<folder_structure>
{{folder_structure}}
</folder_structure>
<github_file>
{{github_file}}
</github_file>
<EOS></EOS>"""


creator_prompt_template = """
You are a researcher working on machine learning.
Please check the descriptions of the tags listed in Tag Descriptions and follow the instructions.
- Tag Descriptions
    - The text enclosed within the <objective> tag indicates the objective of the research being undertaken.
    - The text enclosed within the <add_method_text> tag contains an explanation of a method extracted from a machine learning paper.
    - The text enclosed within the <add_method_code> tag contains the code extracted from the paper.
    - The text enclosed within the <base_method_text> tag provides a description of the base method.
    - The text enclosed within the <base_method_code> tag contains the code of the base method.
- Please follow the rules below to output the code and description of the new method.
    - Please apply the code enclosed in the <add_method_code> tag to the code enclosed in the <base_method_code> tag to generate a new method.
    - Please generate a method that is considered to be novel.
    - Please make sure that the new method protects the content enclosed in the <objective> tag.
    - When creating a new method, please also consider the description of the method enclosed in the <add_method_text> tag and the description enclosed in the <base_method_text> tag.
    - Please output the new method you have created as new_method_text.
    - Please output the new code you have created as new_method_code.
    - The output of new_method_code must follow the template enclosed in the <method_template> tag.
</RULE>
<objective>
{{objective}}
</objective>
<add_method_text>
{{add_method_text}}
</add_method_text>
<add_method_code>
{{add_method_code}}
</add_method_code>
<base_method_text>
{{base_method_text}}
</base_method_text>
<base_method_code>
{{base_method_code}}
</base_method_code>
<method_template>
{{method_template}}
</method_template>
<EOS></EOS>"""

### ResearchGraph

In [10]:
class State(TypedDict):
    objective: str
    method_template: str
    base_method_text: str
    base_method_code: str
    llm_script: str
    index: int
    arxiv_url: str
    github_url: str
    folder_structure: str
    github_file: str
    add_method_code: str
    paper_text: str
    add_method_text: str
    new_method_code: list
    new_method_text: list
    script_save_path: str
    model_save_path: str
    result_save_path: str
    accuracy: str


class AIIntegratorv1:
    def __init__(
        self,
        llm_name: str,
        save_dir: str,
        new_method_file_name: str,
        ft_model_name: str,
        dataset_name: str,
        model_save_dir_name: str,
        result_save_file_name: str,
        answer_data_path: str,
        num_train_data: int | None = None,
        num_inference_data: int | None = None,
    ):
        self.llm_name = llm_name
        self.save_dir = save_dir
        self.new_method_file_name = new_method_file_name
        self.ft_model_name = ft_model_name
        self.dataset_name = dataset_name
        self.model_save_dir_name = model_save_dir_name
        self.result_save_file_name = result_save_file_name
        self.answer_data_path = answer_data_path
        self.num_train_data = num_train_data
        self.num_inference_data = num_inference_data

        if not os.path.exists(self.save_dir):
            os.makedirs(self.save_dir)
        self.graph_builder = StateGraph(State)

        self.graph_builder.add_node(
            "githubretriever",
            NodeFactory.create_node(
                node_name="retrieve_github_repository_node",
                save_dir="/content/drive/MyDrive/AutoRes/ai_integrator/exec-test",
                input_key=["github_url"],
                output_key=["folder_structure", "github_file"],
            )
        )
        self.graph_builder.add_node(
            "arxivretriever",
            NodeFactory.create_node(
                node_name = "retrieve_arxiv_text_node",
                save_dir=self.save_dir,
                input_key=["arxiv_url"],
                output_key=["paper_text"],
            )
        )
        self.graph_builder.add_node(
            "extractor",
            NodeFactory.create_node(
                node_name = "structuredoutput_llmnode",
                input_key=["paper_text"],
                output_key=["add_method_text"],
                llm_name=llm_name,
                prompt_template=extractor_prompt_template,
            )
        )
        self.graph_builder.add_node(
            "codeextractor",
            NodeFactory.create_node(
                node_name = "structuredoutput_llmnode",
                input_key=["add_method_text", "folder_structure", "github_file"],
                output_key=["add_method_code"],
                llm_name=llm_name,
                prompt_template=codeextractor_prompt_template,
            )
        )
        self.graph_builder.add_node(
            "creator",
            NodeFactory.create_node(
                node_name = "structuredoutput_llmnode",
                input_key=[
                    "objective",
                    "add_method_text",
                    "add_method_code",
                    "base_method_text",
                    "base_method_code",
                    "method_template",
                ],
                output_key=["new_method_text", "new_method_code"],
                llm_name=llm_name,
                prompt_template=creator_prompt_template,
            )
        )
        self.graph_builder.add_node(
            "text2script",
            NodeFactory.create_node(
                node_name = "text2script_node",
                input_key=["new_method_code"],
                output_key=["script_save_path"],
                save_file_path=os.path.join(self.save_dir, self.new_method_file_name),
            )
        )
        self.graph_builder.add_node(
            "llmsfttrainer",
            NodeFactory.create_node(
                node_name = "llmsfttrain_node",
                model_name=self.ft_model_name,
                dataset_name=self.dataset_name,
                num_train_data=self.num_train_data,
                model_save_path=os.path.join(self.save_dir, self.model_save_dir_name),
                lora=True,
                input_key=["script_save_path"],
                output_key=["model_save_path"],
            )
        )
        self.graph_builder.add_node(
            "llminferencer",
            NodeFactory.create_node(
                node_name = "llminference_node",
                input_key=["model_save_path"],
                output_key=["result_save_path"],
                dataset_name=self.dataset_name,
                num_inference_data=self.num_inference_data,
                result_save_path=os.path.join(
                    self.save_dir, self.result_save_file_name
                ),
            )
        )
        self.graph_builder.add_node(
            "llmevaluater",
            NodeFactory.create_node(
                node_name = "llmevaluate_node",
                input_key=["result_save_path"],
                output_key=["accuracy"],
                answer_data_path=self.answer_data_path,
            )
        )

        # make edges
        self.graph_builder.add_edge("arxivretriever", "githubretriever")
        self.graph_builder.add_edge("arxivretriever", "extractor")
        self.graph_builder.add_edge(["githubretriever", "extractor"], "codeextractor")
        self.graph_builder.add_edge("codeextractor", "creator")
        self.graph_builder.add_edge("creator", "text2script")
        self.graph_builder.add_edge("text2script", "llmsfttrainer")
        self.graph_builder.add_edge("llmsfttrainer", "llminferencer")
        self.graph_builder.add_edge("llminferencer", "llmevaluater")

        # set entry and finish points
        self.graph_builder.set_entry_point("arxivretriever")
        # self.graph_builder.set_finish_point("creator")
        self.graph_builder.set_finish_point("llmevaluater")

        self.graph = self.graph_builder.compile()

    # def __call__(self, state: State, debug: bool = True) -> dict:
    def __call__(self, state: State) -> dict:
        result = self.graph.invoke(state, debug = True)
        return result

    def write_result(self, response: State):
        index = response["index"]
        arxiv_url = response["arxiv_url"]
        add_method_text = response["add_method_text"][0]
        add_method_code = response["add_method_code"][0]
        new_method_text = response["new_method_text"][0]
        new_method_code = response["new_method_code"][0]
        content = (
            f"---Arxiv URL 1---:\n{arxiv_url}\n\n"
            f"---Add Method Text---:\n{add_method_text}\n\n"
            f"---Add Method Code---:\n{add_method_code}\n\n"
            f"---New Method Text---:\n{new_method_text}\n\n"
            f"---New Method Code---:\n{new_method_code}\n\n"
        )
        with open(self.save_dir + f"ai_integrator_{index}.txt", "w") as f:
            f.write(content)
        return

    def make_image(self, path: str):
        image = Image(self.graph.get_graph().draw_mermaid_png())
        with open(path + "ai_integrator_graph.png", "wb") as f:
            f.write(image.data)

## Setting

In [11]:
llm_name = "gpt-4o-2024-08-06"
save_dir = "/content/drive/MyDrive/AutoRes/ai_integrator/exec-test"
ft_model_name = "unsloth/Meta-Llama-3.1-8B"
dataset_name = "openai/gsm8k"
new_method_file_name = "new_method.py"
model_save_dir_name = "train_model"
result_save_file_name = "pred_file"
answer_data_path = '/content/drive/MyDrive/AutoRes/ai_integrator/gsm8k_answer.csv'
num_train_data = 30
num_inference_data = 30

research_graph = AIIntegratorv1(
    llm_name = llm_name,
    save_dir = save_dir,
    new_method_file_name = new_method_file_name,
    ft_model_name = ft_model_name,
    dataset_name= dataset_name,
    model_save_dir_name = model_save_dir_name,
    result_save_file_name = result_save_file_name,
    answer_data_path = answer_data_path,
    num_train_data = num_train_data,
    num_inference_data = num_inference_data,
)

input: ['github_url']
output: ['folder_structure', 'github_file']
input: ['arxiv_url']
output: ['paper_text']
input: ['paper_text']
output: ['add_method_text']
input: ['add_method_text', 'folder_structure', 'github_file']
output: ['add_method_code']
input: ['objective', 'add_method_text', 'add_method_code', 'base_method_text', 'base_method_code', 'method_template']
output: ['new_method_text', 'new_method_code']
input: ['new_method_code']
output: ['script_save_path']
input: ['script_save_path']
output: ['model_save_path']
==((====))==  Unsloth 2024.12.2: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
__reduc

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

input: ['model_save_path']
output: ['result_save_path']
input: ['result_save_path']
output: ['accuracy']


## Research Graph Image

In [12]:
image_path = "/content/"
research_graph.make_image(image_path)

## Execution

In [13]:
ai_integratorv1_setting['method_template'] = """
import torch
from typing import Iterable
from torch.optim import Optimizer„ÄÄ# Please do not change this code

class NewOptimizer(Optimizer): # Please do not change the name of the class ‚ÄúNewOptimizer‚Äù.
    def __init__(self, params: Iterable,...):
        "parameter initialization"

    def step(self, closure: None = None) -> None:
        "processing details"
"""

from pprint import pprint

pprint(ai_integratorv1_setting)

{'arxiv_url': 'https://arxiv.org/abs/1804.00325v3',
 'base_method_code': '\n'
                     'from torch.optim import Optimizer\n'
                     '\n'
                     'class Adam(Optimizer):\n'
                     '    def __init__(self, params: Iterable, lr: float = '
                     '1e-3, beta1: float = 0.9, beta2: float = 0.999, epsilon: '
                     'float = 1e-8):\n'
                     '        defaults = dict(\n'
                     '            lr=lr,\n'
                     '            beta1=beta1,\n'
                     '            beta2=beta2,\n'
                     '            epsilon=epsilon,\n'
                     '            step=0\n'
                     '        )\n'
                     '        super(Adam, self).__init__(params, defaults)\n'
                     '\n'
                     '    def step(self, closure: None = None) -> None:\n'
                     '        for group in self.param_groups:\n'
                    

In [14]:
# ai_integratorv1_setting["arxiv_url"] = "https://arxiv.org/abs/2101.11075v3"
# ai_integratorv1_setting["github_url"] = "https://github.com/facebookresearch/madgrad"

In [17]:
result = research_graph(
    state=ai_integratorv1_setting,
)

[1;30;43m„Çπ„Éà„É™„Éº„Éü„É≥„Ç∞Âá∫Âäõ„ÅØÊúÄÂæå„ÅÆ 5000 Ë°å„Å´Âàá„ÇäÊç®„Å¶„Çâ„Çå„Åæ„Åó„Åü„ÄÇ[0m
                    "            weight_decay = group['weight_decay']\n"
                    "            betas = group['betas']\n"
                    '            total_mom = float(len(betas))\n'
                    "            for p in group['params']:\n"
                    '                if p.grad is None:\n'
                    '                    continue\n'
                    '                d_p = p.grad.data\n'
                    '                if weight_decay != 0:\n'
                    '                    d_p.add_(weight_decay, p.data)\n'
                    '                param_state = self.state[p]\n'
                    "                if 'momentum_buffer' not in param_state:\n"
                    "                    param_state['momentum_buffer'] = {}\n"
                    '                    for beta in betas:\n'
                    '                        

SyntaxError: invalid syntax (new_method.py, line 60)