## ML Paper Tagger 

<a href="https://colab.research.google.com/github/dair-ai/pe-for-llms/blob/main/notebooks/session-4/demo-4.2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
# update or install the necessary libraries
!pip install --upgrade openai
!pip install --upgrade langchain
!pip install --upgrade python-dotenv

In [2]:
import openai
import os
import IPython
from langchain.llms import OpenAI
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
# set the API key
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

In [4]:
# create a new LLM
from langchain.llms import OpenAI
llm  = OpenAI(model_name="text-davinci-003")

"\n\nOnce upon a time, there was a scientist named Dr. Smith who had created a time machine. After many years of hard work, he had finally perfected the machine and was ready to take the first trip.\n\nExcitedly, he jumped into the time machine, ready to explore the past and future. But what he didn't expect was that something had gone wrong and instead of traveling to the future, he was sent to a parallel universe.\n\nIn this strange universe, Dr. Smith encountered many strange creatures, and encountered a powerful force that threatened to destroy all of existence. With his knowledge of science and technology, Dr. Smith was able to save the universe, and he eventually returned to his own timeline, wiser and more experienced than before."

In [5]:
def print_markdown(text):
    """Prints text as markdown"""
    IPython.display.display(IPython.display.Markdown(text))

We are interested to build a system that takes in scientific ML abstracts and extracts model names when available.

The input: full paper abstract

The output: model name tags

We could also potentially use an NER tool to extract the same information. But we can also use LLMs to extract such information using some clever prompt engineering. 

## Demonstration Set

We will use arXiv to collect a few articles for our demonstrations.

Specifically, we will focus on the Computation and Language section and Computer Vision section: https://arxiv.org/list/cs.CL/recent

In [74]:
import json
import pandas as pd

# load json data at path: data/article-tags.json into a dataframe
with open('data/article-tags.json') as f:
    val_data = json.load(f)

with open('data/few_shot.json') as f:
    few_shot_data = json.load(f)


In [75]:
few_shot_data

[{'abstract': 'Neural Radiance Fields (NeRFs) are a very recent and very popular approach for the problems of novel view synthesis and 3D reconstruction. A popular scene representation used by NeRFs is to combine a uniform, voxel-based subdivision of the scene with an MLP. Based on the observation that a (sparse) point cloud of the scene is often available, this paper proposes to use an adaptive representation based on tetrahedra and a Delaunay representation instead of the uniform subdivision or point-based representations. We show that such a representation enables efficient training and leads to state-of-the-art results. Our approach elegantly combines concepts from 3D geometry processing, triangle-based rendering, and modern neural radiance fields. Compared to voxel-based representations, ours provides more detail around parts of the scene likely to be close to the surface. Compared to point-based representations, our approach achieves better performance.',
  'tags': ['NeRFs', 'MLP

We have the inputs we need for our prompts. Depending on your task, you might want to create several prompt templates/candidates from this. Let's to this with LangChain.

In [30]:
from langchain.prompts.few_shot import FewShotPromptTemplate
from langchain.prompts.prompt import PromptTemplate

In [98]:
# create a custom selector for the few shot data
from langchain.prompts.example_selector.base import BaseExampleSelector
from typing import Dict, List
import numpy as np

class CustomExampleSelector(BaseExampleSelector):
    
    def __init__(self, examples: List[Dict[str, str]]):
        self.examples = examples
    
    def add_example(self, example: Dict[str, str]) -> None:
        """Add new example to store for a key."""
        self.examples.append(example)

    def select_examples(self, size) -> List[dict]:
        """Select which examples to use based on the inputs."""
        return np.random.choice(self.examples, size=size, replace=False)

In [99]:
example_selector = CustomExampleSelector(few_shot_data)

In [102]:
list(example_selector.select_examples(5))

[{'abstract': "Children's drawings have a wonderful inventiveness, creativity, and variety to them. We present a system that automatically animates children's drawings of the human figure, is robust to the variance inherent in these depictions, and is simple and straightforward enough for anyone to use. We demonstrate the value and broad appeal of our approach by building and releasing the Animated Drawings Demo, a freely available public website that has been used by millions of people around the world. We present a set of experiments exploring the amount of training data needed for fine-tuning, as well as a perceptual study demonstrating the appeal of a novel twisted perspective retargeting technique. Finally, we introduce the Amateur Drawings Dataset, a first-of-its-kind annotated dataset, collected via the public demo, containing over 178,000 amateur drawings and corresponding user-accepted character bounding boxes, segmentation masks, and joint location annotations.",
  'tags': ['

In [109]:
template = """
Abstract: {abstract}
Tags: {tags}
"""

prompt = PromptTemplate(
    input_variables=["abstract", "tags"],
    template=template
)

few_shot_prompt = FewShotPromptTemplate(
    examples = list(example_selector.select_examples(3)),
    example_prompt=prompt,
    prefix = "Your task is to extract model names from machine learning paper abstracts. Your response is an an array of the model names in the format [\"model_name\"]. If you don't find model names in the abstract or you are not sure, return [\"NA\"]",
    suffix = "Abstract: {input}\nTags:",
    input_variables=["input"],
    example_separator="\n\n" 
)

In [110]:
print_markdown(few_shot_prompt.format(input="test"))

You are an information extractor capable of extracting model names from machine learning paper abstracts. Your response is an an array of the model names in the format ["model_name"]. If you don't find model names in the abstract or you are not sure, return ["NA"]


Abstract: Topological deep learning is a rapidly growing field that pertains to the development of deep learning models for data supported on topological domains such as simplicial complexes, cell complexes, and hypergraphs, which generalize many domains encountered in scientific computations. In this paper, we present a unifying deep learning framework built upon a richer data structure that includes widely adopted topological domains. Specifically, we first introduce combinatorial complexes, a novel type of topological domain. Combinatorial complexes can be seen as generalizations of graphs that maintain certain desirable properties. Similar to hypergraphs, combinatorial complexes impose no constraints on the set of relations. In addition, combinatorial complexes permit the construction of hierarchical higher-order relations, analogous to those found in simplicial and cell complexes. Thus, combinatorial complexes generalize and combine useful traits of both hypergraphs and cell complexes, which have emerged as two promising abstractions that facilitate the generalization of graph neural networks to topological spaces. Second, building upon combinatorial complexes and their rich combinatorial and algebraic structure, we develop a general class of message-passing combinatorial complex neural networks (CCNNs), focusing primarily on attention-based CCNNs. We characterize permutation and orientation equivariances of CCNNs, and discuss pooling and unpooling operations within CCNNs in detail. Third, we evaluate the performance of CCNNs on tasks related to mesh shape analysis and graph learning. Our experiments demonstrate that CCNNs have competitive performance as compared to state-of-the-art deep learning models specifically tailored to the same tasks. Our findings demonstrate the advantages of incorporating higher-order relations into deep learning models in different applications.
Tags: ['CCNNs']



Abstract: Large Language Models (LLMs), such as ChatGPT and GPT-4, have revolutionized natural language processing research and demonstrated potential in Artificial General Intelligence (AGI). However, the expensive training and deployment of LLMs present challenges to transparent and open academic research. To address these issues, this project open-sources the Chinese LLaMA and Alpaca large models, emphasizing instruction fine-tuning. We expand the original LLaMA's Chinese vocabulary by adding 20K Chinese tokens, increasing encoding efficiency and enhancing basic semantic understanding. By incorporating secondary pre-training using Chinese data and fine-tuning with Chinese instruction data, we substantially improve the models' comprehension and execution of instructions. Our pilot study serves as a foundation for researchers adapting LLaMA and Alpaca models to other languages. Resources are made publicly available through GitHub, fostering open research in the Chinese NLP community and beyond. GitHub repository:
Tags: ['ChatGPT', 'GPT-4', 'LLaMA']



Abstract: Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at
Tags: ['GPT', 'OPT', 'GPTQ']


Abstract: test
Tags:

In [78]:
llm(few_shot_prompt.format(input="Modern systems for multi-hop question answering (QA) typically break questions into a sequence of reasoning steps, termed chain-of-thought (CoT), before arriving at a final answer. Often, multiple chains are sampled and aggregated through a voting mechanism over the final answers, but the intermediate steps themselves are discarded. While such approaches improve performance, they do not consider the relations between intermediate steps across chains and do not provide a unified explanation for the predicted answer. We introduce Multi-Chain Reasoning (MCR), an approach which prompts large language models to meta-reason over multiple chains of thought, rather than aggregating their answers. MCR examines different reasoning chains, mixes information between them and selects the most relevant facts in generating an explanation and predicting the answer. MCR outperforms strong baselines on 7 multi-hop QA datasets. Moreover, our analysis reveals that MCR explanations exhibit high quality, enabling humans to verify its answers."))

" ['MCR']"

In [79]:
# zero-shot prompt
zero_shot_template = """
Your task is extract model names from machine learning paper abstracts. Your response is an an array of the model names in the format [\"model_name\"]. If you don't find model names in the abstract or you are not sure, return [\"NA\"]

Abstract: {abstract}
Tags:
"""

zero_shot_prompt = PromptTemplate(
    input_variables=["abstract"],
    template=zero_shot_template
)

In [80]:
print_markdown(zero_shot_template.format(abstract="Modern systems for multi-hop question answering (QA) typically break questions into a sequence of reasoning steps, termed chain-of-thought (CoT), before arriving at a final answer. Often, multiple chains are sampled and aggregated through a voting mechanism over the final answers, but the intermediate steps themselves are discarded. While such approaches improve performance, they do not consider the relations between intermediate steps across chains and do not provide a unified explanation for the predicted answer. We introduce Multi-Chain Reasoning (MCR), an approach which prompts large language models to meta-reason over multiple chains of thought, rather than aggregating their answers. MCR examines different reasoning chains, mixes information between them and selects the most relevant facts in generating an explanation and predicting the answer. MCR outperforms strong baselines on 7 multi-hop QA datasets. Moreover, our analysis reveals that MCR explanations exhibit high quality, enabling humans to verify its answers."))


You are an information extractor capable of extracting model names from machine learning paper abstracts. Your response is an an array of the model names in the format ["model_name"]. If you don't find model names in the abstract or you are not sure, return ["NA"]

Abstract: Modern systems for multi-hop question answering (QA) typically break questions into a sequence of reasoning steps, termed chain-of-thought (CoT), before arriving at a final answer. Often, multiple chains are sampled and aggregated through a voting mechanism over the final answers, but the intermediate steps themselves are discarded. While such approaches improve performance, they do not consider the relations between intermediate steps across chains and do not provide a unified explanation for the predicted answer. We introduce Multi-Chain Reasoning (MCR), an approach which prompts large language models to meta-reason over multiple chains of thought, rather than aggregating their answers. MCR examines different reasoning chains, mixes information between them and selects the most relevant facts in generating an explanation and predicting the answer. MCR outperforms strong baselines on 7 multi-hop QA datasets. Moreover, our analysis reveals that MCR explanations exhibit high quality, enabling humans to verify its answers.
Tags:


We have a few-shot prompt template and a zero-shot and we can get started with testing. You can also build different candidate prompt templates, i.e., variants of the zero-shot and few-shot templates above. For simplicity, we will just use one template.  

## Prompt Evaluations

- We will test zero-shot and few-shot with the `val_data` and measure accuracy
- We can do this evaluation for different models and different prompt candidates

Ref: https://python.langchain.com/en/latest/use_cases/evaluation/question_answering.html

In [111]:
from langchain.chains import LLMChain

zero_chain = LLMChain(llm=llm, prompt=zero_shot_prompt)
few_shot_chain = LLMChain(llm=llm, prompt=few_shot_prompt)

In [147]:
# extract the abstracts from val_data
abstracts = [{"input": val_data[i]["abstract"]} for i in range(len(val_data))]

In [149]:
fwpredictions = few_shot_chain.apply(abstracts)

In [150]:
fwpredictions

[{'text': " ['LLaMA', 'ChatGPT', 'Evol-Instruct', 'WizardLM']"},
 {'text': " ['FLAN-T5', 'LoRA']"},
 {'text': " ['NA']"},
 {'text': " ['PAXQA']"},
 {'text': " ['ChatGPT']"},
 {'text': " ['ViT', 'OpenCLIP']"},
 {'text': " ['Segment-Anything Model', 'AIGC', 'Stable Diffusion', 'Inpaint Anything']"},
 {'text': " ['BLIP', 'Segment-Anything', 'Text-to-Image Diffusion']"},
 {'text': " ['ChatGPT', 'GPT-4']"},
 {'text': " ['NA']"}]

Predictions

In [84]:
predictions = zero_chain.apply(val_data)

In [85]:
predictions

[{'text': '["Evol-Instruct", "WizardLM", "ChatGPT"]'},
 {'text': '\n["FLAN-T5", "LoRA"]'},
 {'text': '["NA"]'},
 {'text': '["PAXQA", "QG model", "annotation projection", "lexically-constrained machine translation", "extractive QA models"]'},
 {'text': '["ChatGPT"]'},
 {'text': '\n["ViT model", "OpenCLIP"]'},
 {'text': '\n["SAM", "AIGC", "Stable Diffusion", "Inpaint Anything (IA)"]'},
 {'text': '\n["BLIP model", "Segment-Anything model", "text-to-image diffusion model", "Anything-3D"]'},
 {'text': '\n["GPT-4", "ChatGPT", "ScienceQA", "TabMWP"]'},
 {'text': '["NA"]'}]

In [86]:
from langchain.evaluation.qa import QAEvalChain

In [88]:
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(val_data, predictions, question_key="abstract", prediction_key="text", answer_key="tags")


In [151]:
fw_graded_ouputs = eval_chain.evaluate(val_data, fwpredictions, question_key="abstract", prediction_key="text", answer_key="tags")

In [152]:
for i, eg in enumerate(val_data):
    print(f"Example {i}:")
    print("Question: " + eg['abstract'])
    print("Real Answer: " + str(eg['tags']))
    print("Zero-shot Predicted Answer: " + predictions[i]['text'])
    print("Zero-shot Predicted Grade: " + graded_outputs[i]['text'])
    print("Few-shot Predicted Answer: " + fwpredictions[i]['text'])
    print("Few-shot Predicted Grade: " + fw_graded_ouputs[i]['text'])
    print()

Example 0:
Question: Training large language models (LLM) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM model are preferred to outputs from OpenAI ChatGPT. Even though WizardLM sti

## Further Evaluation on Tokens Used or Requests Used