## ML Paper Tagger 

<a href="https://colab.research.google.com/github/dair-ai/maven-pe-for-llms-4/blob/main/notebooks/session-3/use-case-tagger.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import openai
import os
import IPython
from langchain.llms import OpenAI
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
# set the API key
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

In [3]:
# create a new LLM
from langchain.llms import OpenAI
llm  = OpenAI(model_name="text-davinci-003")

In [4]:
def print_markdown(text):
    """Prints text as markdown"""
    IPython.display.display(IPython.display.Markdown(text))

## What we are Building:

We are interested to build a system that takes in scientific ML abstracts and extracts model names when available.

The input: full paper abstract

The output: model name tags

## Demonstration Set

We will use arXiv to collect a few articles for our demonstrations.

Specifically, we will focus on the Computation and Language section and Computer Vision section: https://arxiv.org/list/cs.CL/recent

In [5]:
import json
import pandas as pd

# load json data at path: data/article-tags.json into a dataframe
with open('../data/article-tags.json') as f:
    val_data = json.load(f)

with open('../data/few_shot.json') as f:
    few_shot_data = json.load(f)

In [6]:
few_shot_data

[{'abstract': 'Neural Radiance Fields (NeRFs) are a very recent and very popular approach for the problems of novel view synthesis and 3D reconstruction. A popular scene representation used by NeRFs is to combine a uniform, voxel-based subdivision of the scene with an MLP. Based on the observation that a (sparse) point cloud of the scene is often available, this paper proposes to use an adaptive representation based on tetrahedra and a Delaunay representation instead of the uniform subdivision or point-based representations. We show that such a representation enables efficient training and leads to state-of-the-art results. Our approach elegantly combines concepts from 3D geometry processing, triangle-based rendering, and modern neural radiance fields. Compared to voxel-based representations, ours provides more detail around parts of the scene likely to be close to the surface. Compared to point-based representations, our approach achieves better performance.',
  'tags': ['NeRFs', 'MLP

We have the inputs we need for our prompts. Depending on your task, you might want to create several prompt templates/candidates from this. Let's to this with LangChain.

In [7]:
from langchain.prompts.few_shot import FewShotPromptTemplate
from langchain.prompts.prompt import PromptTemplate

In [8]:
# create a custom selector for the few shot data
from langchain.prompts.example_selector.base import BaseExampleSelector
from typing import Dict, List
import numpy as np

class CustomExampleSelector(BaseExampleSelector):
    
    def __init__(self, examples: List[Dict[str, str]]):
        self.examples = examples
    
    def add_example(self, example: Dict[str, str]) -> None:
        """Add new example to store for a key."""
        self.examples.append(example)

    def select_examples(self, size) -> List[dict]:
        """Select which examples to use based on the inputs."""
        return np.random.choice(self.examples, size=size, replace=False)

In [9]:
example_selector = CustomExampleSelector(few_shot_data)

In [10]:
list(example_selector.select_examples(2))

[{'abstract': 'Prevalent semantic segmentation solutions are, in essence, a dense discriminative classifier of p(class|pixel feature). Though straightforward, this de facto paradigm neglects the underlying data distribution p(pixel feature|class), and struggles to identify out-of-distribution data. Going beyond this, we propose GMMSeg, a new family of segmentation models that rely on a dense generative classifier for the joint distribution p(pixel feature,class). For each class, GMMSeg builds Gaussian Mixture Models (GMMs) via Expectation-Maximization (EM), so as to capture class-conditional densities. Meanwhile, the deep dense representation is end-to-end trained in a discriminative manner, i.e., maximizing p(class|pixel feature). This endows GMMSeg with the strengths of both generative and discriminative models. With a variety of segmentation architectures and backbones, GMMSeg outperforms the discriminative counterparts on three closed-set datasets. More impressively, without any mo

### Few-Shot Prompt Template

In [11]:
template = """
Abstract: {abstract}
Tags: {tags}
"""

prompt = PromptTemplate(
    input_variables=["abstract", "tags"],
    template=template
)

few_shot_prompt = FewShotPromptTemplate(
    examples = list(example_selector.select_examples(3)),
    example_prompt=prompt,
    prefix = "Your task is to extract model names from machine learning paper abstracts. Your response is an an array of the model names in the format [\"model_name\"]. If you don't find model names in the abstract or you are not sure, return [\"NA\"]",
    suffix = "Abstract: {input}\nTags:",
    input_variables=["input"],
    example_separator="\n\n" 
)

In [12]:
print_markdown(few_shot_prompt.format(input="test"))

Your task is to extract model names from machine learning paper abstracts. Your response is an an array of the model names in the format ["model_name"]. If you don't find model names in the abstract or you are not sure, return ["NA"]


Abstract: Generating talking head videos through a face image and a piece of speech audio still contains many challenges. ie, unnatural head movement, distorted expression, and identity modification. We argue that these issues are mainly because of learning from the coupled 2D motion fields. On the other hand, explicitly using 3D information also suffers problems of stiff expression and incoherent video. We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly modulates a novel 3D-aware face render for talking head generation. To learn the realistic motion coefficients, we explicitly model the connections between audio and different types of motion coefficients individually. Precisely, we present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces. As for the head pose, we design PoseVAE via a conditional VAE to synthesize head motion in different styles. Finally, the generated 3D motion coefficients are mapped to the unsupervised 3D keypoints space of the proposed face render, and synthesize the final video. We conducted extensive experiments to demonstrate the superiority of our method in terms of motion and video quality.
Tags: ['SadTalker', 'ExpNet', 'PoseVAE']



Abstract: We propose VisFusion, a visibility-aware online 3D scene reconstruction approach from posed monocular videos. In particular, we aim to reconstruct the scene from volumetric features. Unlike previous reconstruction methods which aggregate features for each voxel from input views without considering its visibility, we aim to improve the feature fusion by explicitly inferring its visibility from a similarity matrix, computed from its projected features in each image pair. Following previous works, our model is a coarse-to-fine pipeline including a volume sparsification process. Different from their works which sparsify voxels globally with a fixed occupancy threshold, we perform the sparsification on a local feature volume along each visual ray to preserve at least one voxel per ray for more fine details. The sparse local volume is then fused with a global one for online reconstruction. We further propose to predict TSDF in a coarse-to-fine manner by learning its residuals across scales leading to better TSDF predictions. Experimental results on benchmarks show that our method can achieve superior performance with more scene details. Code is available at:
Tags: ['NA']



Abstract: Topological deep learning is a rapidly growing field that pertains to the development of deep learning models for data supported on topological domains such as simplicial complexes, cell complexes, and hypergraphs, which generalize many domains encountered in scientific computations. In this paper, we present a unifying deep learning framework built upon a richer data structure that includes widely adopted topological domains. Specifically, we first introduce combinatorial complexes, a novel type of topological domain. Combinatorial complexes can be seen as generalizations of graphs that maintain certain desirable properties. Similar to hypergraphs, combinatorial complexes impose no constraints on the set of relations. In addition, combinatorial complexes permit the construction of hierarchical higher-order relations, analogous to those found in simplicial and cell complexes. Thus, combinatorial complexes generalize and combine useful traits of both hypergraphs and cell complexes, which have emerged as two promising abstractions that facilitate the generalization of graph neural networks to topological spaces. Second, building upon combinatorial complexes and their rich combinatorial and algebraic structure, we develop a general class of message-passing combinatorial complex neural networks (CCNNs), focusing primarily on attention-based CCNNs. We characterize permutation and orientation equivariances of CCNNs, and discuss pooling and unpooling operations within CCNNs in detail. Third, we evaluate the performance of CCNNs on tasks related to mesh shape analysis and graph learning. Our experiments demonstrate that CCNNs have competitive performance as compared to state-of-the-art deep learning models specifically tailored to the same tasks. Our findings demonstrate the advantages of incorporating higher-order relations into deep learning models in different applications.
Tags: ['CCNNs']


Abstract: test
Tags:

In [13]:
llm(few_shot_prompt.format(input="Modern systems for multi-hop question answering (QA) typically break questions into a sequence of reasoning steps, termed chain-of-thought (CoT), before arriving at a final answer. Often, multiple chains are sampled and aggregated through a voting mechanism over the final answers, but the intermediate steps themselves are discarded. While such approaches improve performance, they do not consider the relations between intermediate steps across chains and do not provide a unified explanation for the predicted answer. We introduce Multi-Chain Reasoning (MCR), an approach which prompts large language models to meta-reason over multiple chains of thought, rather than aggregating their answers. MCR examines different reasoning chains, mixes information between them and selects the most relevant facts in generating an explanation and predicting the answer. MCR outperforms strong baselines on 7 multi-hop QA datasets. Moreover, our analysis reveals that MCR explanations exhibit high quality, enabling humans to verify its answers."))

" ['MCR']"

### Zero-shot Template

In [14]:
# zero-shot prompt
zero_shot_template = """
Your task is extract model names from machine learning paper abstracts. Your response is an an array of the model names in the format [\"model_name\"]. If you don't find model names in the abstract or you are not sure, return [\"NA\"]

Abstract: {abstract}
Tags:
"""

zero_shot_prompt = PromptTemplate(
    input_variables=["abstract"],
    template=zero_shot_template
)

In [15]:
print_markdown(zero_shot_template.format(abstract="Modern systems for multi-hop question answering (QA) typically break questions into a sequence of reasoning steps, termed chain-of-thought (CoT), before arriving at a final answer. Often, multiple chains are sampled and aggregated through a voting mechanism over the final answers, but the intermediate steps themselves are discarded. While such approaches improve performance, they do not consider the relations between intermediate steps across chains and do not provide a unified explanation for the predicted answer. We introduce Multi-Chain Reasoning (MCR), an approach which prompts large language models to meta-reason over multiple chains of thought, rather than aggregating their answers. MCR examines different reasoning chains, mixes information between them and selects the most relevant facts in generating an explanation and predicting the answer. MCR outperforms strong baselines on 7 multi-hop QA datasets. Moreover, our analysis reveals that MCR explanations exhibit high quality, enabling humans to verify its answers."))


Your task is extract model names from machine learning paper abstracts. Your response is an an array of the model names in the format ["model_name"]. If you don't find model names in the abstract or you are not sure, return ["NA"]

Abstract: Modern systems for multi-hop question answering (QA) typically break questions into a sequence of reasoning steps, termed chain-of-thought (CoT), before arriving at a final answer. Often, multiple chains are sampled and aggregated through a voting mechanism over the final answers, but the intermediate steps themselves are discarded. While such approaches improve performance, they do not consider the relations between intermediate steps across chains and do not provide a unified explanation for the predicted answer. We introduce Multi-Chain Reasoning (MCR), an approach which prompts large language models to meta-reason over multiple chains of thought, rather than aggregating their answers. MCR examines different reasoning chains, mixes information between them and selects the most relevant facts in generating an explanation and predicting the answer. MCR outperforms strong baselines on 7 multi-hop QA datasets. Moreover, our analysis reveals that MCR explanations exhibit high quality, enabling humans to verify its answers.
Tags:


We have a few-shot prompt template and a zero-shot and we can get started with testing. You can also build different candidate prompt templates, i.e., variants of the zero-shot and few-shot templates above. For simplicity, we will just use one template.  

## Prompt Evaluations

- We will test zero-shot and few-shot with the `val_data` and evaluate manually
- We can do this evaluation for different models and different prompt candidates

Ref: https://python.langchain.com/en/latest/use_cases/evaluation/question_answering.html

In [16]:
from langchain.chains import LLMChain

zero_chain = LLMChain(llm=llm, prompt=zero_shot_prompt)
few_shot_chain = LLMChain(llm=llm, prompt=few_shot_prompt)

In [17]:
# extract the abstracts from val_data
abstracts = [{"input": val_data[i]["abstract"]} for i in range(len(val_data))]

In [18]:
# run few-shot predictions
fwpredictions = few_shot_chain.apply(abstracts)

In [19]:
fwpredictions

[{'text': " ['WizardLM', 'ChatGPT', 'Evol-Instruct', 'LLaMA']"},
 {'text': "['FLAN-T5', 'LoRA']"},
 {'text': " ['NA']"},
 {'text': " ['PAXQA', 'QG']"},
 {'text': " ['ChatGPT']"},
 {'text': " ['ViT', 'OpenCLIP']"},
 {'text': " ['Inpaint Anything (IA)', 'Segment-Anything Model (SAM)', 'AIGC models', 'Stable Diffusion']"},
 {'text': " ['BLIP', 'Segment-Anything', 'text-to-image diffusion']"},
 {'text': " ['Chameleon', 'GPT-4', 'ChatGPT']"},
 {'text': " ['NA']"}]

Predictions

In [20]:
# zero-shot predictions
predictions = zero_chain.apply(val_data)

In [21]:
predictions

[{'text': '["WizardLM", "LLaMA", "OpenAI ChatGPT", "Evol-Instruct"]'},
 {'text': '\n["FLAN-T5", "AMR2.0", "AMR3.0", "BioAMR", "LoRA"]'},
 {'text': '\n["NA"]'},
 {'text': '\n["PAXQA", "QG model", "annotation projection", "lexically-constrained machine translation", "extractive QA models"]'},
 {'text': '\n["ChatGPT"]'},
 {'text': '\n["ViT model", "OpenCLIP"]'},
 {'text': '\n["Segment-Anything Model (SAM)", "Inpaint Anything (IA)", "AIGC models", "Stable Diffusion"]'},
 {'text': '\n["BLIP model", "Segment-Anything model", "text-to-image diffusion model", "Anything-3D"]'},
 {'text': '\n["GPT-4", "ChatGPT", "Chameleon"]'},
 {'text': '["NA"]'}]

In [22]:
from langchain.evaluation.qa import QAEvalChain

In [23]:
# evaluation chain from LangChain (using an LLM to evaluate)
# zero-shot
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(val_data, predictions, question_key="abstract", prediction_key="text", answer_key="tags")

In [24]:
eval_chain.dict()

{'memory': None,
 'verbose': False,
 'tags': None,
 'prompt': {'input_variables': ['query', 'result', 'answer'],
  'output_parser': None,
  'partial_variables': {},
  'template': "You are a teacher grading a quiz.\nYou are given a question, the student's answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\n\nExample Format:\nQUESTION: question here\nSTUDENT ANSWER: student's answer here\nTRUE ANSWER: true answer here\nGRADE: CORRECT or INCORRECT here\n\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \n\nQUESTION: {query}\nSTUDENT ANSWER: {result}\nTRUE ANSWER: {answer}\nGRADE:",
  'template_format': 'f-string',
  'validate_template': True,
  '_type': 'prompt'},
 'llm': {'model_name': 'te

In [25]:
# few-shot
fw_graded_ouputs = eval_chain.evaluate(val_data, fwpredictions, question_key="abstract", prediction_key="text", answer_key="tags")

In [26]:
for i, eg in enumerate(val_data):
    print(f"Example {i}:")
    print("Question: " + eg['abstract'])
    print("Real Answer: " + str(eg['tags']))
    print("Zero-shot Predicted Answer: " + predictions[i]['text'])
    print("Zero-shot Predicted Grade: " + graded_outputs[i]['text'])
    print("Few-shot Predicted Answer: " + fwpredictions[i]['text'])
    print("Few-shot Predicted Grade: " + fw_graded_ouputs[i]['text'])
    print()

Example 0:
Question: Training large language models (LLM) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM model are preferred to outputs from OpenAI ChatGPT. Even though WizardLM sti

Exercise: Evaluate the accuracy of different models and prompting techniques like CoT