In [1]:
%set_env DSP_NOTEBOOK_CACHEDIR cache
%load_ext autoreload
%autoreload 2

env: DSP_NOTEBOOK_CACHEDIR=cache


In [2]:
import os
import dsp
from utils import get_retriever, get_lm

In [3]:
openai_key = os.getenv('OPENAI_API_KEY')  # or replace with your API key (optional)
colbert_server = 'http://ec2-44-228-128-229.us-west-2.compute.amazonaws.com:8893/api/search'

In [4]:
print(openai_key)

In [5]:
lm = get_lm()
rm = get_retriever("semantic-scholar")
dsp.settings.configure(lm=lm, rm=rm)

In [6]:
# dsp.retrieveEnsemble(["few-shot learners", "emergence", "in-context learning"], k=5, by_prob=False)

In [30]:
# rm(["scaling language models",	"emergence in-context learning", "task-specific training"], ensemble=True, k=3)
rm("scaling language models", k=3)

[[{'title': 'Scaling Language Models: Methods, Analysis & Insights from Training Gopher',
   'long_text': "Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models

In [13]:
train = [('Who produced the album that included a re-recording of "Lithium"?', ['Butch Vig']),
         ('Who was the director of the 2009 movie featuring Peter Outerbridge as William Easton?', ['Kevin Greutert']),
         ('The heir to the Du Pont family fortune sponsored what wrestling team?', ['Foxcatcher', 'Team Foxcatcher', 'Foxcatcher Team']),
         ('In what year was the star of To Hell and Back born?', ['1925']),
         ('Which award did the first book of Gary Zukav receive?', ['U.S. National Book Award', 'National Book Award']),
         ('What city was the victim of Joseph Druces working in?', ['Boston, Massachusetts', 'Boston']),]

train = [dsp.Example(question=question, answer=answer) for question, answer in train]

In [16]:
question = dsp.Type(prefix="Question:", desc="${the question to be answered}")
answer = dsp.Type(prefix="Answer:", desc="${a short factoid answer, often between 1 and 5 words}", format=dsp.format_answers)

qa_template = dsp.Template(instructions="Answer questions with short factoid answers.", question=question(), answer=answer())

In [17]:
context = dsp.Type(
    prefix="Context:\n",
    desc="${sources that may contain relevant content}",
    format=dsp.passages2text
)

qa_template_with_passages = dsp.Template(
    instructions=qa_template.instructions,
    context=context(), question=question(), answer=answer()
)

### Program 1: RTR

In [18]:
def retrieve_then_read_QA(question: str) -> str:
    demos = dsp.sample(train, k=5)
    passages = rm(question, k=10)
    
    example = dsp.Example(question=question, context=passages, demos=demos)
    example, completions = dsp.generate(qa_template_with_passages)(example, stage='qa')

    return completions.answer

In [None]:
retrieve_then_read_QA("Query encoder"), lm.inspect_history(n=1)

### Program 2: TOT

This program performs the search step to retrieve the most relevant documents from the retriever 

In [19]:
rationale = dsp.Type(
    prefix="Rationale: Let's think step by step.",
    desc="${a step-by-step deduction that identifies the correct response, which will be provided below}"
)

qa_template_with_CoT = dsp.Template(
    instructions=qa_template.instructions,
    context=context(), question=question(), rationale=rationale(), answer=answer()
)

search_rationale = dsp.Type(
    prefix="Rationale: Let's think step by step. To answer this question, we first need to find out",
    desc="${the missing information}"
)

search_query = dsp.Type(
    prefix="Search Query:",
    desc="${a simple question for seeking the missing information}"
)

rewrite_template = dsp.Template(
    instructions="Write a search query that will help answer a complex question.",
    question=question(), rationale=search_rationale(), query=search_query()
)

condensed_rationale = dsp.Type(
    prefix="Rationale: Let's think step by step. Based on the context, we have learned the following.",
    desc="${information from the context that provides useful clues}"
)

hop_template = dsp.Template(
    instructions=rewrite_template.instructions,
    context=context(), question=question(), rationale=condensed_rationale(), query=search_query()
)

In [25]:
from dsp.utils import deduplicate

@dsp.transformation
def qa_predict(example: dsp.Example, sc=True):
    if sc:
        example, completions = dsp.generate(qa_template_with_CoT, n=20, temperature=0.7)(example, stage='qa')
        completions = dsp.majority(completions)
    else:
        example, completions = dsp.generate(qa_template_with_CoT)(example, stage='qa')
    
    return example.copy(answer=completions.answer)

@dsp.transformation
def multihop_search(example: dsp.Example, max_hops=2, k=10) -> dsp.Example:
    example.context = []
    
    for hop in range(max_hops):
        # Generate a query based
        template = rewrite_template if hop == 0 else hop_template
        example, completions = dsp.generate(template)(example, stage=f'h{hop}')

        # Retrieve k results based on the query generated
        passages = rm(completions.query, k=k)

        # Update the context by concatenating old and new passages
        example.context = deduplicate(example.context + passages)

    return example

In [26]:
def multihop_qa(question: str) -> str:
    demos = dsp.sample(train, k=7)
    x = dsp.Example(question=question, demos=demos)
    
    x = multihop_search(x)
    x = qa_predict(x, sc=True)

    return x.answer

In [None]:
multihop_qa("Large Language Models are zero-shot reasoners"), lm.inspect_history(n=1)

### Router

In [47]:
from router import Router
from constants import DEFAULT_ROUTER_TEMPLATE

In [48]:
r = Router(lm, DEFAULT_ROUTER_TEMPLATE)

In [61]:
r("Find papers that ensemble generations of in-context learning across prompts, where each prompt has different demonstrating examples")

[' TOT']

In [59]:
r("Find a paper that demonstrates how scaling language models leads to emergence of a phenomena called in-context learning where the model learns to perform many tasks without task-specific training")

[' TOT']

In [60]:
r("BERT Language model")

[' ST']

### TOT Queries Analysis


In [14]:
query="find a paper that demonstrates how scaling language models leads to emergence of a phenomena called in-context learning where the model learns to perform many tasks without task-specific training"

In [15]:
title = "Language Models are Few-Shot Learners"
abstract = "Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general."

In [16]:
dsp_title = "Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP"
dsp_abstract = """Retrieval-augmented in-context learning has emerged as a powerful approach for addressing knowledge-intensive tasks using frozen language models (LM) and retrieval models (RM). Existing work has combined these in simple "retrieve-then-read" pipelines in which the RM retrieves passages that are inserted into the LM prompt. To begin to fully realize the potential of frozen LMs and RMs, we propose Demonstrate-Search-Predict (DSP), a framework that relies on passing natural language texts in sophisticated pipelines between an LM and an RM. DSP can express high-level programs that bootstrap pipeline-aware demonstrations, search for relevant passages, and generate grounded predictions, systematically breaking down problems into small transformations that the LM and RM can handle more reliably. We have written novel DSP programs for answering questions in open-domain, multi-hop, and conversational settings, establishing in early evaluations new state-of-the-art in-context learning results and delivering 37-120%, 8-39%, and 80-290% relative gains against the vanilla LM (GPT-3.5), a standard retrieve-then-read pipeline, and a contemporaneous self-ask pipeline, respectively."""

In [17]:
example = dsp.Example(question=query, context= dsp_title + "\n" + dsp_abstract, demos=[])
_, completions = dsp.generate(QUERY_PROCESSING_TEMPLATE)(example, stage='qa')

In [18]:
lm.inspect_history(n=1)





Does the following piece of text fullfil this question? The text should be relevant to the question below. Give a Yes or No answer.

---

Follow the following format.

Text:
${text that might contain information that fulfills the question}

Question: Do you think the paper above answers this question: ${a question from a user looking for a paper}

Answer: ${a Yes, or No answer}

---

Text:
Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP
Retrieval-augmented in-context learning has emerged as a powerful approach for addressing knowledge-intensive tasks using frozen language models (LM) and retrieval models (RM). Existing work has combined these in simple "retrieve-then-read" pipelines in which the RM retrieves passages that are inserted into the LM prompt. To begin to fully realize the potential of frozen LMs and RMs, we propose Demonstrate-Search-Predict (DSP), a framework that relies on passing natural language texts in sophisticated 

In [20]:
tot_pipeline("find a paper that demonstrates how scaling language models leads to emergence of a phenomena called in-context learning where the model learns to perform many tasks without task-specific training")

scaling language models, emergence, in-context learning, many tasks, task-specific training
['On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model', 'Many recent studies on large-scale language models have reported successful in-context zero- and few-shot learning ability. However, the in-depth analysis of when in-context learning occurs is still lacking. For example, it is unknown how in-context learning performance changes as the training corpus varies. Here, we investigate the effects of the source and size of the pretraining corpus on in-context learning in HyperCLOVA, a Korean-centric GPT-3 model. From our in-depth investigation, we introduce the following observations: (1) in-context learning performance heavily depends on the corpus domain source, and the size of the pretraining corpus does not necessarily determine the emergence of in-context learning, (2) in-context learning ability can emerge when a language model is trained on a combina

TypeError: can only concatenate str (not "NoneType") to str

In [None]:
rm = get_retriever()

In [45]:
dsp.settings(lm=lm, rm=rm)

TypeError: 'Settings' object is not callable

In [44]:
rm("few-shot learners")

[{'paper_id': '2108.01928',
  'long_text': ' Few-Shot Learning The term few-shot learning refers to the practice of only providing a few examples when training a model, compared to the typical approach of using large datasets (Wang et al., 2020). In the NLP domain, recent work by Brown et al. (2020) suggests to use these few examples only in the context, as opposed to actually training with it. Fittingly, they call this approach in-context learning. Here, they condition the model on a natural language description of the task together with a few demonstrations. Their experiments reveal that the larger the model, the better its in-context learning capabilities. Our approach is very similar to in-context learning, with the difference that we do not provide a description of the task and utilize natural language templates for the relations.'},
 {'paper_id': '2210.05572',
  'long_text': ' Few-Shot Learning. Few-shot learning aims at quickly generalizing the model to new tasks with a few labe

In [42]:
lm.inspect_history(n=6)





Does the following piece of text fullfil this question? The text should be relevant to the question below. Give a Yes or No answer.

---

Follow the following format.

Text:
${text that might contain information that fulfills the question}

Question:
${a question from a user looking for a paper}

Answer: ${a Yes, or No answer}

---

Text:
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
Large sparsely-activated models have obtained excellent performance in multiple domains. However, such models are typically trained on a single modality at a time. We present the Language-Image MoE, LIMoE , a sparse mixture of experts model capable of multimodal learning. LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss. MoEs are a natural ﬁt for a multimodal backbone, since expert layers can learn an appropriate partitioning of modalities. However, new challenges arise; in particular, training stability and balanced exp

In [30]:
output

<dsp.templates.template_v3.Type at 0x7fa389d27460>