# Comprehensive Rewriting Pipeline 

This is the multi-hop rewriting pipeline that...
- FIRST summarizes the abstract and title and 
- SECOND uses the summary to generate the self-contained rewrite. 
- THIRD makes the rewrite sound human-like

### Step 0: Setting Up

In [1]:
import openai
openai.api_key = "W05A4sw0uvDBLDle"
openai.api_base = "https://praglab-dsp-proxy.omarkhattab.com:5533/"

In [2]:
%load_ext autoreload
%autoreload 2

import sys
import os

try: # When on google Colab, let's clone the notebook so we download the cache.
    import google.colab
    repo_path = 'dspy'
    !git -C $repo_path pull origin v2 || git clone -b v2 https://github.com/stanfordnlp/dspy $repo_path
except:
    repo_path = '.'

if repo_path not in sys.path:
    sys.path.append(repo_path)

# Set up the cache for this notebook
os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join(repo_path, 'cache')

"""
## TODO: Consider this
if not "dspy-ai" in {pkg.key for pkg in pkg_resources.working_set}:
    !pip install -U pip
    !pip install -e $repo_path
"""

try:
    import dspy
except Exception:
    !pip install -U pip
    !pip install -e $repo_path
    
    import dspy

In [3]:
turbo = dspy.OpenAI(model='gpt-3.5-turbo', model_type='chat')
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')

dspy.settings.configure(lm=turbo, rm=colbertv2_wiki17_abstracts)

In [4]:
# Loading QASPER json
import json
from datasets import load_dataset
with open('./getTemplate.json') as file:
    data = json.load(file)

In [5]:
# print_in_multiple_lines
def break_text_into_lines(text, max_words_per_line=35):
    words = text.split()
    lines = []
    current_line = []

    for word in words:
        if len(current_line) + len(word) <= max_words_per_line:
            current_line.append(word)
        else:
            lines.append(' '.join(current_line))
            current_line = [word]

    if current_line:
        lines.append(' '.join(current_line))

    return lines
def print_in_multiple_lines(text, words_per_line):
    lines = break_text_into_lines(text, words_per_line)
    for line in lines:
        print(line)

### Step 1: Vanilla Rewrite

In [10]:
question_number = 13

In [7]:
class BasicRewrite(dspy.Signature):
    """Rewrite the question so that it is self-contained using the abstract and title. The rewrite should not include the title."""

    title = dspy.InputField()
    abstract = dspy.InputField()
    question = dspy.InputField()
    rewrite = dspy.OutputField(desc="rewritten version of the question")

In [11]:
# QASPER JSON
generate_rewrite_predict = dspy.Predict(BasicRewrite, n=5)
generate_rewrite_CoT = dspy.ChainOfThought(BasicRewrite, n=5)

for idx, eachQuestion in enumerate(data):
    if idx != question_number: continue
    print(f'Title: {eachQuestion["title"]}')
    print_in_multiple_lines(f'Abstract: {eachQuestion["abstract"]}', 30)
    print(f'Question: {eachQuestion["question"]}\n')
    
    print('Predict:')
    pred = generate_rewrite_predict(title=eachQuestion["title"] , abstract=eachQuestion["abstract"], question=eachQuestion["question"])
    for idx, c in enumerate(pred.completions): 
        print(f"Rewrite {idx+1}: {c.rewrite}")

    print('Chain of Thought:')
    pred = generate_rewrite_CoT(title=eachQuestion["title"] , abstract=eachQuestion["abstract"], question=eachQuestion["question"])
    for idx, c in enumerate(pred.completions): 
        print(f"Rewrite {idx+1}: {c.rewrite}")

Title: Community Identity and User Engagement in a Multi-Community Landscape
Abstract: A community's identity defines and shapes its internal dynamics. Our current understanding of this interplay is mostly limited to glimpses gathered from
isolated studies of individual communities. In this work we provide a systematic exploration of the nature of this relation across a wide variety of online
communities. To this end we introduce a quantitative, language-based typology reflecting two key aspects of a community's identity: how
distinctive, and how temporally dynamic it is. By mapping almost 300 Reddit communities into the landscape induced by this typology, we reveal
regularities in how patterns of user engagement vary with the characteristics of a community. Our results suggest that the way new and existing users engage with a
community depends strongly and systematically on the nature of the collective identity it fosters, in ways that are highly
consequential to community maintainers

In [71]:
turbo.inspect_history(n=1)





Rewrite the question so that it is self-contained using the abstract and title. The rewrite should not include the title.

---

Follow the following format.

Title: ${title}

Abstract: ${abstract}

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the rewrite}. We ...

Rewrite: rewritten version of the question

---

Title: PO-EMO: Conceptualization, Annotation, and Modeling of Aesthetic Emotions in German and English Poetry

Abstract: Most approaches to emotion analysis regarding social media, literature, news, and other domains focus exclusively on basic emotion categories as defined by Ekman or Plutchik. However, art (such as literature) enables engagement in a broader range of more complex and subtle emotions that have been shown to also include mixed emotional responses. We consider emotions as they are elicited in the reader, rather than what is expressed in the text or intended by the author. Thus, we conceptualize a set of aesthetic emotions t

In [47]:
class AbstractRewrite(dspy.Signature):
    """Rewrite the question so that it is self-contained using the information given in the abstract only."""

    # title = dspy.InputField()
    abstract = dspy.InputField()
    question = dspy.InputField()
    rewrite = dspy.OutputField(desc="rewritten version of the question")

In [65]:
# Using abstract only => not really that good
generate_rewrite_predict = dspy.Predict(AbstractRewrite, n=5)
generate_rewrite_CoT = dspy.ChainOfThought(AbstractRewrite, n=5)

for idx, eachQuestion in enumerate(data):
    if idx != question_number: continue
    print(f'Title: {eachQuestion["title"]}')
    print_in_multiple_lines(f'Abstract: {eachQuestion["abstract"]}', 30)
    print(f'Question: {eachQuestion["question"]}\n')
    
    print('Predict:')
    pred = generate_rewrite_predict(title=eachQuestion["title"] , abstract=eachQuestion["abstract"], question=eachQuestion["question"])
    for idx, c in enumerate(pred.completions): 
        print(f"Rewrite {idx+1}: {c.rewrite}")

    print('Chain of Thought:')
    pred = generate_rewrite_CoT(title=eachQuestion["title"] , abstract=eachQuestion["abstract"], question=eachQuestion["question"])
    for idx, c in enumerate(pred.completions): 
        print(f"Rewrite {idx+1}: {c.rewrite}")

Title: PO-EMO: Conceptualization, Annotation, and Modeling of Aesthetic Emotions in German and English Poetry
Abstract: Most approaches to emotion analysis regarding social media, literature, news, and other domains focus exclusively on basic emotion categories as defined by Ekman or
Plutchik. However, art (such as literature) enables engagement in a broader range of more complex and subtle emotions that have been shown to also
include mixed emotional responses. We consider emotions as they are elicited in the reader, rather than what is expressed in the text or
intended by the author. Thus, we conceptualize a set of aesthetic emotions that are predictive of aesthetic appreciation in the reader, and allow the
annotation of multiple labels per line to capture mixed emotions within context. We evaluate this novel setting in an annotation experiment both with
carefully trained experts and via crowdsourcing. Our annotation with experts leads to an acceptable agreement of kappa=.70, resulti

### Step 1.5: Summarize Abstract -> Rewrite Pipeline 

In [72]:
class Summarize(dspy.Signature):
    """Summarize the title and abstract into two sentence."""

    title = dspy.InputField()
    abstract = dspy.InputField()
    question = dspy.InputField()
    summary = dspy.OutputField()

In [23]:
class SummaryRewrite(dspy.Signature):
    """Rewrite the question so that it is self-contained using the summary."""

    summary = dspy.InputField()
    question = dspy.InputField(desc="question to be rewritten")
    rewrite = dspy.OutputField(desc="rewritten version of the question")

##### Compile SummaryRewrite using Trainset

In [73]:
import random
Demo1 = dspy.Example(summary="The paper proposes a method for learning affective events using discourse relations, which is effective even without manually labeled data. The method only requires a small seed lexicon and a large raw corpus, and it improves supervised learning results when labeled data are limited.", 
                     question="What is the seed lexicon?", dspy_uuid=random.randint(0, 9999999), 
                     rewrite="What is the seed lexicon used in the paper on learning affective events using discourse relations, which is effective even without manually labeled data?")

In [26]:
examples = [Demo1]
trainset = [x.with_inputs('summary', 'question') for x in examples]
for x in trainset:
    print(x)

Example({'summary': 'The paper proposes a method for learning affective events using discourse relations, which is effective even without manually labeled data. The method only requires a small seed lexicon and a large raw corpus, and it improves supervised learning results when labeled data are limited.', 'question': 'What is the seed lexicon?', 'rewrite': 'What is the seed lexicon used in the paper on learning affective events using discourse relations, which is effective even without manually labeled data?'}) (input_keys={'question', 'summary'})


In [74]:
class RAG(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_rewrite = dspy.ChainOfThought(SummaryRewrite)
    
    def forward(self, summary, question):
        prediction = self.generate_rewrite(summary=summary, question=question)
        return dspy.Prediction(summary=summary, question=question, rewrite=prediction.rewrite)

In [29]:
from dspy.teleprompt import BootstrapFewShot

# Validation logic: check that the predicted answer is correct.
# NO validation logic 

# Set up a basic teleprompter, which will compile our RAG program.
teleprompter = BootstrapFewShot()

# Compile!
compiled_rag = teleprompter.compile(RAG(), trainset=trainset)

100%|██████████| 1/1 [00:01<00:00,  1.38s/it]


In [67]:
generate_summary = dspy.ChainOfThought(Summarize)
generate_rewrite3 = dspy.ChainOfThought(SummaryRewrite, n=1)

for idx, eachQuestion in enumerate(data):
    if idx != question_number: continue
    print(f'Title: {eachQuestion["title"]}')
    print_in_multiple_lines(f'Abstract: {eachQuestion["abstract"]}', 30)
    print(f'Question: {eachQuestion["question"]}')

    # Generate summary
    pred = generate_summary(title=eachQuestion["title"], abstract=eachQuestion["abstract"], question=eachQuestion["question"])
    print_in_multiple_lines(f"Summary: {pred.summary}", 30)
    summary = pred.summary

    # Generate rewrite using the summary
    print('\nZero-shot version:')
    pred = generate_rewrite3(summary=summary, question=eachQuestion["question"])
    for idx, c in enumerate(pred.completions): 
        print(f"Rewrite {idx+1}: {c.rewrite}")

    print('Compiled version:')
    pred = compiled_rag(summary=summary, question=eachQuestion["question"])
    print(f"Rewrite {idx+1}: {pred.rewrite}")

Title: PO-EMO: Conceptualization, Annotation, and Modeling of Aesthetic Emotions in German and English Poetry
Abstract: Most approaches to emotion analysis regarding social media, literature, news, and other domains focus exclusively on basic emotion categories as defined by Ekman or
Plutchik. However, art (such as literature) enables engagement in a broader range of more complex and subtle emotions that have been shown to also
include mixed emotional responses. We consider emotions as they are elicited in the reader, rather than what is expressed in the text or
intended by the author. Thus, we conceptualize a set of aesthetic emotions that are predictive of aesthetic appreciation in the reader, and allow the
annotation of multiple labels per line to capture mixed emotions within context. We evaluate this novel setting in an annotation experiment both with
carefully trained experts and via crowdsourcing. Our annotation with experts leads to an acceptable agreement of kappa=.70, resulti

In [31]:
turbo.inspect_history(n=1)





Rewrite the question so that it is self-contained using the summary.

---

Follow the following format.

Summary: ${summary}

Question: question to be rewritten

Reasoning: Let's think step by step in order to ${produce the rewrite}. We ...

Rewrite: rewritten version of the question

---

Summary: The paper proposes a method for learning affective events using discourse relations, which is effective even without manually labeled data. The method only requires a small seed lexicon and a large raw corpus, and it improves supervised learning results when labeled data are limited.

Question: What is the seed lexicon?

Reasoning: Let's think step by step in order to produce the rewrite. We need to understand what the seed lexicon is in order to rewrite the question.

Rewrite: What is the definition or purpose of the seed lexicon in the proposed method for learning affective events using discourse relations?

---

Summary: The study focuses on conceptualizing and annotating aesthetic em

### Step 2: Make Rewrite Human-like (yet to move)

### Step 3: Create Rubric (to select best rewrites from each pipeline)

In [92]:
# 15 Examples (1~15)
titles = ["How Language-Neutral is Multilingual BERT?", 
          "How Language-Neutral is Multilingual BERT?", 
          "CrossWOZ: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset", 
          "Learning Supervised Topic Models for Classification and Regression from Crowds", 
          "Learning Supervised Topic Models for Classification and Regression from Crowds",
          "How Language-Neutral is Multilingual BERT?",
          "Stay On-Topic: Generating Context-specific Fake Restaurant Reviews",
          "RobBERT: a Dutch RoBERTa-based Language Model",
          "DENS: A Dataset for Multi-class Emotion Analysis",
          "Unsupervised Machine Commenting with Neural Variational Topic Model",
          "Stay On-Topic: Generating Context-specific Fake Restaurant Reviews",
          "Unsupervised Machine Commenting with Neural Variational Topic Model",
          "Learning Supervised Topic Models for Classification and Regression from Crowds",
          "DENS: A Dataset for Multi-class Emotion Analysis",
          "Automatic Target Recovery for Hindi-English Code Mixed Puns"]
# abstracts = []
questions = ["How do they show that mBERT representations can be split into a language-specific component and a language-neutral component?", 
             "How do they show that mBERT representations can be split into a language-specific component and a language-neutral component?", 
             "What are the benchmark models?", 
             "What datasets were used?", 
             "What datasets were used?",
             "What challenges does this work present that must be solved to build better language-neutral representations?", 
             "Which dataset do they use a starting point in generating fake reviews?", 
             "What data did they use?",
             "What is the size of this dataset?",
             "Which lexicon-based models did they compare with?",
             "Which dataset do they use a starting point in generating fake reviews?",
             "Which lexicon-based models did they compare with?",
             "What datasets were used?", 
             "What is the size of this dataset?",
             "What are Puns?"]
rewrites = ["What method do the authors use to demonstrate that mBERT representations can be divided into a language-specific component and a language-neutral component in the paper \"How Language-Neutral is Multilingual BERT?\"?", 
            "What evidence do they provide to demonstrate that mBERT representations can be split into a language-specific component and a language-neutral component in the paper on the language-neutrality of mBERT?",
            "What benchmark models are provided for pipelined task-oriented dialogue systems in the CrossWOZ dataset?", 
            "Which datasets were used in the study on learning supervised topic models for classification and regression from crowds?",
            "What datasets were used in the paper on learning supervised topic models for classification and regression from crowds?",
            "What challenges need to be addressed in order to improve the language-neutral representations in multilingual BERT?", 
            "What dataset do the authors use as a starting point in generating fake restaurant reviews in the paper?",
            "What type of data did the authors use to train the Dutch language model RobBERT in the paper?", 
            "What is the average length of the passages in the DENS dataset for multi-class emotion analysis?",
            "What lexicon-based models did the authors compare their proposed topic-based approach to in the paper on unsupervised machine commenting?",
            "What dataset do the authors of the paper \"Stay On-Topic: Generating Context-specific Fake Restaurant Reviews\" use as a starting point in generating fake reviews?",
            "What are the lexicon-based models that the authors compare their proposed topic-based approach to in the paper?",
            "What datasets were used to evaluate the proposed supervised topic models in the paper?",
            "What is the size of the DENS dataset introduced in the paper?",
            "What are puns and how are they classified in this paper?"]

In [94]:
# Metrics Answers
metric1 = ["Yes", "Yes", "Yes", "Yes", "Yes", "No" , "Yes", "Yes", "No" , "Yes", "Yes", "Yes", "Yes", "No", "Yes"]   # Metric 1
metric3 = ["Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No" , "Yes", "Yes", "Yes", "Yes", "No" , "No" , "Yes", "No"]   # Metric 2

#### Metric 1: Same Answer?

In [98]:
class SameAnswer(dspy.Signature):
    """Determine whether the rewrite is asking for the same thing as the original question at its core but with a specific focus on the study/paper."""

    title = dspy.InputField(desc="title of the paper")
    question = dspy.InputField(desc="the original question")
    rewrite = dspy.InputField(desc="the rewrite of the original question")
    answer = dspy.OutputField(desc="the answer in Yes or No")

In [99]:
# Define the predictor.
generate_answer = dspy.ChainOfThought(SameAnswer)

for idx in range(15):
  # if idx != 10: continue
  # Call the predictor on a particular input.
  pred = generate_answer(title=titles[idx], question=questions[idx], rewrite=rewrites[idx])

  # Print the input and the prediction.
  print(f'Example {idx+1}')
  print(f"Title: {titles[idx]}\nQuestion: {questions[idx]}\nRewrite: {rewrites[idx]}")
  print(f"Same Answer?: {pred.answer}")

Example 1
Title: How Language-Neutral is Multilingual BERT?
Question: How do they show that mBERT representations can be split into a language-specific component and a language-neutral component?
Rewrite: What method do the authors use to demonstrate that mBERT representations can be divided into a language-specific component and a language-neutral component in the paper "How Language-Neutral is Multilingual BERT?"?
Same Answer?: Yes
Example 2
Title: How Language-Neutral is Multilingual BERT?
Question: How do they show that mBERT representations can be split into a language-specific component and a language-neutral component?
Rewrite: What evidence do they provide to demonstrate that mBERT representations can be split into a language-specific component and a language-neutral component in the paper on the language-neutrality of mBERT?
Same Answer?: Yes
Example 3
Title: CrossWOZ: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset
Question: What are the benchmark models?
Re

In [100]:
turbo.inspect_history(n=1)





Determine whether the rewrite is asking for the same thing as the original question at its core but with a specific focus on the study/paper.

---

Follow the following format.

Title: title of the paper

Question: the original question

Rewrite: the rewrite of the original question

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: the answer in Yes or No

---

Title: Automatic Target Recovery for Hindi-English Code Mixed Puns

Question: What are Puns?

Rewrite: What are puns and how are they classified in this paper?

Reasoning: Let's think step by step in order to[32m produce the answer. We are asked to determine whether the rewrite is asking for the same thing as the original question at its core but with a specific focus on the study/paper. In the original question, the focus is on understanding what puns are. In the rewrite, the focus is on understanding what puns are and how they are classified specifically in the paper titled "Automatic

#### Metric 2: Includes Title?

In [97]:
# METRIC 5 Functions
import re
import string
import unicodedata

from collections import Counter
from dsp.utils.utils import print_message
from dsp.utils.metrics import *    # import everything from dsp.utils.metrics

def F1(prediction, answers_list):
    assert type(answers_list) == list
    return max(F1_score(prediction, ans) for ans in answers_list)
def F1_score(prediction, ground_truth):
    prediction_tokens = normalize_text(prediction).split()
    ground_truth_tokens = normalize_text(ground_truth).split()

    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())

    if len(prediction_tokens) == len(ground_truth_tokens) == 0:
        # Unlike most tasks, QReCC and SQuAD-2.0 assign 1.0 in this edge case. We don't for uniformity.
        print_message(
            "\n#> F1 Metric: Rare edge case of len(prediction_tokens) == len(ground_truth_tokens) == 0.\n")

    if num_same == 0:
        return 0

    precision = 1.0 * num_same / len(prediction_tokens) # overlap in title
    recall = 0 * num_same / len(ground_truth_tokens) # overlap in the rewrite (not important)
    f1 = (2 * precision * recall) / (precision + recall)
    f1 = precision

    return f1
def longest_common_substring(str1, str2):
    m = len(str1)
    n = len(str2)

    # Create a 2D table to store the length of common substrings
    # where dp[i][j] represents the length of common substring ending at str1[i-1] and str2[j-1]
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    # Variables to keep track of the length of the longest common substring and its ending index
    max_len = 0
    end_idx = 0
    
    # Fill in the table
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if str1[i - 1] == str2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
                if dp[i][j] > max_len:
                    max_len = dp[i][j]
                    end_idx = i

    # Extract the longest common substring
    longest_substring = str1[end_idx - max_len:end_idx]

    return longest_substring
def contiguous_ratio(title, rewrite):
    title = normalize_text(title)
    string = normalize_text(rewrite)
    substring = longest_common_substring(title, string)
    contig_ratio = len(substring)/len(title)
    # print("Longest common substring:", substring)
    return contig_ratio
def getCombinedScore(title, rewrite):
    f1_score = F1(prediction=title, answers_list=[rewrite])
    contiguous_score = contiguous_ratio(title, rewrite)
    return f1_score * contiguous_score

In [90]:
def Metric5(example, rewrite):
    combined_score = getCombinedScore(example.title, rewrite)
    return "No" if (combined_score < 0.3) else "Yes"

In [91]:
# Metric 5 Tester
title = "Unsupervised Machine Commenting with Neural Variational Topic Model"
rewrite = "What lexicon-based models did the authors compare their proposed topic-based approach to in the paper on unsupervised machine commenting?"
combined_score = getCombinedScore(title, rewrite)
print(combined_score)

0.17350746268656717


#### Metric 3: Specific Enough? (MOST IMPORTANT!)

In [95]:
class SpecificEnough1(dspy.Signature):
    """Answer the question"""
    
    rewrite = dspy.InputField()
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often 2 to 3 sentences")

In [96]:
# Define the predictor.
Question1 = "Given only the rewrite, how would one go about finding the correct paper using search?"
generate_answer = dspy.ChainOfThought(SpecificEnough1)

for idx in range(15):
  if idx != 11: continue

  # Call the predictor on a particular input.
  pred = generate_answer(rewrite=rewrites[idx])

  # Print the input and the prediction.
  print(f'Example {idx+1}')
  print(f"Title: {titles[idx]}\nRewrite: {rewrites[idx]}")
  print(f"Specific Enough?: {pred.answer}")

Example 12
Title: Unsupervised Machine Commenting with Neural Variational Topic Model
Rewrite: What are the lexicon-based models that the authors compare their proposed topic-based approach to in the paper?
Specific Enough?: produce the answer. We need to identify the lexicon-based models that the authors compare their proposed topic-based approach to in the paper. To do this, we should carefully read the paper and look for any mentions or discussions about these models.

Answer: In the paper, the authors compare their proposed topic-based approach to three lexicon-based models: SentiWordNet, V


### Step 4: Filter Best Rewrites using Rubric

Scoring Method v1: 
- Same Answer => +2 points
- DOESN'T Includes Title => +1 point
- Specific Enough => +4 points
Once we have this done, it will be used as the validation metric to train/optimize our rewrites AND used to filter the best rewrites..?