Goal: to improve the performance of a small LLM in generating competitive debate speeches through prompt engineering and architecture at inference time alone

Rationale: Large SOTA closed-source LLMs like Gemini are good at generating emotionally charged arguments for debates but their debate speeches themselves could use improvement, and the size of these models makes them unwieldy. We want to see if we can get smaller open-source LLMs to match their performance without requiring expensive training

Experiments in architecture modifications:

We add a special layer for argument extraction and refutation in hopes that this will improve the way our responses are structured. The refutations will be passed into the response generator as context

After 12/3, we may try prompt engineering for each layer to find the best-working prompt

# Setup

Some of this might not be necessary. Copied from HW1

In [None]:
!pip install -U gdown
!gdown https://drive.google.com/uc?id=1wUpcOYU-eT1pe_je6qQLa8v4TBPUhPae
!unzip code.zip
!rm  -r __MACOSX
!mv cs224v_hw1_code/* .
!rm -r cs224v_hw1_code
!rm requirements.txt
!rm notebook.ipynb

In [None]:
!pip install dspy-ai==3.0.3
!pip install crawl4ai==0.7.4
!pip install langchain_text_splitters==0.3.11
!pip install jupyter==1.0.0
!pip install nbconvert[webpdf]

!playwright install

In [None]:
import json
import os
from typing import List, Tuple

import dspy
import httpx
from tqdm import tqdm

from src.dataclass import RetrievedDocument, LiteratureSearchAgentResponse, LiteratureSearchAgentRequest
from src.encoder import Encoder
from src.literature_search import LiteratureSearchAgent
from src.lm import init_lm, LanguageModelProviderConfig, LanguageModelProvider, LiteLLMServerConfig
from src.retriever_agent.serper_rm import SerperRM
from src.rag import RagAgent
from src.dataclass import RagResponse, RagRequest

from google.colab import userdata

# Argument and Rebuttal Layer

In [None]:
# Currently this will be used only for the prime minister speech
class ArgumentFormulator(dspy.Signature):
    """
    You are in a debate tournament and tasked with generating arguments in support of or in opposition to a motion. You will be given the debate motion and your assigned position on that motion (whether you support or oppose it). Based on these inputs, generate up to 10 relevant arguments that address the motion from your assigned position. Each argument should be between 1-2 sentences long, 3 if absolutely necessary.
    """
    motion: str = dspy.InputField(
        desc="The debate motion being discussed"
    )
    position: str = dspy.InputField(
        desc="Our assigned position on the debate motion, which we will generate arguments from"
    )

    arguments: List[str] = dspy.OutputField(
        desc="List of generated arguments that address the debate mmoion from our assigned position"
    )


In [None]:
# Ideas not yet tried:

# We could try to get the extractor to return two separate lists
# - The most important arguments
# - Th weakest arguments

# We could also try to annotate the argument with its type, if it's a counterargument
# to one of our side's arguments, for example, or if it's a standard argument
# trope, for which we could draw on a database of successful responses to that
# trope.

class ArgumentExtractor(dspy.Signature):
    """
    You are in a debate tournament and working on a response to the opposing side's speech. You will be given the debate motion, the opponent's assigned position, and the transcript of the opponent's speech. Based on this input, extract up to 6 of the most important or weakest arguments made by the opponent and return them in a list. Each argument should be condensed or summarized from claims made in the speech, and it should be between 1-2 sentences long, or at most 3 if absolutely necessary.
    """
    opponent_speech: str = dspy.InputField(
        desc="The transcript of the opposing team's speech"
    )
    motion: str = dspy.InputField(
        desc="The debate motion being discussed"
    )
    position: str = dspy.InputField(
        desc="The position of the opposing team in the debate"
    )


    extracted_arguments: List[str] = dspy.OutputField(
        desc="List of arguments extracted from the opponent's speech"
    )

In [None]:
# Shoud we ask the model to sort the refutations in order of strongest
# to weakest?
# Should we explicitly tell the model that short, pithy refutations
# are okay, so it doesn't assume long-winded thoroughness is needed?
class ArgumentRefuter(dspy.Signature):
    """
    You are in a debate tournament and working on a response to the opposing side's speech. You are given the debate motion, the opponent's assigned position and a list of arguments extracted from the opponent's latest speech. Based on these inputs, refute the opponent's arguments, returning a list of at most 6 refutations. Each refutation may be anywhere between 1-8 sentences long and should identify the argument or arguments that you are refuting. You may also identify if contradictions between two or more of the opponent's arguments exist.
    """
    extracted_arguments: List[str] = dspy.InputField(
        desc="List of arguments extracted from the opponent's speech"
    )
    motion: str = dspy.InputField(
        desc="The debate motion being discussed"
    )
    position: str = dspy.InputField(
        desc="The position of the opposing team in the debate"
    )

    refutations: List[str] = dspy.OutputField(
        desc="List of refutations for the opponent's arguments"
    )

# Speech generators for each stage

Should we have different prompts for the first speech, the turns, and the concluding speech?
Improve my prompts here

In [None]:
class OpeningSpeechGenerator(dspy.Signature):
  """
  You are in a debate tournament and tasked with generating the prime minister speech, which is the
  opening speech in the debate. Given the debate motion, your  position on this motion,
  and a list of possible arguments you could use, generate
  a well-structured, expressive debate speech. You are not limited to the
  suggested arguments, and you do not need to draw from them; feel free to
  generate your own.
  """
  motion: str = dspy.InputField(
      desc="The debate motion being discussed"
  )
  position: str = dspy.InputField(
      desc="Our assigned position on the debate motion"
  )
  argument_suggestions: List[str] = dspy.InputField(
      desc="List of possible arguments that may be helpful for our speech"
  )

  speech: str = dspy.OutputField(
      desc="The generated opening speech"
  )


In [None]:
# Generate a speech in response to opponent's speech.
# Should we include the original transcript at all? Or is including the refutations enough?
# Should we include a list of all the arguments we've made, in case we want to add onto those arguments?
# If not, should the speech be merely a response to the previous speaker's?
# How do we refute the opponent's refutations?
class ResponseGenerator(dspy.Signature):
  """
  You are in a debate tournament and working on a response to the opposing side's speech.
  You are given the debate motion, the opponent's assigned position on that motion,
  and a list of possible refutations to the arguments made in the opponent's latest speech.
  Based on these inputs, generate a well-structured debate speech responding to the opponent,
  making sure to address the weaknesses in their argument.
  """
  motion: str = dspy.InputField(
      desc="The debate motion being discussed"
  )
  position: str = dspy.InputField(
      desc="The position of the opposing team in the debate"
  )
  refutations: List[str] = dspy.InputField(
      desc="List of refutations for the opponent's arguments"
  )

  response: str = dspy.OutputField(
      desc="The generated debate speech responding to the opponent"
  )

In [None]:
class ConcludingSpeechGenerator(dspy.Signature):
  """
  You are in a debate tournament and working on your final response to the opposing side's speech.
  You are given the debate motion, the opponent's assigned position on that motion,
  and a list of possible refutations to the arguments made in the opponent's latest speech.
  Based on these inputs, generate a well-structured debate speech responding to the opponent,
  poking holes into their arguments, and wrapping up your side of the debate.
  """
  motion: str = dspy.InputField(
      desc="The debate motion being discussed"
  )
  position: str = dspy.InputField(
      desc="The position of the opposing team in the debate"
  )
  refutations: List[str] = dspy.InputField(
      desc="List of refutations for the opponent's arguments"
  )

  response: str = dspy.OutputField(
      desc="The generated debate speech responding to the opponent"
  )

# Experiments

Check if we improve on SOTA model performance

And then check using only smaller models, since that is more realistic improvement

In [None]:
speech_generation_lm_config = LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1", # Reasoning for using this model for speech generation?
    temperature=1.0,
    max_tokens=10000, # Fix, how long should a speech be?
    litellm_server_config=LiteLLMServerConfig(api_key=userdata.get("LITELLM_API_KEY"), api_base=userdata.get("LITELLM_API_BASE"))
)
speech_generation_lm = init_lm(speech_generation_lm_config)

In [None]:
argument_formulation_lm_config = LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1", # Reasoning for using this model for argument formulation?
    temperature=1.0,
    max_tokens=10000, # Fix, how long should each argument be? No more than 3 sentences, which is how many tokens?
    litellm_server_config=LiteLLMServerConfig(api_key=userdata.get("LITELLM_API_KEY"), api_base=userdata.get("LITELLM_API_BASE"))
)
argument_formulation_lm_config = init_lm(argument_formulation_lm_config)

In [None]:
rebuttal_lm_config = LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1", # Reasoning for using this model for rebuttal formulation?
    temperature=1.0,
    max_tokens=10000, # Fix, how long should each rebuttal be? No more than 8 sentences, which is how many tokens?
    litellm_server_config=LiteLLMServerConfig(api_key=userdata.get("LITELLM_API_KEY"), api_base=userdata.get("LITELLM_API_BASE"))
)
rebuttal_lm_config = init_lm(rebuttal_lm_config)

In [None]:
generate_initial_arguments = dspy.Predict(ArgumentFormulator) # Generate arguments given a motion and position
extract_arguments = dspy.Predict(ArgumentExtractor) # Extract arguments from a transcript of the opponent's speech. Takes in opponent_speech, motion, position. Outputs extracted_arguments
refute_arguments = dspy.Predict(ArgumentRefuter) # Respond to arguments extracted from the opponent's speech. Input: extracted_arguments, motion, position. Output: refutations

generate_opening_speech = dspy.Predict(OpeningSpeechGenerator) # Generate the opening speech. Input: motion, position, argument_suggestions. Output: speech
generate_response = dspy.Predict(ResponseGenerator) # Generate a response to the opponent's speech. Input: motion, position, refutations. Output: response
generate_concluding_speech = dspy.Predict(ConcludingSpeechGenerator) # Generate a concluding speech. Same fields as in generate_response, just different instructions.

In [None]:
with dspy.context(lm=argument_formulation_lm):
    initial_arguments = (await generate_initial_arguments.aforward(
        motion=
        position=
    )).arguments