# Research: Forecasting with LLMs
This notebook demonstrates ways of using `edsl` to reproduce experiments in the recent paper [Approaching Human-Level Forecasting with Language Models](https://arxiv.org/html/2402.18563v1) by Danny Halawi, Fred Zhang, Chen Yueh-Han and Jacob Steinhardt, exploring whether LLMs are as capable as human experts in answering forecasting questions.

## LLM prompts
The experiments involved prompting LLMs to do a variety of tasks related to forecasting questions (questions that the LLMs were asked to predict but could not know the answers to based on training data cutoff dates). These tasks included:
1. Generating search queries for (time-bound) news articles relating to a forecasting question
2. Reviewing retrieved articles and ranking their relevancy to the question
3. Summarizing the articles

The paper also presents a reasoning system in which an LLM is prompted to go through the following steps relating to a forecasting question in order to improve its answer:
1. Rephrase and expand the original question
2. Consider reasons why the answer may be no and rate them 
3. Consider reasons why the answer may be yes and rate them
4. Aggregate the considerations 
5. Provide an initial prediction based on all of the above
6. Evaluate the prediction 
7. Provide a final prediction

Fig.1 in the paper shows these news retrieval and reasoning systems:

<img src="forecasting_llms_fig1.png" width="700">

## Recreating the prompts
In a series of examples we show how to recreate the above-mentioned prompts in `edsl` by formatting them as survey questions that can be administered to LLMs. Edsl provides a variety of `Question` object types that we can use to do this: we use free text, linear scale, list and numerical questions types in our examples; other available types include multiple choice, checkbox, budget and yes/no.

Edsl also allows us to specify `Agent` personas for LLMs to reference in responding to questions (e.g., <i>You are... You have a background in...</i>). We provide examples of surveys administered to personified agents as well.

For more details on the edsl package and examples of methods in this notebook, please see:
<a href="https://deepnote.com/workspace/expected-parrot-c2fa2435-01e3-451d-ba12-9c36b3b87ad9/project/Expected-Parrot-examples-b457490b-fc5d-45e1-82a5-a66e1738a4b9/notebook/Tutorial%20-%20Starter%20Tutorial-e080f5883d764931960d3920782baf34">Starter Tutorial</a>
<a href="https://deepnote.com/workspace/expected-parrot-c2fa2435-01e3-451d-ba12-9c36b3b87ad9/project/Expected-Parrot-examples-b457490b-fc5d-45e1-82a5-a66e1738a4b9/notebook/Tutorial%20-%20Building%20Your%20Research-444f68e01bb24974a796058f55e670c7">Building Your Research</a>
<a href="https://deepnote.com/workspace/expected-parrot-c2fa2435-01e3-451d-ba12-9c36b3b87ad9/project/Expected-Parrot-examples-b457490b-fc5d-45e1-82a5-a66e1738a4b9/notebook/Tutorial%20-%20Exploring%20Your%20Results-7e72d5f898d948b4a22047c07e50561c">Exploring Your Results</a>

## Installing edsl
We use the latest version of the [Edsl](https://pypi.org/project/edsl/) library available at PyPI.

In [1]:
# EDSL should be automatically installed when you run this notebook. If not, run the following command:
# ! pip install edsl

*** Note: The first time we import from `edsl` we will be prompted to optionally enter API keys for the LLMs that we want use in simulating results later on (press return to skip entering any key). ***

## Search query generation prompts
The prompts for the LLM to generate search queries (Fig.12):

<img src="forecasting_llms_fig12.png" width="600">

We use`QuestionList` to recreate these prompts, as a readily usable list of query strings is desired:

In [2]:
from edsl.questions import QuestionList

q_search_queries = QuestionList(
    question_name = "search_queries",
    question_text = """Consider the following forecasting question and background information for the question.
    Question: {{ question }}
    Background: {{ background }}
    Task: Generate brief search queries (up to {{ max_words }} words each) to gather information on Google that could 
    influence the forecast.
    You must generate this exact number of queries: {{ num_keywords }}
    """
)

q_search_queries_subquestions = QuestionList(
    question_name = "search_queries_subquestions",
    question_text = """Consider the following forecasting question and background information for the question.
    Then generate short search queries (up to {{ max_words }} words each) to be used to find articles on Google News
    to help answer the question.
    Question: {{ question }}
    Background: {{ background }}
    You must generate this exact number of queries: {{ num_keywords }}
    Start off by writing down sub-questions. Then use your sub-questions to help steer the search queries you produce.
    """
)

## Parameterizing the prompts
We parameterize a question by creating `Scenario` objects for the parameter values ("scenarios" of a question with different inputs). In the above question we can add scenarios for `{{ question }}`, `{{ background }}`, `{{ max_words }}` and `{{ num_keywords }}` using the example in Table 1 as follows:

<img src="forecasting_llms_table12.png" width="600">

In [3]:
from edsl import Scenario

scenario = Scenario({
    "question": "Will Starship achieve liftoff before Monday, May 1st, 2023?", 
    "background": "On April 14th, SpaceX received a launch license for its Starship spacecraft. A launch scheduled for April 17th was scrubbed due to a frozen valve. SpaceX CEO Elon Musk tweeted: “Learned a lot today, now offloading propellant, retrying in a few days...", 
    "max_words": 5, 
    "num_keywords": 5
})

## Administering a question to an LLM
We administer a question (or survey of multiple questions) by appending the `.run()` method:

In [4]:
from edsl import Model

Model()

No model name provided, using default model: gpt-4-1106-preview


LanguageModelOpenAIFour(model = 'gpt-4-1106-preview', parameters={'temperature': 0.5, 'max_tokens': 1000, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'logprobs': False, 'top_logprobs': 3})

In [5]:
results = q_search_queries.by(scenario).run()
results

Result 0
                                                      Result                                                       
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Attribute              ┃ Value                                                                                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ agent                  │                                    Agent Attributes                                    │
│                        │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │
│                        │ ┃ Attribute               ┃ Value                                                    ┃ │
│                        │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │
│                        │ │ _name                   │ None    

This administers our parameterized question to the default LLM. We can also specify the LLMs to use by creating a `Model` for each of them with desired attributes (temperature, max_tokens, etc.):

In [6]:
from edsl import Model

Model.available()

['claude-3-haiku-20240307',
 'claude-3-opus-20240229',
 'claude-3-sonnet-20240229',
 'dbrx-instruct',
 'gemini_pro',
 'gpt-3.5-turbo',
 'gpt-4-1106-preview',
 'llama-2-13b-chat-hf',
 'llama-2-70b-chat-hf',
 'mixtral-8x7B-instruct-v0.1']

In [7]:
Model

Available models: ['claude-3-haiku-20240307', 'claude-3-opus-20240229', 'claude-3-sonnet-20240229', 'dbrx-instruct', 'gemini_pro', 'gpt-3.5-turbo', 'gpt-4-1106-preview', 'llama-2-13b-chat-hf', 'llama-2-70b-chat-hf', 'mixtral-8x7B-instruct-v0.1']

To create an instance, you can do: 
>>> m = Model('gpt-4-1106-preview', temperature=0.5, ...)

To get the default model, you can leave out the model name. 
To see the available models, you can do:
>>> Model.available()

Here we run the question using GPT 4 and inspect the results:

In [8]:
model = Model("gpt-4-1106-preview")

results = q_search_queries.by(scenario).by(model).run()

We can select the components of the `Results` object that we want to view:

In [9]:
results.columns

['agent.agent_name',
 'answer.search_queries',
 'answer.search_queries_comment',
 'iteration.iteration',
 'model.frequency_penalty',
 'model.logprobs',
 'model.max_tokens',
 'model.model',
 'model.presence_penalty',
 'model.temperature',
 'model.top_logprobs',
 'model.top_p',
 'prompt.search_queries_system_prompt',
 'prompt.search_queries_user_prompt',
 'raw_model_response.search_queries_raw_model_response',
 'scenario.background',
 'scenario.max_words',
 'scenario.num_keywords',
 'scenario.question']

In [10]:
(results
.select("model.model", "scenario.*", "search_queries")
.print(pretty_labels={
    "model.model": "LLM", 
    "scenario.question": "Question", 
    "scenario.background": "Background",
    "scenario.max_words": "Max Words",
    "scenario.num_keywords": "Num Keywords",
    "answer.search_queries": "Search Queries"
    })
)

## Relevance ranking prompt

The prompt for the LLM to rate the relevance of an article to a forecasting question (Fig.14):

<img src="forecasting_llms_fig14.png" width="600">

We use `QuestionLinearScale` to recreate the prompt, as a numerical rating is desired. The "thoughts" will automatically be captured in the answer comment field:

In [11]:
from edsl.questions import QuestionLinearScale

q_rate_relevance = QuestionLinearScale(
    question_name = "rate_relevance",
    question_text = """Consider the following forecasting question and its background information.
    Then consider the following news article and rate its relevance with respect to the forecasting question.
    Question: {{ question }}
    Question Background: {{ background }}
    Resolution Criteria: {{ resolution_criteria }}
    Article: {{ article }}
    Please rate the relevance of the article to the forecasting question on a scale of 1 to 6
    (1 = irrelevant, 2 = slightly relevant, 3 = somewhat relevant, 4 = relevant, 5 = highly relevant, 6 = most relevant).
    You do not need to access any external sources. Just consider the information provided.
    Focus on the content of the article, not the title.
    If the text content is as error message about JavaScript, paywall, cookier or other technical issues,
    output a score of 1. 
    """,
    question_options = [1,2,3,4,5,6]
)

We parameterize it as we did for the search query generation question above:

In [12]:
scenario = Scenario({
    "question": "Will Starship achieve liftoff before Monday, May 1st, 2023?", 
    "background": "On April 14th, SpaceX received a launch license for its Starship spacecraft. A launch scheduled for April 17th was scrubbed due to a frozen valve. SpaceX CEO Elon Musk tweeted: “Learned a lot today, now offloading propellant, retrying in a few days...", 
    "resolution_criteria": "This question resolves Yes if Starship leaves the launchpad intact and under its own power before 11:59pm ET on Sunday, April 30th.", 
    "article": """..."""
})


# Add article text to the scenario
results = q_rate_relevance.by(scenario).by(model).run()

## Summarization prompt

The prompt for the LLM to condense the text of an article (Fig.13):

<img src="forecasting_llms_fig13.png" width="600">

We use `QuestionFreeText` to recreate the prompt, as the desired response is unstructured text:

In [13]:
from edsl.questions import QuestionFreeText

q_shorten_article = QuestionFreeText(
    question_name = "shorten_article",
    question_text = """Shorten the following article (condense it to no more than 100 words).
    Article: {{ article }}
    When doing this task, please do not remove any details that would be helpful for making considerations
    about the following forecasting question:
    Forecasting Question: {{ question }}
    Question Background: {{ background }}
    """
)

scenario = Scenario({
    "article": """...""",
    "question": "Will Starship achieve liftoff before Monday, May 1st, 2023?", 
    "background": "On April 14th, SpaceX received a launch license for its Starship spacecraft. A launch scheduled for April 17th was scrubbed due to a frozen valve. SpaceX CEO Elon Musk tweeted: “Learned a lot today, now offloading propellant, retrying in a few days...", 
})

# Add article text to the scenario
results = q_shorten_article.by(scenario).by(model).run()

## Reasoning system
We can recreate the paper's reasoning system by creating a method that uses question responses as inputs to follow-on questions:

<img src="forecasting_llms_fig15.png" width="600">

In [14]:
from edsl.questions import QuestionFreeText, QuestionNumerical
from edsl import Agent, Model

def reasoning_system(question, persona=None, model="gpt-3.5-turbo"):

    # Create model for specified LLM
    model = Model(model)

    # Create an agent if a persona is provided
    if persona:
        agent = Agent(traits={"persona":persona})

    q1 = QuestionFreeText(
        question_name = "expand",
        question_text = """Review the following question: """ + question + """
        Rephrase and expand this question. Maintain all information in the original question."""
    )
    
    if persona:
        r1 = q1.by(agent).by(model).run()
    else:
        r1 = q1.by(model).run()

    expanded_question = r1.to_pandas()["answer.expand"][0]
    
    q2 = QuestionFreeText(
        question_name = "no",
        question_text = """Review the following question: """ + expanded_question + """
        Using your knowledge of the world and topic, as well as the information provided, 
        provide a few reasons why the answer might be no. Rate the strength of each reason."""
    )

    if persona:
        r2 = q2.by(agent).by(model).run()
    else:
        r2 = q2.by(model).run()

    no_considerations = r2.to_pandas()["answer.no"][0]

    q3 = QuestionFreeText(
        question_name = "yes",
        question_text = """Review the following question: """ + expanded_question + """
        Using your knowledge of the world and topic, as well as the information provided, 
        provide a few reasons why the answer might be yes. Rate the strength of each reason."""
    )

    if persona:
        r3 = q3.by(agent).by(model).run()
    else:
        r3 = q3.by(model).run()

    yes_considerations = r3.to_pandas()["answer.yes"][0]

    q4 = QuestionFreeText(
        question_name = "aggregate",
        question_text = """Review the following question and considerations: """ 
        + expanded_question + no_considerations + yes_considerations + """
        Now aggregate these considerations. Think like a superforecaster (e.g., Nate Silver)."""
    )

    if persona:
        r4 = q4.by(agent).by(model).run()
    else:
        r4 = q4.by(model).run()

    aggregate_considerations = r4.to_pandas()["answer.aggregate"][0]

    q5 = QuestionFreeText(
        question_name = "initial",
        question_text = """Review the following question and considerations. """ 
        + expanded_question + aggregate_considerations + 
        """ Now output an initial probability (prediction)."""
    )

    if persona:
        r5 = q5.by(agent).by(model).run()
    else:
        r5 = q5.by(model).run()

    initial_probability = r5.to_pandas()["answer.initial"][0]

    q6 = QuestionFreeText(
        question_name = "confidence",
        question_text = """Review the following question and considerations. """ 
        + expanded_question + aggregate_considerations + 
        """ When asked to provide an initial probability (prediction) you replied: """ + initial_probability + 
        """ Now evaluate whether your calculated probability is excessively confident or not confident enough. 
        Also, consider anything else that might affect the forecast that you did not before consider 
        (e.g., base rate of the event)."""
    )

    if persona:
        r6 = q6.by(agent).by(model).run()
    else:
        r6 = q6.by(model).run()

    confidence = r6.to_pandas()["answer.confidence"][0]
    
    q7 = QuestionNumerical(
        question_name = "final",
        question_text = """Review the following question and considerations. """ 
        + expanded_question + aggregate_considerations + 
        """ When asked to provide an initial probability (prediction) you replied: """ + initial_probability + 
        """ Then when asked to evaluate whether your calculated probability is excessively confident 
        or not confident enough you replied: """ + confidence +
        """ Now output your final prediction (a number between 0 and 1)."""
    )

    if persona:
        r7 = q7.by(agent).by(model).run()
    else:
        r7 = q7.by(model).run()

    final_prediction = r7.to_pandas()["answer.final"][0]

    answer = {
        "model": model,
        "original_question": question,
        "expanded_question": expanded_question,
        "no_considerations": no_considerations,
        "yes_considerations": yes_considerations,
        "aggregate_considerations": aggregate_considerations,
        "initial_probability": initial_probability,
        "confidence": confidence,
        "final_prediction": final_prediction
    }

    return answer

### Testing the reasoning system
Here we test the replicated reasoning system with a question from the paper (Fig.21)

In [15]:
reasoning_system("Will 'Barbie' gross 2x more than 'Oppenheimer' on opening weekend?")

{'model': LanguageModelOpenAIThreeFiveTurbo(model = 'gpt-3.5-turbo', parameters={'temperature': 0.5, 'max_tokens': 1000, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'logprobs': False, 'top_logprobs': 3}),
 'original_question': "Will 'Barbie' gross 2x more than 'Oppenheimer' on opening weekend?",
 'expanded_question': "The question is asking if 'Barbie' will earn twice as much as 'Oppenheimer' during their respective opening weekends at the box office. In other words, it is inquiring about the potential difference in gross earnings between the two movies during their initial weekend screenings.",
 'no_considerations': 'One reason why Barbie may not earn twice as much as Oppenheimer during their respective opening weekends could be due to differences in target audience. If Oppenheimer appeals to a wider demographic or has more buzz surrounding it, it could potentially outperform Barbie at the box office. Strength: Moderate.',
 'yes_considerations': "One reason why the answ

We can see how the response changes with a different LLM and an agent persona:

In [16]:
question = "Will 'Barbie' gross 2x more than 'Oppenheimer' on opening weekend?"
persona = "You are an expert box office analyst."
model = "gpt-4-1106-preview"

reasoning_system(question=question, persona=persona, model=model)

{'model': LanguageModelOpenAIFour(model = 'gpt-4-1106-preview', parameters={'temperature': 0.5, 'max_tokens': 1000, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'logprobs': False, 'top_logprobs': 3}),
 'original_question': "Will 'Barbie' gross 2x more than 'Oppenheimer' on opening weekend?",
 'expanded_question': "Is it anticipated that the opening weekend box office revenue for 'Barbie' will be double that of 'Oppenheimer'?",
 'no_considerations': "Predicting box office performance involves considering several factors such as genre appeal, target audience, marketing campaigns, release dates, and competition from other films. 'Barbie', likely targeting a younger and more family-oriented audience, may not necessarily double the opening weekend revenue of 'Oppenheimer', which may appeal to a different demographic interested in historical or biographical dramas. Moreover, 'Oppenheimer' could attract audiences due to its potential for critical acclaim, given the subject matte

## Reproducing checks for post-training data leakage
Here we reproduce some questions used for checking LLM data leakage mentioned in the paper and try some others, with mixed results:

<img src="testing_leakage.png" width="600">

In [17]:
from edsl import Survey
from edsl.questions import QuestionFreeText, QuestionYesNo

q_cutoff_date = QuestionFreeText(
    question_name = "cutoff_date",
    question_text = "What is your training data cutoff date?"
)

q_last_news = QuestionFreeText(
    question_name = "last_news",
    question_text = "What is the date of the most recent news you know about?"
)

q_dec2020 = QuestionYesNo(
    question_name = "dec20",
    question_text = "Do you know about events that occurred in December 2020?"
)

q_mar23 = QuestionYesNo(
    question_name = "mar23",
    question_text = "Do you know about events that occurred in March 2023?"
)

q_turkish_election = QuestionFreeText(
    question_name = "turkish_election",
    question_text = "Who won the 2023 Turkish presidential election?"
)

q_world_cup = QuestionFreeText(
    question_name = "world_cup",
    question_text = "Who won the 2022 World Cup?"
)

test_survey = Survey(questions=[q_cutoff_date, q_last_news, q_dec2020, q_mar23, q_turkish_election, q_world_cup])

test_results = test_survey.run()

In [18]:
test_results.select("model.model", "cutoff_date", "last_news", "dec20", "mar23", "turkish_election","world_cup").print()

---
<p style="font-size: 14px;">Copyright © 2024 Expected Parrot, Inc. All rights reserved.   <a href="www.expectedparrot.com" style="color:#130061">www.expectedparrot.com</a></p>