# Using PDFs in a survey
This notebook provides sample [EDSL](https://docs.expectedparrot.com/) code demonstrating a method `from_pdf()` that imports a PDF and automatically creates `Scenario` objects for the pages to use as parameters of survey questions. This can be helpful when using EDSL to extract qualitative information from a large text efficiently. 

EDSL is an open-source library for simulating surveys and experiments with AI agents and large language models. Please see our [documentation page](https://docs.expectedparrot.com/) for tips and tutorials on getting started.

## How it works
EDSL comes with a [variety of question types](https://docs.expectedparrot.com/en/latest/questions.html) that we can select from based on the desired form of the response (multiple choice, free text, etc.). We can also parameterize questions with textual content in order to ask questions about it. We do this by creating a `{{ placeholder }}` in a question text, e.g., *What are the key themes of this text: {{ text }}*, and then creating `Scenario` objects for the content to be inserted in the placeholder when we run the survey. This allows us to administer multiple versions of a question with different inputs all at once. A common use case for this is performing [data labeling tasks](https://docs.expectedparrot.com/en/latest/notebooks/data_labeling_example.html) designed as questions about one or more pieces of textual data that can be inserted into the survey question texts. [Learn more about using scenarios](https://docs.expectedparrot.com/en/latest/scenarios.html).

## Example
For purposes of demonstration we use a PDF copy of the first page of the recent paper [Automated Social Science:
Language Models as Scientist and Subjects](https://arxiv.org/pdf/2404.11794) and conduct a survey consisting of several questions about the contents of it:

<img src="automated_social_science_paper.png" width="300px">

Posting a PDF to Coop file store:

In [1]:
from edsl.scenarios.FileStore import PDFFileStore

In [2]:
ass_pdf = PDFFileStore("automated_social_scientist.pdf")
info = ass_pdf.push()
print(info)

{'description': 'File: automated_social_scientist.pdf', 'object_type': 'scenario', 'url': 'https://www.expectedparrot.com/content/21f84703-03ea-43be-9f93-28fee30edd87', 'uuid': '21f84703-03ea-43be-9f93-28fee30edd87', 'version': '0.1.39.dev1', 'visibility': 'unlisted'}


Now that we have stored it at the Coop we can re-import it:

In [3]:
ass_pdf = PDFFileStore.pull(info["uuid"])

Here we create a survey of questions that we will administer for each page of the PDF. Note that the `from_pdf()` method requires that the scenario placeholders be `{{ text }}` (for regular scenario objects, you can use any placeholder word that you like):

In [4]:
from edsl import QuestionFreeText, QuestionList, ScenarioList, Survey

In [5]:
q_summary = QuestionFreeText(
    question_name="summary",
    question_text="Briefly summarize the abstract of this paper: {{ text }}",
)

q_authors = QuestionList(
    question_name="authors",
    question_text="List the names of all the authors of the following paper: {{ text }}",
)

q_thanks = QuestionList(
    question_name="thanks",
    question_text="List the names of the people thanked in the following paper: {{ text }}",
)

survey = Survey([q_summary, q_authors, q_thanks])

Next we create a `ScenarioList` for the PDF using the `from_pdf()` method, which automatically creates a list of `Scenario` objects for the pages of the PDF which will be inserted in our questions (in our example, this is just the first page of the paper):

In [6]:
automated_social_scientist = ScenarioList.from_pdf(ass_pdf.to_tempfile())

Alternative method for importing a file locally:

In [7]:
# automated_social_scientist = ScenarioList.from_pdf("automated_social_scientist.pdf")

We can inspect the scenarios:

In [8]:
automated_social_scientist[0:2]

filename,page,text
tmpop_bxq2r.pdf,1,"Automated Social Science: Language Models as Scientist and Subjects∗ Benjamin S. Manning† MIT Kehang Zhu† Harvard John J. Horton MIT & NBER April 26, 2024 Abstract We present an approach for automatically generating and testing, in silico, social scientific hypotheses. This automation is made possible by recent ad- vances in large language models (LLM), but the key feature of the approach is the use of structural causal models. Structural causal models provide a lan- guage to state hypotheses, a blueprint for constructing LLM-based agents, an experimental design, and a plan for data analysis. The fitted structural causal model becomes an object available for prediction or the planning of follow-on experiments. We demonstrate the approach with several scenarios: a nego- tiation, a bail hearing, a job interview, and an auction. In each case, causal relationships are both proposed and tested by the system, finding evidence for some and not others. We provide evidence that the insights from these simulations of social interactions are not available to the LLM purely through direct elicitation. When given its proposed structural causal model for each scenario, the LLM is good at predicting the signs of estimated effects, but it cannot reliably predict the magnitudes of those estimates. In the auction experiment, the in silico simulation results closely match the predictions of auction theory, but elicited predictions of the clearing prices from the LLM are inaccurate. However, the LLM’s predictions are dramatically improved if the model can condition on the fitted structural causal model. In short, the LLM knows more than it can (immediately) tell. ∗Thanks to generous support from Drew Houston and his AI for Augmentation and Productivity seed grant. Thanks to Jordan Ellenberg, Benjamin Lira Luttges, David Holtz, Bruce Sacerdote, Paul R¨ottger, Mohammed Alsobay, Ray Duch, Matt Schwartz, David Autor, and Dean Eckles for their helpful feedback. Author’s contact information, code, and data are currently or will be available at http://www.benjaminmanning.io/. †Both authors contributed equally to this work. 1 arXiv:2404.11794v2 [econ.GN] 25 Apr 2024"
tmpop_bxq2r.pdf,2,"1 Introduction There is much work on efficiently estimating econometric models of human behavior but comparatively little work on efficiently generating and testing those models to estimate. Previously, developing such models and hypotheses to test was exclusively a human task. This is changing as researchers have begun to explore automated hypothesis generation through the use of machine learning.1 But even with novel machine-generated hypotheses, there is still the problem of testing. A potential solution is simulation. Researchers have shown that Large Language Models (LLM) can simulate humans as experimental subjects with surprising degrees of realism.2 To the extent that these simulation results carry over to human subjects in out-of- sample tasks, they provide another option for testing (Horton, 2023). In this paper, we combine these ideas—automated hypothesis generation and automated in silico hypothesis testing—by using LLMs for both purposes. We demonstrate that such automation is possible. We evaluate the approach by comparing results to a setting where the real-world predictions are well known and test to see if an LLM can be used to generate information that it cannot access through direct elicitation. The key innovation in our approach is the use of structural causal models to orga- nize the research process. Structural causal models are mathematical representations of cause and effect (Pearl, 2009b; Wright, 1934) and have long offered a language for expressing hypotheses.3 What is novel in our paper is the use of these models as a blueprint for the design of agents and experiments. In short, each explanatory variable describes something about a person or scenario that has to vary for the effect to be identified, so the system “knows” it needs to generate agents or scenarios that 1A few examples include generative adversarial networks to formulate new hypotheses (Ludwig and Mullainathan, 2023), algorithms to find anomalies in formal theories (Mullainathan and Ram- bachan, 2023), reinforcement learning to propose tax policies (Zheng et al., 2022), random forests to identify heterogenous treatment effects (Wager and Athey, 2018), and several others (Buyalskaya et al., 2023; Cai et al., 2023; Enke and Shubatt, 2023; Girotra et al., 2023; Peterson et al., 2021). 2(Aher et al., 2023; Argyle et al., 2023; Bakker et al., 2022; Binz and Schulz, 2023b; Brand et al., 2023; Bubeck et al., 2023; Fish et al., 2023; Mei et al., 2024; Park et al., 2023) 3In an unfortunate clash of naming conventions, some disciplines have alternative definitions for the term “structural” when discussing formal models. Here, structural does not refer to the definition traditionally used in economics. See Appendix B for a more detailed explanation. 2"


We can select pages to use if we do not want to use all of them -- e.g., here we filter just the first page to use with our survey:

In [9]:
automated_social_scientist = automated_social_scientist.filter("page == 1")
automated_social_scientist

filename,page,text
tmpop_bxq2r.pdf,1,"Automated Social Science: Language Models as Scientist and Subjects∗ Benjamin S. Manning† MIT Kehang Zhu† Harvard John J. Horton MIT & NBER April 26, 2024 Abstract We present an approach for automatically generating and testing, in silico, social scientific hypotheses. This automation is made possible by recent ad- vances in large language models (LLM), but the key feature of the approach is the use of structural causal models. Structural causal models provide a lan- guage to state hypotheses, a blueprint for constructing LLM-based agents, an experimental design, and a plan for data analysis. The fitted structural causal model becomes an object available for prediction or the planning of follow-on experiments. We demonstrate the approach with several scenarios: a nego- tiation, a bail hearing, a job interview, and an auction. In each case, causal relationships are both proposed and tested by the system, finding evidence for some and not others. We provide evidence that the insights from these simulations of social interactions are not available to the LLM purely through direct elicitation. When given its proposed structural causal model for each scenario, the LLM is good at predicting the signs of estimated effects, but it cannot reliably predict the magnitudes of those estimates. In the auction experiment, the in silico simulation results closely match the predictions of auction theory, but elicited predictions of the clearing prices from the LLM are inaccurate. However, the LLM’s predictions are dramatically improved if the model can condition on the fitted structural causal model. In short, the LLM knows more than it can (immediately) tell. ∗Thanks to generous support from Drew Houston and his AI for Augmentation and Productivity seed grant. Thanks to Jordan Ellenberg, Benjamin Lira Luttges, David Holtz, Bruce Sacerdote, Paul R¨ottger, Mohammed Alsobay, Ray Duch, Matt Schwartz, David Autor, and Dean Eckles for their helpful feedback. Author’s contact information, code, and data are currently or will be available at http://www.benjaminmanning.io/. †Both authors contributed equally to this work. 1 arXiv:2404.11794v2 [econ.GN] 25 Apr 2024"


Now we can add the list of scenarios to to the survey and run it:

In [10]:
results = survey.by(automated_social_scientist).run()

We can see a list of all the components of results that are directly accessible:

In [11]:
results.columns

0
agent.agent_index
agent.agent_instruction
agent.agent_name
answer.authors
answer.summary
answer.thanks
comment.authors_comment
comment.summary_comment
comment.thanks_comment
generated_tokens.authors_generated_tokens


We can select components of the results to inspect and print:

In [12]:
results.select("summary", "authors", "thanks")

answer.summary,answer.authors,answer.thanks
"The paper introduces a method for automating the generation and testing of social science hypotheses using large language models (LLMs) and structural causal models. These models help articulate hypotheses, design experiments, and analyze data. The approach is demonstrated through scenarios like negotiations and auctions, where causal relationships are proposed and tested. The study shows that while LLMs can predict the direction of effects, they struggle with accurately predicting effect sizes unless conditioned on a fitted structural causal model. The findings suggest that LLMs possess implicit knowledge that can be better accessed through structured causal models.","['Benjamin S. Manning', 'Kehang Zhu', 'John J. Horton']","['Drew Houston', 'Jordan Ellenberg', 'Benjamin Lira Luttges', 'David Holtz', 'Bruce Sacerdote', 'Paul Röttger', 'Mohammed Alsobay', 'Ray Duch', 'Matt Schwartz', 'David Autor', 'Dean Eckles']"


## Posting to the Coop
The [Coop](https://www.expectedparrot.com/content/explore) is a platform for creating, storing and sharing LLM-based research.
It is fully integrated with EDSL and accessible from your workspace or Coop account page.
Learn more about [creating an account](https://www.expectedparrot.com/login) and [using the Coop](https://docs.expectedparrot.com/en/latest/coop.html).

Here we demonstrate how to post this notebook:

In [13]:
from edsl import Notebook

In [14]:
n = Notebook(path = "scenario_from_pdf.ipynb")

In [15]:
n.push(description = "Example code for generating scenarios from PDFs", visibility = "public")

{'description': 'Example code for generating scenarios from PDFs',
 'object_type': 'notebook',
 'url': 'https://www.expectedparrot.com/content/37f2034d-6c00-4f29-be23-3b31d3e93e47',
 'uuid': '37f2034d-6c00-4f29-be23-3b31d3e93e47',
 'version': '0.1.39.dev1',
 'visibility': 'public'}