# Importing pdfs
This notebook provides sample [EDSL](https://docs.expectedparrot.com/) code demonstrating a method `from_pdf()` that imports a PDF and automatically creates `Scenario` objects for the pages to use as parameters of survey questions. This can be helpful when using EDSL to extract qualitative information from a large text efficiently. 

EDSL is an open-source library for simulating surveys and experiments with AI agents and large language models. Please see our [documentation page](https://docs.expectedparrot.com/) for tips and tutorials on getting started.

## How it works
EDSL comes with a [variety of question types](https://docs.expectedparrot.com/en/latest/questions.html) that we can select from based on the desired form of the response (multiple choice, free text, etc.). We can also parameterize questions with textual content in order to ask questions about it. We do this by creating a `{{ placeholder }}` in a question text, e.g., *What are the key themes of this text: {{ text }}*, and then creating `Scenario` objects for the content to be inserted in the placeholder when we run the survey. This allows us to administer multiple versions of a question with different inputs all at once. A common use case for this is performing [data labeling tasks](https://docs.expectedparrot.com/en/latest/notebooks/data_labeling_example.html) designed as questions about one or more pieces of textual data that can be inserted into the survey question texts. [Learn more about using scenarios](https://docs.expectedparrot.com/en/latest/scenarios.html).

## Example
For purposes of demonstration we use a PDF copy of the first page of the recent paper [Automated Social Science:
Language Models as Scientist and Subjects](https://arxiv.org/pdf/2404.11794) and conduct a survey consisting of several questions about the contents of it:

<img src="automated_social_science_paper.png" width="300px">

Importing the tools:

In [1]:
# pip install edsl

In [2]:
from edsl.questions import QuestionFreeText, QuestionList
from edsl import ScenarioList, Survey

Here we create a survey of questions that we will administer for each page of the PDF. Note that the `from_pdf()` method requires that the scenario placeholders be `{{ text }}` (for regular scenario objects, you can use any placeholder word that you like):

In [3]:
q_summary = QuestionFreeText(
    question_name = "summary",
    question_text = "Briefly summarize the abstract of this paper: {{ text }}"
)

q_authors = QuestionList(
    question_name = "authors",
    question_text = "List the names of all the authors of the following paper: {{ text }}"
)

q_thanks = QuestionList(
    question_name = "thanks",
    question_text = "List the names of the people thanked in the following paper: {{ text }}"
)

survey = Survey([q_summary, q_authors, q_thanks])

Next we create a `ScenarioList` for the PDF using the `from_pdf()` method, which automatically creates a list of `Scenario` objects for the pages of the PDF which will be inserted in our questions (in our example, this is just the first page of the paper):

In [4]:
automated_social_scientist = ScenarioList.from_pdf("automated_social_scientist.pdf")

Now we can add the list of scenarios to to the survey and run it:

In [5]:
results = survey.by(automated_social_scientist).run()

We can see a list of all the components of results that are directly accessible:

In [6]:
results.columns

['agent.agent_instruction',
 'agent.agent_name',
 'answer.authors',
 'answer.summary',
 'answer.thanks',
 'comment.authors_comment',
 'comment.thanks_comment',
 'iteration.iteration',
 'model.frequency_penalty',
 'model.logprobs',
 'model.max_tokens',
 'model.model',
 'model.presence_penalty',
 'model.temperature',
 'model.top_logprobs',
 'model.top_p',
 'prompt.authors_system_prompt',
 'prompt.authors_user_prompt',
 'prompt.summary_system_prompt',
 'prompt.summary_user_prompt',
 'prompt.thanks_system_prompt',
 'prompt.thanks_user_prompt',
 'question_options.authors_question_options',
 'question_options.summary_question_options',
 'question_options.thanks_question_options',
 'question_text.authors_question_text',
 'question_text.summary_question_text',
 'question_text.thanks_question_text',
 'question_type.authors_question_type',
 'question_type.summary_question_type',
 'question_type.thanks_question_type',
 'raw_model_response.authors_raw_model_response',
 'raw_model_response.summary_

We can select components of the results to inspect and print:

In [7]:
results.select("summary", "authors", "thanks").print(format="rich")

## Another example
Let's try another example - the (complete) paper [Owning, Using and Renting:
Some Simple Economics of the "Sharing Economy"](https://john-joseph-horton.com/papers/sharing.pdf).

Here we import it and verify that all the pages have been turned into scenarios:

In [8]:
sharing_economy = ScenarioList.from_pdf("sharing_economy.pdf")

In [9]:
len(sharing_economy)

56

In [10]:
sharing_economy[0:2]

Let's see what pages are the most important. We start by generating a summary of the paper based on the abstract, using just the first scenario, which is the first page of the paper. We can also create an agent with a relevant persona for the model to use in answering the questions ([learn more about creating AI agents to answer survey questions](https://docs.expectedparrot.com/en/latest/agents.html)):

In [11]:
from edsl.questions import QuestionFreeText, QuestionList, QuestionLinearScale
from edsl import Agent, Survey

In [12]:
social_scientist_agent = Agent({"persona":"You are an experienced social scientist."},
                               instruction = "You are evaluating the contents of a research paper.")

In [13]:
q_summary = QuestionFreeText(
    question_name = "summary",
    question_text = "Draft a summary of the paper based on the abstract: {{ text }}"
)

q_authors = QuestionList(
    question_name = "authors",
    question_text = "List the authors of this paper: {{ text }}"
)

survey = Survey([q_summary, q_authors])

In [14]:
results = survey.by(sharing_economy[0]).by(social_scientist_agent).run()

results.select("summary", "authors").print(format="rich")

We'll use the summary as context for a new set of questions prompting the agent to identify the most important idea on each page, then select and summarize the most important ideas, and rate the relative importance of each page of the paper:

In [15]:
summary = results.select("summary").first()

In [16]:
q_idea = QuestionFreeText(
    question_name = "idea",
    question_text = "Paper summary: " + summary + 
    " Quote the most important sentence on this page: {{ text }}"
)

In [17]:
ideas = q_idea.by(sharing_economy).by(social_scientist_agent).run().select("idea").to_list()
ideas[0:10]

['We find that “sharing economy” markets always expand consumption and increase surplus, but may increase or decrease ownership. Regardless, ownership is decoupled from individual preferences in the long-run, as the rental rates and the purchase prices of goods become equal.',
 'The goal of this paper is to provide answers to these questions.',
 'Our first major question is why P2P rental markets only became a force in the 21st century, despite the fact that the economic problem these markets are able to solve—under-utilization of durable goods—is hardly new.',
 'While ownership may increase or decrease in the long-run, the option of renting out an owned good makes ownership more valuable. As such, a P2P rental market can have a market-expanding effect, in the sense that it allows a previously infeasible product market to emerge.',
 'owners with lower valuations are the biggest beneﬁciaries, as they consume the good less of the time, and hence they have more excess capacity to rent. Si

In [18]:
q_important = QuestionFreeText(
    question_name = "important",
    question_text = f"""Paper summary: {summary}
    Consider the following ideas that are mentioned in the paper and 
    summarize the 5 most important of them: {ideas}."""
)

In [19]:
results = q_important.by(social_scientist_agent).run()
results.select("important").print(format="rich")

In [20]:
important_ideas = results.select("important").first()

In [21]:
q_relative = QuestionLinearScale(
    question_name = "relative",
    question_text = "Consider the following paper summary: " + important_ideas +
    " What is the relative importance of this page of the paper: {{ text }}",
    question_options = [0,1,2,3,4,5],
    option_labels = {0:"Unimportant", 3:"Important", 5:"Most important"}
)

In [22]:
results = q_relative.by(sharing_economy).by(social_scientist_agent).run()

We can filter and sort pages based on the responses, and inspect the agent's comments on its answers:

In [23]:
(results
 .sort_by("page")
 .filter("relative == '5'")
 .select("page", "relative", "relative_comment")
 .print(format="rich")
)

The selected page:

In [24]:
(results
 .sort_by("page")
 .filter("relative == '5'")
 .select("page", "text")
 .print(format="rich")
)

Please see our [documentation page](https://docs.expectedparrot.com/) for examples of other survey methods and use cases!