In [28]:
# A small proof of concept for parsing papers into a schema
# Crazy new tech. I see prompt engineering as a kind of #dataprep. 

from langchain.document_loaders import PyPDFLoader

# read paper and split into pages
loader = PyPDFLoader("zoog.pdf")
pages = loader.load_and_split()

In [62]:
from langchain import PromptTemplate

qalmri_template = """

As a language model, your task is to classify sections of text within a research paper according to the QALMRI schema. 
The QALMRI schema is a framework designed to help readers make sense of research papers by breaking them down into their key components. 
The acronym QALMRI stands for Question, Approach, Logic, Method, Results, and Inference. Here's a brief overview of each component:

Question (Q): The research question or problem that the study aims to address. It is typically stated in the introduction section of the paper and may be a hypothesis or a set of related questions. It must be infered from the text, as it is not explicitly stated in the section.
Approach (A): The strategy or theoretical framework the researchers use to address the research question. This may include the perspective or lens through which they view the problem and the assumptions they make.
Logic (L): The rationale or reasoning behind the research design and the approach taken. This includes the researchers' explanation of why their approach is suitable for addressing the research question and how it relates to existing literature.
Method (M): The specific procedures and techniques used to collect and analyze data in the study. This may include the study design, sample selection, data collection methods, and data analysis techniques.
Results (R): The findings or outcomes of the study, often presented in the form of data, tables, or figures. This section should provide a clear and concise summary of the results, highlighting the most important findings.
Inference (I): The conclusions and interpretations drawn from the results. This section should discuss the implications of the findings, address any limitations of the study, and suggest areas for future research.

For example, after reading a section of a research paper, your classification might look like this, though you must not restate this information in your response:

Question (Q): "What is the impact of social media on adolescents' mental health?" 
Approach (A): Researchers used a longitudinal study to examine the relationship between social media use and mental health outcomes in adolescents. 
Logic (L): The study builds upon previous research that suggests a correlation between social media use and mental health issues, and aims to provide further evidence to support or refute this relationship. 
Method (M): The study included a sample of 500 adolescents, who were surveyed on their social media use and mental health symptoms at three time points over a two-year period. 
Results (R): Results indicate a significant positive correlation between social media use and symptoms of depression and anxiety in adolescents.
Inference (I): The findings suggest that increased social media use may contribute to poorer mental health outcomes in adolescents, highlighting the need for further research and intervention strategies. 

To complete this classification task, carefully read the section below and infer which it belongs to. If it is unclear which component the section belongs to, you may choose to classify it as "Other" but not "None". 
Make sure to provide a brief summary or quote for each component. Be thorough and precise in your classification, as this will help readers better understand and engage with the research paper.

{section}

\n
"""

qalmri_prompt = PromptTemplate(
    input_variables=["section"],
    template = qalmri_template
)

In [67]:
from langchain.llms import OpenAI
from langchain.chains import LLMChain
import asyncio
import time

resps  = []

async def async_generate(chain, section):
    resp = await chain.arun(section=section)
    print(resp)
    resps.append(resp)

async def generate_concurrently():
    llm = OpenAI()
    chain = LLMChain(llm=llm, prompt=qalmri_prompt) 

    tasks = [async_generate(chain, page.page_content) for page in pages]
    await asyncio.gather(*tasks)

s = time.perf_counter()
await generate_concurrently()
elapsed = time.perf_counter() - s
print('\033[1m' + f"Concurrent executed in {elapsed:0.2f} seconds." + '\033[0m')

Other: This section of the paper discusses previous research that has used satellite observations of ozone and related species to monitor and attribute background surface ozone.
Other: This section provides a list of references to previous research conducted on the topic of ozone and air quality, as well as to the models and instruments used in the current study.
Other: This section of the paper provides a list of citations for articles relevant to the study, including details on authors, titles, journal, and dates of publication. These citations are used to support the researchers' approach and provide evidence for their claims.
Other: This section provides an overview of the limitations of the study and the potential implications of the results. It also acknowledges the support of the NASA Earth Science Division and a NASA Earth and Space Science Fellowship for the research.
Other: This section provides citations for various studies related to the research topic, including details su