# Set up

## Imports

In [1]:
import pandas as pd
from langchain.output_parsers import StructuredOutputParser, ResponseSchema
from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
import time
from getpass import getpass


## Load excel

Loading the source excel file with information such as the titles of the articles, their rank, date of publication, abstracts and links to the full text. See dataframe for a better understanding of the structure of the spreadsheet.

In [2]:
df = pd.read_excel('Articles_for_screening.xlsx')

## Add new columns

Adds 4 new columns to the dataframe. The extracted data will be registered in these columns; for now, they are set to be empty.

**AI_ML** - Whether the described system uses AI/ML or not

**NLP** - Whether the described system uses NLP or not

**AI_ML_source** - source/reference text for why the described system uses/doesn't use AI or ML

**NLP_source** - source/reference text for why the described system uses/doesn't use NLP.

**NOTE: If you want to add new parameters to screen for, create columns for them here.**



In [3]:
df['AI_ML'] = None
df['NLP'] = None
df['AI_ML_source'] = None
df['NLP_source'] = None

In [4]:
df.head()

Unnamed: 0,Rank,FullTextLink,FullText,Title,Link,Abstract,FullAbstract,PublicationDate,novel,AI_ML,NLP,novel_source,AI_ML_source,NLP_source
0,202,https://www.ahajournals.org/doi/10.1161/CIRCIN...,Sheth_2016.pdf,Optical Coherence Tomography–Guided Percutaneo...,https://www.ahajournals.org/doi/abs/10.1161/CI...,Among 10 732 patients enrolled in the TOTAL tr...,Background—\nPatients undergoing primary percu...,2016-04-07,,,,,,
1,203,,,Improved accuracy of pedicle screw insertion w...,https://journals.lww.com/spinejournal/Fulltext...,The reasons for not using CAOS in 35 screws we...,STUDY DESIGN:\nA prospective clinical trial wa...,2019-07-27,,,,,,
2,204,https://www.tandfonline.com/doi/epdf/10.1080/1...,Davids_2021.pdf,Matching-adjusted indirect comparisons of safe...,https://www.tandfonline.com/doi/abs/10.1080/10...,patient-level data were extracted from the ELE...,"Acalabrutinib is a highly selective, potent, n...",2021-05-06,,,,,,
3,205,https://www.researchgate.net/publication/28307...,Shiomi_2015.pdf,Effects of a diverting stoma on symptomatic an...,https://www.sciencedirect.com/science/article/...,Although the tumor level of participating pati...,Background\nRoutine creation of a diverting st...,2015-02-01,,,,,,
4,206,,,For whom does a match matter most? Patient-lev...,https://psycnet.apa.org/journals/ccp/90/1/61/,"When randomized to the Match, trial patients w...","OBJECTIVE:\nA double-blind, randomized control...",2022-03-02,,,,,,


## Enter OpenAI key

In [5]:
oai_key = getpass("Enter the key")



# Extraction Process

We use GPT to extract the parameters for screening. Specifically, we focus on **AI/ML** and **NLP**. Note that we also ask for source text that would indicate what GPT referenced in choosing its answer. We do not analyze the source texts in the manuscript, but we found them useful when engineering prompts and resolving conflicting answers between GPT and human reviewers.

**If you want to experiment more with the output, you can make changes to the description in each of the ResponseSchema below. If you want to screen for other parameters, add them to `response_schemas` in the same format as that for `AI_ML` and `NLP`. Please note that you'll need to make additional changes in `main_extraction` to save the responses for any new parameters in the dataframe.**

In [None]:
response_schemas = [
        [ResponseSchema(name="AI_ML", description="Identify whether the abstract explains machine learning or AI used to match patients to clinical trials or not. Answer it in YES or NO"),
        ResponseSchema(name="AI_ML_source", description="Source text that explains machine learning or AI that was used to match patients to clinical trials"),
        ],
        [ResponseSchema(name="NLP", description="Identify whether the abstract explains usage of natural language processing to match patients to clinical trials. Answer it in YES or NO"),
         ResponseSchema(name="NLP_source", description="Source text that explains usage of natural language processing to match patients to clinical trials") ],
        ]

We do the extraction via two functions.

First we have `extract_information`, which processes a single text. It uses the response schemas above to fill in the prompts for GPT, passes the prompts to GPT, parses the responses, aggregates them in a dictionary and returns said dictionary. The result is a collection that maps each parameter of interest to its values for a single abstract.

The second function, `main_extraction`, iterates through all the articles, calls `extract_information` on each of them and populates the original datafarame with the extracted parameters.

The whole process is initiated in the **Run the code** section below.

In [6]:
def extract_information(entry, model_name, response_schemas):
    """ Function that given the text of a paper abstract and a collection of parameters to screen for (along with
    questions and formatting instructions), uses `model name` AI model to extract information about the queried parameters.

    Parameters:
    entry (str): the full text of the abstract of a scientific article
    model_name (str): the name of the AI model to be used for extracting information. ex: "gpt-3.5-turbo", "gpt-4"
    response_schemas (list[list[ResponseSchema]]): A collection of ResponseSchema objects that map each parameter of interest
                                                   to the corresponding question and formatting instructions that will be filled
                                                   into the prompt and asked to `model_name`. Each entry corresponds to a single
                                                   parameter and contains two response schemas, one for the value of the parameter
                                                   and another for the source text justifying the value of the parameter. See example
                                                   of response_schemas in code cell above for a better understanding of the structure.

    Returns:
    dict {int: dict {str: str}}: A collection mapping the index of each question asked to the extracted answers. The index
                                 is the order in which the question appears in `response_schemas`.
    Example: {0: {"AI_ML" : "YES", "source" : "something" },
              1: {"NLP" : "YES", "source" : "something" }}
    """

    ans = {}

    for i,res_schema in enumerate(response_schemas):

        output_parser = StructuredOutputParser.from_response_schemas(res_schema)

        format_instructions = output_parser.get_format_instructions()

        chat_model = ChatOpenAI(temperature=0, openai_api_key = oai_key, model_name= model_name)


        prompt = ChatPromptTemplate(
            messages=[
                HumanMessagePromptTemplate.from_template("""
                You are a helpful medical data science assistant that answers question from the provided context which are abstracts of research papers. 
                Answer the users question as best as possible.\n{format_instructions}\n{question}
                """)
            ],
            input_variables=["question"],
            partial_variables={"format_instructions": format_instructions}
        )


        _input = prompt.format_prompt(question=entry)
        output = chat_model(_input.to_messages())

        try:
            llm_output = output_parser.parse(output.content)
        except:
            llm_output = output.content

        ans[i] = llm_output

    return ans


In [7]:
def main_extraction(full_abstracts, model_name):
    """ Function that executes `extract_information` for each of the entries in the FullAbstract column of the dataframe
    and saves the extracted parameters in the corresponding columns of the dataframe.

    Parameters:
    full_abstracts (list[str]): list of the texts of all the abstracts in the dataframe
    model_name (str): name of the AI model to be used for the data extraction. ex: "gpt-3.5-turbo"

    Outputs:
    Saves the extracted values in their respective columns. Modifies the original dataframe.
    """

    for ind, entry in enumerate(full_abstracts):
        ans = ''

        try:
            ans = extract_information(entry,model_name, response_schemas)
            ans0 = ans[0]  #AI/ML
            ans1 = ans[1]  #NLP

            df.at[ind, 'AI_ML'] = ans0['AI_ML']
            df.at[ind, 'NLP'] = ans1['NLP']

            df.at[ind, 'AI_ML_source'] = ans0['AI_ML_source']
            df.at[ind, 'NLP_source'] = ans1['NLP_source']

            print(f"Index: {ind} DONE processing.")

            #time.sleep(5)

        except Exception as e:
            print('Index :', ind, 'There was an error !', e)



## Run the Code

If you want to have more robust output then please consider using **gpt-4** instead of **gpt-3.5-turbo**

In [8]:
full_abstracts = df['FullAbstract'].to_list()
model_name = "gpt-4"
#model_name = "gpt-3.5-turbo"

In [9]:
main_extraction(full_abstracts, model_name)

Index: 0 DONE processing.
Index: 1 DONE processing.
Index: 2 DONE processing.
Index: 3 DONE processing.
Index: 4 DONE processing.
Index: 5 DONE processing.
Index: 6 DONE processing.
Index: 7 DONE processing.
Index: 8 DONE processing.
Index: 9 DONE processing.
Index: 10 DONE processing.
Index: 11 DONE processing.
Index: 12 DONE processing.
Index: 13 DONE processing.
Index: 14 DONE processing.
Index: 15 DONE processing.
Index: 16 DONE processing.
Index: 17 DONE processing.
Index: 18 DONE processing.
Index: 19 DONE processing.
Index: 20 DONE processing.
Index: 21 DONE processing.
Index: 22 DONE processing.
Index: 23 DONE processing.
Index: 24 DONE processing.
Index: 25 DONE processing.
Index: 26 DONE processing.
Index: 27 DONE processing.
Index: 28 DONE processing.
Index: 29 DONE processing.
Index: 30 DONE processing.
Index: 31 DONE processing.
Index: 32 DONE processing.
Index: 33 DONE processing.
Index: 34 DONE processing.
Index: 35 DONE processing.
Index: 36 DONE processing.
Index: 37 D

In [10]:
df

Unnamed: 0,Rank,FullTextLink,FullText,Title,Link,Abstract,FullAbstract,PublicationDate,novel,AI_ML,NLP,novel_source,AI_ML_source,NLP_source
0,101,https://www.jacc.org/doi/abs/10.1016/j.jcin.20...,Cardiol_2015.pdf,with matched hydration: the MYTHOS (induced di...,https://www.jacc.org/doi/abs/10.1016/j.jcin.20...,to a patient in an amount matched to the volum...,Objectives:\nThis study investigated the effec...,2012 Jan,YES,NO,NO,This study investigated the effect of furosemi...,,
1,102,,,Matching-adjusted indirect comparisons: a new ...,https://www.sciencedirect.com/science/article/...,patients in one treatment group (in this case ...,OBJECTIVE:\nIn the absence of head-to-head ran...,2022-03-30 00:00:00,NO,NO,NO,,,
2,103,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8...,Croft_2021.pdf,Copy number evolution and its relationship wit...,https://www.nature.com/articles/s41375-020-010...,"Importantly, few studies have been performed o...",Structural chromosomal changes including copy ...,2021-11-18 00:00:00,NO,NO,NO,,,
3,104,,,The effect of visual scanning exercises integr...,https://journals.sagepub.com/doi/abs/10.1177/1...,A matched-pair randomized control trial was co...,BACKGROUND:\nUnilateral spatial neglect (USN) ...,2022-04-09 00:00:00,NO,NO,NO,,,
4,105,,,Crestal bone stability around implants with ho...,https://onlinelibrary.wiley.com/doi/abs/10.111...,Individuals for this 1-year controlled clinica...,BACKGROUND:\nIt has been shown that thin mucos...,2017-02-22 00:00:00,NO,NO,NO,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,197,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Johnson_2020.pdf,Opportunities for patient matching algorithms ...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,matching patients to patients or clinical mode...,The premise of precision medicine is rooted in...,2017 Feb 23,NO,NO,NO,,,
96,198,,Elizabeth_2021,Matching the patient to the intraocular lens: ...,https://www.sciencedirect.com/science/article/...,In a clinical trial to compare spectacle indep...,The intraocular lens (IOL) selection process f...,2021-11-01 00:00:00,NO,NO,NO,,,
97,199,,,Isavuconazole treatment for mucormycosis: a si...,https://www.thelancet.com/journals/laninf/arti...,FungiScope patient outcomes and to the case-ma...,BACKGROUND:\nMucormycosis is an uncommon invas...,2018-03-07 00:00:00,NO,NO,NO,,,
98,200,,,Practice patterns of ABOmatching for cryopreci...,https://onlinelibrary.wiley.com/doi/abs/10.111...,This study demonstrated that during the FIBRES...,BACKGROUND AND OBJECTIVES:\nThis sub-study of ...,2022-09-14 00:00:00,NO,NO,NO,,,


# Saving Results

In [10]:
df.to_excel('results/Screening_reviewer_GPT4.xlsx')