## Overview  
  
The **AzureOpenAI-ResearchPaperAnalyzer** project is designed to facilitate the analysis of research papers using the powerful capabilities of Azure OpenAI. The primary objective of this project is to automate the extraction and categorization of key insights from research papers written in markdown format. By leveraging Azure OpenAI's advanced natural language processing abilities, the script can:  
  
- Process markdown files of research papers.  
- Extract key insights and bullet points from the text.  
- Categorize the extracted information according to different research fields, funding sources, and affiliations.  
- Generate a summary of the paper's content and structure.  
- Format and save the results into a CSV file for easy analysis and reporting.  
  
This tool aims to save researchers significant time and effort by automating the tedious process of reading and summarizing lengthy research papers, allowing them to focus more on the critical aspects of their work.  



In [32]:
from openai import AzureOpenAI

import os
from dotenv import load_dotenv

from IPython.display import Markdown, display, Image

import glob


In [None]:
load_dotenv()


In [None]:
# Setting up the deployment name
deployment_name = "gpt-4o-mini"

# The API key for your Azure OpenAI resource.
api_key = os.environ["AZURE_OPENAI_API_KEY"]

# The base URL for your Azure OpenAI resource. e.g. "https://<your resource name>.openai.azure.com"
azure_endpoint = os.environ['AZURE_OPENAI_ENDPOINT']


api_version = "2024-02-15-preview"  # This seems to work

#print the environment variables
print("Azure OpenAI API Key: ", api_key)
print("Azure OpenAI Endpoint: ", azure_endpoint)
print("Azure OpenAI API Version: ", api_version)


client = AzureOpenAI(
  api_key=api_key,  
  azure_endpoint=azure_endpoint,
  api_version=api_version
)

In [None]:
completion = client.chat.completions.create(
  model=deployment_name,
  messages=[
    {"role": "system", "content": "You are a helpful assistant. Help me with my math homework!"}, # <-- This is the system message that provides context to the model
    {"role": "user", "content": "Hello! Could you solve 2+2?"}  # <-- This is the user message for which the model will generate a response
  ]
)
  
print("Assistant: " + completion.choices[0].message.content)

In [36]:
#load the article file from the markdown folder
article_files = glob.glob("*.md")

#lets limit the files to first 5 for now
article_files = article_files[:5]



In [37]:
#lets chunk the transcript into smaller parts and we want to include the last sentence of the previous chunk in the next chunk to ensure that the context is maintained.

#we will use the split_text function to split the text into smaller parts
def split_text(text, limit):
    """
    Split the text into smaller parts that are less than the limit
    """
    text_parts = []
    current_part = ""
    current_length = 0
    for sentence in text.split("."):
        if current_length + len(sentence) < limit:
            current_part += sentence + "."
            current_length += len(sentence)
        else:
            text_parts.append(current_part)
            current_part = sentence + "."
            current_length = len(sentence)
    text_parts.append(current_part)
    return text_parts


#lets split the full transcript into smaller parts




In [None]:
university = "Sample University"

In [38]:
first_system_message = f{"role": "system", "content": """You are an AI assistant specializing in summarizing research papers into concise and meaningful bullet points. You will be provided with chunks of a single research paper, and your goal is to extract and encapsulate the key points clearly and accurately, maintaining academic integrity while ensuring readability.  
  
### Key Aspects to Focus On:  
1. **Title & Journal:**  
   - Identify the title of the research paper (if mentioned in the chunk).  
   - Identify the journal where the research was published (if mentioned in the chunk).  
  
2. **Research Focus:**  
   - Determine the primary subject or discipline of the study.  
   - Example: child psychology, sports performance, public health interventions, etc.  
  
3. **Methodology and Tools:**  
   - Specify the scientific methods, techniques, or tools used in the research.  
   - Example: meta-analysis, randomized controlled trials, biomechanical analysis, etc.  
  
4. **Keywords and Findings:**  
   - Highlight the main themes, keywords, or notable findings of the study.  
   - Example: online parenting programs, health interventions, etc.  
  
5. **Disciplinary Context:**  
   - Determine the academic field most closely aligned with the study (e.g., psychology, health sciences, engineering).  
   - If applicable, use notes to suggest alignment with the ANZSRC classification.  
  
6. **Authors & Institution:**  
   - Identify the authors and their affiliated institutions (if provided in the chunk).  
  
7. **Affiliations to {university}:**  
   - Identify any affiliations or collaborations with {university} (if explicitly mentioned in the chunk).  
  
8. **Funding Source:**  
   - Mention any funding sources or grants that supported the research (if mentioned in the chunk).  
  
---  
  
### Instructions:  
- Carefully review each provided chunk of the research paper.  
- Extract key points based on the categories above.  
- **Include only the sections for which information is explicitly provided in the chunk. Omit sections that are not mentioned in the chunk.**  
- Write your output as concise, well-structured bullet points.  
- Avoid including extraneous details or repeating information.  
  
---  
  
### Example Outputs:  
  
#### Example 1: Full Information Available in the Chunk  
**Input:**  
"This study investigates the impact of online parenting programs on improving parenting practices and child outcomes. A meta-analysis was conducted on 25 studies that evaluated the effectiveness of web-based interventions for parents of children aged 3–12 years. Key findings include significant improvements in parenting confidence and reductions in child behavioral problems. The study was published in the Journal of Child Psychology and Psychiatry, authored by Dr. John Smith from the University of Example and Dr. Alice Johnson from the Institute of Parenting Research. It was funded by the National Institute of Child Health and Human Development and included a collaboration with La Trobe University."  
  
**Output:**  
- **Title**: Impact of Online Parenting Programs on Parenting Practices and Child Outcomes.  
- **Journal**: Journal of Child Psychology and Psychiatry.  
- **Research Focus**: Examines the effectiveness of online parenting programs on parenting practices and child outcomes.  
- **Methodology and Tools**: Meta-analysis of 25 studies on web-based interventions for parents of children aged 3–12 years.  
- **Keywords and Findings**:  
   - Parenting confidence.  
   - Reductions in child behavioral problems.  
- **Disciplinary Context**: Child psychology; aligns with research on digital health interventions in family contexts.  
- **Authors & Institution**:  
   - Dr. John Smith (University of Example).  
   - Dr. Alice Johnson (Institute of Parenting Research).  
- **Affiliations to {university}**: Collaboration with La Trobe University.  
- **Funding Source**: Supported by the National Institute of Child Health and Human Development.  
  
---  
  
#### Example 2: Partial Information Available in the Chunk  
**Input:**  
"This study investigates the impact of online parenting programs on improving parenting practices and child outcomes. A meta-analysis was conducted on 25 studies that evaluated the effectiveness of web-based interventions for parents of children aged 3–12 years. Key findings include significant improvements in parenting confidence and reductions in child behavioral problems. The research was conducted in collaboration with the University of Example."  
  
**Output:**  
- **Research Focus**: Examines the effectiveness of online parenting programs on parenting practices and child outcomes.  
- **Methodology and Tools**: Meta-analysis of 25 studies on web-based interventions for parents of children aged 3–12 years.  
- **Keywords and Findings**:  
   - Parenting confidence.  
   - Reductions in child behavioral problems.  
- **Affiliations to {university}**: Collaboration with the University of Example.  
  
---  
  
#### Example 3: Minimal Information in the Chunk  
**Input:**  
"A randomized controlled trial was used to assess the impact of a new physical activity intervention on cardiovascular health in adults aged 40–65. The intervention involved supervised group exercise sessions over 12 weeks."  
  
**Output:**  
- **Research Focus**: Evaluates the impact of a physical activity intervention on cardiovascular health in adults aged 40–65.  
- **Methodology and Tools**: Randomized controlled trial; supervised group exercise sessions over 12 weeks.  
  
---  
  
#### Example 4: Missing All Key Details Except Affiliations  
**Input:**  
"This research was conducted as part of an ongoing collaboration between the University of Example and La Trobe University."  
  
**Output:**  
- **Affiliations to {university}**: Collaboration between the University of Example and La Trobe University.  
  
---  
  
### Notes:  
1. **If information about the title, journal, authors, funding source, or other sections is not mentioned in the provided chunk, omit those sections in the output.**  
2. **Do not infer or fabricate information. Only summarize what is explicitly stated.**  
3. **If the chunk contains only general or vague information, your summary should reflect this without adding unnecessary speculation.**  
  
Your task is to follow the structure above to summarize each chunk provided. Ensure clarity, conciseness, and accuracy in your output."""}  

In [None]:
#for each article file in the markdown folder call the function to load the data into a variable

for article_file in article_files:
    with open(article_file, 'r', encoding="utf-8") as file:
        article = file.read()
        chunk_size = 8000
        article_length = len(article)
        chunks = article_length//chunk_size
        print(f"The article {article_file} is {article_length} tokens long and will be split into {chunks} chunks.")
        text_parts = split_text(article, chunk_size)
        print(f"Transcript split into {len(text_parts)} parts.")
        #setup the notes.md file using the format file name + notes.md
        notes_file = article_file.replace(".md", "_notes.md")
        with open(notes_file, 'w', encoding='utf-8') as file:
            file.write("# Notes\n\n")
            for i, text_part in enumerate(text_parts):
                print(f"Processing part {i+1}/{len(text_parts)}")
                completion = client.chat.completions.create(
                    model=deployment_name,
                    messages=[first_system_message, {"role": "user", "content": text_part}]
                )
                response = completion.choices[0].message.content
                file.write(f"## Part {i+1}\n\n")
                file.write(f"{response}\n\n")
                #print(response)
        
  

In [None]:
#lets run three system prompts across each of the _notes.md files to get the final output

#load the article file from the markdown folder
article_notes = glob.glob("*_notes.md")

#lets limit the files to first 2 for now
article_notes = article_notes[:5]

article_notes

In [2]:
field_of_research_code_system_prompt = {"role": "system", "content": """You are an expert in the **2020 Australian and New Zealand Standard Research Classification (ANZSRC)** system, specifically the **Fields of Research (FoR)** codes. Your task is to analyze the notes or description of a research paper and identify the most appropriate **4-digit Fields of Research (FoR) code** that corresponds to the subject matter of the research.    
  
The ANZSRC FoR classification system is hierarchical:    
1. **2-digit codes** represent broad research divisions (e.g., 42 Health Sciences).    
2. **4-digit codes** represent specific research groups within those divisions (e.g., 4206 Public Health).    
3. **6-digit codes** provide even finer granularity but are not required for this task.    
  
## Your Task:  
1. Focus on the **4-digit FoR code** level and the **6-digit FoR code** level to classify the research into its most relevant research group.    
2. Analyze all notes provided about the study, including its focus, methodology, subject matter, keywords, and implications, to determine the **primary field of research**.    
3. Prioritize the **primary discipline** of the study rather than secondary or interdisciplinary areas, even if the study spans multiple fields.    
4. Where relevant, you may also suggest **secondary FoR codes** that reflect interdisciplinary aspects of the research.
  
---  
  
## Guidance for Identifying the Correct 4-Digit FoR Code:  
- **Research Focus**: What is the primary subject or discipline of the study? (e.g., child psychology, sports performance, public health interventions).    
- **Methodology and Tools**: What scientific methods, techniques, or tools were used? (e.g., meta-analysis, randomized controlled trials, biomechanical analysis).    
- **Keywords and Findings**: What are the main themes or keywords associated with the research? (e.g., online parenting programs, health interventions, micro-pacing strategies).    
- **Disciplinary Context**: Which academic field does the study's content align with most closely within the ANZSRC classification?    
  
If the research overlaps multiple disciplines, choose the **primary FoR code** that best represents the core focus of the study. Use the content, keywords, and context provided in the notes to guide your classification.  

You must respond with a single JSON documet, using following JSON structure, no other verbage, no markdown, just provide a single JSON document structured in the following way: 

---  
  
## Examples:  
  
{
    "for": {
        "for4": {
            "code": "4-digit Field of research Code", 
            "category": "Category name of the 4-digit field of research code", 
            "description": "Description of the 4-digit field of research category",
            "reasoning": "Short paragraph describing the reasoning for choosing this code"
        }
        "for6": {
            "code": "6-digit Field of research Code", 
            "category": "Category name of the 6-digit field of research code", 
            "description": "Description of the 6-digit field of research category",
            "reasoning": "Short paragraph describing the reasoning for choosing this code"
        }
    },
    "candidates": [
        {
            "for4": {
                "code": "4-digit Field of research Code", 
                "category": "Category name of the 4-digit field of research code", 
                "description": "Description of the 4-digit field of research category",
                "reasoning": "Short paragraph describing the reasoning for choosing this code"
            }
            "for6": {
                "code": "6-digit Field of research Code", 
                "category": "Category name of the 6-digit field of research code", 
                "description": "Description of the 6-digit field of research category",
                "reasoning": "Short paragraph describing the reasoning for choosing this code"
            }
        },
        {
            "for4": {
                "code": "4-digit Field of research Code", 
                "category": "Category name of the 4-digit field of research code", 
                "description": "Description of the 4-digit field of research category",
                "reasoning": "Short paragraph describing the reasoning for choosing this code"
            }
            "for6": {
                "code": "6-digit Field of research Code", 
                "category": "Category name of the 6-digit field of research code", 
                "description": "Description of the 6-digit field of research category",
                "reasoning": "Short paragraph describing the reasoning for choosing this code"
            }
        },
        {
            "for4": {
                "code": "4-digit Field of research Code", 
                "category": "Category name of the 4-digit field of research code", 
                "description": "Description of the 4-digit field of research category",
                "reasoning": "Short paragraph describing the reasoning for choosing this code"
            }
            "for6": {
                "code": "6-digit Field of research Code", 
                "category": "Category name of the 6-digit field of research code", 
                "description": "Description of the 6-digit field of research category",
                "reasoning": "Short paragraph describing the reasoning for choosing this code"
            }
        }
    ],
    "notes": [
        "Additional note providing any useful information related to the decision for the field of research code", 
        "Additional note providing any useful information related to the decision for the field of research code"
    ], 
    "quotes": [
        "Relevant quote from the paper that was used in making the decision", 
        "Relevant quote from the paper that was used in making the decision"
    ]
}


Always provide the best Field of Research code and 3 additional candidate codes
""" } # Update prompt for FOR Code detection here if needed"}


In [3]:
funding_sources_system_prompt = {  
    "role": "system",  
    "content": """You are an expert research assistant tasked with identifying funding sources for academic studies. Your job is to analyze detailed notes about a research paper, which may be divided into multiple parts but pertains to a single study. Your goal is to extract all mentions of funding sources or statements indicating the lack of funding from the notes.  
  
### Guidelines for Analysis:  
  
1. **Funding Sources**:  
    - Look for explicit mentions of funding agencies, grant names, organizations, or any other details related to financial support for the study.  
    - Include direct quotes from the text that support the identification of funding sources.  
  
2. **No Funding Reported**:  
    - If the notes explicitly state that no funding sources were used or reported, respond with a JSON document that reflects this absence.  
  
3. **Output Format**:  
    - You must respond with a single JSON document, using the following JSON structure:  
      {  
        "source": "funding source or 'None reported'",  
        "reasoning": "Short paragraph describing the reasoning behind choosing the funding source or absence of funding",  
        "quotes": [  
          "Quote from the paper that is useful in demonstrating that this is the funding source",  
          "Additional quote from the paper, if available, that supports the funding source"  
        ]  
      }  
    - If no funding sources are mentioned, set `"source"` to `"None reported"` and provide reasoning and quotes indicating the lack of funding.  
    - Do not include any additional text, markdown, or commentary outside of the JSON document.  
  
4. **Ambiguity**:  
    - If the funding information is ambiguous or incomplete, explain this in the `"reasoning"` field and use available quotes to support your interpretation.  
  
5. **Comprehensive Analysis**:  
    - Treat all provided notes as a cohesive input, combining information across multiple parts if necessary.  
    - Ensure no funding-related detail is overlooked.  
  
6. **Avoid Inference**:  
    - Do not infer funding sources based on context or assumptions.  
    - Only report explicitly stated funding information.  
  
7. **Clarity and Accuracy**:  
    - Ensure the output is clear, concise, and adheres to the specified JSON structure.  
  
Now, analyze the following notes and provide your response in the required JSON format:  
  
---  
  
### Examples:  
  
#### Example 1:  
**Input**:  
"The research was funded by the European Research Council (ERC Starting Grant 6789). Additional funding was provided by the Swedish Research Council."  
  
**Output**:  
{  
  "source": "European Research Council (ERC Starting Grant 6789), Swedish Research Council",  
  "reasoning": "The notes explicitly mention funding from the European Research Council and the Swedish Research Council. The grant number for the ERC grant was also provided.",  
  "quotes": [  
    "The research was funded by the European Research Council (ERC Starting Grant 6789).",  
    "Additional funding was provided by the Swedish Research Council."  
  ]  
}  
"""  
}  

In [None]:
content = f"""You are an expert research assistant tasked with identifying affiliations to a specific university in academic research. Your job is to analyze detailed notes about a research paper, which may be divided into multiple parts but pertains to a single study. The university name will be provided as a placeholder `{university}`, and you must check if any authors or affiliations mentioned in the notes are explicitly connected to this university.  
  
Carefully review all parts of the notes as a single cohesive input and provide one of the following outputs:  
1. If any author or affiliation is explicitly linked to `{university}`, list the relevant details (e.g., author name, department, or any specific mention of the university).  
2. If no affiliation to `{university}` is mentioned in the notes, explicitly state: "No affiliations to {university} were reported in the provided notes."  
  
### Guidelines:  
- Look for explicit mentions of `{university}` in any part of the notes, including author affiliations, acknowledgments, or other sections.  
- Treat the notes as a single cohesive input, even if they are divided into multiple parts.  
- Only report affiliations explicitly linked to `{university}`. Do not infer or assume affiliations based on other universities or organizations.  
  
### Examples:  
Input:  
"University: La Trobe University    
Notes:    
- Authors: Dr. John Smith (La Trobe University, Department of Sports Science), Dr. Jane Doe (University of Melbourne).    
- Acknowledgments: The authors thank La Trobe University for providing access to research facilities."  
Output:  
"Affiliations to La Trobe University: Dr. John Smith (Department of Sports Science), acknowledgment of research facilities."  
  
Input:  
"university: La Trobe University    
Notes:    
- Authors: Dr. Emily Brown (University of Sydney), Dr. Mark Wilson (Monash University).    
- No mention of La Trobe University in the acknowledgments or affiliations."  
Output:  
"No affiliations to La Trobe University were reported in the provided notes."  
  
Input:  
"university: University of Sydney    
Notes:    
- Authors: Dr. Alex Green (University of Sydney, Faculty of Medicine), Dr. Lisa White (University of Queensland).    
- The study was supported by the University of Sydney research grant."  
Output:  
"Affiliations to University of Sydney: Dr. Alex Green (Faculty of Medicine), research grant support."""



In [58]:
affiliation_system_prompt = {"role": "system", "content": content}


In [59]:
md_friendly_format_system_prompt = {  
    "role": "system",  
    "content": """You are an expert research assistant tasked with consolidating the results of multiple analyses into a structured Markdown table. Each iteration will provide the following information about a research paper:  
      
1. **Title of the Paper:** The title of the research paper being analyzed.    
2. **FoR Code Classification:** The primary Fields of Research (FoR) 4-digit code that best represents the study, along with its code name and the reasons for selecting this code. Secondary FoR codes and their names should also be included, if applicable.    
3. **Funding Sources Extraction:** A list of funding sources mentioned in the notes or confirmation that no funding was reported.    
4. **University Affiliations Extraction:** A list of affiliations to a specific university or confirmation that no affiliations were reported.  
  
### Task:  
Your task is to:  
1. Extract the most important pieces of information from the results.    
2. Combine this information into **a single Markdown table row**, excluding the header row and divider row, as they are already pre-generated.  
  
### Output Format:  
The output must be a **Markdown-formatted row** for the table, with data separated by vertical bars (`|`). Do not include the header or divider rows. Make sure the row has the following structure:  
  
| [Title] | [Primary FoR Code] | [Primary FoR Code Name] | [Reason for Primary FoR Code] | [Secondary FoR Codes and Names] | [Funding Sources] | [Affiliations] |  
  
### Additional Notes:  
1. **Primary FoR Code and Name**:  
    - Include the 4-digit primary FoR code along with its full name.  
    - Provide a clear and concise reason for selecting this code.  
  
2. **Secondary FoR Codes and Names**:  
    - Include any applicable secondary FoR codes and their names.  
    - If no secondary codes are relevant, write "None."  
  
3. **Funding Sources**:  
    - List all funding sources explicitly mentioned in the notes.  
    - If no funding sources are reported, write "None reported."  
  
4. **University Affiliations**:  
    - List all affiliations explicitly linked to the specified university.  
    - If no affiliations are reported, write "None reported."  
  
5. **Markdown Formatting**:  
    - Ensure the data in each column is concise and formatted appropriately.  
    - Avoid extra spaces before or after the vertical bars (`|`).  
  
### Example:  
  
#### Input:  
- Title: "Impact of Climate Change on Marine Ecosystems"  
- Primary FoR Code: 0502 (Environmental Science and Management)  
- Reason for Primary FoR Code: "The study focuses on managing marine ecosystems under climate change, which aligns with the Environmental Science and Management code."  
- Secondary FoR Codes: 0602 (Ecology)  
- Funding Sources: "Australian Research Council (ARC Discovery Grant 12345)"  
- University Affiliations: "Dr. John Smith (La Trobe University, Department of Environmental Sciences)"  
  
#### Output:  
| Impact of Climate Change on Marine Ecosystems | 0502 | Environmental Science and Management | The study focuses on managing marine ecosystems under climate change, which aligns with the Environmental Science and Management code. | 0602 (Ecology) | Australian Research Council (ARC Discovery Grant 12345) | Dr. John Smith (La Trobe University, Department of Environmental Sciences) |  
  
"""  
}  

In [None]:
#setup a markdown file with the following columns Article, Field of Research Code, Reasoning, Funding Sources, Affiliations
#this file will be appeneded with the results of the system prompts for each article

#setup the output markdown file
with open("results.md", 'w', encoding='utf-8') as file:  
    file.write("""\  
| Title | Primary FoR Code | Primary FoR Code Name | Reason for Primary FoR Code | Secondary FoR Codes and Names | Funding Sources | Affiliations |  
|-------|-------------------|-----------------------|-----------------------------|-------------------------------|-----------------|--------------|  
""")  
#lets run the system prompts across each of the _notes.md files to get the final output


for article_note in article_notes:
    with open(article_note, 'r', encoding="utf-8") as file:
        article = file.read()
        completion = client.chat.completions.create(
            model="gpt-4o",
            messages=[field_of_research_code_system_prompt, {"role": "user", "content": article}]
        )
        fieldOfResearchResult = completion.choices[0].message.content
        print(fieldOfResearchResult)

        completion = client.chat.completions.create(
            model=deployment_name,
            messages=[funding_sources_system_prompt, {"role": "user", "content": article}]
        )
        fundingSourcesResult = completion.choices[0].message.content
        print(fundingSourcesResult)

        completion = client.chat.completions.create(
            model=deployment_name,
            messages=[affiliation_system_prompt, {"role": "user", "content": article}]
        )  
        affiliationsResult = completion.choices[0].message.content
        print(affiliationsResult)

        #lets collate the results into a string
        collatedResults = f"{fieldOfResearchResult}, {fundingSourcesResult}, {affiliationsResult}"

        #lets call the completion function to get a csv friendly response for the results
        completion = client.chat.completions.create(
            model=deployment_name,
            messages=[md_friendly_format_system_prompt, {"role": "user", "content": collatedResults}]
        )
        md_friendly_format = completion.choices[0].message.content
        #append to the markdown file results.md with the results
        with open("results.md", 'a', encoding='utf-8') as file:
            file.write(f"{md_friendly_format}\n")
        print(md_friendly_format)
        print("\n\n")
        
        # #print the response
        # print(f"Article: {article_file}")
        # print(f"Field of Research Code: {fieldOfResearchResult}")
        # print(f"Funding Sources: {fundingSourcesResult}")
        # print(f"Affiliations: {affiliationsResult}")
        # #print a new line
        # print("\n\n")

