# Utilizing Agents + RAG to Generate Research Ideas
### By: Bradley Sides

**Steps:**
1. Author inputs their 3(?) most recent papers.
2. Related works are located using RAG with sentence embeddings.
3. With special consideration placed on the author's own papers, augmented with background context from the related works in addition to the model's own knowledge, generate a "baseline" research idea for the author.
4. Utilize harsh reviewing agents to assess the idea based on multiple criteria.
    
    a. Novelty agent checks that the idea is sufficiently different from already researched topics. Provides detailed feedback as well as a score.
    
    b. Fundability/Impact agent checks that the idea is both competitive for grants and focused on an important topic rather than simply something obscure. Provdes detailed feedback as well as a score.
5. If score minimums are not met, return to idea generating agent with feedback from both reviewing agents to improve upon the idea, iterate until passing.
6. Once score minimums are met, the idea is finalized.


**Models Used:**

   LLM: Llama 3 70B 8192

   Sentence Encoder: sentence-transformers all-MiniLM-L6-v2

   Tokenizer: GPT2

## Load packages, models, environment variables

In [1]:
import numpy as np
import pandas as pd
import pyarrow.dataset as ds
import glob
from transformers import GPT2Tokenizer
import os
from langchain_groq import ChatGroq
from langchain_core.prompts import ChatPromptTemplate
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.output_parsers import JsonOutputParser
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

#pd.set_option('display.max_colwidth', None)

GROQ_API_KEY = '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'
TAVILY_API_KEY = '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'

In [2]:
# Master DataFrame, pre-cleaned and organized
df = pd.read_parquet('compressed_fulldata.parquet')

# Sentence Encoder
sent_model = SentenceTransformer('all-MiniLM-L6-v2')

# Language Model: LLama 3
    # Num Parameters: 70B
    # Context Windoow: 8192
GROQ_LLM = ChatGroq(
            model="llama3-70b-8192",
            groq_api_key=GROQ_API_KEY
        )

# Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

In [3]:
# Keys and OS settings for Langchain
#os.environ['LANGCHAIN TRACING V2'] = 'true'
#os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
#os.environ['LANGCHAIN API KEY'] = 'ls__cb6a134591764951b016859dde32b411'
#!pip -q install langchain-groq duckduckgo-search
#!pip -q install -U langchain_community tiktoken langchainhub
#!pip -q install -U langchain langgraph tavily-python
#!pip show langgraph


In [4]:
df.head(1)

Unnamed: 0,title,abstract,publication_date,pmid,concepts,author,num_citations,related_works,cited_by_api_url
0,Treatment of Alcohol Withdrawal Syndrome,Treatment of the alcohol withdrawal syndrome i...,1994-01-01,7912939,"[Alcohol withdrawal syndrome, Medicine, Rehabi...",[Vural Özdemir],18,"[https://openalex.org/W4388336948, https://ope...",https://api.openalex.org/works?filter=cites:W2...


## Implement a simple RAG system using sentence embeddings to find the N most similar papers to each paper the author submits

In [5]:
'''
TODO: Implement RAG to get 
    1. Related works (HyDE/Decomposition/RAG Fusion)
    2. Top (3?) citations from each paper
'''
# create sentence embeddings
def embed_sentences():
    embed = sent_model.encode(df['abstract'].tolist(), show_progress_bar = True)
    print(np.array(embed).shape)
    np.save('saved_embeddings.npy', embed) # Optional for saving the embeddings to disk
    return embed
    
# find similar papers based on cosine similarity between sentence embeddings
def simple_rag(abstract, emb_list, abstract_df, n):
    query_emb = sent_model.encode([abstract])[0]
    similarities = cosine_similarity([query_emb], emb_list)[0]
    # top = np.argsort(similarities)[-n:][::-1] # Indices of top n most similar papers, excluding the paper itself
    # USE THIS IF PASSING IN A PAPER FROM DATASET, OTHER IF 
    top = np.argsort(similarities)[-n-1:][::-1] 
    top_papers = abstract_df.iloc[top]
    return [(row['title'], row['abstract']) for idx, row in top_papers.iterrows()]

# print out original and similar papers
def print_similar(sim_papers):
    
    print("ORIGINAL PAPER: ")
    print("________________________________________________")
    print("Title: " + sim_papers[0][0])
    print("")
    print("Abstract: " + sim_papers[0][1])
    print("=========================================================")
    for i in range(len(sim_papers)-1):
        print("++++++++++++++++++++++++++++++++++")
        print("Related title Number " + str((i+1)))
        print("++++++++++++++++++++++++++++++++++")
        print("Title: " + sim_papers[i+1][0])
        print("")
        print("Abstract: " + sim_papers[i+1][1])
        print("=========================================================")

embed = embed_sentences()
#abstract = my_abstracts[0]
#n = 2 # Number of papers to pull
#sim_papers = simple_rag(abstract, embed, df, n) # First entry is the same paper, drop it
#print_similar(sim_papers)

Batches:   0%|          | 0/5790 [00:00<?, ?it/s]

(185275, 384)


## Generate the initial idea focused on author's work and utilizing related works

In [145]:
'''
This will serve as the initial idea generator for the agents to work on
'''
prompt = PromptTemplate(
    template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
    You are a research assistant. You are a master at synthesizing information to formulate creative, novel, fundable, and feasible ideas that improve on the previous work that is presented to you.
    
    <|eot_id|><|start_header_id|>user<|end_header_id|>
    Conduct a comprehensive analysis of the abstracts and related work provided below and present a well-formulated idea for a new research paper that logically follows the direction of research in my field. 
    Here are my 3 previous papers, from most to least recent:
    My recent papers: \n\n {og_papers} \n\n
    
    <|eot_id|>
    <|start_header_id|>assistant<|end_header_id|>
    """,
    input_variables = ["og_papers"]
)

og_papers = my_data
base_idea_gen = prompt | GROQ_LLM | StrOutputParser()
#idea = base_idea.invoke({"og_papers": og_papers})
#print(idea)

## Introduce Reviewing Agents

### 1: Fundability and Impact Reviewer

In [146]:
'''
REVIEWING AGENT #1: FUNDABILITY AND IMPACT REVIEWER
    
** Note: Utilizing prompts from researchAgent paper heavily for this, will need to make new ones 
'''
fundability_agent_prompt = PromptTemplate(template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
    You are an AI assistant whose primary goal is to meticulously evaluate the fundability of research ideas based off of the NSF funding criteria in order to aid researchers in refining their approaches based on your evaluations and feedback, thereby amplifying the quality and impact of their scientific contributions.
    \n\n
    You are going to evaluate a research idea for its potential fundability by the NSF. Refer to the target papers to help understand the context of the problem for a more comprehensive assessment.
    
    The existing studies are: {og_papers}
    
    Now, proceed with your evaluation approach that should be systematic:
        - Start by thoroughly reading the experiment design and its rationale, keeping in mind the context provided by the research problem, scientific method, and existing studies mentioned above.
        - Next, generate a review and feedback that should be constructive, helpful, and concise, focusing on the fundability of the experiment.
        - Finally, provide a score on a 5-point Likert scale, with 1 being the lowest, please ensuring a discerning and critical evaluation to avoid a tendency towards uniformly high ratings (4-5) unless fully justified:
        
    Criteria to consider:
        - Quality and potential to advance knowledge: Does the project propose high-quality activities that can transform the frontiers of knowledge?
        - Contribution to societal goals: How does the project contribute to broader societal goals?
        - Metrics for evaluation: Are the metrics for meaningful assessment and evaluation appropriate and well-defined?
        - Originality and Creativity: Are the ideas creative, original, or potentially transformative?
        - Plan and Rationale: Is the project plan well-reasoned and organized? Does it include mechanisms to assess success?
        - Qualifications: How well qualified are the individuals or teams proposing the project?
        - Resources: Are adequate resources available to carry out the activities proposed?
    
    I am going to provide you with the research idea here: {final_idea}
    
    After your evaluation of the above content, please provide your review, feedback, and rating, in the format of:
    Review: 
    Feedback:
    Rating (1-5):
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
    input_variables = ["final_idea", "og_papers"]
)

fundability = fundability_agent_prompt | GROQ_LLM | StrOutputParser()

#print(fundability.invoke({"my_data": my_data, "idea": idea}))

### 2: Novelty Reviewer

In [147]:
'''
REVIEWING AGENT #2: NOVELY REVIEWER

Description: This should pull the top N most similar papers to the "original" idea and produce a "novelty" score, as well as provide feedback.
'''
novelty_agent_prompt = PromptTemplate(template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
    You are an AI assistant whose primary goal is to meticulously evaluate the novelty of research ideas based off given criteria in order to aid researchers in refining their approaches based on your evaluations and feedback, thereby amplifying the quality and impact of their scientific contributions.
    \n\n
    The process for evaluating an idea for its novelty is outlined as follows:
        - Begin by understanding the essence of the proposed idea, focusing particularly on its unique aspects and claims of novelty.
        - Compare the idea against a broad spectrum of existing studies, considering both direct and tangential relevance.
    
    The criteria for novelty assessment are:
        - Innovation: Evaluate if the idea introduces any new methodologies, tools, or conceptual frameworks.
        - Transformation Potential: Assess whether the idea has the potential to significantly shift current practices or theoretical understanding.
        - Differentiation: Examine how distinct the idea is from existing studies. Highlight specific elements that set it apart.
        - Feasibility of New Approaches: Consider the practical implementation of the idea, evaluating if the innovative aspects are achievable within the current technological and resource constraints.
        - Impact on Existing Knowledge: Determine the potential impact of the idea on stimulating further research or development in its field.
        - Interdisciplinary Merit: Consider if the idea brings together diverse fields or disciplines in a way that fosters new directions or insights.

        Given the research idea presented here: {final_idea}
        Evaluate it against both your own knowledge of related work, as well as the following papers deemed similar through sentence embedding: {most_similar_papers}
        Please proceed with your systematic evaluation, and provide a detailed review that includes:
            - Your assessment of how the idea meets each novelty criterion.
            - Constructive feedback on areas where the idea could be further differentiated or developed.
            - A rating on a 5-point scale regarding its overall novelty, where 1 indicates very little novelty and 5 indicates highly novel.

    assistant
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
    input_variables = ["final_idea", "most_similar_papers"]
)
novelty = novelty_agent_prompt | GROQ_LLM | StrOutputParser()
#most_similar_papers = simple_rag(idea, embed, df, n=10)
#print(novelty.invoke({"idea": idea, "most_similar_papers": most_similar_papers}))


In [148]:
'''
TODO: REWRITING AGENTS:
    1. Novelty Rewriter
    2. Fundability/Impact Rewriter
'''  
#Description: These should take feedback from reviewing agents, rewrite according to critiques, and then pass back to the reviewers


novelty_analysis_prompt = PromptTemplate(
    template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
    You are an expert at evaluating the feedback from reviewers on a research idea based on its novelty and deciding if the idea needs to be updated. \n
    <|eot_id|><|start_header_id|>user<|end_header_id|>
    This is the research idea: {final_idea} \n
    
    this feedback was given on the idea: {novelty_feedback} \n
    
    If the score is lower than 4.5, return EXACTLY: rewrite
    If the score is 4.5 or higher, return EXACTLY: no_rewrite
    <|eot_id|><|start_header_id|>user<|end_header_id|>""",
    input_variables= ["final_idea", "novelty_feedback"]
)

fundability_analysis_prompt = PromptTemplate(
    template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
    You are an expert at evaluating the feedback from reviewers on a research idea based on its fundability and potential and deciding if the idea needs to be updated. \n
    <|eot_id|><|start_header_id|>user<|end_header_id|>
    This is the research idea: {final_idea} \n
    
    this feedback was given on the idea: {fundability_feedback} \n
    
    If the score is lower than 4.5, return EXACTLY: rewrite
    If the score is 4.5 or higher, return EXACTLY: no_rewrite
    <|eot_id|><|start_header_id|>user<|end_header_id|>""",
    input_variables= ["final_idea", "fundability_feedback"]
)

nov2 = novelty_analysis_prompt | GROQ_LLM | StrOutputParser()
fund2 = fundability_analysis_prompt | GROQ_LLM | StrOutputParser()

In [149]:
rewrite_idea_prompt = PromptTemplate(
    template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
    You are the final agent in charge of producing novel, fundable, impactful, and feasible research ideas. A research idea has been formulated and reviewed and you are tasked with incorporating the feedback into the idea and potentially augmenting it based on what is recommended.
    You must not sacrifice one criteria for the improvement of another, so be careful with your augmentation and utilize the feedback as best you can.
    You should produce either an updated version of the idea given to you based on the feedback, or the same idea given to you if augmentation is not necessary. As a baseline, both scores should be above 4.5 Provide justification for any change you make.
    <|eot_id|><|start_header_id|>user<|end_header_id|>
    The research idea is: {final_idea}
    
    
    The feedback from the fundability and impact reviewer is: {fundability_feedback}

    <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
    input_variables = ["final_idea", "fundability_feedback"]
)

rewrite = rewrite_idea_prompt | GROQ_LLM | StrOutputParser()

## Build the state

In [150]:
from langchain.schema import Document
from langgraph.graph import END, StateGraph
from typing_extensions import TypedDict
from typing import List

In [151]:
class GraphState(TypedDict):
    """
    Represents the state of the graph
    
    Attributes: 
        og_papers: abstract and title of first X papers from main author
        sim_papers: abstract and title of Y similar papers to each of og_papers
        most_similar_papers: abstract and title of Z similar papers to the IDEA
        idea: LLM generated idea
        updated_idea: LLM generated idea
        novelty_feedback: LLM generated critiques on novelty
        fundability_feedback: LLM generated critiques on fundability/impact
        num_steps: number of steps taken
    """
    og_papers : str
    sim_papers : List[List[str]]
    most_similar_papers : List[List[str]]
    base_idea : str
    final_idea : str
    #novelty_feedback : str
    fundability_feedback : str
    #novelty_analysis: str
    fundability_analysis: str
    num_steps : int


In [152]:
def find_similar_docs(state):
    '''Get related papers'''
    print("---FINDING RELATED PAPERS----")
    og_papers = state['og_papers']
    sim_papers = state['sim_papers']
    num_steps = int(state['num_steps'])
    num_steps += 1
    for paper in og_papers:
        p = simple_rag(paper, embed, df, n=2)
        sim_papers.append(p)
    print_similar(sim_papers)
    return{"sim_papers": sim_papers, "num_steps": num_steps}

In [153]:
# TODO: Edit this function AND base_idea to include sim_docs
def init_idea(state):
    '''Generate initial idea'''
    print("---GENERATING INITIAL IDEA---")
    og_papers = state['og_papers']
    base_idea = state['base_idea']
    num_steps = int(state['num_steps'])
    num_steps += 1
    base_idea = base_idea_gen.invoke({"og_papers": og_papers})
    final_idea = base_idea # for updating
    print(base_idea)
    #write_markdown_file(base_idea, "base_idea")
    return {"base_idea": base_idea, "num_steps": num_steps}

In [154]:
def assess_fundability(state):
    '''Assess fundability and impact of the idea'''
    print("---ASSESSING FUNDABILITY & IMPACT OF IDEA---")
    og_papers = state['og_papers']
    final_idea = state['final_idea']
    fundability_feedback = state['fundability_feedback']
    num_steps = int(state['num_steps'])
    num_steps += 1
    
    fundability_feedback = fundability.invoke({"final_idea": final_idea, "og_papers": og_papers})
    print(fundability_feedback)
    #write_markdown_file(fundability_feedback, "fundability_feedback")
    return {"fundability_feedback": fundability_feedback, "num_steps": num_steps}

In [155]:
def assess_novelty(state):
    '''Assess novelty of idea'''
    print("---ASSESSING NOVELTY OF IDEA---")
    og_papers = state['og_papers']
    final_idea = state['final_idea']
    most_similar_papers = state['most_similar_papers']
    novelty_feedback = state['novelty_feedback']
    num_steps = int(state['num_steps'])
    num_steps += 1
    
    most_similar_papers = simple_rag(final_idea, embed, df, n=8)
    novelty_feedback = novelty.invoke({"final_idea": final_idea, "most_similar_papers": most_similar_papers})
    print(novelty_feedback)
    #write_markdown_file(novelty_feedback, "novelty_feedback")
    return {"novelty_feedback": novelty_feedback, "num_steps": num_steps}

In [156]:
def analyze_novelty(state):
    '''Decide on update based novelty of idea'''
    print("---DECIDING ON UPDATE VIA: NOVELTY---")
    novelty_feedback = state['novelty_feedback']
    novelty_analysis = state['novelty_analysis']
    num_steps = int(state['num_steps'])
    num_steps += 1
    
    novelty_analysis = nov2.invoke({"final_idea": final_idea, "novelty_feedback": novelty_feedback})
    print(novelty_analysis)
    #write_markdown_file(novelty_feedback, "novelty_feedback")
    return {"novelty_analysis": novelty_analysis, "num_steps": num_steps}

In [157]:
def analyze_fundability(state):
    '''Decide on update based fundability of idea'''
    print("---DECIDING ON UPDATE VIA: FUNDABILITY & IMPACT---")
    final_idea = state['final_idea']
    fundability_feedback = state['fundability_feedback']
    fundability_analysis = state['fundability_analysis']
    
    fundability_analysis = fund2.invoke({"final_idea": final_idea, "fundability_feedback": fundability_feedback})
    print(fundability_analysis)
    #write_markdown_file(novelty_feedback, "novelty_feedback")
    updated_state = state.copy()
    updated_state['fundability_analysis'] = fundability_analysis
    updated_state['num_steps'] = int(state['num_steps']) + 1
    return updated_state

In [158]:
def rewrite_idea(state):
    '''Rewrite the idea based on feedback'''
    print("---REWRITING IDEA---")
    
    final_idea = state['final_idea']
    #novelty_feedback = state['novelty_feedback']
    fundability_feedback = state['fundability_feedback']
    num_steps = int(state['num_steps'])
    num_steps += 1
    
    final_idea = rewrite.invoke({"final_idea": final_idea, "fundability_feedback": fundability_feedback})
    print(final_idea)
    return {"final_idea": final_idea, "num_steps": num_steps}

In [167]:
def no_more(state):
    print("NO MORE UPDATES NEEDED")
    final_idea = state['final_idea']
    base_idea = state['base_idea']
    num_steps = int(state['num_steps'])
    num_steps += 1
    
    #write_markdown_file(base_idea, "base_idea")
    return {"final_idea": final_idea, "num_steps": num_steps}

In [168]:
def state_printer(state):
    """Print the state"""
    print("---STATE PRINTER---")
    print(f"base_idea Idea: {state['base_idea']}\n")
    print(f"Final Idea: : {state['final_idea']}\n")
    print(f"Similar (to my own) Papers: {state['sim_papers']}\n")
    print(f"Related (to idea) Papers: {state['most_similar_papers']}\n")
    print(f"Fundability Feedback: : {state['fundability_feedback']}\n")
    #print(f"Novelty Feedback: {state['novelty_feedback']}\n")
    print(f"Num Steps: {state['num_steps']}\n")

### Conditional Edge(s)

In [169]:
# Conditional Edge
def route_to_rewrite(state):
    print("---ROUTE TO REWRITE---")
    final_idea = state['final_idea']
    #novelty_feedback = state['novelty_feedback']
    fundability_feedback = state['fundability_feedback']
    #novelty_analysis = state['novelty_analysis']
    fundability_analysis = state['fundability_analysis']
    print(fundability_analysis)
    if fundability_analysis == "rewrite":
        print("---ROUTE TO REWRITE---")
        return "rewrite"
    else:
        print("---ROUTE TO FINAL---")
        return "no_rewrite"


## Build the Graph

In [170]:
workflow = StateGraph(GraphState)

# Nodes:
#workflow.add_node("find_similar_docs", find_similar_docs)
workflow.add_node("init_idea", init_idea)
workflow.add_node("assess_fundability", assess_fundability)
#workflow.add_node("assess_novelty", assess_novelty)
workflow.add_node("analyze_fundability", analyze_fundability)
#workflow.add_node("analyze_novelty", analyze_novelty)
workflow.add_node("rewrite_idea", rewrite_idea)
workflow.add_node("no_more", no_more)
workflow.add_node("state_printer", state_printer)

# Edges: 
workflow.set_entry_point("init_idea")
#workflow.add_edge("find_similar_docs", "init_idea")
workflow.add_edge("init_idea", "assess_fundability")
workflow.add_edge("assess_fundability", "analyze_fundability")
#workflow.add_edge("assess_novelty", "analyze_novelty")
#workflow.add_conditional_edges(
    #"analyze_novelty", 
     #route_to_rewrite,
    #{
         #"rewrite": "rewrite_idea",
         #"no_rewrite": "analyze_fundability"
     #}      
#)
workflow.add_conditional_edges(
    "analyze_fundability",
    route_to_rewrite,
    {
        "rewrite": "rewrite_idea",
        "no_rewrite": "no_more"
    },
)
workflow.add_edge("rewrite_idea", "assess_fundability")
workflow.add_edge("no_more", "state_printer")
workflow.add_edge("state_printer", END)

In [171]:
app = workflow.compile()

## Sandbox


In [172]:
"""
SANDBOX AREA!!!
Practice with authors
"""

# Get most recent 5 papers from specific author
filter_df = df[df['author'] =='Wei Chen']
sorted_df = filter_df.sort_values(by='publication_date', ascending = False)
recent_papers = sorted_df.head(5)
recent_papers

# Format for passing into prompt template
my_abstracts = []
for i in range(4):
    concat = "TITLE: " + recent_papers.iloc[i]['title'] + "\nABSTRACT: " + recent_papers.iloc[i]['abstract']
    my_abstracts.append(concat)
my_data = '\n\n'.join(my_abstracts)    
token_count = len(tokenizer.tokenize(my_data))
print("Number of tokens in original papers: " + str(token_count))

Number of tokens in original papers: 1625


In [None]:

inputs = {"og_papers": my_data, "num_steps":0}
for output in app.stream(inputs):
    for key, value in output.items():
        print(f"Finished running: {key}:")

In [175]:
### Let's do it with Acuna papers:
abstract_1 = "### PAPER 1: Peer review is an important part of science, aimed at providing expert and objective assess- ment of a manuscript. Because of many factors, including time constraints, unique expertise needs, and deference, many journals ask authors to suggest peer reviewers for their own manuscript. Previous researchers have found differing effects about this practice that might be inconclusive due to sample sizes. In this article, we analyze the association between author-suggested reviewers and review invitation, review scores, acceptance rates, and subjective review quality using a large dataset of close to 8K manuscripts from 46K authors and 21K reviewers from the journal PLOS ONE’s Neuroscience section. We found that all- author-suggested review panels increase the chances of acceptance by 20 percent points vs all-editor-suggested panels while agreeing to review less often. While PLOS ONE has since ended the practice of asking for suggested reviewers, many others still use them and perhaps should consider the results presented here."
abstract_2 = "### PAPER 2: Figures are an essential part of scientific communication. Yet little is understood about how accessible (e.g., color-blind safe), readable (e.g., good contrast), and explainable (e.g., contain captions and legends) they are. We develop computational techniques to measure these features and analyze a large sample of them from open access publications. Our method combines computer and human vision research principles, achieving high accuracy in detecting problems. In our sample, we estimated that around 20.6% of publications contain either accessibility, readability, or explainability issues (around 2% of all figures contain accessibility issues, 3% of diagnostic figures contain readability issues, and 23% of line charts contain explainability issues). We release our analysis as a dataset and methods for further examination by the scientific community."
abstract_3 = "### PAPER 3: Research has shown that most resources shared in articles (e.g., URLs to code or data) are not kept up to date and mostly disappear from the web after some years (Zeng et al., 2019). Little is known about the factors that differentiate and predict the longevity of these resources. This article explores a range of explanatory features related to the publication venue, authors, references, and where the resource is shared. We analyze an extensive repository of publications and, through web archival services, reconstruct how they looked at different time points. We discover that the most important factors are related to where and how the resource is shared, and surprisingly little is explained by the author’s reputation or prestige of the journal. By examining the places where long-lasting resources are shared, we suggest that it is critical to disseminate and create standards with modern technologies. Finally, we discuss implications for reproducibility and recognizing scientific datasets as first-class citizens."
abstract_4 = "### PAPER 4: The rapid advancement of AI technology has made text generation tools like GPT-3 and ChatGPT increasingly accessible, scalable, and effective. This can pose serious threat to the credibility of various forms of media if these technologies are used for plagiarism, including scientific literature and news sources. Despite the development of automated methods for paraphrase identification, detecting this type of plagiarism remains a challenge due to the disparate nature of the datasets on which these methods are trained. In this study, we review traditional and current approaches to paraphrase identification and propose a refined typology of paraphrases. We also investigate how this typology is represented in popular datasets and how under-representation of certain types of paraphrases impacts detection capabilities. Finally, we outline new directions for future research and datasets in the pursuit of more effective paraphrase detection using AI."
abstracts = [abstract_1, abstract_2, abstract_3, abstract_4]
concat = ' '.join(abstracts)

inputs = {"og_papers": concat, "num_steps":0}
for output in app.stream(inputs):
    for key, value in output.items():
        print(f"Finished running: {key}:")

---GENERATING INITIAL IDEA---
After conducting a comprehensive analysis of the abstracts and related work provided, I have formulated a well-formulated idea for a new research paper that logically follows the direction of research in your field. Here's a potential research idea:

**Title:** "Evaluating the Impact of AI-generated Content on the Quality and Integrity of Scientific Publishing: A Large-scale Analysis of Peer-reviewed Articles"

**Research Question:** How do AI-generated content and paraphrased text affect the quality, readability, and credibility of scientific publications, and what are the implications for the scientific community?

**Background:** The rapid advancement of AI technology has made text generation tools increasingly accessible, scalable, and effective. However, this poses a significant threat to the credibility of scientific literature, as AI-generated content can be used for plagiarism and manipulation of scientific results. The rise of AI-generated content