In [6]:
!pip install openai
!pip install requests



- This is the MVP / proof of concept
- Using gpt-3.5-turbo because it's the cheapest
- https://info.arxiv.org/help/api/user-manual.html#arxiv-api-users-manual

To Do:
- The current approach is basic. Some prompt engineering could be helpful. Use OAI Playground Compare to do this
    - Need to refine enhance_query so it's more helpful and elaborate. The current queries being constructed are trivial
    - suggest_research_direction() needs to be split up into a few functions, and should probably make its own arxiv calls based on user_feedback and research_interests. 
    - incorporate RAG 
- Port over to langchain to build a more conversational app
- Have a more stable way of inserting the user's API key
- A front end would be nice. Maybe GitHub pages for a basic UI? Try to find a template online? 
    #github pages apparently only serves static contact, which isnt great for us. Love the idea but sadly not viable and a more full stack is needed
- Connect to medrxiv and/or biorxiv? Building a routing system to direct API calls to the most appropriate pre-print server shouldn't be too hard.
    - Bug: It will generate answers to biological or medical questions without even trying to find any relevant literature to back up its answers. It doesn't know which questions are out of scope. Hallucinations could be an issue

In [4]:
from openai import OpenAI
import requests
import re

In [14]:
# Set your OpenAI API key here
OpenAI.api_key = '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'
#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
#client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

client = OpenAI(api_key=OpenAI.api_key)

In [15]:
def enhance_query(original_query):
    prompt = f"""
    Given a natural language query, convert it into a structured search query for the arXiv API. The arXiv API query format uses field prefixes like 'au' for author, 'ti' for title, 'cat' for category, and logical operators like 'AND', 'OR'. Below are examples of converting natural language queries into structured arXiv API queries:

    Here are a few examples:
    
    Natural Language Query: Papers by Albert Einstein about relativity
    Structured arXiv API Query: au:Albert Einstein AND all:relativity

    Natural Language Query: Quantum computing research after 2015
    Structured arXiv API Query: all:quantum computing AND submittedDate:[2015 TO *]

    Natural Language Query: Machine learning applications in finance
    Structured arXiv API Query: all:machine learning AND all:finance

    Now, convert the following natural language query into a structured arXiv API query:
    '{original_query}'
    Structured arXiv API Query:
    """

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
    )
    # Extracting the structured query from the response
    arxiv_query = response.choices[0].message.content.strip()
    return arxiv_query

def search_arxiv(query):
    url = f'http://export.arxiv.org/api/query?search_query={query}&start=0&max_results=5'
    response = requests.get(url)
    return response.text

def extract_titles_and_summaries(xml_response):
    # Regex patterns to match titles and summaries
    title_pattern = re.compile(r'<title>(.*?)<\/title>')
    summary_pattern = re.compile(r'<summary>(.*?)<\/summary>', re.DOTALL)  # re.DOTALL to match across newlines

    titles = title_pattern.findall(xml_response)
    summaries = summary_pattern.findall(xml_response)

    # The first 'title' match is always "ArXiv Query: ..." so we skip it
    titles = titles[1:]

    # Pairing titles with summaries
    papers_info = [{"title": title, "summary": summary.strip()} for title, summary in zip(titles, summaries)]

    return papers_info

def summarize(initial_query, papers_info):
    prompt = f"The user asked: '{initial_query}'. Based on the following titles and summaries from academic papers, provide a detailed and accessible explanation of the topic:\n\n"
    
    for paper in papers_info:
        prompt += f"Title: {paper['title']}\nSummary: {paper['summary']}\n\n"
    
    prompt += "Please review the titles and summaries to provide a thoughtful response to the user's question."

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "system",
                "content": prompt,
            }
        ],
    )
    
    thoughtful_response = response.choices[0].message.content.strip()
    return thoughtful_response

def suggest_research_directions(initial_query, thoughtful_response):
    """
    Generates novel research directions based on the user's feedback on a provided summary
    and their specific interests.

    Parameters:
    - initial_query: The original query posed by the user.
    - thoughtful_response: A comprehensive response to the initial query, summarizing relevant
      academic papers and insights.

    Returns:
    - A string containing suggestions for research trends, gaps, next steps, or future directions.
    """

    print("\n--- Research Direction Suggestion ---")
    user_feedback = input("What are your thoughts on the provided summary? Any specific areas of interest or questions that arise? ")

    research_interests = input("Could you specify any particular research interests or areas where you're seeking innovation? ")
    
    prompt = f"""
    Based on the initial inquiry about '{initial_query}' and the provided summary, the researcher shared their thoughts: '{user_feedback}'. They expressed a particular interest in '{research_interests}'.

    Considering the current state of research and potential future developments, identify emerging trends, and gaps in the literature, and suggest novel research directions or next steps that could significantly advance the field. Emphasize novelty and innovation in your suggestions.
    """

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "system",
                "content": prompt,
            }
        ],
    )

    research_suggestions = response.choices[0].message.content.strip()
    return research_suggestions

In [20]:
def main():
    user_query = input("Enter your search query: ")
    print("Converting your query into an arXiv-friendly format...")
    arxiv_query = enhance_query(user_query)
    print(f"Here is the arXiv Query I am using: {arxiv_query}\nFetching papers from arXiv...")
    arxiv_results = search_arxiv(arxiv_query)
    extracted_information = extract_titles_and_summaries(arxiv_results)
    # print(extracted_information)
    thoughtful_response = summarize(user_query, extracted_information)
    print("Here's what I found:\n", thoughtful_response)
    research_suggestions = suggest_research_directions(user_query, thoughtful_response)
    print("Here are a few research ideas to inspire your work:\n", research_suggestions)

if __name__ == "__main__":
    main()

Converting your query into an arXiv-friendly format...
Here is the arXiv Query I am using: au:John Lafferty
Fetching papers from arXiv...
Here's what I found:
 The papers written by John Lafferty focus on two main topics: "Denoising Flows on Trees" and "Nonparametric Reduced Rank Regression."

In the paper "Denoising Flows on Trees," Lafferty extends Pinsker's theorem in statistical theory to address estimation under storage or communication constraints. The focus is on characterizing the minimax rate in nonparametric estimation while placing limits on the bits used to encode an estimator. By analyzing the excess risk concerning these constraints, signal size, and noise level, the paper establishes a Pareto-optimal minimax tradeoff between storage and risk for the case of a Euclidean ball. This work aims to provide a comprehensive understanding of the tradeoff between storage limitations and estimation accuracy in a statistical setting.

In the paper "Nonparametric Reduced Rank Regress