## Generation

In [69]:
# Critic Chain: Takes a hypothesis and provides a critique.
critic_template = """
Given the following hypothesis:
{hypothesis}

Analyze each part of the hypothesis:
1. Clarity: Are the statements clear and easy to understand?
2. Coherence: Do the different parts of the hypothesis logically flow together? Does it seem like the proposed solution would work?
3. Scientific Validity: Are there any scientific inaccuracies or assumptions that seem unfounded?

After your analysis, provide specific feedback on how each field (Problem, Solution, Methodology, Evaluation, Results) can be improved.
"""
critic_prompt = PromptTemplate(input_variables=["hypothesis"], template=critic_template)
critic_chain = LLMChain(llm=llm, prompt=critic_prompt, output_key="critique")

# Reviser Chain: Takes original hypothesis and critique, then provides a revised hypothesis.
reviser_template = """
Given the original hypothesis:
{hypothesis}

And based on the critique provided:
{critique}

Revise the hypothesis by addressing each point of the critique. Ensure the new version is clearer, more coherent, and scientifically valid.
"""
reviser_prompt = PromptTemplate(input_variables=["hypothesis", "critique"], template=reviser_template)
reviser_chain = LLMChain(llm=llm, prompt=reviser_prompt, output_key="revised_hypothesis")

# Overall Sequential Chain: Critiques the hypothesis and then revises it.
overall_chain = SequentialChain(
    chains=[critic_chain, reviser_chain],
    input_variables=["hypothesis"],
    output_variables=["critique", "revised_hypothesis"],
    verbose=True)

# Test the chain with a provided hypothesis
hypothesis = {
    "Problem": "Issues in determining basic characteristics of black holes and their surrounding disks in X-ray binary, using models of the source's disk X-ray continuum. A key issue is the determination of the \"color correction factor\".",
    "Solution": "Using observational data to estimate the color correction factor by modeling the disk spectrum with saturated Compton scattering.",
    "Methodology": "The work is based on two observations made by XMM-Newton on GX 339-4. These observations offer high-quality data at low energies. The spectra were then fitted to these models.",
    "Evaluation": "The quality of fit of the spectra to the models was examined. Other models were also tested for fit.",
    "Results": "The spectra fits well with the model and provides reasonable values for the color correction factor. However, the high-soft-state continuum cannot be adequately fitted by the latest disk models."
}
result = overall_chain({"hypothesis": str(hypothesis)})




[1m> Entering new SequentialChain chain...[0m

[1m> Finished chain.[0m


In [70]:
print(result['critique'])

Analysis:

1. Clarity: The hypothesis is relatively clear but contains technical terms that may be difficult for non-experts to understand. For example, terms like "X-ray binary", "color correction factor", "saturated Compton scattering", and "high-soft-state continuum" may require more explanation or context.
   
2. Coherence: The different parts of the hypothesis logically flow together. The problem is clearly stated, a solution is proposed, the methodology is explained, an evaluation process is detailed, and results are provided. 

3. Scientific Validity: The hypothesis appears to be scientifically valid. It uses real-world observational data and incorporates established scientific models. However, the validity of the results would depend on the accuracy and reliability of the data and models used.

Feedback:

Problem: The problem statement could be improved by providing more background information to help non-experts understand the issue. For example, briefly explaining what a blac

In [71]:
print(result['revised_hypothesis'])

{'Problem': 'The study aims to determine basic attributes of black holes and their surrounding disks, specifically in X-ray binary systems. In simple terms, X-ray binaries are systems where a black hole and a star are in close proximity, and the intense gravitational pull of the black hole causes matter from the star to form a disk around the black hole. The main challenge is the determination of the "color correction factor", a numerical factor that helps us to understand the temperature and energy of the emitting source.', 

'Solution': 'The proposed solution is to estimate the color correction factor using observational data. This is achieved by modeling the spectrum of the disk around the black hole with a phenomenon known as saturated Compton scattering, where photons gain energy by interacting with high-energy particles.', 

'Methodology': 'The research utilized two observational datasets provided by the XMM-Newton space telescope on the X-ray binary system GX 339-4. The high-qua

In [72]:
import ast

hypothesis_dict = ast.literal_eval(result['revised_hypothesis'])

In [1]:
import ast
import yaml
from langchain.chains import SequentialChain
from langchain.memory import SimpleMemory
from langchain.chat_models import AzureChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

# This is an LLMChain to write a synopsis given a title of a play.
config = yaml.safe_load(open("../config.yml"))
API_KEY = config['api_key']
DEPLOYMENT_NAME = config['deployment_name']
BASE_URL = config['base_url']
API_VERSION = config['api_version']

llm = AzureChatOpenAI(
    openai_api_base=BASE_URL,
    openai_api_version=API_VERSION,
    deployment_name=DEPLOYMENT_NAME,
    openai_api_key=API_KEY,
    openai_api_type="azure",
)

# Critic Chain: Takes a hypothesis and provides a critique.
critic_template = """
Given the following hypothesis:
{hypothesis}

Analyze each part of the hypothesis:
1. Clarity: Are the statements clear and easy to understand?
2. Coherence: Do the different parts of the hypothesis logically flow together? Does it seem like the proposed solution would work?
3. Scientific Validity: Are there any scientific inaccuracies or assumptions that seem unfounded?

After your analysis, provide specific feedback on how each field (Problem, Solution, Methodology, Evaluation, Results) can be improved.
Specifically, give feedback on the frailties of the idea as a whole, and suggest potential enhancements.
"""

# Reviser Chain: Takes original hypothesis and critique, then provides a revised hypothesis.
reviser_template = """
Given the original hypothesis:
{hypothesis}

And based on the critique provided:
{critique}

Revise the hypothesis by addressing each point of the critique. Ensure the new version is clearer, more coherent, and scientifically valid.
"""

def adversarial_update_hypothesis(hypothesis):
    critic_prompt = PromptTemplate(input_variables=["hypothesis"], template=critic_template)
    critic_chain = LLMChain(llm=llm, prompt=critic_prompt, output_key="critique")
    reviser_prompt = PromptTemplate(input_variables=["hypothesis", "critique"], template=reviser_template)
    reviser_chain = LLMChain(llm=llm, prompt=reviser_prompt, output_key="revised_hypothesis")
    overall_chain = SequentialChain(
        chains=[critic_chain, reviser_chain],
        input_variables=["hypothesis"],
        output_variables=["critique", "revised_hypothesis"],
        verbose=False)
    result = overall_chain({"hypothesis": str(hypothesis)})
    return result

In [2]:
import re
from preprocessing import Hypothesis
from tqdm.notebook import tqdm

def improve_hypothesis(hypothesis: str, n_iters: int = 3) -> str:
    for i in tqdm(range(n_iters)):
        revised_hypothesis = adversarial_update_hypothesis(hypothesis)
        hyp_str = revised_hypothesis['revised_hypothesis']
        hyp_str = hyp_str.replace('\n', ' ')
        # Remove everything before the first { and after the last }
        hyp_str = re.sub(r'^.*?\{', '{', hyp_str)
        hypothesis = hyp_str
    return hypothesis

hypothesis_to_improve = {
    "Problem": "Issues in determining basic characteristics of black holes and their surrounding disks in X-ray binary, using models of the source's disk X-ray continuum. A key issue is the determination of the \"color correction factor\".",
    "Solution": "Using observational data to estimate the color correction factor by modeling the disk spectrum with saturated Compton scattering.",
    "Methodology": "The work is based on two observations made by XMM-Newton on GX 339-4. These observations offer high-quality data at low energies. The spectra were then fitted to these models.",
    "Evaluation": "The quality of fit of the spectra to the models was examined. Other models were also tested for fit.",
    "Results": "The spectra fits well with the model and provides reasonable values for the color correction factor. However, the high-soft-state continuum cannot be adequately fitted by the latest disk models."
}

improve_hypothesis(hypothesis=hypothesis_to_improve, n_iters=3)

  0%|          | 0/3 [00:00<?, ?it/s]

'{\'Problem\': \'We are tasked with the challenge of accurately determining the fundamental features of black holes and the disks that surround them in X-ray binary, a type of binary star system that emits X-rays. This process is pivotal to enhancing our understanding of how matter behaves in extreme gravitational fields and the nature of black holes. The importance of achieving this lies in its potential to contribute significantly to the broader field of astrophysics, enabling us to understand the universe better. The critical part of this process is the determination of the "color correction factor", a measure of how the color of an object appears to change due to relativistic effects when it is moving at speeds close to the speed of light.\',   \'Solution\': \'We propose a solution that utilizes observational data to derive estimates for the color correction factor. This involves modeling the disk spectrum through the innovative concept of "saturated Compton scattering". In this pr

In [22]:
import pandas as pd

df = pd.read_csv("../data/tuning/cs.LG.csv")
print(f"There are {len(df)} hypotheses in the dataset.")
abstracts = pd.read_csv("../data/processed/arxiv-cs.LG.csv", low_memory=False)
print(f"There are {len(abstracts)} abstracts in the dataset.")

There are 12057 hypotheses in the dataset.
There are 147502 abstracts in the dataset.


In [36]:
# Randomly sample a row from the dataframe
# Generate random number in (0, len(df))
import random

random_idx = random.randint(0, len(df)-1)
print('-'*75)
print(df.iloc[random_idx]['title'])
print('-'*75)
print("Bit: "+df.iloc[random_idx]['bit']+"\n")
print("Flip: "+df.iloc[random_idx]['flip'])

---------------------------------------------------------------------------
Online Learning in Decentralized Multiuser Resource Sharing Problems
---------------------------------------------------------------------------
Bit: In decentralized systems, resource sharing is typically handled without considering the time-varying and unknown qualities of the resources. The traditional approach assumes that users operate independently, with no communication between them, and that the use of a resource by multiple users leads to reduced quality due to congestion. This approach also overlooks the potential benefits of user-specific rewards and the cooperative goal of achieving the highest system utility.

Flip: This research proposes a shift towards considering the time-varying and unknown qualities of resources in a decentralized system. It suggests the implementation of distributed learning algorithms that allow for user-specific rewards and costly communication between users, with the aim o

In [38]:
# Get the original abstract using the index
title = df.iloc[random_idx]['title']
corresponding_idx = int(abstracts[abstracts.title == title].index[0])
print('-'*75)
print(abstracts.iloc[corresponding_idx]['title'])
print()
print(abstracts.iloc[corresponding_idx]['abstract'])

---------------------------------------------------------------------------
Online Learning in Decentralized Multiuser Resource Sharing Problems

  In this paper, we consider the general scenario of resource sharing in a
decentralized system when the resource rewards/qualities are time-varying and
unknown to the users, and using the same resource by multiple users leads to
reduced quality due to resource sharing. Firstly, we consider a
user-independent reward model with no communication between the users, where a
user gets feedback about the congestion level in the resource it uses.
Secondly, we consider user-specific rewards and allow costly communication
between the users. The users have a cooperative goal of achieving the highest
system utility. There are multiple obstacles in achieving this goal such as the
decentralized nature of the system, unknown resource qualities, communication,
computation and switching costs. We propose distributed learning algorithms
with logarithmic regre

In [40]:
path = '../data/processed/arxiv-CS.LG.csv'

df = pd.read_csv(path)
df.head()

Unnamed: 0,authors,title,doi,categories,abstract,authors_parsed
0,Maxim Raginsky,Learning from compressed observations,10.1109/ITW.2007.4313111,cs.IT cs.LG math.IT,The problem of statistical learning is to co...,"[['Raginsky', 'Maxim', '']]"
1,Soummya Kar and Jose M. F. Moura,Sensor Networks with Random Links: Topology De...,10.1109/TSP.2008.920143,cs.IT cs.LG math.IT,"In a sensor network, in practice, the commun...","[['Kar', 'Soummya', ''], ['Moura', 'Jose M. F...."
2,"Andras Gyorgy, Tamas Linder, Gabor Lugosi, Gyo...",The on-line shortest path problem under partia...,,cs.LG cs.SC,The on-line shortest path problem is conside...,"[['Gyorgy', 'Andras', ''], ['Linder', 'Tamas',..."
3,Jianlin Cheng,A neural network approach to ordinal regression,,cs.LG cs.AI cs.NE,Ordinal regression is an important type of l...,"[['Cheng', 'Jianlin', '']]"
4,David H. Wolpert and Dev G. Rajnarayan,Parametric Learning and Monte Carlo Optimization,,cs.LG,This paper uncovers and explores the close r...,"[['Wolpert', 'David H.', ''], ['Rajnarayan', '..."


In [41]:
df.shape

(147502, 6)

# Vector database

In [138]:
from langchain.prompts import PromptTemplate
import os
import yaml
from langchain.chains import RetrievalQA
from langchain.document_loaders import CSVLoader  # Assuming CSVLoader and Document are already imported
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import AzureChatOpenAI
from langchain.vectorstores import Chroma
from langchain.prompts import PromptTemplate

# Load configuration and set API key
config = yaml.safe_load(open("../config.yml"))
API_KEY = config['embedding_api_key']
DEPLOYMENT_NAME = "embedding"
BASE_URL = config['embedding_base_url']
API_VERSION = config['api_version']
os.environ['OPENAI_API_KEY'] = API_KEY

# Initialize embeddings
embeddings = OpenAIEmbeddings(
    openai_api_base=BASE_URL,
    deployment=DEPLOYMENT_NAME,
    model="text-embedding-ada-002",
    openai_api_key=API_KEY,
    openai_api_type="azure",
    chunk_size=16,
)

# Initialize chat model
llm = AzureChatOpenAI(
    openai_api_base=config['base_url'],
    openai_api_version=API_VERSION,
    deployment_name=config['deployment_name'],
    openai_api_key=config['api_key'],
    openai_api_type="azure",
    temperature=0.0,
)

# Load Documents
loader = CSVLoader(file_path='../data/processed/arxiv-cs.LG.csv', source_column="authors")
documents = loader.load()
documents = documents[:50]

# Create Chroma Vectorstore
vectordb = Chroma.from_documents(
    documents=documents, embedding=embeddings,
)

prompt_template = """
You are a Neurips paper review determining the novelty of an idea. Use the following pieces of context to determine if the following idea.

Please perform a concise analysis to evaluate the novelty of this "flip" by:

* Describing the originality of the new idea in relation to the original "bit".
* Comparing it to existing ideas or implementations that are similar.
* Assessing its potential impact or utility in its domain.
* Summing up the above findings.

Here are some other relevant ideas from paper abstracts that you can use for assessing novelty:

{context}

Here is the bit-flip: {question}

Based on this context and your own knowledge, determine the novelty of the flip component of the idea. Please provide your final output as a novelty score between 0 to 5, where 0 means "Not Novel" and 5 means "Highly Novel". 
If the idea already exists in the provided abstracts, provide a low score. If the idea is incremental or not very different, also provide a low score.
Your final output should be of the form 'Novelty Score: x'.
"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

chain_type_kwargs = {"prompt": PROMPT}
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vectordb.as_retriever(), chain_type_kwargs=chain_type_kwargs)

bit_flip = {
    'bit': """Probabilistic principal component analysis (PPCA) seeks a low dimensional representation of a data set by solving an eigenvalue problem on the sample covariance matrix, assuming independent spherical Gaussian noise.""",
    'flip': """Instead of solely relying on PPCA, the data variance can be further decomposed into its components through a generalised eigenvalue problem, called residual component analysis (RCA), which considers other factors like sparse conditional dependencies and temporal correlations that leave some residual variance.""",
}

query = f"Bit: {bit_flip['bit']}. Flip: {bit_flip['flip']}"
output = qa.run(query)

In [139]:
import re

def parse_novelty_score(output):
    match = re.search(r'Novelty Score: (\d)', output)
    if match:
        return int(match.group(1))
    else:
        return "Could not extract novelty score."

parse_novelty_score(output)

4

In [17]:
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
import yaml
import os
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from tqdm import tqdm

def worker(text):
    """Worker function to compute embeddings for a text."""
    if text is None:
        return None
    embedded_text = embeddings.embed_query(text)
    return (text, embedded_text)

# Load configuration and set API key
config = yaml.safe_load(open("../config.yml"))
API_KEY = config['embedding_api_key']
DEPLOYMENT_NAME = "embedding"
BASE_URL = config['embedding_base_url']
os.environ['OPENAI_API_KEY'] = API_KEY

# Initialize embeddings
embeddings = OpenAIEmbeddings(
    openai_api_base=BASE_URL,
    deployment=DEPLOYMENT_NAME,
    model="text-embedding-ada-002",
    openai_api_key=API_KEY,
    openai_api_type="azure",
)

# Load CSV file directly
df = pd.read_csv('../data/processed/arxiv-cs.LG.csv', low_memory=False)

# Create a list of strings by concatenating 'title' and 'abstract'
documents = [f"{row['title']} {row['abstract']}" for index, row in df.head(50).iterrows()]

# Initialize an empty list to hold (text, embedding) tuples
text_embeddings = []

# Use ThreadPoolExecutor with max_workers for threading
with ThreadPoolExecutor(max_workers=5) as executor:
    results = list(tqdm(executor.map(worker, documents), total=len(documents), desc="Computing Embeddings"))
    text_embeddings.extend([(text, embedding) for text, embedding in results if text and embedding])

# Create a FAISS Vectorstore and add the embeddings
vectordb = FAISS.from_embeddings(text_embeddings, embedding=embeddings)

Computing Embeddings: 100%|██████████| 50/50 [00:05<00:00,  9.28it/s]


In [18]:
query = "Gaussian processes"
docs = vectordb.similarity_search(query)

In [19]:
docs[0]

Document(page_content="Model Selection Through Sparse Maximum Likelihood Estimation   We consider the problem of estimating the parameters of a Gaussian or binary\ndistribution in such a way that the resulting undirected graphical model is\nsparse. Our approach is to solve a maximum likelihood problem with an added\nl_1-norm penalty term. The problem as formulated is convex but the memory\nrequirements and complexity of existing interior point methods are prohibitive\nfor problems with more than tens of nodes. We present two new algorithms for\nsolving problems with at least a thousand nodes in the Gaussian case. Our first\nalgorithm uses block coordinate descent, and can be interpreted as recursive\nl_1-norm penalized regression. Our second algorithm, based on Nesterov's first\norder method, yields a complexity estimate with a better dependence on problem\nsize than existing interior point methods. Using a log determinant relaxation\nof the log partition function (Wainwright & Jordan 

In [20]:
new_db = FAISS.load_local("../data/vectorstore/arxiv-cs.LG", embeddings)
query = "Gaussian processes"
docs = new_db.similarity_search(query)
docs[0]

Document(page_content='Gaussian Processes for Nonlinear Signal Processing   Gaussian processes (GPs) are versatile tools that have been successfully\nemployed to solve nonlinear estimation problems in machine learning, but that\nare rarely used in signal processing. In this tutorial, we present GPs for\nregression as a natural nonlinear extension to optimal Wiener filtering. After\nestablishing their basic formulation, we discuss several important aspects and\nextensions, including recursive and adaptive algorithms for dealing with\nnon-stationarity, low-complexity solutions, non-Gaussian noise models and\nclassification scenarios. Furthermore, we provide a selection of relevant\napplications to wireless digital communications.\n', metadata={})