<a href="https://colab.research.google.com/github/cnmnzhang/ml-concepts/blob/main/RAG_for_summarizing_research_trends_and_generating_research_ideas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG for summarizing research trends and generating research ideas

RAG, vector stores, Mistral, ArXiv API

Use case: Build a RAG system that summarizes trends in a **field of interest** from original, existing papers in that field. The system will also suggest research ideas with titles. We will further propmt the LLM to include a catchy title as well to see how creative the LLM can be :D




## Using RAG to enrich LLM Capabilities and address hallucination

While large language models (LLMs) show powerful capabilities that power advanced use cases, they suffer from issues such as factual inconsistency and hallucination. Retrieval-augmented generation (RAG) is an approach to enrich LLM capabilities and improve reliability. RAG combines LLMs with external knowledge to enrich the prompt context with relevant information that helps accomplish a task.

## Services
Firework for fast access to open-source models

Obtain your Fireworks API Key to use the Mistral 7B model: https://readme.fireworks.ai/docs

Other open-source models here: https://app.fireworks.ai/models

Read more about the Fireworks APIs here: https://readme.fireworks.ai/reference/createchatcompletion

In [148]:
search_query = "genomics" #@param {type:"string"}


## Libraries

In [149]:
%%capture
!pip install chromadb tqdm fireworks-ai python-dotenv pandas
!pip install sentence-transformers
!pip install xmltodict

In [150]:
import fireworks.client
import os
import dotenv
import chromadb
import json
from tqdm.auto import tqdm
import pandas as pd
import random
from google.colab import userdata

# you can set envs using Colab secrets
dotenv.load_dotenv()

fireworks.client.api_key = userdata.get("FIREWORKS_API_KEY")

## Getting Started with Completions from Fireworks

Let's define a function to get completions from the Fireworks inference platform.

In [4]:
def get_completion(prompt, model=None, max_tokens=50):

    fw_model_dir = "accounts/fireworks/models/"

    if model is None:
        model = fw_model_dir + "llama-v2-7b"
    else:
        model = fw_model_dir + model

    completion = fireworks.client.Completion.create(
        model=model,
        prompt=prompt,
        max_tokens=max_tokens,
        temperature=0
    )

    return completion.choices[0].text

Let's first try the function with a simple prompt:

In [5]:
get_completion("Hello, my name is")

' Katie and I am a 20 year old student at the University of Leeds. I am currently studying a BA in English Literature and Creative Writing. I have been working as a tutor for over 3 years now and I'

Now let's test with Mistral-7B-Instruct:

In [6]:
mistral_llm = "mistral-7b-instruct-4k"

get_completion("Hello, my name is", model=mistral_llm)

' [Your Name]. I am a [Your Profession/Occupation]. I am writing to [Purpose of Writing].\n\nI am writing to [Purpose of Writing] because [Reason for Writing]. I believe that ['

The Mistral 7B Instruct model needs to be instructed using special instruction tokens `[INST] <instruction> [/INST]` to get the right behavior. You can find more instructions on how to prompt Mistral 7B Instruct here: https://docs.mistral.ai/llm/mistral-instruct-v0.1

In [7]:
mistral_llm = "mistral-7b-instruct-4k"

get_completion("Tell me 2 jokes", model=mistral_llm)

".\n1. Why don't scientists trust atoms? Because they make up everything!\n2. Did you hear about the mathematician who’s afraid of negative numbers? He will stop at nothing to avoid them."

In [8]:
mistral_llm = "mistral-7b-instruct-4k"

get_completion("[INST]Tell me 2 jokes[/INST]", model=mistral_llm)

" Sure, here are two jokes for you:\n\n1. Why don't scientists trust atoms? Because they make up everything!\n2. Why did the tomato turn red? Because it saw the salad dressing!"

Now let's try with a more complex prompt that involves instructions:

In [9]:
prompt = """[INST]
Given the following wedding guest data, write a very short 3-sentences thank you letter:

{
  "name": "John Doe",
  "relationship": "Bride's cousin",
  "hometown": "New York, NY",
  "fun_fact": "Climbed Mount Everest in 2020",
  "attending_with": "Sophia Smith",
  "bride_groom_name": "Tom and Mary"
}

Use only the data provided in the JSON object above.

The senders of the letter is the bride and groom, Tom and Mary.
[/INST]"""

get_completion(prompt, model=mistral_llm, max_tokens=150)

" Dear John Doe,\n\nWe, Tom and Mary, would like to extend our heartfelt gratitude for your attendance at our wedding. It was a pleasure to have you there, and we truly appreciate the effort you made to be a part of our special day.\n\nWe were thrilled to learn about your fun fact - climbing Mount Everest is an incredible accomplishment! We hope you had a safe and memorable journey.\n\nThank you again for joining us on this special occasion. We hope to stay in touch and catch up on all the amazing things you've been up to.\n\nWith love,\n\nTom and Mary"

## RAG Use Case: ArXiv





### Step 1: Load the Dataset

Let's first load the dataset. We will use the ArXiv API

In [126]:
import urllib, urllib.request

url = f'http://export.arxiv.org/api/query?search_query=all:{search_query}&start=0&max_results=100'
data = urllib.request.urlopen(url)
data_decoded = data.read().decode('utf-8')


In [135]:
import xmltodict

xml_dict = xmltodict.parse(data_decoded)
feed = xml_dict['feed']
ml_papers_dict = feed['entry']
len(ml_papers_dict)

100

In [136]:
# papers = pd.DataFrame(entries)[['id', 'published', 'title', 'summary']]
# papers = papers.dropna(subset=["title", "summary"])
# df.head(2)

Alternatively, there is a [dataset](https://github.com/dair-ai/ML-Papers-of-the-Week/tree/main/research) that contains a list of weekly top trending ML papers.

In [158]:
# # load dataset from data/ folder to pandas dataframe
# # url = 'https://raw.githubusercontent.com/dair-ai/ML-Papers-of-the-Week/main/research/ml-potw-10232023.csv'
# ml_papers = pd.read_csv(url, index_col=0, header=0).reset_index()
# ml_papers = ml_papers.dropna(subset=["Title", "Description"])

# # convert dataframe to list of dicts with Title and Description columns only
# ml_papers_dict = ml_papers.to_dict(orient="records")
# len(ml_papers_dict)

In [139]:
ml_papers_dict[0]

{'id': 'http://arxiv.org/abs/2211.08157v2',
 'updated': '2023-01-18T17:32:16Z',
 'published': '2022-11-15T14:09:39Z',
 'title': 'Genome-on-Diet: Taming Large-Scale Genomic Analyses via Sparsified\n  Genomics',
 'summary': 'Searching for similar genomic sequences is an essential and fundamental step\nin biomedical research and an overwhelming majority of genomic analyses.\nState-of-the-art computational methods performing such comparisons fail to cope\nwith the exponential growth of genomic sequencing data. We introduce the\nconcept of sparsified genomics where we systematically exclude a large number\nof bases from genomic sequences and enable much faster and more\nmemory-efficient processing of the sparsified, shorter genomic sequences, while\nproviding similar or even higher accuracy compared to processing non-sparsified\nsequences. Sparsified genomics provides significant benefits to many genomic\nanalyses and has broad applicability. We show that sparsifying genomic\nsequences grea

### Step 2: Generate Embeddings with SentenceTransformer and Store in ChromaDB

In [140]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

class MyEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        batch_embeddings = embedding_model.encode(input)
        return batch_embeddings.tolist()

embed_fn = MyEmbeddingFunction()
settings = chromadb.Settings(allow_reset=True)
# Initialize the chromadb directory, and client.
client = chromadb.PersistentClient(path="./chromadb")
# client.delete_collection(name=f"arxiv_{search_query}")

# create collection
collection = client.get_or_create_collection(
    name=f"arxiv_{search_query}"
)

In [142]:
# Generate embeddings, and index titles in batches
batch_size = 50

# loop through batches and generated + store embeddings
for i in tqdm(range(0, len(ml_papers_dict), batch_size)):

    i_end = min(i + batch_size, len(ml_papers_dict))
    batch = ml_papers_dict[i : i + batch_size]

    # Replace title with "No Title" if empty string
    batch_titles = [str(paper["summary"]) if str(paper["summary"]) != "" else "No summary" for paper in batch]
    batch_ids = [str(sum(ord(c) + random.randint(1, 10000) for c in paper["title"])) for paper in batch]
    batch_metadata = [dict(url=paper["id"],
                           title=paper['title'])
                           for paper in batch]

    # generate embeddings
    batch_embeddings = embedding_model.encode(batch_titles)

    # upsert to chromadb
    collection.upsert(
        ids=batch_ids,
        metadatas=batch_metadata,
        documents=batch_titles,
        embeddings=batch_embeddings.tolist(),
    )

  0%|          | 0/2 [00:00<?, ?it/s]

### Test Retriever

In [151]:
collection = client.get_or_create_collection(
    name=f"ml-papers-nov-2023",
    embedding_function=embed_fn
)

retriever_results = collection.query(
    query_texts=["models"],
    # n_results=2,
)

print(retriever_results["documents"])

[['Consistency Models', 'Eight Things to Know about Large Language Models', 'An Overview on Language Models: Recent Developments and Outlook', 'Mastering Diverse Domains through World Models', 'Language Models can Solve Computer Tasks', 'Language Modeling is Compression', 'Large Language and Speech Model', 'Model Compression for LLMs', 'A Survey of Large Language Models', 'Toolformer: Language Models Can Teach Themselves to Use Tools']]


# Prompting

In [154]:
# user query
# user_query = "S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models"
user_query = "genomics"
# query for user query
results = collection.query(
    query_texts=[user_query],
    n_results=10,
)

# concatenate titles into a single string
short_titles = '\n'.join(results['documents'][0])

prompt_template = f'''[INST]

Your task one is to list machine learning trends in {search_query} and {user_query} based on the recent innovative papers in the field listed in SHORT_TITLES.

Your task two is for each trend, come up with a novel research ideas. For each idea, include a catchy title and methods summary
PLEASE DO NOT include current SHORT_TITLES in the SUGGESTED_RESEARCH.

SHORT_TITLES: {short_titles}

SUGGESTED_RESEARCH:

[/INST]
'''


print("\n\n\nPrompt Template:")
print(prompt_template)




Prompt Template:
[INST]

Your task one is to list machine learning trends in genomics and genomics based on the recent innovative papers in the field listed in SHORT_TITLES. 

Your task two is for each trend, come up with a novel research ideas. For each idea, include a catchy title and methods summary
PLEASE DO NOT include current SHORT_TITLES in the SUGGESTED_RESEARCH.

SHORT_TITLES: BiomedGPT
Large language models generate functional protein sequences across diverse families
Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl
SequenceMatch
Single–amino acid changes in proteins sometimes have little effect but can often lead to problems in protein folding, activity, or stability. Only a small fraction of variants have been experimentally investigated, but there are vast amounts of biological sequence data that are suitable for use as training data for machine learning approaches. Cheng et al. developed AlphaMissense, a deep learning model that builds on 

In [155]:
responses = get_completion(prompt_template, model=mistral_llm, max_tokens=2000)
suggested_titles = ''.join([str(r) for r in responses])

# Print the suggestions.
print("Model Suggestions:")
print(suggested_titles)

Model Suggestions:

Task One: Machine Learning Trends in Genomics

1. Deep Learning for Protein Structure Prediction
Title: AlphaFold2: A Deep Learning Model for Protein Structure Prediction
Methods Summary: The AlphaFold2 model uses deep learning to predict protein structures based on amino acid sequences. The model is trained on population frequency data and uses sequence and predicted structural context to make accurate predictions. The model can be used to predict the structures of all single-amino acid substitutions in the human proteome.
2. Interpretable Machine Learning for Science
Title: PySR and SymbolicRegression.jl: Interpretable Machine Learning for Science
Methods Summary: PySR and SymbolicRegression.jl are machine learning tools that are designed to be interpretable. These tools can be used to analyze scientific data and make predictions. The tools are designed to be easy to use and understand, making them a valuable resource for scientists.
3. Single-Cell Multi-omics Usi