## TODO 
- maybe some better model for embedding extraction, maybe we could fine tune?
- make the quantization work (apparently you have to have linux (for BitsAndBytes) or wsl  at least so maybe we could dockerize or something), or choose a simpler model
- better prompt for the chatbot 
- somehow test the implementation

In [1]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer 
import arxiv

  from .autonotebook import tqdm as notebook_tqdm


## Get the papers based on the user query

In [2]:
# Load the sentence transformer model
# TODO maybe some better model for embedding extraction, maybe we could fine tune?
model = SentenceTransformer('all-MiniLM-L6-v2') 

# Define your query
user_query = "This paper presents some preliminary investigations of a new co-attention, the dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer,"

# Get the embedding for the query
query_embedding = model.encode([user_query])

search = arxiv.Search(
    query=user_query,
    max_results=50,
    sort_by=arxiv.SortCriterion.Relevance,
    sort_order=arxiv.SortOrder.Descending
)

client = arxiv.Client()
results = list(client.results(search))

# Extract summaries and titles
papers = []
summaries = []
for result in results:
    title = result.title
    authors = ', '.join([author.name for author in result.authors])
    summary = result.summary
    url = f"https://arxiv.org/abs/{result.entry_id.split('/')[-1]}"
    papers.append({
        "title": title,
        "authors": authors,
        "summary": summary,
        "url": url
    })
    summaries.append(summary)

# Encode all summaries
summary_embeddings = model.encode(summaries)

# Compute cosine similarities
similarities = cosine_similarity(query_embedding, summary_embeddings)[0]

# Attach similarity scores to papers and sort
for i, paper in enumerate(papers):
    paper["similarity"] = similarities[i]

top_papers = sorted(papers, key=lambda x: x["similarity"], reverse=True)[:5] # top 5

# Print top 5 similar papers
for i, paper in enumerate(top_papers, 1):
    print(f"Rank #{i}")
    print(f"Title: {paper['title']}")
    print(f"Authors: {paper['authors']}")
    print(f"Summary: {paper['summary']}")
    print(f"Similarity: {paper['similarity']:.4f}")
    print(f"URL: {paper['url']}")
    print("-" * 80)


Rank #1
Title: Two-Headed Monster And Crossed Co-Attention Networks
Authors: Yaoyiran Li, Jing Jiang
Summary: This paper presents some preliminary investigations of a new co-attention
mechanism in neural transduction models. We propose a paradigm, termed
Two-Headed Monster (THM), which consists of two symmetric encoder modules and
one decoder module connected with co-attention. As a specific and concrete
implementation of THM, Crossed Co-Attention Networks (CCNs) are designed based
on the Transformer model. We demonstrate CCNs on WMT 2014 EN-DE and WMT 2016
EN-FI translation tasks and our model outperforms the strong Transformer
baseline by 0.51 (big) and 0.74 (base) BLEU points on EN-DE and by 0.17 (big)
and 0.47 (base) BLEU points on EN-FI.
Similarity: 0.7980
URL: https://arxiv.org/abs/1911.03897v1
--------------------------------------------------------------------------------
Rank #2
Title: Understanding How Encoder-Decoder Architectures Attend
Authors: Kyle Aitken, Vinay V Ramases

## Generate a response by incoroprating the retrieved papers with a chatbot

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig
import torch

# Load a chat-capable model
# TODO make the quantization work (apparently you have to have linux (for BitsAndBytes) or wsl  at least
# so maybe we could dockerize or something), or choose a simpler model
model_id = "HuggingFaceH4/zephyr-7b-beta"  # You can replace with another chat model if you want
tokenizer = AutoTokenizer.from_pretrained(model_id)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)
# Setup chat pipeline
rag = pipeline("chat", model=model, tokenizer=tokenizer)

# Combine summaries into a context string
context = "\n\n".join(
    f"Title: {paper['title']}\nSummary: {paper['summary']}" for paper in top_papers
)

# Prepare system + user messages
# TODO prompt engineering: make up an even better prompt, maybe specific for the 
messages = [
    {"role": "system", "content": "You are a helpful AI that answers user questions based on provided research papers."},
    {"role": "user", "content": f"""Here are some research papers:

{context}

Use these summaries to answer the following research question (also cite the papers):

Question: {user_query}
Answer:"""}
]

# Generate the answer
output = rag(messages, max_new_tokens=300)

# Print results
print("Research Papers and Generated Answer:")
print(f"Research Papers:\n{context}")  # Full context
print(f"Generated Answer:\n{output[0]['generated_text']}")


CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend


RuntimeError: CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend

In [5]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

model_id = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

rag = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

# Combine summaries into a context string, but make sure it's within the token limit
context = "\n\n".join(
    f"Title: {paper['title']}\nSummary: {paper['summary']}" for paper in top_papers
)

# Encode the context and check its length
input_ids = tokenizer.encode(context, return_tensors="pt")
max_length = 1700  # Adjust this based on your model's max token length

# Truncate if necessary to fit within the max token limit
if input_ids.shape[1] > max_length:
    input_ids = input_ids[:, :max_length]


# Prepare the prompt, ensuring it stays within the token limit
prompt = f"""Here are some research papers:

{context[:max_length]}  # Only include a truncated context if necessary

Use the above research paper summaries to answer the following question:

Question: {user_query}
Answer:"""

# Generate the answer using the same prompt
output = rag(prompt, max_new_tokens=300)

# Provide the generated answer along with the papers
print("Research Papers and Generated Answer:")
print(f"Research Papers:\n{context[:max_length]}")  # Display truncated context
print(f"Generated Answer:\n{output[0]['generated_text']}")


Device set to use cpu
Token indices sequence length is longer than the specified maximum sequence length for this model (1280 > 512). Running this sequence through the model will result in indexing errors


Research Papers and Generated Answer:
Research Papers:
Title: Two-Headed Monster And Crossed Co-Attention Networks
Summary: This paper presents some preliminary investigations of a new co-attention
mechanism in neural transduction models. We propose a paradigm, termed
Two-Headed Monster (THM), which consists of two symmetric encoder modules and
one decoder module connected with co-attention. As a specific and concrete
implementation of THM, Crossed Co-Attention Networks (CCNs) are designed based
on the Transformer model. We demonstrate CCNs on WMT 2014 EN-DE and WMT 2016
EN-FI translation tasks and our model outperforms the strong Transformer
baseline by 0.51 (big) and 0.74 (base) BLEU points on EN-DE and by 0.17 (big)
and 0.47 (base) BLEU points on EN-FI.

Title: Understanding How Encoder-Decoder Architectures Attend
Summary: Encoder-decoder networks with attention have proven to be a powerful way to
solve many sequence-to-sequence tasks. In these networks, attention aligns
encoder an