<a href="https://colab.research.google.com/github/chapSKor/basicRAGs/blob/main/RAG_basic_t5_finance_phrasebank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
pip install datasets transformers sentence-transformers scikit-learn torch


Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [3]:
import torch
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, util
from transformers import T5ForConditionalGeneration, T5Tokenizer
import numpy as np

In [35]:
# Step 1: Load a text-based dataset (e.g., the 'bookcorpus' dataset)
#dataset = load_dataset('ag_news', split='train')
dataset = load_dataset('financial_phrasebank','sentences_allagree', split='train')
corpus = dataset['sentence']  # Use the "sentence" field for text data
#corpus = dataset['text'][:1000]  # Use a subset for demonstration (can adjust this as needed) use this for ag_news

FinancialPhraseBank-v1.0.zip:   0%|          | 0.00/682k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2264 [00:00<?, ? examples/s]

In [36]:
# Step 2: Load the Sentence Transformer model for retrieval
retriever_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [37]:
# Encode the entire corpus using the Sentence Transformer model
corpus_embeddings = retriever_model.encode(corpus, convert_to_tensor=True)

In [38]:
# Step 3: Load the T5 model and tokenizer for text generation
t5_model = T5ForConditionalGeneration.from_pretrained('t5-base')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-base')

In [39]:
# Function to retrieve relevant documents from the corpus
def retrieve_passages(query, top_k=3):
    # Get the embedding of the query
    query_embedding = retriever_model.encode(query, convert_to_tensor=True)

    # Calculate cosine similarity between the query embedding and the corpus embeddings
    similarities = util.cos_sim(query_embedding, corpus_embeddings)

    # Get the top-k indices with highest similarity scores
    top_k_indices = torch.topk(similarities, k=top_k).indices.flatten()  # Flatten the tensor

    # Convert the flattened tensor indices to Python integers and retrieve the passages
    retrieved_passages = [corpus[idx.item()] for idx in top_k_indices]

    return retrieved_passages

In [40]:
# Function to generate a response using T5
def generate_response(query, retrieved_passages):
    # Combine the query and retrieved passages
    context = " ".join(retrieved_passages)
    #This line takes the list of retrieved passages and joins them into a single string, separated by spaces.
    #For example, if retrieved_passages contains three sentences, they will be combined into one long string.
    input_text = f"question: {query} context: {context}"
    #This line formats the input text to match the expected input format for the T5 model.
    #The f"..." syntax is an f-string, allowing you to embed variables directly into a string.
    #By specifying "question: ... context: ...", we are giving the T5 model explicit instructions to treat the input as a question-answering task.
    # Encode and generate using T5
    #"Encode the Input for the T5 Model"
    input_ids = t5_tokenizer.encode(input_text, return_tensors='pt', max_length=512, truncation=True)
    #This uses the T5 tokenizer to convert the input_text into a sequence of token IDs that the model can understand.
    #The T5 model expects input in tokenized form, so this step is essential.
    #max_length=512: Limits the input text to a maximum length of 512 tokens. This ensures that the input doesn't exceed the model's maximum input size.
    #truncation=True: If the input text is longer than max_length, it will be truncated to fit within the limit.

    #Generate a Response Using the T5 Model:
    output_ids = t5_model.generate(input_ids, max_length=100, num_beams=5, early_stopping=True)
    #This line generates a response using the T5 model based on the tokenized input.
    #The model takes the input tensor (input_ids) and produces an output tensor (output_ids) containing the generated text tokens.
    #max_length=100: The maximum length of the generated response is limited to 100 tokens.
    #num_beams=5: This enables beam search with 5 beams. Beam search is a method that explores multiple possible outputs and selects the one with the highest probability. A higher value generally improves the quality of the generated text but increases computation time.
    #early_stopping=True: The generation process stops as soon as the model finishes generating a complete sentence, rather than continuing up to the max_length.

    #Decode the Generated Tokens into Human-Readable Text:
    response = t5_tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return response

In [41]:
def generate_response_2(query, retrieved_passages):
    # Combine the query and retrieved passages
    context = " ".join(retrieved_passages)
    input_text = f"question: {query} context: {context}"

    # Encode the input for the T5 model
    input_ids = t5_tokenizer.encode(input_text, return_tensors='pt', max_length=512, truncation=True)

    # Generate a response using T5 with adjusted parameters for more detailed output
    output_ids = t5_model.generate(
        input_ids,
        max_length=300,          # Increased max length for longer responses
        min_length=50,           # Ensure responses are at least 50 tokens long
        num_beams=10,             # More beams for better exploration
        early_stopping=True,
        temperature=0.7,         # Add some randomness to the generation
        top_p=0.9,               # Nucleus sampling to control diversity
        repetition_penalty=1.2   # Discourage repetitive responses
    )

    # Decode the generated tokens into text
    response = t5_tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return response


In [42]:
# Example Usage
#query = "How are oil stocks going to be affected by climate change?"
query = "What are the current trends in financial markets?"
retrieved_passages = retrieve_passages(query, top_k=10)
#You can adjust the top_k parameter in retrieve_passages() to retrieve more or fewer passages.
response = generate_response_2(query, retrieved_passages)

In [43]:
print("\nQuery:", query)
print("\nRetrieved Passages:")
for passage in retrieved_passages:
    print("-", passage)
print("\nGenerated Response:", response)


Query: What are the current trends in financial markets?

Retrieved Passages:
- Also , a six-year historic analysis is provided for these markets .
- `` The trend in the sports and leisure markets was favorable in the first months of the year .
- Also , a six-year historic analysis is provided for this market .
- As of July 2 , 2007 , the market cap segments will be updated according to the average price in May 2007 .
- However , the growth margin slowed down due to the financial crisis .
- The company 's market share is continued to increase further .
- The company said that paper demand increased in all of its main markets , including of publication papers , and that it increased average paper prices by 4 percent compared with last year .
- Sales are expected to increase in the end of the year 2006 , however .
- According to the company , in addition to normal seasonal fluctuation the market situation has weakened during autumn 2008 .
- Net sales will , however , increase from 2005 