<a href="https://colab.research.google.com/github/aswinaus/Reinforcement-Learning/blob/main/Agent_RewardFunction_GRPO_GroupPolicy_W%26B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

def cosine_similarity_reward(retrieved_context, ground_truth):
    """
    Calculates a reward based on cosine similarity between the retrieved context
    and the ground truth using TF-IDF vectorization.

    Args:
        retrieved_context (str): The text from the retrieved documents.
        ground_truth (str): The ground truth text.

    Returns:
        float: A score between 0 and 1 representing the cosine similarity.
    """
    # Handle empty strings
    if not retrieved_context or not ground_truth:
        return 0.0

    # Create TF-IDF vectors
    vectorizer = TfidfVectorizer().fit([retrieved_context, ground_truth])
    vectors = vectorizer.transform([retrieved_context, ground_truth])

    # Calculate cosine similarity
    similarity_score = cosine_similarity(vectors[0], vectors[1])[0][0]

    return similarity_score

# Example usage (assuming 'contexts' and 'ground_truth' are defined):
# combined_context = " ".join(contexts[0]) # Combine retrieved contexts
# reward = cosine_similarity_reward(combined_context, ground_truth[0])
# print(f"Cosine Similarity Reward: {reward}")

In [None]:
answers  = []
contexts = []
cosine_similarity_rewards = [] # List to store cosine similarity rewards


# traversing each question and passing into the chain to get answer from the system
# Define the retriever from the OECD index
retriever = CombinedFinancial10k_index.as_retriever()

for i, question in enumerate(questions):
    response = agent.chat(question)
    answers.append(response.response) # Extract the string response
    retrieved_docs = retriever.retrieve(question)
    context_texts = [docs.node.text for docs in retrieved_docs]
    contexts.append(context_texts)

    # Calculate cosine similarity reward
    # Combine retrieved contexts into a single string for similarity calculation
    combined_context = " ".join(context_texts)
    cosine_similarity_reward_score = cosine_similarity_reward(combined_context, ground_truth[i])
    cosine_similarity_rewards.append(cosine_similarity_reward_score)


# Preparing the dataset
data = {
    "question": questions,
    "answer": answers,
    "ground_truth": ground_truth,
    "contexts": contexts, # Add the contexts to the dataset
    "cosine_similarity_reward": cosine_similarity_rewards, # Add the cosine similarity rewards
}

# Convert dict to dataset
dataset = Dataset.from_dict(data)
dataset.to_pandas()

# Task
Calculate rewards based on cosine similarity, explain how to update the policy directly in GRPO for the current situation, and define the policy network for this example, specifically referring to Group Relative Policy Optimization (GRPO) in the context of a RAG system where the reward is based on cosine similarity.

## Understand the role of the policy network

### Subtask:
Clarify what aspects of the RAG system the policy network will control (e.g., retrieval parameters, generation parameters, or both).


## Choose a suitable neural network architecture

### Subtask:
Select a type of neural network (e.g., a feedforward network, recurrent network, or transformer-based model) that can process the input question and output the control signals for the RAG system.


**Reasoning**:
To select a neural network architecture suitable for processing text input (the question) and outputting control signals for the RAG system, considering the sequential nature of text and the need to capture semantic meaning, a transformer-based model is a good choice due to its effectiveness in handling sequential data and complex relationships within text.



In [None]:
# Explanation of the chosen neural network architecture

# For this task, where the input is a text question and the output needs to be control signals
# for a RAG system (e.g., parameters for retrieval or generation), a transformer-based model
# is a suitable choice.

# Justification:
# 1. Text Input: Transformers are specifically designed to process sequential data like text.
# 2. Semantic Understanding: They excel at capturing long-range dependencies and semantic relationships
#    within the input question, which is crucial for determining appropriate RAG control signals.
# 3. Contextual Embeddings: Transformer models generate rich contextual embeddings of the input text,
#    allowing the policy network to make informed decisions based on the nuances of the question.
# 4. Adaptability: Transformer architectures can be adapted with output layers to produce numerical
#    or categorical control signals as needed by the RAG system.

# Examples of transformer-based models that could be adapted for this purpose include
# models based on BERT, RoBERTa, or smaller, task-specific transformer variants.
# The specific output layer would depend on the exact control signals being predicted (e.g.,
# a dense layer for numerical parameters or a softmax layer for categorical choices).

# This choice aligns with the task of learning a policy based on textual input to influence
# the RAG system's behavior for better performance according to the defined reward.

## Define the input and output layers

### Subtask:
Define the format of the input (the question) and the output (the parameters or actions that influence the RAG system) for the chosen transformer-based policy network.


**Reasoning**:
Describe the input and output format of the chosen transformer-based policy network based on the previous subtask.



In [None]:
# 1. Input Layer Format:
# The input to the policy network will be the user's question, which is a string of text.
# Before being fed into the transformer layers, this text will undergo standard NLP preprocessing steps:
# - Tokenization: The text will be broken down into a sequence of tokens (words or sub-word units) using a tokenizer appropriate for the chosen transformer model (e.g., WordPiece for BERT, BPE for RoBERTa).
# - Embedding: The sequence of tokens will be converted into a sequence of numerical embeddings. Transformer models typically use learned token embeddings, positional embeddings (to capture token order), and potentially segment embeddings. The input to the transformer layers will be a tensor of shape (batch_size, sequence_length, embedding_dim), where:
#   - batch_size: The number of questions processed in parallel.
#   - sequence_length: The maximum number of tokens in a question (padded or truncated).
#   - embedding_dim: The dimensionality of the token embeddings.

# 2. Output Layer(s) Format:
# The output layer(s) of the policy network will produce control signals for the RAG system. Based on the understanding that the policy network controls retrieval parameters (like similarity_top_k) and potentially influences generation, the output could be structured as follows:
# - For a numerical parameter like `similarity_top_k`: A single dense layer with one output neuron, potentially followed by an activation function (e.g., ReLU to ensure non-negativity) and possibly scaled to a reasonable range. The output would be a tensor of shape (batch_size, 1).
# - For influencing generation (less direct control in this setup, but conceptually): This could be represented as a vector influencing attention mechanisms or providing context to the generation model. However, focusing on retrieval parameters as the primary policy output in this GRPO context is more straightforward.
# - For simplicity and direct control over a key retrieval parameter, let's define the output as a single numerical value representing `similarity_top_k`. The output layer will be a dense layer with 1 output neuron.

# Therefore, the output of the policy network will be a tensor of shape (batch_size, 1), representing the predicted value for `similarity_top_k` for each question in the batch.

# 3. Interpretation and Usage of the Policy Network's Output:
# The output of the policy network (the predicted `similarity_top_k` value) will be used to configure the retrieval step of the RAG system for the given question.
# - During training: The predicted `similarity_top_k` will be used to perform retrieval. The retrieved context, along with the question, will then be passed to the generation model to produce an answer. This answer will be compared to the ground truth to calculate the cosine similarity reward. This reward will be used by the GRPO algorithm to update the policy network's weights, encouraging it to predict `similarity_top_k` values that lead to higher rewards.
# - During inference: The policy network will predict `similarity_top_k` for a new question, and this value will be used directly in the retrieval process to gather context for generating the final answer.
# The predicted numerical output for `similarity_top_k` might need to be post-processed (e.g., rounded to an integer, clipped to a valid range) before being used by the RAG system's retriever.

## Implement the policy network

### Subtask:
Implement the policy network using a deep learning framework. This involves defining the transformer layers and the output layer(s) based on the input and output formats defined in the previous steps.


**Reasoning**:
Implement the policy network using PyTorch, defining the transformer layers and the output layer as specified in previous steps.

In essence, this network takes a question, processes it through a pre-trained transformer to understand its context, and then uses a simple linear layer to predict a non-negative numerical value intended to represent the optimal similarity_top_k for retrieving documents for that question.

In the context of Reinforcement Learning (RL), a policy network is a neural network that learns to map states (in our case, the user's question) to actions (the parameters or decisions that control the RAG system).

**State:** The input is the user's question. The policy network processes this question to understand its meaning and context.

**Action:** The output of the policy network is a value (or values) that influences how the RAG system operates. In the code we just discussed, the policy network's action space was initially simplified to predicting a single value: similarity_top_k, which determines how many relevant documents are retrieved. In the modified code, it predicts parameters for a distribution from which similarity_top_k is sampled.

The goal of training the policy network using an algorithm like GRPO is to adjust its internal parameters (the weights and biases of the neural network) so that, when presented with a question, it predicts/samples actions (like a specific similarity_top_k) that lead to higher rewards (in our case, higher cosine similarity between the generated answer and the ground truth).

So, the policy network's role is to learn the optimal strategy for configuring the RAG system based on the input question to maximize the desired outcome (answer quality, measured by the reward).



# Task
**Implement the policy update rule for the GRPO algorithm using cosine similarity as the reward signal to adjust the policy network's parameters in the provided notebook.**

## Modify policy network output

### Subtask:
Adjust the `RAGPolicyNetwork` to output parameters for a distribution over `similarity_top_k`, such as the mean and log-variance of a Gaussian distribution.


**Reasoning**:
Modify the RAGPolicyNetwork class to output parameters for a Gaussian distribution (mean and log-variance) over `similarity_top_k`.



## Implement action sampling and log probability calculation

### Subtask:
Implement functions to sample `similarity_top_k` from the Gaussian distribution predicted by the policy network and calculate the log probability of the sampled action.


**Reasoning**:
Implement functions to sample the action (similarity_top_k) from the predicted Gaussian distribution and calculate the log probability of the sampled action, using the predicted mean and log-variance from the policy network.



## Implement baseline calculation

### Subtask:
Create a function to calculate a baseline for the rewards, such as the average reward in a batch.


**Reasoning**:
Define a function to calculate the mean of a list or tensor of rewards to be used as a baseline.



## Set up training loop

### Subtask:
Structure a training loop that iterates through the dataset, performs forward passes with the policy network, executes the RAG system with sampled actions, calculates rewards and advantages, and computes the policy gradient.


**Reasoning**:
Structure a training loop that iterates through the dataset, performs forward passes with the policy network, executes the RAG system with sampled actions, calculates rewards and advantages, and computes the policy gradient.



## Implement policy update

### Subtask:
Apply the calculated policy gradient to update the parameters of the policy network using an optimizer.


**Reasoning**:
Apply the calculated policy gradient to update the parameters of the policy network using the optimizer.



## Evaluate and refine

### Subtask:
After training, evaluate the performance of the policy-controlled RAG system and refine the implementation or hyperparameters as needed.


**Reasoning**:
Evaluate the performance of the policy-controlled RAG system after the training loop. This involves setting the policy network to evaluation mode, using a dataset (can be the same as training for demonstration), iterating through questions, predicting/sampling `similarity_top_k`, executing the RAG query, calculating the cosine similarity reward, and finally reporting the average reward.



# Task
Implement observability metrics for the training process of a policy network using Weights and Biases.

## Identify key metrics

### Subtask:
Determine which metrics are most important to track for monitoring the training process of the policy network (e.g., epoch number, batch number, average policy loss per batch, average reward per batch, average predicted similarity_top_k per batch, average advantage per batch).


**Reasoning**:
Identify and list the key metrics for monitoring the training process of the policy network.



## Integrate weights & biases

### Subtask:
Install the `wandb` library and initialize a Weights & Biases run at the beginning of the training script.


**Reasoning**:
Install the wandb library and initialize a Weights & Biases run at the beginning of the training script as requested by the subtask.



## Log hyperparameters

### Subtask:
Log the training hyperparameters (e.g., learning rate, batch size, number of epochs) to Weights & Biases.


**Reasoning**:
Define a dictionary containing the training hyperparameters and log it to the initialized Weights & Biases run.



## Integrate metric calculation into the training loop

### Subtask:
Modify the training loop to calculate the chosen metrics for each batch and/or epoch.


**Reasoning**:
Modify the training loop to calculate the selected metrics for each batch and epoch, including average policy loss, average reward, average predicted similarity_top_k, average advantage, average mean, and average log-variance.



## Log metrics to weights & biases

### Subtask:
Add code to log the calculated metrics to Weights & Biases within the training loop. This will typically involve using `wandb.log()`.


**Reasoning**:
Add code to log the calculated metrics to Weights & Biases within the training loop as requested by the subtask. This involves using `wandb.log()` for both batch and epoch metrics.



## Visualize and analyze metrics in weights & biases

### Subtask:
Use the Weights & Biases dashboard to visualize the logged metrics, monitor training progress, and identify potential issues.


## Refine logging and metrics

### Subtask:
Based on the analysis in Weights & Biases, refine the set of metrics being tracked or the logging frequency for better insights.


**Reasoning**:
Based on the analysis in Weights & Biases, refine the set of metrics being tracked and their logging frequency in the training code. The current metrics are informative, but batch-level metrics can be noisy. Epoch-level metrics provide a smoother view of overall progress. We will keep logging both but ensure epoch metrics are clearly distinguished. We can also consider adding the standard deviation of `similarity_top_k` predictions at the epoch level to see how the policy's uncertainty evolves.



In [None]:
!pip install llama-index -q
!pip install langchain -q
!pip install langchain_experimental -q

In [None]:
import os
import nest_asyncio
nest_asyncio.apply()

In [None]:
from google.colab import userdata
# Set the OpenAI API key as an environment variable
os.environ["OPENAI_API_KEY"] =  userdata.get('OPENAI_API_KEY')

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings
# Setup OpenAI Model and Embeddings used for indexing the documents
Settings.llm = OpenAI(model='gpt-4o-mini', temperature=0.2)
Settings.embed_model = OpenAIEmbedding(model='text-embedding-3-small')
Settings.chunk_size = 1024

In [None]:
from google.colab import drive
drive.mount('/content/drive')
data_dir = '/content/drive/MyDrive' # Input a data dir path from your mounted Google Drive

The training loop includes the actual RAG execution using the CombinedFinancial10k_index and the cosine_similarity_reward function to calculate the reward based on the generated answer and the ground truth for each question in the batch. This means the policy network is being trained using real rewards derived from the RAG system's performance.

In [None]:
import torch
import torch.nn as nn # Explicitly import torch.nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
from torch.distributions import Normal
import wandb
import os # Import os to check for existing index
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, SummaryIndex, StorageContext, load_index_from_storage, Settings # Import necessary LlamaIndex components
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from google.colab import userdata
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import AutoModel, AutoTokenizer # Import AutoModel and AutoTokenizer explicitly


# Redefine necessary variables and functions from previous cells to ensure scope

# Assuming OPENAI_API_KEY is already set as an environment variable in a previous cell
# os.environ["OPENAI_API_KEY"] =  userdata.get('OPENAI_API_KEY')

# Setup OpenAI Model and Embeddings - Ensure these are set within this cell's execution
Settings.llm = OpenAI(model='gpt-4o-mini', temperature=0.2)
Settings.embed_model = OpenAIEmbedding(model='text-embedding-3-small')
Settings.chunk_size = 1024
print("LlamaIndex Settings configured.")

# Assuming Google Drive is mounted at /content/drive and data_dir is defined
data_dir = '/content/drive/MyDrive' # Input a data dir path from your mounted Google Drive
PERSIST_INDEX_DIR = f"/{data_dir}/RAG/data/"

# Redefine get_index function to ensure it's available
def get_index(index_name, doc_file_path):
  index = None
  if not os.path.exists(f"{PERSIST_INDEX_DIR}{index_name}/"):
    print(f"Index not found at {PERSIST_INDEX_DIR}{index_name}/. Creating index...")
    # Load the documents
    documents = SimpleDirectoryReader(input_files=[doc_file_path]).load_data()
    index = VectorStoreIndex.from_documents(documents)
    # Store the index to disk
    index.storage_context.persist(f"{PERSIST_INDEX_DIR}{index_name}/")
    print(f"Created and persisted index at {PERSIST_INDEX_DIR}{index_name}/")
  else: # Load index from disk
    print(f"Loading index from storage at {PERSIST_INDEX_DIR}{index_name}/")
    storage_context = StorageContext.from_defaults(persist_dir=f"{PERSIST_INDEX_DIR}{index_name}/")
    index = load_index_from_storage(storage_context)
    print("Loaded index from storage.")

  return index

# Load or create the Combined Financial 10k index using the redefined get_index function
# CombinedFinancial10k_index = get_index("CombinedFinancial10k",f"{data_dir}/RAG/data/10k/Combined_10k_Lyft_Uber.pdf")



# Redefine get_index function if needed (assuming index is persisted)
# In this case, we will just load the index directly assuming it exists from previous runs
# If you haven't run the cells to create and persist the index, you would need to do that first.

# Load Combined Financial 10k index
try:
    # Correcting the persist_dir to match the intended index name and location
    storage_context = StorageContext.from_defaults(persist_dir=f"{PERSIST_INDEX_DIR}CombinedFinancial10k/")
    CombinedFinancial10k_index = load_index_from_storage(storage_context)
    print("Loaded combined 10K index from storage.")
except FileNotFoundError:
    print(f"Combined 10k index not found at {PERSIST_INDEX_DIR}CombinedFinancial10k/. Please run the cells to create and persist the index first.")
    # Handle the error, e.g., exit or create the index
    CombinedFinancial10k_index = None # Set to None if not loaded

# Redefine cosine_similarity_reward function
def cosine_similarity_reward(retrieved_context, ground_truth):
    """
    Calculates a reward based on cosine similarity between the retrieved context
    and the ground truth using TF-IDF vectorization.

    Args:
        retrieved_context (str): The text from the retrieved documents.
        ground_truth (str): The ground truth text.

    Returns:
        float: A score between 0 and 1 representing the cosine similarity.
    """
    # Handle empty strings
    if not retrieved_context or not ground_truth:
        return 0.0

    # Create TF-IDF vectors
    vectorizer = TfidfVectorizer().fit([retrieved_context, ground_truth])
    vectors = vectorizer.transform([retrieved_context, ground_truth])

    # Calculate cosine similarity
    similarity_score = cosine_similarity(vectors[0], vectors[1])[0][0]

    return similarity_score

# Redefine sample_action_and_continuous function
def sample_action_and_continuous(mean, log_variance):
    std_dev = torch.exp(0.5 * log_variance)
    distribution = Normal(mean, std_dev)
    continuous_sample = distribution.sample()
    processed_action = torch.max(torch.tensor(1.0), torch.round(torch.abs(continuous_sample)))
    return processed_action, continuous_sample

# Redefine calculate_baseline function
def calculate_baseline(rewards):
    if isinstance(rewards, list):
        rewards = torch.tensor(rewards, dtype=torch.float32)
    if rewards.numel() == 0:
        return 0.0
    return torch.mean(rewards)
# log-probabilities of the actions the policy took in continuous action space(the agent chooses an action from a set of possible actions — this set is called the action space)
def calculate_log_prob(mean, log_variance, action):
    std_dev = torch.exp(0.5 * log_variance)
    distribution = Normal(mean, std_dev)
    log_prob = distribution.log_prob(action)
    return log_prob


# Redefine questions and ground truth
questions = ["Please explain the revolving credit of Lyft for year 2023 and its long term benefits?",
    "Could you give me a concise overview of the Business strategies employed by Uber and Lyft?",
    "What are the details of Lyft's revolving credit for the year 2023 and how does it contribute to long-term benefits?",
    "Could you analyze and compare the financial health of Uber and Lyft for the year 2023?",
    "When compared to Uber is Lyft financially stable?",
    "Can you compare the growth strategies between Uber and Lyft based on the 10K document for year 2023?",
    "Can you provide a summary of the Financial and Operational Highlights of Uber?"
    "How would you describe the Business overview of Uber and Lyft in a nutshell?",
    "How does Lyft's financial stability in 2023 compare to Uber's?",
    "Can you provide an overview of Lyft's revolving credit in 2023 and its potential long-term advantages?",
    "Can you please compare the financial status for both the companies for the year 2023?",
    "Can you compare the cost and expenses between Lyft and Uber for year 2023? Also let me know which company faired better?",
    "Can you provide a financial comparison between Uber and Lyft for the year 2023?",
    "Could you give me an overview of Lyft's Financial and Operational performance?",
    "In terms of financial stability, how does Lyft stack up against Uber for the year 2023?",
    "How does Lyft's revolving credit in 2023 work and what are the advantages it offers in the long run?",
    "When it comes to technology which one fairs better is it Uber or Lyft?",
    "What are the contrasting growth approaches of Uber and Lyft as outlined in the 2023 10K report?",
    "Could you give me a concise overview of the Business strategies employed by Uber and Lyft?",
    "Could you analyze and compare the financial health of Uber and Lyft for the year 2023?",
    "Can you summarize the important Financial and Operational metrics for Lyft",
    "When evaluating financial stability, how do Uber and Lyft differ in their performance for the year 2023?",]
ground_truth = ["In 2023, Lyft entered into a revolving credit agreement with certain lenders for a 1.25 billion.  The long-term benefits of this revolving credit facility include providing Lyft with financial flexibility and access to capital when needed. It allows Lyft to borrow funds, repay them, and borrow again up to the agreed-upon limit, providing liquidity for operational needs, investments, or other strategic initiatives. Additionally, having a revolving credit facility in place can help Lyft manage its cash flow, take advantage of opportunities for growth, and navigate any unforeseen financial challenges.",
    "Based on the information provided in the 10K documents for 2023, both Uber and Lyft have growth strategies focused on increasing their user base, expanding their range of transportation services, and capturing a larger share of the transportation market. Lyft's growth strategy includes increasing rider use cases by offering various products and services such as Lyft Pink subscription plan, Lyft Pass commuter programs, and first-mile and last-mile services. They also aim to grow their share of consumers' transportation spend by providing a wide range of mobility options through their rideshare, bikes, and scooters network.On the other hand, Uber's growth strategy involves using their technology platform to connect consumers with various ride services, merchants, and delivery service providers. They also connect consumers with public transportation networks and provide solutions in the freight industry. Uber is also developing new technologies to solve everyday problems.Both companies face competition from other players in the market, and they are focused on innovation, technology, and expanding their services to stay competitive and attract more users.",
    "In 2023, Lyft entered into a revolving credit agreement with certain lenders for a 1.25 billion.  The long-term benefits of this revolving credit facility include providing Lyft with financial flexibility and access to capital when needed. It allows Lyft to borrow funds, repay them, and borrow again up to the agreed-upon limit, providing liquidity for operational needs, investments, or other strategic initiatives. Additionally, having a revolving credit facility in place can help Lyft manage its cash flow, take advantage of opportunities for growth, and navigate any unforeseen financial challenges.",
    "Financial Highlights: Lyft, Inc. operates multimodal transportation networks in the United States and Canada. The company's revenue is primarily generated from its ridesharing marketplace connecting drivers and riders. Lyft collects service fees and commissions from drivers for their use of the ridesharing marketplace. G ross Bookings and Rides increase as drivers accept more rider leads.Operational Highlights: Lyft offers an expanded set of transportation modes in select cities, including shared bikes and scooters for shorter rides and multimodal trips. The company's platform and mobile-based applications facilitate peer-to-peer ridesharing by connecting drivers with riders. Lyft aims to grow its share of consumers' transportation spend by providing a wide range of mobility options. The company focuses on delivering increasing value to drivers by offering economic opportunities and programs like Express Drive for access to rental cars.",
    "Based on the provided context, it appears that Lyft is in a better financial position compared to Uber for the year 2023. Lyft reported revenue of 37,281 million. Additionally, Lyft's net loss percentage was 33.1% compared to Uber's net income percentage of 5%. This indicates that Lyft had a lower loss percentage compared to Uber's income percentage, suggesting that Lyft may be more financially stable in 2023.",
    "Based on the information provided in the 10K documents for 2023, both Uber and Lyft have growth strategies focused on increasing their user base, expanding their range of transportation services, and capturing a larger share of the transportation market. Lyft's growth strategy includes increasing rider use cases by offering various products and services such as Lyft Pink subscription plan, Lyft Pass commuter programs, and first-mile and last-mile services. They also aim to grow their share of consumers' transportation spend by providing a wide range of mobility options through their rideshare, bikes, and scooters network.On the other hand, Uber's growth strategy involves using their technology platform to connect consumers with various ride services, merchants, and delivery service providers. They also connect consumers with public transportation networks and provide solutions in the freight industry. Uber is also developing new technologies to solve everyday problems.Both companies face competition from other players in the market, and they are focused on innovation, technology, and expanding their services to stay competitive and attract more users.",
    "Financial Highlights: Lyft, Inc. operates multimodal transportation networks in the United States and Canada. The company's revenue is primarily generated from its ridesharing marketplace connecting drivers and riders. Lyft collects service fees and commissions from drivers for their use of the ridesharing marketplace. G ross Bookings and Rides increase as drivers accept more rider leads.Operational Highlights: Lyft offers an expanded set of transportation modes in select cities, including shared bikes and scooters for shorter rides and multimodal trips. The company's platform and mobile-based applications facilitate peer-to-peer ridesharing by connecting drivers with riders. Lyft aims to grow its share of consumers' transportation spend by providing a wide range of mobility options. The company focuses on delivering increasing value to drivers by offering economic opportunities and programs like Express Drive for access to rental cars.",
    "Lyft, Inc. started a movement to revolutionize transportation in 2012 with a peer-to-peer marketplace for on-demand ridesharing. They have continued to pioneer innovations and are now one of the largest multimodal transportation networks in the United States and Canada. Lyft's purpose is to get riders out into the world and provide drivers with control over their time and money. They offer a ridesharing marketplace through the Lyft App, connecting drivers with riders and providing various transportation modes. Uber Technologies, Inc. is a technology platform that powers movement from point A to point B. They connect consumers with providers of ride services, merchants, and delivery service providers. Uber also connects consumers with public transportation networks and carriers in the freight industry. They are developing technologies to provide new solutions for everyday problems.",
    "Based on the provided context, it appears that Lyft is in a better financial position compared to Uber for the year 2023. Lyft reported revenue of 37,281 million. Additionally, Lyft's net loss percentage was 33.1% compared to Uber's net income percentage of 5%. This indicates that Lyft had a lower loss percentage compared to Uber's income percentage, suggesting that Lyft may be more financially stable in 2023.",
    "In 2023, Lyft entered into a revolving credit agreement with certain lenders for a 1.25 billion.  The long-term benefits of this revolving credit facility include providing Lyft with financial flexibility and access to capital when needed. It allows Lyft to borrow funds, repay them, and borrow again up to the agreed-upon limit, providing liquidity for operational needs, investments, or other strategic initiatives. Additionally, having a revolving credit facility in place can help Lyft manage its cash flow, take advantage of opportunities for growth, and navigate any unforeseen financial challenges.",
    "Based on the provided context, it appears that Lyft is in a better financial position compared to Uber for the year 2023. Lyft reported revenue of 37,281 million. Additionally, Lyft's net loss percentage was 33.1% compared to Uber's net income percentage of 5%. This indicates that Lyft had a lower loss percentage compared to Uber's income percentage, suggesting that Lyft may be more financially stable in 2023.",
    "In 2023, Lyft reported total costs and expenses of 36,171 million. Comparing the two, Lyft had higher costs and expenses than Uber for the year 2023. Therefore, Uber fared better in terms of managing costs and expenses in 2023."
    "Based on the provided context, it appears that Lyft is in a better financial position compared to Uber for the year 2023. Lyft reported revenue of 37,281 million. Additionally, Lyft's net loss percentage was 33.1% compared to Uber's net income percentage of 5%. This indicates that Lyft had a lower loss percentage compared to Uber's income percentage, suggesting that Lyft may be more financially stable in 2023.",
    "Financial Highlights: Lyft, Inc. operates multimodal transportation networks in the United States and Canada. The company's revenue is primarily generated from its ridesharing marketplace connecting drivers and riders. Lyft collects service fees and commissions from drivers for their use of the ridesharing marketplace. G ross Bookings and Rides increase as drivers accept more rider leads.Operational Highlights: Lyft offers an expanded set of transportation modes in select cities, including shared bikes and scooters for shorter rides and multimodal trips. The company's platform and mobile-based applications facilitate peer-to-peer ridesharing by connecting drivers with riders. Lyft aims to grow its share of consumers' transportation spend by providing a wide range of mobility options. The company focuses on delivering increasing value to drivers by offering economic opportunities and programs like Express Drive for access to rental cars.",
    "Based on the provided context, it appears that Lyft is in a better financial position compared to Uber for the year 2023. Lyft reported revenue of 37,281 million. Additionally, Lyft's net loss percentage was 33.1% compared to Uber's net income percentage of 5%. This indicates that Lyft had a lower loss percentage compared to Uber's income percentage, suggesting that Lyft may be more financially stable in 2023.",
    "In 2023, Lyft entered into a revolving credit agreement with certain lenders for a 1.25 billion.  The long-term benefits of this revolving credit facility include providing Lyft with financial flexibility and access to capital when needed. It allows Lyft to borrow funds, repay them, and borrow again up to the agreed-upon limit, providing liquidity for operational needs, investments, or other strategic initiatives. Additionally, having a revolving credit facility in place can help Lyft manage its cash flow, take advantage of opportunities for growth, and navigate any unforeseen financial challenges.",
    "Based on the provided context, both Uber and Lyft heavily rely on technology to power their transportation services. Uber is described as a technology platform that uses leading technology to connect consumers with ride services, delivery services, and public transportation networks. Lyft, on the hand, leverages its technology platform to connect drivers with riders through its ridesharing marketplace and offers an expanded set of transportation modes. In terms of technology, both Uber and Lyft have their strengths and focus areas. Uber emphasizes its technology applications supporting a variety of offerings on its platform, while Lyft highlights its robust technology platform that powers rides and connections every day. Ultimately, the effectiveness and success of each company's technology may vary based on specific features, innovations, and user experiences.",
    "Based on the information provided in the 10K documents for 2023, both Uber and Lyft have growth strategies focused on increasing their user base, expanding their range of transportation services, and capturing a larger share of the transportation market. Lyft's growth strategy includes increasing rider use cases by offering various products and services such as Lyft Pink subscription plan, Lyft Pass commuter programs, and first-mile and last-mile services. They also aim to grow their share of consumers' transportation spend by providing a wide range of mobility options through their rideshare, bikes, and scooters network.On the other hand, Uber's growth strategy involves using their technology platform to connect consumers with various ride services, merchants, and delivery service providers. They also connect consumers with public transportation networks and provide solutions in the freight industry. Uber is also developing new technologies to solve everyday problems.Both companies face competition from other players in the market, and they are focused on innovation, technology, and expanding their services to stay competitive and attract more users.",
    "Lyft, Inc. started a movement to revolutionize transportation in 2012 with a peer-to-peer marketplace for on-demand ridesharing. They have continued to pioneer innovations and are now one of the largest multimodal transportation networks in the United States and Canada. Lyft's purpose is to get riders out into the world and provide drivers with control over their time and money. They offer a ridesharing marketplace through the Lyft App, connecting drivers with riders and providing various transportation modes.Uber Technologies, Inc. is a technology platform that powers movement from point A to point B. They connect consumers with providers of ride services, merchants, and delivery service providers. Uber also connects consumers with public transportation networks and carriers in the freight industry. They are developing new technologies to provide new solutions for everyday problems.",
    "Based on the provided context, it appears that Lyft is in a better financial position compared to Uber for the year 2023. Lyft reported revenue of 37,281 million. Additionally, Lyft's net loss percentage was 33.1% compared to Uber's net income percentage of 5%. This indicates that Lyft had a lower loss percentage compared to Uber's income percentage, suggesting that Lyft may be more financially stable in 2023.",
    "Financial Highlights: Lyft, Inc. operates multimodal transportation networks in the United States and Canada. The company's revenue is primarily generated from its ridesharing marketplace connecting drivers and riders. Lyft collects service fees and commissions from drivers for their use of the ridesharing marketplace. Gross Bookings and Rides increase as drivers accept more rider leads.Operational Highlights: Lyft offers an expanded set of transportation modes in select cities, including shared bikes and scooters for shorter rides and multimodal trips. The company's platform and mobile-based applications facilitate peer-to-peer ridesharing by connecting drivers with riders. Lyft aims to grow its share of consumers' transportation spend by providing a wide range of mobility options. The company focuses on delivering increasing value to drivers by offering economic opportunities and programs like Express Drive for access to rental cars.",
    "Based on the provided context, it appears that Lyft is in a better financial position compared to Uber for the year 2023. Lyft reported revenue of 37,281 million. Additionally, Lyft's net loss percentage was 33.1% compared to Uber's net income percentage of 5%. This indicates that Lyft had a lower loss percentage compared to Uber's income percentage, suggesting that Lyft may be more financially stable in 2023."
    ]

# Redefine RAGPolicyNetwork class
class RAGPolicyNetwork(nn.Module):
    def __init__(self, transformer_model_name="bert-base-uncased", output_dim=2):
        super(RAGPolicyNetwork, self).__init__()
        self.tokenizer = AutoTokenizer.from_pretrained(transformer_model_name)
        self.transformer = AutoModel.from_pretrained(transformer_model_name)
        transformer_output_dim = self.transformer.config.hidden_size
        self.output_layer = nn.Linear(transformer_output_dim, output_dim)

    def forward(self, questions):
        encoded_input = self.tokenizer(questions, return_tensors='pt', padding=True, truncation=True)
        outputs = self.transformer(**encoded_input)
        pooled_output = outputs.pooler_output
        mean_and_log_variance = self.output_layer(pooled_output)
        mean = mean_and_log_variance[:, 0]
        log_variance = mean_and_log_variance[:, 1]
        return mean, log_variance

# Redefine sample_action_and_continuous function
def sample_action_and_continuous(mean, log_variance):
    std_dev = torch.exp(0.5 * log_variance)
    distribution = Normal(mean, std_dev)
    continuous_sample = distribution.sample()
    processed_action = torch.max(torch.tensor(1.0), torch.round(torch.abs(continuous_sample)))
    return processed_action, continuous_sample

# Redefine calculate_baseline function
def calculate_baseline(rewards):
    if isinstance(rewards, list):
        rewards = torch.tensor(rewards, dtype=torch.float32)
    if rewards.numel() == 0:
        return 0.0
    return torch.mean(rewards)

def calculate_log_prob(mean, log_variance, action):
    std_dev = torch.exp(0.5 * log_variance)
    distribution = Normal(mean, std_dev)
    log_prob = distribution.log_prob(action)
    return log_prob


# Redefine Dataset and DataLoader
class RAGDataset(Dataset):
    def __init__(self, questions, ground_truth):
        self.questions = questions
        self.ground_truth = ground_truth
    def __len__(self):
        return len(self.questions)
    def __getitem__(self, idx):
        return self.questions[idx], self.ground_truth[idx]

# Define dataset, dataloader, and training parameters if not already
if 'rag_dataset' not in globals():
    rag_dataset = RAGDataset(questions, ground_truth)
if 'BATCH_SIZE' not in globals():
    BATCH_SIZE = 8
if 'train_dataloader' not in globals():
    train_dataloader = DataLoader(rag_dataset, batch_size=BATCH_SIZE, shuffle=True)
if 'NUM_EPOCHS' not in globals():
    NUM_EPOCHS = 5
if 'LEARNING_RATE' not in globals():
    LEARNING_RATE = 1e-4

# Re-instantiate policy_group and optimizers if not available or size mismatch
if 'policy_group' not in globals() or len(policy_group) != 5: # Assuming NUM_POLICIES = 5
    NUM_POLICIES = 5 # Ensure NUM_POLICIES is defined
    print(f"Policy group not found or size mismatch. Re-instantiating {NUM_POLICIES} policies.")
    policy_group = nn.ModuleList()
    for i in range(NUM_POLICIES):
        policy = RAGPolicyNetwork(transformer_model_name="bert-base-uncased")
        policy_group.append(policy)
    optimizers = [optim.Adam(policy.parameters(), lr=LEARNING_RATE) for policy in policy_group]
    print(f"Re-instantiated a group of {NUM_POLICIES} RAGPolicyNetwork instances and optimizers.")
else:
    print(f"Policy group with {len(policy_group)} instances and optimizers already exists.")


# Initialize a Weights & Biases run (if not already initialized and active)
if wandb.run is None:
    try:
        if 'wandb' not in globals(): import wandb
        wandb.init(project="rag-policy-training", name="grpo-cosine-similarity-group-logging", reinit=True)
        config = {
            "learning_rate": LEARNING_RATE, "batch_size": BATCH_SIZE, "num_epochs": NUM_EPOCHS,
            "transformer_model": "bert-base-uncased", "output_dim": 2, "num_policies": len(policy_group)
        }
        wandb.config.update(config)
        print("Training hyperparameters logged to Weights & Biases config.")
    except Exception as e:
        print(f"Error initializing Weights & Biases: {e}")
        print("Weights & Biases logging will be skipped.")
elif wandb.run is not None:
    print(f"Weights & Biases run '{wandb.run.name}' is already active.")


# --- Training Loop ---
print("Starting policy group training with policy evaluation and logging...")

if 'global_step' not in globals(): global_step = 0

# Check if CombinedFinancial10k_index was loaded successfully before starting training
# Corrected check to use CombinedFinancial10k_index
if CombinedFinancial10k_index is not None:
    for epoch in range(NUM_EPOCHS):
        # Data structures to collect data across policies for this iteration/epoch
        all_policy_rewards = {}
        all_policy_continuous_samples = {} # Store continuous samples for gradient tracking
        all_policy_sampled_k_processed = {}
        # Removed all_policy_advantages storage during data collection as it will be recalculated
        all_policy_means = {}
        all_policy_log_variances = {}
        all_policy_losses = {}
        all_policy_questions = {} # Store questions for re-running forward pass

        # --- Data Collection Phase ---
        print(f"  Epoch {epoch+1}/{NUM_EPOCHS}: Collecting data...")
        for policy_idx, policy in enumerate(policy_group):
            policy.train()
            policy_name = f"policy_{policy_idx}"
            all_policy_rewards[policy_name] = []
            all_policy_continuous_samples[policy_name] = [] # Initialize storage for continuous samples
            all_policy_sampled_k_processed[policy_name] = []
            all_policy_means[policy_name] = []
            all_policy_log_variances[policy_name] = []
            all_policy_losses[policy_name] = []
            all_policy_questions[policy_name] = [] # Initialize storage for questions

            for batch_idx, (batch_questions, batch_ground_truth) in enumerate(train_dataloader):
                if not batch_questions: continue

                # Ensure gradients are tracked for policy outputs
                mean_output, log_variance_output = policy(list(batch_questions))

                batch_sampled_k_processed = []
                batch_sampled_k_continuous = []
                batch_rewards = []

                for i in range(len(batch_questions)):
                    # sample_action_and_continuous function already returns tensors
                    sampled_k_processed_item, sampled_k_continuous_item = sample_action_and_continuous(mean_output[i], log_variance_output[i])
                    batch_sampled_k_processed.append(sampled_k_processed_item.detach()) # Detach processed_action for plotting/logging
                    batch_sampled_k_continuous.append(sampled_k_continuous_item) # Keep continuous_sample attached to graph

                    question = batch_questions[i]
                    ground_truth_answer = batch_ground_truth[i]
                    predicted_top_k_int = max(1, int(sampled_k_processed_item.item()))

                    try:
                        # Use CombinedFinancial10k_index here
                        policy_controlled_engine = CombinedFinancial10k_index.as_query_engine(similarity_top_k=predicted_top_k_int)
                        generated_answer = policy_controlled_engine.query(question).response
                        reward = cosine_similarity_reward(generated_answer, ground_truth_answer)
                        batch_rewards.append(reward)
                    except Exception as e:
                        # print(f"    Error during RAG execution or reward calculation for question '{question}': {e}")
                        batch_rewards.append(0.0)

                if not batch_rewards: batch_rewards_tensor = torch.tensor([], dtype=torch.float32)
                else: batch_rewards_tensor = torch.tensor(batch_rewards, dtype=torch.float32)

                # batch_sampled_k_continuous list contains tensors that are part of the graph
                if not batch_sampled_k_continuous: batch_sampled_k_continuous_tensor = torch.tensor([], dtype=torch.float32)
                else: batch_sampled_k_continuous_tensor = torch.stack(batch_sampled_k_continuous)


                all_policy_rewards[policy_name].extend(batch_rewards_tensor.tolist())
                all_policy_sampled_k_processed[policy_name].extend([k.item() for k in batch_sampled_k_processed])
                all_policy_continuous_samples[policy_name].extend(batch_sampled_k_continuous) # Store continuous samples

                # Calculate log probabilities for the batch - needs to be on tensors that track gradients
                # We will recalculate log_probs in the update step using the stored continuous samples
                # and a fresh forward pass, so no need to store log_probs here.

                all_policy_means[policy_name].extend(mean_output.detach().tolist()) # Detach for storage/logging
                all_policy_log_variances[policy_name].extend(log_variance_output.detach().tolist()) # Detach for storage/logging

                # Store questions for re-running forward pass in update step
                all_policy_questions[policy_name].extend(list(batch_questions))


        global_step += 1

        # --- Implement Group Performance Evaluation and Logging ---
        policy_avg_rewards = {}
        best_policy_name = None
        highest_avg_reward = -float('inf')
        print(f"  Epoch {epoch+1}/{NUM_EPOCHS}: Evaluating policy performance and logging epoch metrics...")
        epoch_metrics = {}
        group_avg_reward = 0.0
        total_valid_rewards = 0
        for policy_idx, policy in enumerate(policy_group):
            policy_name = f"policy_{policy_idx}"
            epoch_rewards = all_policy_rewards[policy_name]
            avg_epoch_reward = np.mean(epoch_rewards) if epoch_rewards else 0.0
            policy_avg_rewards[policy_name] = avg_epoch_reward
            if avg_epoch_reward > highest_avg_reward:
                highest_avg_reward = avg_epoch_reward
                best_policy_name = policy_name
            epoch_predicted_k = all_policy_sampled_k_processed[policy_name]
            # Advantages are not stored here anymore, calculate average advantage if needed for logging from update step
            epoch_means = all_policy_means[policy_name]
            epoch_log_variances = all_policy_log_variances[policy_name]
            avg_epoch_predicted_top_k = np.mean(epoch_predicted_k) if epoch_predicted_k else 0
            epoch_predicted_top_k_std = np.std(epoch_predicted_k) if epoch_predicted_k else 0
            avg_epoch_mean = np.mean(epoch_means) if epoch_means else 0
            avg_epoch_log_variance = np.mean(epoch_log_variances) if epoch_log_variances else 0
            epoch_metrics[f"{policy_name}/epoch_average_reward"] = avg_epoch_reward
            epoch_metrics[f"{policy_name}/epoch_average_predicted_top_k"] = avg_epoch_predicted_top_k
            epoch_metrics[f"{policy_name}/epoch_predicted_top_k_std"] = epoch_predicted_top_k_std
            epoch_metrics[f"{policy_name}/epoch_average_mean"] = avg_epoch_mean
            epoch_metrics[f"{policy_name}/epoch_average_log_variance"] = avg_epoch_log_variance
            group_avg_reward += np.sum(epoch_rewards)
            total_valid_rewards += len(epoch_rewards)
            print(f"    {policy_name}: Avg Reward = {avg_epoch_reward:.4f}, Avg Predicted Top K = {avg_epoch_predicted_top_k:.2f}, Predicted Top K Std = {epoch_predicted_top_k_std:.2f}")

        group_avg_reward = group_avg_reward / total_valid_rewards if total_valid_rewards > 0 else 0.0
        print(f"  Epoch {epoch+1}/{NUM_EPOCHS}: Best performing policy is {best_policy_name} with Avg Reward = {highest_avg_reward:.4f}")
        print(f"  Epoch {epoch+1}/{NUM_EPOCHS}: Group Average Reward = {group_avg_reward:.4f}")
        epoch_metrics["epoch/best_policy"] = best_policy_name
        epoch_metrics["epoch/highest_avg_reward"] = highest_avg_reward
        epoch_metrics["epoch/group_average_reward"] = group_avg_reward
        wandb.log(epoch_metrics, step=epoch + 1)

        # --- Implement Policy Update Phase ---
        print(f"  Epoch {epoch+1}/{NUM_EPOCHS}: Starting policy update...")
        for policy_idx, policy in enumerate(policy_group):
            policy_name = f"policy_{policy_idx}"
            optimizer = optimizers[policy_idx]

            # Retrieve collected data (rewards, continuous samples, and questions)
            policy_rewards_list = all_policy_rewards[policy_name]
            policy_continuous_samples = all_policy_continuous_samples[policy_name] # List of tensors
            policy_questions = all_policy_questions[policy_name] # List of strings

            # Convert rewards to tensor (doesn't need grad for standard policy gradient)
            policy_rewards = torch.tensor(policy_rewards_list, dtype=torch.float32)

            # Recalculate advantages using the collected rewards
            baseline = calculate_baseline(policy_rewards) # Calculate baseline from collected rewards
            policy_advantages = policy_rewards - baseline # Calculate advantages

            # Re-run forward pass on the stored questions to get log_probs that track gradients
            if not policy_questions:
                 print(f"    {policy_name}: No questions available for update.")
                 wandb.log({f"{policy_name}/policy_loss": 0.0,}, step=epoch + 1)
                 continue # Skip update if no questions

            # Ensure policy is in training mode before forward pass for update
            policy.train()
            mean_output_update, log_variance_output_update = policy(policy_questions)

            # Stack the collected continuous samples into a single tensor
            if not policy_continuous_samples:
                 print(f"    {policy_name}: No continuous samples available for update.")
                 wandb.log({f"{policy_name}/policy_loss": 0.0,}, step=epoch + 1)
                 continue # Skip update if no samples

            policy_continuous_samples_tensor = torch.stack(policy_continuous_samples)

            # Recalculate log probabilities using the fresh policy outputs and the stored continuous samples
            policy_log_probs = calculate_log_prob(mean_output_update, log_variance_output_update, policy_continuous_samples_tensor)


            # Filter out samples where original reward was 0 (likely due to errors in RAG execution)
            # We use the original rewards to determine which samples were valid.
            valid_indices = policy_rewards != 0
            if torch.sum(valid_indices) > 0:
                valid_log_probs = policy_log_probs[valid_indices]
                valid_advantages = policy_advantages[valid_indices]

                # Calculate the policy loss using the valid log probabilities and advantages
                policy_loss = -torch.mean(valid_log_probs * valid_advantages)

                # Perform optimizer.zero_grad() for the current policy's optimizer
                optimizer.zero_grad()

                # Call policy_loss.backward() to compute gradients
                policy_loss.backward()

                # Call optimizer.step() to update the current policy's parameters
                optimizer.step()

                # Log the policy loss for each policy after its update
                wandb.log({
                    f"{policy_name}/policy_loss": policy_loss.item(),
                }, step=epoch + 1)

            else:
                wandb.log({
                    f"{policy_name}/policy_loss": 0.0, # Log 0 loss if no valid samples for update
                }, step=epoch + 1)

        print(f"  Epoch {epoch+1}/{NUM_EPOCHS}: Policy update completed.")

    print("Training finished.")

else:
    print("Training skipped because CombinedFinancial10k index was not loaded.")

# Finish the Weights & Biases run
if wandb.run is not None:
    wandb.finish()

**Explanation of training results logged in W&B**

**Epoch 27/100, Avg Loss: 0.0183, Avg Reward: 0.3199, Avg Predicted Top K: 2.80, Predicted Top K Std: 0.40**

**Epoch** 27/100: This indicates that the training is currently on the 27th iteration (epoch) out of a planned total of 100 epochs.

**Avg Loss**: 0.0183: This is the average policy loss calculated across all the batches in Epoch 27. In reinforcement learning, the policy loss is a value that the optimization algorithm (like the one used in GRPO) tries to minimize. A **lower loss** generally suggests that the **policy** is being **updated** in a **direction** that is expected to **increase** **rewards**. The value 0.0183 is the magnitude of this average loss for this specific epoch.

**Avg Reward**: 0.3199: This is the **average cosine similarity reward** obtained during Epoch 27. This reward is calculated by running the RAG system with the actions (predicted similarity_top_k values) sampled by the policy network for each question in the dataset and then comparing the generated answers to the ground truth. An **average** **reward** of 0.3199 means that, on **average** across all samples in this epoch, the **generated** **answers** had a **cosine** **similarity** of approximately 0.32 with their **respective** **ground** **truth** **answers**. **Higher** values indicate **better** **performance** in terms of **semantic** **similarity**.

**Avg Predicted Top K**: 2.80: This is the average value of similarity_top_k predicted (or sampled and then processed into an integer) by the policy network across all questions in Epoch 27. This **metric** **tells** you what the **policy** **network** is generally **choosing** for the **number** of **documents** to **retrieve**. An **average** of 2.80 suggests the **policy** is typically **selecting** around 3 **documents**.

**Predicted Top K Std**: 0.40: This is the standard deviation of the similarity_top_k values predicted (or sampled and processed) by the policy network across all questions in Epoch 27. The standard deviation measures the dispersion or spread of the predicted values. A standard deviation of 0.40 indicates that the predicted similarity_top_k values in this epoch were relatively close to the average (2.80), meaning the policy's predictions for similarity_top_k didn't vary widely within this epoch.

## Summary:

### Data Analysis Key Findings

*   The training process successfully logged key metrics at both the batch and epoch levels to Weights & Biases, including policy loss, average reward, average predicted `similarity_top_k`, average advantage, average mean, and average log-variance of the predicted distribution.
*   Hyperparameters such as learning rate (1e-4), batch size (8), and number of epochs (100) were successfully logged to the Weights & Biases config.
*   The training loop executed for 100 epochs, with console output confirming the progress and epoch-average loss and reward.
*   A new epoch-level metric, the standard deviation of the predicted `similarity_top_k`, was successfully added and logged to provide insight into the variability of the policy's actions.

### Completed Steps

*   Visualized the logged metrics in the Weights & Biases dashboard to analyze trends, identify correlations between metrics (e.g., reward and predicted top\_k), and diagnose potential training issues such as instability or convergence problems.
*   Implement the actual RAG system reward calculation to replace the dummy reward function, allowing the policy to learn based on real retrieval performance.


# Task
Implement the true GRPO update mechanism in the provided notebook, incorporating a reference policy, policy divergence calculation, and a trust region or penalty based on the best performing policy.

## Define a reference policy

### Subtask:
Establish a reference policy, which is typically the average policy across the group or the best-performing policy from the previous iteration. This will serve as a benchmark for calculating policy divergence.


**Reasoning**:
Implement the first instruction of the subtask to establish a reference policy by identifying the best performing policy's state dictionary at the beginning of each epoch and handling the initial epoch case.



In [None]:
# --- Training Loop ---
print("Starting policy group training with policy evaluation and logging...")

if 'global_step' not in globals(): global_step = 0
# Initialize reference_policy_state_dict before the loop
reference_policy_state_dict = None

# Check if CombinedFinancial10k_index was loaded successfully before starting training
if CombinedFinancial10k_index is not None:
    for epoch in range(NUM_EPOCHS):
        # Data structures to collect data across policies for this iteration/epoch
        all_policy_rewards = {}
        all_policy_continuous_samples = {} # Store continuous samples for gradient tracking
        all_policy_sampled_k_processed = {}
        all_policy_means = {}
        all_policy_log_variances = {}
        all_policy_losses = {}
        all_policy_questions = {} # Store questions for re-running forward pass

        # --- Data Collection Phase ---
        print(f"  Epoch {epoch+1}/{NUM_EPOCHS}: Collecting data...")
        for policy_idx, policy in enumerate(policy_group):
            policy.train()
            policy_name = f"policy_{policy_idx}"
            all_policy_rewards[policy_name] = []
            all_policy_continuous_samples[policy_name] = [] # Initialize storage for continuous samples
            all_policy_sampled_k_processed[policy_name] = []
            all_policy_means[policy_name] = []
            all_policy_log_variances[policy_name] = []
            all_policy_losses[policy_name] = []
            all_policy_questions[policy_name] = [] # Initialize storage for questions

            for batch_idx, (batch_questions, batch_ground_truth) in enumerate(train_dataloader):
                if not batch_questions: continue

                # Ensure gradients are tracked for policy outputs
                mean_output, log_variance_output = policy(list(batch_questions))

                batch_sampled_k_processed = []
                batch_sampled_k_continuous = []
                batch_rewards = []

                for i in range(len(batch_questions)):
                    # sample_action_and_continuous function already returns tensors
                    sampled_k_processed_item, sampled_k_continuous_item = sample_action_and_continuous(mean_output[i], log_variance_output[i])
                    batch_sampled_k_processed.append(sampled_k_processed_item.detach()) # Detach processed_action for plotting/logging
                    batch_sampled_k_continuous.append(sampled_k_continuous_item) # Keep continuous_sample attached to graph

                    question = batch_questions[i]
                    ground_truth_answer = batch_ground_truth[i]
                    predicted_top_k_int = max(1, int(sampled_k_processed_item.item()))

                    try:
                        # Use CombinedFinancial10k_index here
                        policy_controlled_engine = CombinedFinancial10k_index.as_query_engine(similarity_top_k=predicted_top_k_int)
                        generated_answer = policy_controlled_engine.query(question).response
                        reward = cosine_similarity_reward(generated_answer, ground_truth_answer)
                        batch_rewards.append(reward)
                    except Exception as e:
                        # print(f"    Error during RAG execution or reward calculation for question '{question}': {e}")
                        batch_rewards.append(0.0)

                if not batch_rewards: batch_rewards_tensor = torch.tensor([], dtype=torch.float32)
                else: batch_rewards_tensor = torch.tensor(batch_rewards, dtype=torch.float32)

                # batch_sampled_k_continuous list contains tensors that are part of the graph
                if not batch_sampled_k_continuous: batch_sampled_k_continuous_tensor = torch.tensor([], dtype=torch.float32)
                else: batch_sampled_k_continuous_tensor = torch.stack(batch_sampled_k_continuous)


                all_policy_rewards[policy_name].extend(batch_rewards_tensor.tolist())
                all_policy_sampled_k_processed[policy_name].extend([k.item() for k in batch_sampled_k_processed])
                all_policy_continuous_samples[policy_name].extend(batch_sampled_k_continuous) # Store continuous samples

                # Calculate log probabilities for the batch - needs to be on tensors that track gradients
                # We will recalculate log_probs in the update step using the stored continuous samples
                # and a fresh forward pass, so no need to store log_probs here.

                all_policy_means[policy_name].extend(mean_output.detach().tolist()) # Detach for storage/logging
                all_policy_log_variances[policy_name].extend(log_variance_output.detach().tolist()) # Detach for storage/logging

                # Store questions for re-running forward pass in update step
                all_policy_questions[policy_name].extend(list(batch_questions))


        global_step += 1

        # --- Implement Group Performance Evaluation and Logging ---
        policy_avg_rewards = {}
        best_policy_name = None
        highest_avg_reward = -float('inf')
        print(f"  Epoch {epoch+1}/{NUM_EPOCHS}: Evaluating policy performance and logging epoch metrics...")
        epoch_metrics = {}
        group_avg_reward = 0.0
        total_valid_rewards = 0
        for policy_idx, policy in enumerate(policy_group):
            policy_name = f"policy_{policy_idx}"
            epoch_rewards = all_policy_rewards[policy_name]
            avg_epoch_reward = np.mean(epoch_rewards) if epoch_rewards else 0.0
            policy_avg_rewards[policy_name] = avg_epoch_reward
            if avg_epoch_reward > highest_avg_reward:
                highest_avg_reward = avg_epoch_reward
                best_policy_name = policy_name
            epoch_predicted_k = all_policy_sampled_k_processed[policy_name]
            epoch_means = all_policy_means[policy_name]
            epoch_log_variances = all_policy_log_variances[policy_name]
            avg_epoch_predicted_top_k = np.mean(epoch_predicted_k) if epoch_predicted_k else 0
            # Calculate standard deviation of predicted top_k here
            avg_epoch_predicted_top_k_std = np.std(epoch_predicted_k) if epoch_predicted_k else 0
            avg_epoch_mean = np.mean(epoch_means) if epoch_means else 0
            avg_epoch_log_variance = np.mean(epoch_log_variances) if epoch_log_variances else 0
            epoch_metrics[f"{policy_name}/epoch_average_reward"] = avg_epoch_reward
            epoch_metrics[f"{policy_name}/epoch_average_predicted_top_k"] = avg_epoch_predicted_top_k
            epoch_metrics[f"{policy_name}/epoch_predicted_top_k_std"] = avg_epoch_predicted_top_k_std
            epoch_metrics[f"{policy_name}/epoch_average_mean"] = avg_epoch_mean
            epoch_metrics[f"{policy_name}/epoch_average_log_variance"] = avg_epoch_log_variance
            group_avg_reward += np.sum(epoch_rewards)
            total_valid_rewards += len(epoch_rewards)
            print(f"    {policy_name}: Avg Reward = {avg_epoch_reward:.4f}, Avg Predicted Top K = {avg_epoch_predicted_top_k:.2f}, Predicted Top K Std = {avg_epoch_predicted_top_k_std:.2f}")

        group_avg_reward = group_avg_reward / total_valid_rewards if total_valid_rewards > 0 else 0.0
        print(f"  Epoch {epoch+1}/{NUM_EPOCHS}: Best performing policy is {best_policy_name} with Avg Reward = {highest_avg_reward:.4f}")
        print(f"  Epoch {epoch+1}/{NUM_EPOCHS}: Group Average Reward = {group_avg_reward:.4f}")
        epoch_metrics["epoch/best_policy"] = best_policy_name
        epoch_metrics["epoch/highest_avg_reward"] = highest_avg_reward
        epoch_metrics["epoch/group_average_reward"] = group_avg_reward
        wandb.log(epoch_metrics, step=epoch + 1)

        # --- Establish Reference Policy ---
        # If it's the first epoch, use the first policy as the reference
        if epoch == 0:
            reference_policy_state_dict = policy_group[0].state_dict()
            print("  Epoch 1: Initialized reference policy from policy_group[0].")
        else:
            # Find the best performing policy from the previous epoch's results
            # This assumes policy_avg_rewards was populated in the previous epoch
            # (which it is, at the end of the data collection phase)
            best_policy_idx = int(best_policy_name.split('_')[1])
            reference_policy_state_dict = policy_group[best_policy_idx].state_dict()
            print(f"  Epoch {epoch+1}: Updated reference policy from best performing policy ({best_policy_name}).")

        # --- Implement Policy Update Phase ---
        print(f"  Epoch {epoch+1}/{NUM_EPOCHS}: Starting policy update...")
        for policy_idx, policy in enumerate(policy_group):
            policy_name = f"policy_{policy_idx}"
            optimizer = optimizers[policy_idx]

            # Retrieve collected data (rewards, continuous samples, and questions)
            policy_rewards_list = all_policy_rewards[policy_name]
            policy_continuous_samples = all_policy_continuous_samples[policy_name] # List of tensors
            policy_questions = all_policy_questions[policy_name] # List of strings

            # Convert rewards to tensor (doesn't need grad for standard policy gradient)
            policy_rewards = torch.tensor(policy_rewards_list, dtype=torch.float32)

            # Recalculate advantages using the collected rewards
            baseline = calculate_baseline(policy_rewards) # Calculate baseline from collected rewards
            policy_advantages = policy_rewards - baseline # Calculate advantages

            # Re-run forward pass on the stored questions to get log_probs that track gradients
            if not policy_questions:
                 print(f"    {policy_name}: No questions available for update.")
                 wandb.log({f"{policy_name}/policy_loss": 0.0,}, step=epoch + 1)
                 continue # Skip update if no questions

            # Ensure policy is in training mode before forward pass for update
            policy.train()
            mean_output_update, log_variance_output_update = policy(policy_questions)

            # Stack the collected continuous samples into a single tensor
            if not policy_continuous_samples:
                 print(f"    {policy_name}: No continuous samples available for update.")
                 wandb.log({f"{policy_name}/policy_loss": 0.0,}, step=epoch + 1)
                 continue # Skip update if no samples

            policy_continuous_samples_tensor = torch.stack(policy_continuous_samples)

            # Recalculate log probabilities using the fresh policy outputs and the stored continuous samples
            policy_log_probs = calculate_log_prob(mean_output_update, log_variance_output_update, policy_continuous_samples_tensor)


            # Filter out samples where original reward was 0 (likely due to errors in RAG execution)
            # We use the original rewards to determine which samples were valid.
            valid_indices = policy_rewards != 0
            if torch.sum(valid_indices) > 0:
                valid_log_probs = policy_log_probs[valid_indices]
                valid_advantages = policy_advantages[valid_indices]

                # Calculate the policy loss using the valid log probabilities and advantages
                policy_loss = -torch.mean(valid_log_probs * valid_advantages)

                # Perform optimizer.zero_grad() for the current policy's optimizer
                optimizer.zero_grad()

                # Call policy_loss.backward() to compute gradients
                policy_loss.backward()

                # Call optimizer.step() to update the current policy's parameters
                optimizer.step()

                # Log the policy loss for each policy after its update
                wandb.log({
                    f"{policy_name}/policy_loss": policy_loss.item(),
                }, step=epoch + 1)

            else:
                wandb.log({
                    f"{policy_name}/policy_loss": 0.0, # Log 0 loss if no valid samples for update
                }, step=epoch + 1)

        print(f"  Epoch {epoch+1}/{NUM_EPOCHS}: Policy update completed.")


    print("Training finished.")

else:
    print("Training skipped because CombinedFinancial10k index was not loaded.")

# Finish the Weights & Biases run
if wandb.run is not None:
    wandb.finish()