Importing the libraries 

In [9]:
!pip3 install --upgrade huggingface_hub
!pip3 install --upgrade sentence-transformers



Collecting sentence-transformers
  Using cached sentence_transformers-3.2.1-py3-none-any.whl.metadata (10 kB)
Using cached sentence_transformers-3.2.1-py3-none-any.whl (255 kB)
Installing collected packages: sentence-transformers
  Attempting uninstall: sentence-transformers
    Found existing installation: sentence-transformers 2.2.2
    Uninstalling sentence-transformers-2.2.2:
      Successfully uninstalled sentence-transformers-2.2.2
Successfully installed sentence-transformers-3.2.1


In [None]:
#for macos installation of libraries 
!pip3 install requests numpy faiss-cpu scikit-learn rouge-score nltk

#for windows
%pip install requests numpy faiss-cpu scikit-learn rouge-score nltk


In [11]:
import json
import numpy as np
import os
import faiss
from pdfminer.high_level import extract_text
from sentence_transformers import SentenceTransformer
# from huggingface_hub import hf_hub_download



In [12]:
# Load a pre-trained embedding model
# This line initializes the SentenceTransformer model 'all-MiniLM-L6-v2', a compact, 
# efficient model for creating sentence embeddings that capture semantic meaning.
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embedding and Text Extraction Utilities

This script provides utilities for generating text embeddings, extracting text from PDF files, and loading metadata from JSONL files. It leverages the `SentenceTransformer` model to create embeddings that represent the input text, making it suitable for similarity search tasks. Additionally, it uses `extract_text` for reading content from PDFs and handles any file-related errors gracefully. The `load_metadata` function reads JSONL metadata for further document processing.

## Code Overview

1. **generate_embeddings(text)**: Creates embeddings for input text using a pre-trained `SentenceTransformer` model.
2. **extract_pdf_text(pdf_path)**: Extracts text from the specified PDF file; handles errors if extraction fails.
3. **load_metadata(jsonl_file)**: Reads metadata from a JSONL file and loads it as a list of JSON objects for use in downstream tasks.


In [48]:
def generate_embeddings(text):
    """Generate embeddings for the given text."""
    
    return model.encode(text, convert_to_tensor=True)

def extract_pdf_text(pdf_path):
    """Extract text from a PDF file."""
    try:
        return extract_text(pdf_path)
    except Exception as e:
        print(f"Error extracting text from {pdf_path}: {e}")
        return ""

def load_metadata(jsonl_file):
    """Load metadata from a JSONL file."""
    try:
        with open(jsonl_file, 'r') as f:
            return [json.loads(line) for line in f]
    except Exception as e:
        print(f"Error loading metadata from {jsonl_file}: {e}")
        return []


In [49]:
# Define your paths
jsonl_file = "/Users/innovapathinc/Desktop/Response_eval/data/financebench_document_information.jsonl"
pdf_files = [
    "/Users/innovapathinc/Desktop/Response_eval/data/3M_2015_10K.pdf",
    "/Users/innovapathinc/Desktop/Response_eval/data/3M_2016_10K.pdf"
]

# Load metadata from JSONL file
metadata = load_metadata(jsonl_file)
print("Loaded metadata:", metadata)


Loaded metadata: [{'doc_name': '3M_2015_10K', 'company': '3M', 'gics_sector': 'Industrials', 'doc_type': '10k', 'doc_period': 2015, 'doc_link': 'https://investors.3m.com/financials/sec-filings/content/0001558370-16-003162/0001558370-16-003162.pdf'}, {'doc_name': '3M_2016_10K', 'company': '3M', 'gics_sector': 'Industrials', 'doc_type': '10k', 'doc_period': 2016, 'doc_link': 'https://investors.3m.com/financials/sec-filings/content/0001558370-17-000479/0001558370-17-000479.pdf'}, {'doc_name': '3M_2017_10K', 'company': '3M', 'gics_sector': 'Industrials', 'doc_type': '10k', 'doc_period': 2017, 'doc_link': 'https://investors.3m.com/financials/sec-filings/content/0001558370-18-000535/0001558370-18-000535.pdf'}, {'doc_name': '3M_2018_10K', 'company': '3M', 'gics_sector': 'Industrials', 'doc_type': '10k', 'doc_period': 2018, 'doc_link': 'https://investors.3m.com/financials/sec-filings/content/0001558370-19-000470/0001558370-19-000470.pdf'}, {'doc_name': '3M_2022_10K', 'company': '3M', 'gics_sec

In [56]:
# Initialize FAISS index
index = faiss.IndexFlatL2(384)  # Dimension should match your embedding size

# Extract text from all PDF files and store embeddings
for pdf_file in pdf_files:
    pdf_name = os.path.basename(pdf_file)
    pdf_text = extract_pdf_text(pdf_file)
    embedding = generate_embeddings(pdf_text).cpu().numpy()  # Convert to numpy array
    index.add(np.array([embedding]))  # Add to FAISS index
    print(f"Stored embedding for {pdf_name} in the index.")


Stored embedding for 3M_2015_10K.pdf in the index.
Stored embedding for 3M_2016_10K.pdf in the index.


In [60]:
# Function to retrieve the top K most similar documents for a given query
def retrieve_documents(query, index, k=3):
    query_embedding = generate_embeddings(query).cpu().numpy()
    distances, indices = index.search(np.array([query_embedding]), k)
    return indices[0], distances[0]  # Return indices and distances


In [61]:
# Example usage
query = "What are the key financials of 3M in 2020?"
top_docs, top_distances = retrieve_documents(query, index)

In [62]:
# Display the top retrieved document indices and distances
print(f"Top retrieved documents indices: {top_docs}")
print(f"Distances: {top_distances}")

# Print the names of the top retrieved documents using the metadata
for doc_index in top_docs:
    if doc_index < len(metadata):  # Ensure index is within bounds
        doc_name = metadata[doc_index]['doc_name']
        print(f"Document: {doc_name}.pdf")

Top retrieved documents indices: [ 0  1 -1]
Distances: [1.4042511e+00 1.4111483e+00 3.4028235e+38]
Document: 3M_2015_10K.pdf
Document: 3M_2016_10K.pdf
Document: 3M_2022_10K.pdf


In [65]:
import requests
import numpy as np

# Set your Mistral API URL and API key
mistral_api_url = "https://api.mistral.ai/v1/chat/completions"
# Update with the actual endpoint
mistral_api_key = "KoxLAEgGVZeEMPv3AJ3KSA4aJV6ZxPmo"

def generate_response(context, query):
    """Generate a response using Mistral AI."""
    prompt = f"Based on the following content: {context}\nAnswer the query: {query}"

    headers = {
        "Authorization": f"Bearer {mistral_api_key}",
        "Content-Type": "application/json"
    }

    payload = {
        "model": "mistral-small-latest",  # Replace with the actual model name
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 300
    }

    response = requests.post(mistral_api_url, headers=headers, json=payload)

    if response.status_code == 200:
        return response.json()['choices'][0]['message']['content'].strip()
    else:
        raise Exception(f"Error {response.status_code}: {response.text}")

def retrieve_documents(query, index, metadata, k=3):
    """Retrieve the top K most similar documents for a given query."""
    query_embedding = generate_embeddings(query).cpu().numpy()  # Ensure the embedding is on CPU
    distances, indices = index.search(np.array([query_embedding]), k)

    # Gather document contexts based on the retrieved indices
    contexts = []
    for idx in indices[0]:
        if idx < len(metadata):  # Check to avoid index out of range
            contexts.append(f"{metadata[idx]['doc_name']} (Link: {metadata[idx]['doc_link']})")

    return contexts, distances[0]  # Return contexts and distances

# Example usage
query = "Who are the executive officers in 3m comapany in 2015 ?"
top_docs, top_distances = retrieve_documents(query, index, metadata)

# Prepare the context for the response
context = "\n".join(top_docs)
response = generate_response(context, query)

print("Response from Mistral AI:")
print(response)


Response from Mistral AI:
To find the executive officers of 3M Company in 2015, you would need to refer to the 3M_2015_10K document. Here is a general guide on how to locate this information:

1. **Open the 3M_2015_10K document** linked in the provided content.
2. **Navigate to the section on executive officers**. This section is typically found in the beginning of the document, often in the "Item 1" or "Item 4" section, which details information about the company's board of directors and executive officers.
3. **Look for a heading such as "Executive Officers" or "Management"**.

The specific names and positions of the executive officers should be listed there. Here is a general format you might expect:

- **Chief Executive Officer (CEO)**
- **Chief Financial Officer (CFO)**
- **Chief Operating Officer (COO)**
- **Other executive positions (e.g., Chief Legal Officer, Chief Technical Officer, etc.)**

For a precise list, you would need to consult the document directly.


In [64]:
import requests
import numpy as np
import time
from sklearn.metrics import precision_score
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu

# Set your Mistral API URL and API key
mistral_api_url = "https://api.mistral.ai/v1/chat/completions"
mistral_api_key = "KoxLAEgGVZeEMPv3AJ3KSA4aJV6ZxPmo"

def generate_response(context, query):
    """Generate a response using Mistral AI."""
    prompt = f"Based on the following content: {context}\nAnswer the query: {query}"

    headers = {
        "Authorization": f"Bearer {mistral_api_key}",
        "Content-Type": "application/json"
    }

    payload = {
        "model": "mistral-small-latest",  # Replace with the actual model name
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 300
    }

    response = requests.post(mistral_api_url, headers=headers, json=payload)

    if response.status_code == 200:
        return response.json()['choices'][0]['message']['content'].strip()
    else:
        raise Exception(f"Error {response.status_code}: {response.text}")

def retrieve_documents(query, index, metadata, k=3):
    """Retrieve the top K most similar documents for a given query."""
    query_embedding = generate_embeddings(query).cpu().numpy()  # Ensure the embedding is on CPU
    distances, indices = index.search(np.array([query_embedding]), k)

    # Gather document contexts based on the retrieved indices
    contexts = []
    for idx in indices[0]:
        if idx < len(metadata):  # Check to avoid index out of range
            contexts.append(f"{metadata[idx]['doc_name']} (Link: {metadata[idx]['doc_link']})")

    return contexts, distances[0]  # Return contexts and distances

def evaluate_precision_at_k(retrieved_docs, relevant_docs, k=3):
    """Calculate Precision@K."""
    retrieved_set = set(retrieved_docs[:k])
    relevant_set = set(relevant_docs)
    true_positives = len(retrieved_set.intersection(relevant_set))

    precision = true_positives / k
    return precision

def evaluate_rouge(reference, generated):
    """Evaluate ROUGE score."""
    rouge = Rouge()
    scores = rouge.get_scores(generated, reference, avg=True)
    return scores

def evaluate_bleu(reference, generated):
    """Evaluate BLEU score."""
    reference = reference.split()  # Tokenize the reference
    generated = generated.split()  # Tokenize the generated response
    score = sentence_bleu([reference], generated)
    return score

def evaluate_factual_accuracy(response, ground_truth):
    """Check if the response contains valid information."""
    # This is a simplistic check; you can implement a more sophisticated one.
    return any(term in response.lower() for term in ground_truth)

# Example usage
query = "What are the key financials of 3M in 2015?"
top_docs, top_distances = retrieve_documents(query, index, metadata)

# Prepare the context for the response
context = "\n".join(top_docs)

# Measure latency and generate response
start_time = time.time()
response = generate_response(context, query)
latency = time.time() - start_time

print("Response from Mistral AI:")
print(response)

# Define the ground truth and relevant documents for evaluation
ground_truth = "3M's key financials for 2015 include revenue, net income, and earnings per share."  # Example ground truth
def extract_pdf_names(pdf_files):
    return [os.path.splitext(os.path.basename(pdf))[0] for pdf in pdf_files]

# Extract actual document names from the PDF file list
relevant_docs = extract_pdf_names(pdf_files)

print("Relevant Documents from actual PDFs:")
print(relevant_docs)
# Evaluate Precision@K
precision_at_k = evaluate_precision_at_k(top_docs, relevant_docs)
print(f"Precision@K: {precision_at_k:.2f}")

# Evaluate ROUGE
rouge_scores = evaluate_rouge(ground_truth, response)
print(f"ROUGE scores: {rouge_scores}")

# Evaluate BLEU
bleu_score = evaluate_bleu(ground_truth, response)
print(f"BLEU score: {bleu_score:.2f}")

# Evaluate Factual Accuracy
factual_accuracy = evaluate_factual_accuracy(response, ground_truth)
print(f"Factual Accuracy: {'Valid' if factual_accuracy else 'Invalid'}")

# Print latency
print(f"Response Latency: {latency:.2f} seconds")


Response from Mistral AI:
To provide you with the key financials of 3M for the year 2015, I would need to access the content of the 3M_2015_10K document. However, as I don't have direct access to external links or files, you can find the key financial information in the "Financial Statements and Supplementary Data" section of the 3M_2015_10K document available at the provided link.

Generally, the key financials you would look for include:

1. **Revenue**: Total sales or net revenues.
2. **Net Income**: Profit after all expenses and taxes.
3. **Earnings per Share (EPS)**: Net income divided by the number of outstanding shares.
4. **Total Assets**: Overall value of what the company owns.
5. **Total Liabilities**: Overall value of what the company owes.
6. **Shareholders' Equity**: The difference between total assets and total liabilities.
7. **Cash and Cash Equivalents**: Liquid assets that can be readily converted to cash.
8. **Debt**: Long-term and short-term liabilities.

To get the 

The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
