In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/report-data/report.pdf
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/config.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/pytorch_model-00002-of-00002.bin
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/tokenizer.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/tokenizer_config.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/pytorch_model.bin.index.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/pytorch_model-00001-of-00002.bin
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/special_tokens_map.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/.gitattributes
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/tokenizer.model
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/generation_config.json


# Financial Report QA System

This project leverages advanced technologies to analyze a company's annual financial report. It combines **Large Language Models (LLMs)**, such as **Mistral 7B**, with **Retrieval-Augmented Generation (RAG)** techniques to deliver precise and insightful answers to user queries. The system is designed to provide structured responses that extract meaningful metrics and trends directly from financial documents.

---

## Key Technologies Used

1. **Mistral 7B Model:**  
   A powerful language model used for generating natural language answers. Its advanced architecture allows it to understand and summarize complex financial data efficiently.

2. **FAISS (Facebook AI Similarity Search):**  
   A vector database utilized for indexing and retrieving relevant sections of the document. It ensures quick and accurate retrieval of content most relevant to the user's query.

3. **LangChain Framework:**  
   Facilitates seamless integration of language models with retrieval mechanisms, enabling efficient workflows for document analysis and QA systems.

4. **PDF Text Processing:**  
   The system extracts textual content from PDF files, segments it into smaller chunks for better processing, and indexes these chunks to improve context understanding.

5. **Retrieval-Augmented Generation (RAG):**  
   Combines the power of retrieval (to find relevant document sections) and generation (to synthesize answers), ensuring both relevance and clarity.

---

## How It Works

1. **Document Processing:**  
   The system extracts and preprocesses text from the PDF report, creating manageable chunks for analysis.

2. **Vector Indexing:**  
   Using FAISS, the chunks are indexed as dense vectors, allowing efficient similarity searches based on user queries.

3. **Query Handling:**  
   A retriever identifies the most relevant sections of the report, which are then fed into the Mistral 7B model for answer generation.

4. **Structured Output:**  
   The model generates concise, bullet-point answers based on the retrieved content, tailored to the specific user query.

---

## Example Query and Response

**Query:**  
What are the key financial highlights mentioned in the report?

**Response:**  
1. **Revenue growth:** The company's revenue increased by 12% from $1,422,280 in 2021 to $1,545,598 in 2022. This indicates strong customer demand and successful expansion efforts.  
2. **Net income improvement:** Despite an increase in operating expenses, the company's net income improved by 17% from ($230,000) in 2021 to ($154,559) in 2022. This suggests effective cost management and efficient operations.  
3. **Trade receivable reduction:** The value of trade receivables decreased by 13%, highlighting improved collection practices.  
4. **Other current liability decrease:** Current liabilities reduced by 21%, reflecting improved debt management.

---

## Brief Development Overview

This system exemplifies the integration of cutting-edge AI technologies with data retrieval techniques. By combining the Mistral 7B model, FAISS for vector-based retrieval, and the LangChain framework, it showcases a scalable and efficient approach to 
## Prompt Design

The system employs a **task-specific prompt structure** optimized for financial analysis. This structure guides the language model (Mistral 7B) to generate concise, structured, and relevant answers based on retrieved document content.

### Characteristics of the Prompt:
1. **Instruction-Based Design:**  
   The prompt provides step-by-step instructions to the model, specifying the focus on key metrics (e.g., revenue, net income) and limiting responses to 3-4 significant points.

2. **Contextual Input:**  
   Relevant sections of the document are passed as context, ensuring that the model generates answers grounded in the provided data.

3. **Structured Output:**  
   The prompt explicitly requests bullet-point answers to improve clarity and readability.

### Purpose:
This design ensures that the model prioritizes critical financial information, avoids generalizations, and delivers outputs tailored to user queries.
automating the analysis of complex financial documents.
s of complex financial documents.


In [2]:
# Installing necessary libraries for model fine-tuning
!pip install -U bitsandbytes  # Install bitsandbytes for 4-bit and 8-bit quantization, optimizing model memory and speed.
!pip install -U langchain-community  # Install community-contributed modules for LangChain, enhancing retrieval and processing.
!pip install langchain faiss-cpu transformers sentence-transformers PyPDF2   # Install LangChain for LLM pipelines, FAISS for vector search, Transformers for model management, and SentenceTransformers for embedding generation.

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl (69.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.45.0
Collecting langchain-community
  Downloading langchain_community-0.3.12-py3-none-any.whl.metadata (2.9 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.12 (from langchain-community)
  Downloading langchain-0.3.12-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.25 (from langchain-community)
  Downloading langchain_core-0.3.25-py3-none-any.whl.metadata (6.3 kB)
Collecting langsmith<0.3,>=0.1.125 (from langchain-community)
  Downloading langsmi

In [3]:
import os
from PyPDF2 import PdfReader
from langchain.vectorstores import FAISS
from sentence_transformers import SentenceTransformer
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Path to the PDF file containing the annual report
pdf_path = "/kaggle/input/report-data/report.pdf"

# Function to extract text from the PDF file
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a given PDF file using PyPDF2.

    Args:
        pdf_path (str): The path to the PDF file.

    Returns:
        str: The extracted text from the PDF.
    """
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

# Extract text from the provided PDF
report_text = extract_text_from_pdf(pdf_path)

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text(report_text)

# Splitting the extracted text into paragraphs and creating Document objects
documents = [Document(page_content=chunk) for chunk in chunks]
# documents = [Document(page_content=paragraph) for paragraph in report_text.split('\n') if paragraph.strip()]

# Initializing the SentenceTransformer model for generating text embeddings
embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

# Creating a FAISS vector store for efficient similarity-based search
vectorstore = FAISS.from_documents(documents, embedding_model)

# Configuring BitsAndBytes for optimized memory usage with the Mistral model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Enables 4-bit quantization for reduced memory usage
    bnb_4bit_quant_type="nf4",  # Specifies quantization type for performance
    bnb_4bit_compute_dtype="float16",  # Sets computation precision to 16-bit floating point
    bnb_4bit_use_double_quant=False  # Disables double quantization for simplicity
)

# Path to the pre-trained Mistral model
mistral_model_path = "/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1"

# Loading the tokenizer for the Mistral model
tokenizer = AutoTokenizer.from_pretrained(mistral_model_path)

# Loading the Mistral language model with the specified quantization configuration
model = AutoModelForCausalLM.from_pretrained(
    mistral_model_path,
    device_map='auto',  # Automatically maps the model to available devices (e.g., GPUs)
    quantization_config=bnb_config  # Applies the BitsAndBytes configuration
)

# Creating a text-generation pipeline with the Mistral model
llm_pipeline = pipeline(
    "text-generation",  # Specify the task type as text generation
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=512,  # Set maximum output token length
    repetition_penalty=1.2,  # Penalize repetitive text in outputs
    temperature=0.1,  # Control randomness; lower values produce more deterministic outputs
    top_k=40,  # Limit sampling to top 40 tokens by probability
    top_p=0.9,  # Apply nucleus sampling to include tokens with cumulative probability <= 0.9
    pad_token_id=tokenizer.eos_token_id,  # Use EOS token for padding
    return_full_text=False,  # Only return generated text without input context
    do_sample=True  # Enable sampling to introduce variability in output
)

# Integrating the pipeline with LangChain
llm = HuggingFacePipeline(pipeline=llm_pipeline)

# Setting up the retriever to fetch the top 5 most similar documents
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 5})

# Function to generate an answer to a query
def generate_answer(query, retriever, llm):
    """
    Generates a concise and clear answer to the provided query by retrieving relevant documents and using the language model.

    Args:
        query (str): The question to answer.
        retriever: The document retriever object.
        llm: The Hugging Face pipeline for text generation.

    Returns:
        str: The optimized and concise generated answer.
    """
    # Retrieve relevant documents using the retriever
    docs = retriever.invoke(query)
    if not docs:
        return "No relevant information found."

    # Combine retrieved documents into context
    context = "\n".join([doc.page_content for doc in docs])

    # Formulate the prompt with clear instructions for concise output
    prompt = f"""
    Context:
    {context}

    Question:
    {query}

    Instructions:
    1. Focus on the most critical financial metrics (e.g., revenue, net income, total assets, total liabilities).
    2. Limit the response to 3-4 significant changes with concise explanations.
    3. Avoid including minor details unless directly relevant to the query.
    4. Provide the answer in a structured, bullet-point format.
    5. Use exact numerical values (percentages or absolute changes) to emphasize the significance of each metric and avoid generalizations or overly abstract statements.

    Answer:
    """

    # Generate the answer using the language model
    answer = llm(prompt)
    return answer


# Example usage
query = "What are the key financial highlights mentioned in the report?"
response = generate_answer(query, retriever, llm)
print(f"Query: {query}")
print(f"Response: {response}")



  embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  llm = HuggingFacePipeline(pipeline=llm_pipeline)
  answer = llm(prompt)


Query: What are the key financial highlights mentioned in the report?
Response: 1. Revenue growth rate
        - The company's revenue increased by 12% from $1,422,280 in 2021 to $1,545,598 in 2022. This indicates strong sales performance and market demand for Ethernity Networks' products/services.
    2. Net income increase
        - The company's net income rose from $1,253,268 in 2021 to $1,392,966 in 2022. This signifies improved profitability due to higher revenue and effective cost management.
    3. Total asset expansion
        - The company's total assets grew from $1,595,578 in 2021 to $1,794,950 in 2022. This suggests that Ethernity Networks invested heavily in its business operations, possibly through capital expenditures or acquisitions.
    4. Trade receivable reduction
        - The company reduced its trade receivables from $1,422,280 in 2021 to $1,373,718 in 2022. This may indicate better collection practices or a decrease in outstanding invoices.
