<a href="https://colab.research.google.com/github/adarshukla3005/Financial_Report_Generator/blob/main/Financial_Report_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **LLM + RAG Projects on Finance Domain**

This notebook contains the use cases of RAG and LLM in Finance Domain using Python + Langchain and Open Source LLMs and Vector DBs.

![](https://marcabraham.files.wordpress.com/2024/03/raga-retrieval-augmented-generation-and-actions.png?w=1024)

# **Developing a Financial Report Using Economic Indicators from an API**  
Utilizing the Financial Modelling Prep API to gather Market Economic Indicators.

**Problem Statement:**  
Creating a financial report for a company or stock using the latest market and economic data, without the need for training or fine-tuning large language models (LLMs) or machine learning models.

**Project Methodology:**
- This project leverages an open-source API to retrieve the latest financial and market data related to company metrics and economic indicators.
- Python is used to process and save this data into a CSV file.
- The CSV file is then loaded into a Vector Database with the help of an embedding model from Hugging Face.
- A Retrieval-Augmented Generation (RAG) question-answering (QA) chain is built using Langchain, and the RAG architecture is implemented with the Falcon 7B LLM (an open-source model).
- The system is tested by querying the built model and analyzing the responses.

**NOTE:**  
While this tutorial uses open-source LLMs, the accuracy of the responses might not be as high. However, using models such as OpenAI's GPT or Meta's LLAMA can significantly improve the output accuracy, even with the same code.

In [6]:
from google.colab import userdata
userdata.get('api_key')

try:
    from urllib.request import urlopen
except ImportError:
    from urllib2 import urlopen

import certifi
import json
import pandas as pd


def get_jsonparsed_data(url, api_key, exchange):
  if exchange == "NSE":
    url = f"https://financialmodelingprep.com/api/v3/search?query={ticker}&exchange=NSE&apikey={api_key}"
  else:
    url = f"https://financialmodelingprep.com/api/v3/quote/{ticker}?apikey={api_key}"
  response = urlopen(url, cafile=certifi.where())
  data = response.read().decode("utf-8")
  return json.loads(data)

ticker = "MSFT"
exchange = "US"
eco_ind = pd.DataFrame(get_jsonparsed_data(ticker, api_key,exchange))
eco_ind

  response = urlopen(url, cafile=certifi.where())


Unnamed: 0,symbol,name,price,changesPercentage,change,dayLow,dayHigh,yearHigh,yearLow,marketCap,...,exchange,volume,avgVolume,open,previousClose,eps,pe,earningsAnnouncement,sharesOutstanding,timestamp
0,MSFT,Microsoft Corporation,388.7,0.03603,0.14,385.57,392.705,468.35,376.91,2889588026000,...,NASDAQ,21484898,22978596,386.77,388.56,12.4,31.35,2025-04-23T20:00:00.000+0000,7433980000,1742241601


### Installing the Langchain Libraries

In [7]:
!pip install langchain langchain-community langchain-core transformers

Collecting langchain-community
  Downloading langchain_community-0.3.19-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB

In [8]:
def preprocess_economic_data(df):
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['earningsAnnouncement'] = pd.to_datetime(df['earningsAnnouncement'])
    return df

preprocessed_economic_df = preprocess_economic_data(eco_ind)
preprocessed_economic_df

Unnamed: 0,symbol,name,price,changesPercentage,change,dayLow,dayHigh,yearHigh,yearLow,marketCap,...,exchange,volume,avgVolume,open,previousClose,eps,pe,earningsAnnouncement,sharesOutstanding,timestamp
0,MSFT,Microsoft Corporation,388.7,0.03603,0.14,385.57,392.705,468.35,376.91,2889588026000,...,NASDAQ,21484898,22978596,386.77,388.56,12.4,31.35,2025-04-23 20:00:00+00:00,7433980000,1970-01-01 00:00:01.742241601


### Storing the Pre-Processed Data into CSV

In [9]:
preprocessed_economic_df.to_csv("eco_ind.csv")

### Installing the Hugging Face Embedding Library

In [None]:
%pip install --upgrade --quiet  langchain sentence_transformers

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m363.4/363.4 MB[0m [31m86.7 MB/s[0m eta [36m0:00:01[0m

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
hg_embeddings = HuggingFaceEmbeddings()

  hg_embeddings = HuggingFaceEmbeddings()
  hg_embeddings = HuggingFaceEmbeddings()
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
from langchain.document_loaders import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader_eco = CSVLoader('eco_ind.csv')
documents_eco = loader_eco.load()

# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=5)

# Split your docs into texts
texts_eco = text_splitter.split_documents(documents_eco)

# Embeddings
embeddings = HuggingFaceEmbeddings()

  embeddings = HuggingFaceEmbeddings()


### Building the Vector DB for RAG

In [None]:
from langchain.vectorstores import Chroma

persist_directory = 'docs/chroma_rag/'

In [None]:
!pip install chromadb

Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
[0m  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.8.3-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.29.0-

In [None]:
economic_langchain_chroma = Chroma.from_documents(
    documents=texts_eco,
    collection_name="economic_data",
    embedding=hg_embeddings,
    persist_directory=persist_directory
)

In [None]:
question = "Microsoft(MSFT)"
docs_eco = economic_langchain_chroma.similarity_search(question,k=3)

### Building RAG Chain using Vector DB and LLM

In [None]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.llms import HuggingFaceHub
from IPython.display import display, Markdown
import os
import warnings
warnings.filterwarnings('ignore')

os.environ["HUGGINGFACEHUB_API_TOKEN"] = "Your Access Token"

llm = HuggingFaceHub(
    repo_id="tiiuae/falcon-7b-instruct",
    model_kwargs={"temperature": 0.1},
)

retriever_eco = economic_langchain_chroma.as_retriever(search_kwargs={"k":2})
qs="Microsoft(MSFT) Financial Report"
template = """You are a Financial Market Expert and Get the Market Economic Data and Market News about Company and Build the Financial Report for me.
              Understand this Market Information {context} and Answer the Query for this Company {question}. i just need the data into Tabular Form as well."""

PROMPT = PromptTemplate(input_variables=["context","question"], template=template)
qa_with_sources = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff",chain_type_kwargs = {"prompt": PROMPT}, retriever=retriever_eco, return_source_documents=True)
llm_response = qa_with_sources({"query": qs})

In [None]:
import requests

headers = {"Authorization": f"Bearer {os.environ['HUGGINGFACEHUB_API_TOKEN']}"}
response = requests.get("https://api-inference.huggingface.co/models/tiiuae/falcon-7b-instruct", headers=headers)

print(response.status_code)
print(response.json())

200
{'_id': '6447714d3411a0902bad9607', 'id': 'tiiuae/falcon-7b-instruct', 'sha': '8782b5c5d8c9290412416618f36a133653e85285', 'pipeline_tag': 'text-generation', 'library_name': 'transformers', 'private': False, 'gated': False, 'siblings': [], 'safetensors': {'parameters': {'BF16': 7217189760}}, 'tags': ['transformers', 'pytorch', 'coreml', 'safetensors', 'falcon', 'text-generation', 'conversational', 'custom_code', 'en', 'dataset:tiiuae/falcon-refinedweb', 'arxiv:2205.14135', 'arxiv:1911.02150', 'arxiv:2005.14165', 'arxiv:2104.09864', 'arxiv:2306.01116', 'license:apache-2.0', 'autotrain_compatible', 'text-generation-inference', 'endpoints_compatible', 'region:us'], 'cardData': {'tags': None, 'base_model': None}}


In [None]:
Markdown(llm_response['result'])

You are a Financial Market Expert and Get the Market Economic Data and Market News about Company and Build the Financial Report for me.
              Understand this Market Information : 0
symbol: MSFT
name: Microsoft Corporation

earningsAnnouncement: 2025-01-29 21:00:00+00:00 and Answer the Query for this Company Microsoft(MSFT) Financial Report. i just need the data into Tabular Form as well.
<p>The following is the financial report for Microsoft Corporation (MSFT) for the year 2025. The report includes the following sections:</p>

<ul>
<li>Income Statement</li>
<li>Balance Sheet</li>
<li>Cash Flow Statement</li>
<li>Income Statement</li>
<li>Balance Sheet</li>
<li>Cash Flow Statement</li>
</ul>

<p>The following is the income statement for Microsoft Corporation (MSFT) for the year 2025:</p>

<ul>
<li>Revenue: $2,423,000,000</li>
<li>Net Income: $1,073,000,000</li>
<li>Earnings per Share: $0.00</li>
</ul>

<p>The following is the balance sheet for Microsoft Corporation (MSFT) for the year 2025:</p>

<ul>
<li>Assets: $1,073,000,000</li>
<li>Liabilities: $1,073,000,000</li>
<li>Equity: $1,000,000,000</li>
</ul>

<p>The following is the cash flow statement for Microsoft Corporation (MSFT) for the year 2025:</p>

<ul>
<li>Cash Inflow: $1,073,000,000</li>
<li>Net Cash Flow: $1,073,000,000</li>
</ul>

<p>The following is the income statement for Microsoft Corporation (MSFT) for the year 2025:</p>

<ul>
<li>Revenue: $2,423,000,000</li>
<li>Net Income: $1,073,000,000</li>
<li>Earnings per Share: $0.00</li>
</ul>

<p>The following is the balance sheet for Microsoft Corporation (MSFT) for the year 2025:</p>

<ul>
<li>Assets: $1,073,000,000</li>
<li>Liabilities: $1,073,000,000</li>
<li>Equity: $1,000,000,000</li>
</ul>

<p>The following is the cash flow statement for Microsoft Corporation (MSFT) for the year 2025:</p>

<ul>
<li>Cash Inflow: $1,073,000,000</li>
<li>Net Cash Flow: $1,073,000,000</li>
</ul>

# **Using NEWS API to Build Financial News Summarizer about the Company Sentiment in Current Time**

 ### Fetchning the Latest Data using the NEWSAPI with the help of API Key from there website.

 **Problem Statment:** Building a GenAI based system that can analyse the market news about the whole stock exchange or a company and tell me about the sentiment of market along with analysis based on news.

**Project Methodology**
- This Project using the open source API to fetch the latest financial news regarding Company and Market.
- Using Python, that fetched data is pre-processed and saved in CSV File.
- Loading that same CSV file to insert into Vector DB using Embedding Model from Hugging Face.
- Building RAG QA Chain using Langchain and building the RAG architecture using Falcon 7B LLM (Open Source).
- Checking the Response.

In [None]:
pip install newsapi-python



In [None]:
import requests
import pandas as pd
from newsapi import NewsApiClient  # Corrected import
from datetime import datetime, timedelta

def fetch_news(query, from_date, to_date, language='en', sort_by='relevancy', page_size=30, api_key='YOUR_API_KEY'):
    # Initialize the NewsAPI client
    newsapi = NewsApiClient(api_key=api_key)

    # Fetch all articles matching the query
    all_articles = newsapi.get_everything(
        q=query,
        from_param=from_date,
        to=to_date,
        language=language,
        sort_by=sort_by,
        page_size=page_size
    )

    # Extract articles
    articles = all_articles.get('articles', [])

    # Convert to DataFrame
    if articles:
        df = pd.DataFrame(articles)
        return df
    else:
        return pd.DataFrame()  # Return an empty DataFrame if no articles are found

# Get the current time
current_time = datetime.now()
# Get the time 10 days ago
time_10_days_ago = current_time - timedelta(days=10)
api_key = 'c0e23a8956cf4b54af382abd932f88ff'
q = "Microsoft News June 2024"
df = fetch_news(q, time_10_days_ago, current_time, api_key=api_key)

if not df.empty:
    df_news = df.drop("source", axis=1)

    def preprocess_news_data(df):
        # Convert publishedAt to datetime
        df['publishedAt'] = pd.to_datetime(df['publishedAt'])
        df = df[~df['author'].isna()]
        df = df[['author', 'title']]
        return df

    preprocessed_news_df = preprocess_news_data(df_news)
    print(preprocessed_news_df.head())
else:
    print("No articles found.")

                                      author  \
0     kevinokemwa@outlook.com (Kevin Okemwa)   
1              Melia Russell,Samantha Stokes   
2                            Dan DeFrancesco   
3        jez@windowscentral.com (Jez Corden)   
4  samuelwtolbert@gmail.com (Samuel Tolbert)   

                                               title  
0  Microsoft to make performance-based job cuts a...  
1  Silicon Valley is foaming at the mouth with th...  
2  Meta's performance-based cuts could kick off a...  
3  A new rumor suggests Final Fantasy 7 Remake is...  
4  Assassin's Creed Shadows is being delayed agai...  


### Pre-Processing the Data

In [None]:
def build_prompt(news_df):
    prompt = "You are a financial analyst tasked with providing insights into recent news articles related to the financial industry. Here are some recent news articles:\n\n"

    for index, row in news_df.iterrows():
        title = row['title']
        prompt += f"   **News:** {title}\n\n"

    prompt += "Please analyze these articles and provide insights into any potential impacts on the financial industry Sentiment on the provided company."

    return prompt

# Build the prompt
prompt = build_prompt(preprocessed_news_df)
print(prompt)

You are a financial analyst tasked with providing insights into recent news articles related to the financial industry. Here are some recent news articles:

   **News:** Microsoft to make performance-based job cuts across departments, including security, impacting "less than 1%" of the workforce

   **News:** Silicon Valley is foaming at the mouth with the promise of AI 'agents.' These are the startups to watch.

   **News:** Meta's performance-based cuts could kick off a wider trend in tech

   **News:** A new rumor suggests Final Fantasy 7 Remake is finally coming to Xbox in 2025, with Rebirth heading across in 2026 — as more Xbox games head to PS5 and Nintendo

   **News:** Assassin's Creed Shadows is being delayed again, now launching in March 2025

   **News:** Natrium 'advanced nuclear' power plant wins Wyoming permit

   **News:** Microsoft reveals another round of job cuts

   **News:** Rubin Observatory aces 1st image tests, gets ready to use world's largest digital camera

  

### LLM from Hugging Face Open Source

In [None]:
llm = HuggingFaceHub(
    repo_id="tiiuae/falcon-7b-instruct",
    model_kwargs={"temperature": 0.1},
)

In [None]:
Markdown(llm.invoke(prompt))

You are a financial analyst tasked with providing insights into recent news articles related to the financial industry. Here are some recent news articles:

   **News:** Microsoft to make performance-based job cuts across departments, including security, impacting "less than 1%" of the workforce

   **News:** Silicon Valley is foaming at the mouth with the promise of AI 'agents.' These are the startups to watch.

   **News:** Meta's performance-based cuts could kick off a wider trend in tech

   **News:** A new rumor suggests Final Fantasy 7 Remake is finally coming to Xbox in 2025, with Rebirth heading across in 2026 — as more Xbox games head to PS5 and Nintendo

   **News:** Assassin's Creed Shadows is being delayed again, now launching in March 2025

   **News:** Natrium 'advanced nuclear' power plant wins Wyoming permit

   **News:** Microsoft reveals another round of job cuts

   **News:** Rubin Observatory aces 1st image tests, gets ready to use world's largest digital camera

   **News:** CVE-2024-44243 macOS flaw allows persistent malware installation

   **News:** Lilbits: A bunch of handheld gaming news, plus a RISC-V server chip and a new single-board PC

   **News:** Why Nvidia rug pull doesn't faze US stock market bulls: Morning Brief

   **News:** The moments set to shape video games in 2025

   **News:** Microsoft Patch Tuesday updates for January 2025 fixed three actively exploited flaws

   **News:** All the Games Reportedly Set for Release on Nintendo Switch 2

   **News:** New Microsoft PHI-4 a Compact Powerhouse Open Source AI Model

   **News:** United Airlines Speeds Up Move To High-Speed Starlink Connectivity

   **News:** Russia-linked APT Star Blizzard targets WhatsApp accounts

   **News:** Inexperienced actors developed the FunkSec ransomware using AI tools

   **News:** Gayfemboy Botnet targets Four-Faith router vulnerability

   **News:** Meet China’s top six AI unicorns: who are leading the wave of AI in China

   **News:** Researchers disclosed details of a now-patched Samsung zero-click flaw

   **News:** Who are Trump's Cabinet Nominees? Get to Know His Picks

   **News:** A Complete Guide to Trump's Cabinet Appointees

   **News:** U.S. CISA adds BeyondTrust PRA and RS and Qlik Sense flaws to its Known Exploited Vulnerabilities catalog

   **News:** Threat actors exploit Aviatrix Controller flaw to deploy backdoors and cryptocurrency miners

   **News:** U.S. CISA adds Fortinet FortiOS to its Known Exploited Vulnerabilities catalog

   **News:** SECURITY AFFAIRS MALWARE NEWSLETTER – ROUND 28

   **News:** How a researcher earned $100,000 hacking a Facebook server

Please analyze these articles and provide insights into any potential impacts on the financial industry Sentiment on the provided company.
Microsoft's recent job cuts and performance-based layoffs could have a ripple effect on the company's financials, as it may impact the demand for their products. The company's stock price may experience short-term volatility, but the long-term outlook remains positive.

Microsoft's job cuts and performance-based layoffs could have a ripple effect on the company's financials, as it may impact the demand for their products. The company's stock price may experience short-term volatility, but the long-term outlook remains positive.

# **Financial Data Investment Advisor**

**Problem Statment:** Building a Financial Advisor based on the Data that gathered from various financial advices in dataset from Stocks to mutual funds to gold or silver bonds as well using Python, Langchain and LLM (open source).

**Project Methodology**
- This Project using the Open Source Data from Kaggle regarding financial advices.
- Using Python, that load data and then pre-processed and saved in CSV File.
- Loading that same CSV file to insert into Vector DB using Embedding Model from Hugging Face.
- Building RAG QA Chain using Langchain and building the RAG architecture using Falcon 7B LLM (Open Source).
- Checking the Response.


![](https://media.licdn.com/dms/image/D5612AQFSyeoRrkC5fw/article-cover_image-shrink_720_1280/0/1701189671766?e=2147483647&v=beta&t=cpa6wlGMWG44ZyGW6MWyKZ2Vr0BT-G1zlb8RB0yio6w)

## **Loading the Financial Data from Kaggle or Any Open Source Platform**

Data Source - https://www.kaggle.com/datasets/nitindatta/finance-data

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("nitindatta/finance-data")

print("Path to dataset files:", path)

Path to dataset files: /root/.cache/kagglehub/datasets/nitindatta/finance-data/versions/1


In [None]:
data = pd.read_csv("/Finance_data.csv")
data_fin = data.to_dict(orient='records')

In [None]:
for entry in data_fin:
  prompt = f"I'm a {entry['age']}-year-old {entry['gender']} looking to invest in {entry['Avenue']} for {entry['Purpose']} over the next {entry['Duration']}. What are my options?"
  print(prompt)

I'm a 34-year-old Female looking to invest in Mutual Fund for Wealth Creation over the next 1-3 years. What are my options?
I'm a 23-year-old Female looking to invest in Mutual Fund for Wealth Creation over the next More than 5 years. What are my options?
I'm a 30-year-old Male looking to invest in Equity for Wealth Creation over the next 3-5 years. What are my options?
I'm a 22-year-old Male looking to invest in Equity for Wealth Creation over the next Less than 1 year. What are my options?
I'm a 24-year-old Female looking to invest in Equity for Wealth Creation over the next Less than 1 year. What are my options?
I'm a 24-year-old Female looking to invest in Mutual Fund for Wealth Creation over the next 1-3 years. What are my options?
I'm a 27-year-old Female looking to invest in Equity for Wealth Creation over the next 3-5 years. What are my options?
I'm a 21-year-old Male looking to invest in Mutual Fund for Wealth Creation over the next 3-5 years. What are my options?
I'm a 35-yea

### Pre-Processng the Data into Prompt-Response Format

In [None]:
# Convert the data to prompt-response format
prompt_response_data = []
for entry in data_fin:
    prompt = f"I'm a {entry['age']}-year-old {entry['gender']} looking to invest in {entry['Avenue']} for {entry['Purpose']} over the next {entry['Duration']}. What are my options?"
    response = (
        f"Based on your preferences, here are your investment options:\n"
        f"- Mutual Funds: {entry['Mutual_Funds']}\n"
        f"- Equity Market: {entry['Equity_Market']}\n"
        f"- Debentures: {entry['Debentures']}\n"
        f"- Government Bonds: {entry['Government_Bonds']}\n"
        f"- Fixed Deposits: {entry['Fixed_Deposits']}\n"
        f"- PPF: {entry['PPF']}\n"
        f"- Gold: {entry['Gold']}\n"
        f"Factors considered: {entry['Factor']}\n"
        f"Objective: {entry['Objective']}\n"
        f"Expected returns: {entry['Expect']}\n"
        f"Investment monitoring: {entry['Invest_Monitor']}\n"
        f"Reasons for choices:\n"
        f"- Equity: {entry['Reason_Equity']}\n"
        f"- Mutual Funds: {entry['Reason_Mutual']}\n"
        f"- Bonds: {entry['Reason_Bonds']}\n"
        f"- Fixed Deposits: {entry['Reason_FD']}\n"
        f"Source of information: {entry['Source']}\n"
    )
    prompt_response_data.append({"prompt": prompt, "response": response})

prompt_response_data[:5]

[{'prompt': "I'm a 34-year-old Female looking to invest in Mutual Fund for Wealth Creation over the next 1-3 years. What are my options?",
  'response': 'Based on your preferences, here are your investment options:\n- Mutual Funds: 1\n- Equity Market: 2\n- Debentures: 5\n- Government Bonds: 3\n- Fixed Deposits: 7\n- PPF: 6\n- Gold: 4\nFactors considered: Returns\nObjective: Capital Appreciation\nExpected returns: 20%-30%\nInvestment monitoring: Monthly\nReasons for choices:\n- Equity: Capital Appreciation\n- Mutual Funds: Better Returns\n- Bonds: Safe Investment\n- Fixed Deposits: Fixed Returns\nSource of information: Newspapers and Magazines\n'},
 {'prompt': "I'm a 23-year-old Female looking to invest in Mutual Fund for Wealth Creation over the next More than 5 years. What are my options?",
  'response': 'Based on your preferences, here are your investment options:\n- Mutual Funds: 4\n- Equity Market: 3\n- Debentures: 2\n- Government Bonds: 1\n- Fixed Deposits: 5\n- PPF: 6\n- Gold: 7\

### Storing Data into Vector DB

In [None]:
from langchain.docstore.document import Document
documents = []
for entry in prompt_response_data:
    combined_text = f"Prompt: {entry['prompt']}\nResponse: {entry['response']}"
    documents.append(Document(page_content=combined_text))

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=10)
texts = text_splitter.split_documents(documents)

In [None]:
from langchain.vectorstores import Chroma
persist_directory = 'docs/chroma/'
vectordb_fin = Chroma.from_documents(
    documents=texts,
    embedding=hg_embeddings,
    persist_directory=persist_directory
)

### Building RAG System using VectorDB and LLM

In [None]:
from langchain.chains import RetrievalQA
retriever_fin = vectordb_fin.as_retriever(search_kwargs={"k":5})
qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=retriever_fin, return_source_documents=False)
query = "I'm a 34-year-old female looking to invest in mutual funds for wealth creation over the next 1-3 years. What are my options?"
result = qa({"query": query})
result

{'query': "I'm a 34-year-old female looking to invest in mutual funds for wealth creation over the next 1-3 years. What are my options?",
 'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nPrompt: I'm a 34-year-old Female looking to invest in Mutual Fund for Wealth Creation over the next\n\nPrompt: I'm a 32-year-old Female looking to invest in Mutual Fund for Wealth Creation over the next\n\nPrompt: I'm a 28-year-old Female looking to invest in Mutual Fund for Wealth Creation over the next\n\nPrompt: I'm a 24-year-old Female looking to invest in Mutual Fund for Wealth Creation over the next\n\nPrompt: I'm a 29-year-old Male looking to invest in Mutual Fund for Wealth Creation over the next\n\nQuestion: I'm a 34-year-old female looking to invest in mutual funds for wealth creation over the next 1-3 years. What are my options?\nHelpful Answer:\n\nThere are severa

# **GenAI Financial Fraud Detection Application**

**Problem Statment:** Building a Financial Fraud Detection Algorithm that detects frauds or anomalies in transaction or user behaviors based on past data and pattern recognition. Data is mostly unstrcutured, so using GenAI or LLM is must.

**Project Methodology**
-

![](https://images.spiceworks.com/wp-content/uploads/2021/06/16094651/Fraud-Detection.png)