<a href="https://colab.research.google.com/github/analyticsworld1/1-ProPub/blob/main/LLM_%2B_RAG_for_Finance1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **LLM + RAG Projects on Finance Domain**
**Author**: Simranjeet Singh

This notebook contains the use cases of RAG and LLM in Finance Domain using Python + Langchain and Open Source LLMs and Vector DBs.

So just learn with me and all free resources available that I am providing and I will help you learn in structured way.

*Don't Forget to Subscribe and Follow*

- Youtube: https://www.youtube.com/channel/UC4RZP6hNT5gMlWCm0NDzUWg
- Instagram: https://www.instagram.com/freebirdscrew/

**NOTE:** This Full Playlist or Course using Open Source LLMs so Responses of the Projects might not be as accurate as it can but using OpenAI GPT or Meta LLAMA Models can drastically increase the output accuracy using same code as I am teaching.

![](https://marcabraham.files.wordpress.com/2024/03/raga-retrieval-augmented-generation-and-actions.png?w=1024)

# **Build Short Financial Report using Economic Indicators from the API**
Using Financial Modelling Prep API, fetching the Topic Market Economic Indicators.

**Problem Statment:** Building Financial Report of a Company or Stock using Latest Stock Market or Economic Data without Traning or Fine Tuning the LLMs or ML Models.

**Project Methodology**
- This Project using the open source API to fetch the latest financial modelling data regarding Company Metrics and Market Economic Indicators.
- Using Python, that fetched data is pre-processed and saved in CSV File.
- Loading that same CSV file to insert into Vector DB using Embedding Model from Hugging Face.
- Building RAG QA Chain using Langchain and building the RAG architecture using Falcon 7B LLM (Open Source).
- Checking the Response.

**NOTE:** This Full Playlist or Course using Open Source LLMs so Responses of the Projects might not be as accurate as it can but using OpenAI GPT or Meta LLAMA Models can drastically increase the output accuracy using same code as I am teaching.


![](https://media.licdn.com/dms/image/D5622AQFvnkgDSWCi4A/feedshare-shrink_800/0/1695081465240?e=2147483647&v=beta&t=mu9zgB9y-_sReXMyF9tyALz7bdUla2laZBEHPtm4glE)

In [None]:
try:
    from urllib.request import urlopen
except ImportError:
    from urllib2 import urlopen

import certifi
import json
import pandas as pd


def get_jsonparsed_data(url, api_key, exchange):
  if exchange == "NSE":
    url = f"https://financialmodelingprep.com/api/v3/search?query={ticker}&exchange=NSE&apikey={api_key}"
  else:
    url = f"https://financialmodelingprep.com/api/v3/quote/{ticker}?apikey={api_key}"
  response = urlopen(url, cafile=certifi.where())
  data = response.read().decode("utf-8")
  return json.loads(data)

api_key="C1HRSweTniWdBuLmTTse9w8KpkoiouM5"
ticker = "MSFT"
exchange = "US"
eco_ind = pd.DataFrame(get_jsonparsed_data(ticker, api_key,exchange))
eco_ind

  response = urlopen(url, cafile=certifi.where())


Unnamed: 0,symbol,name,price,changesPercentage,change,dayLow,dayHigh,yearHigh,yearLow,marketCap,...,exchange,volume,avgVolume,open,previousClose,eps,pe,earningsAnnouncement,sharesOutstanding,timestamp
0,MSFT,Microsoft Corporation,423.85,-0.1578,-0.67,423.05,426.28,433.6,309.45,3150184593500,...,NASDAQ,11920235,19701822,426.2,424.52,11.55,36.7,2024-07-23T00:00:00.000+0000,7432310000,1717790401


### Installing the Langchain Libraries

In [None]:
!pip install langchain langchain-community langchain-core transformers

In [None]:
def preprocess_economic_data(df):
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['earningsAnnouncement'] = pd.to_datetime(df['earningsAnnouncement'])
    return df

preprocessed_economic_df = preprocess_economic_data(eco_ind)
preprocessed_economic_df

Unnamed: 0,symbol,name,price,changesPercentage,change,dayLow,dayHigh,yearHigh,yearLow,marketCap,...,exchange,volume,avgVolume,open,previousClose,eps,pe,earningsAnnouncement,sharesOutstanding,timestamp
0,MSFT,Microsoft Corporation,423.85,-0.1578,-0.67,423.05,426.28,433.6,309.45,3150184593500,...,NASDAQ,11920235,19701822,426.2,424.52,11.55,36.7,2024-07-23 00:00:00+00:00,7432310000,1970-01-01 00:00:01.717790401


### Storing the Pre-Processed Data into CSV

In [None]:
preprocessed_economic_df.to_csv("eco_ind.csv")

### Installing the Hugging Face Embedding Library

In [None]:
%pip install --upgrade --quiet  langchain sentence_transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
hg_embeddings = HuggingFaceEmbeddings()

In [None]:
from langchain.document_loaders import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader_eco = CSVLoader('eco_ind.csv')
documents_eco = loader_eco.load()

# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=5)

# Split your docs into texts
texts_eco = text_splitter.split_documents(documents_eco)

# Embeddings
embeddings = HuggingFaceEmbeddings()



### Building the Vector DB for RAG

In [None]:
from langchain.vectorstores import Chroma

persist_directory = 'docs/chroma_rag/'

In [None]:
economic_langchain_chroma = Chroma.from_documents(
    documents=texts_eco,
    collection_name="economic_data",
    embedding=hg_embeddings,
    persist_directory=persist_directory
)

In [None]:
question = "Microsoft(MSFT)"
docs_eco = economic_langchain_chroma.similarity_search(question,k=3)

### Building RAG Chain using Vector DB and LLM

In [None]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.llms import HuggingFaceHub
from IPython.display import display, Markdown
import os
import warnings
warnings.filterwarnings('ignore')

os.environ["HUGGINGFACEHUB_API_TOKEN"] = "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"

llm = HuggingFaceHub(
    repo_id="tiiuae/falcon-7b-instruct",
    model_kwargs={"temperature": 0.1},
)

retriever_eco = economic_langchain_chroma.as_retriever(search_kwargs={"k":2})
qs="Microsoft(MSFT) Financial Report"
template = """You are a Financial Market Expert and Get the Market Economic Data and Market News about Company and Build the Financial Report for me.
              Understand this Market Information {context} and Answer the Query for this Company {question}. i just need the data into Tabular Form as well."""

PROMPT = PromptTemplate(input_variables=["context","question"], template=template)
qa_with_sources = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff",chain_type_kwargs = {"prompt": PROMPT}, retriever=retriever_eco, return_source_documents=True)
llm_response = qa_with_sources({"query": qs})

In [None]:
Markdown(llm_response['result'])

You are a Financial Market Expert and Get the Market Economic Data and Market News about Company and Build the Financial Report for me.
              Understand this Market Information : 0
symbol: MSFT
name: Microsoft Corporation

earningsAnnouncement: 2024-07-23 00:00:00+00:00 and Answer the Query for this Company Microsoft(MSFT) Financial Report

The following financial report is for Microsoft Corporation (MSFT). The report includes the latest financial data and market news about the company.

Financial Report:

For the fiscal year ended on June 30, 2024, Microsoft Corporation (MSFT) reported a total revenue of $2.5 trillion, an increase of $1.1 trillion from the previous year. The company's net income for the fiscal year was $128.1 billion

# **Using NEWS API to Build Financial News Summarizer about the Company Sentiment in Current Time**

 ### Fetchning the Latest Data using the NEWSAPI with the help of API Key from there website.

 **Problem Statment:** Building a GenAI based system that can analyse the market news about the whole stock exchange or a company and tell me about the sentiment of market along with analysis based on news.

**Project Methodology**
- This Project using the open source API to fetch the latest financial news regarding Company and Market.
- Using Python, that fetched data is pre-processed and saved in CSV File.
- Loading that same CSV file to insert into Vector DB using Embedding Model from Hugging Face.
- Building RAG QA Chain using Langchain and building the RAG architecture using Falcon 7B LLM (Open Source).
- Checking the Response.


![](https://img.freepik.com/premium-photo/bullseye-photography-bull-fighting-fight-generative-ai_901275-24479.jpg)

In [None]:
import requests
import pandas as pd
from newsapi import NewsApiClient
from datetime import datetime, timedelta

def fetch_news(query, from_date, to_date, language='en', sort_by='relevancy', page_size=30, api_key='YOUR_API_KEY'):
    # Initialize the NewsAPI client
    newsapi = NewsApiClient(api_key=api_key)
    query = query.replace(' ','&')
    # Fetch all articles matching the query
    all_articles = newsapi.get_everything(
        q=query,
        from_param=from_date,
        to=to_date,
        language=language,
        sort_by=sort_by,
        page_size=page_size
    )

    # Extract articles
    articles = all_articles.get('articles', [])

    # Convert to DataFrame
    if articles:
        df = pd.DataFrame(articles)
        return df
    else:
        return pd.DataFrame()  # Return an empty DataFrame if no articles are found

# Get the current time
current_time = datetime.now()
# Get the time 10 days ago
time_10_days_ago = current_time - timedelta(days=10)
api_key = 'c0e23a8956cf4b54af382abd932f88ff'
q = "Microsoft News June 2024"
df = fetch_news(q, time_10_days_ago, current_time, api_key=api_key)

df_news = df.drop("source", axis=1)

def preprocess_news_data(df):
    # Convert publishedAt to datetime
    df['publishedAt'] = pd.to_datetime(df['publishedAt'])
    df = df[~df['author'].isna()]
    df = df[['author', 'title']]
    return df

preprocessed_news_df = preprocess_news_data(df_news)
preprocessed_news_df.head()

Unnamed: 0,author,title
0,Kris Holt,Summer Game Fest 2024: What to expect and how ...
1,Ali Rees,Get some popcorn ready for an extra-long Xbox ...
2,Ali Rees,Leaks suggest we could see a huge Starfield an...
3,Wesley Yin-Poole,Microsoft Confirms Xbox Game Pass June 2024 Wa...
4,Wesley Yin-Poole,Warzone Has a New Frank Woods Cutscene — Final...


### Pre-Processing the Data

In [None]:
def build_prompt(news_df):
    prompt = "You are a financial analyst tasked with providing insights into recent news articles related to the financial industry. Here are some recent news articles:\n\n"

    for index, row in news_df.iterrows():
        title = row['title']
        prompt += f"   **News:** {title}\n\n"

    prompt += "Please analyze these articles and provide insights into any potential impacts on the financial industry Sentiment on the provided company."

    return prompt

# Build the prompt
prompt = build_prompt(preprocessed_news_df)
print(prompt)

### LLM from Hugging Face Open Source

In [None]:
llm = HuggingFaceHub(
    repo_id="tiiuae/falcon-7b-instruct",
    model_kwargs={"temperature": 0.1},
)

In [None]:
Markdown(llm.invoke(prompt))

You are a financial analyst tasked with providing insights into recent news articles related to the financial industry. Here are some recent news articles:

   **News:** Summer Game Fest 2024: What to expect and how to watch games revealed live

   **News:** Get some popcorn ready for an extra-long Xbox Games June Showcase

   **News:** Leaks suggest we could see a huge Starfield announcement at Xbox Games Showcase

   **News:** Microsoft Confirms Xbox Game Pass June 2024 Wave 1 Lineup

   **News:** Warzone Has a New Frank Woods Cutscene — Finally Making a Crucial Moment in Call of Duty Black Ops Lore Canon

   **News:** WWDC 2024: What We're Expecting and How to Watch Apple's iOS 18 Event - CNET

   **News:** How to watch Intel’s big Computex 2024 keynote tonight

   **News:** How to watch Summer Game Fest 2024 — Not-E3, Xbox Games Showcase, Call of Duty: Black Ops 6 Direct, Wholesome Direct, and more

   **News:** Report: Microsoft is 'considering' bringing its flagship Xbox IP to PlayStation for the first time, but will it?

   **News:** Destiny 2 Developer Bungie ‘Truly Sorry’ for The Final Shape Launch Issues

   **News:** NVIDIA Splits 10-to-1; Non-farm Payrolls on Deck for Friday

   **News:** A PR disaster: Microsoft has lost trust with its users, and Windows Recall is the straw that broke the camel's back

   **News:** Sony Removes 8K Claim From PlayStation 5 Boxes

   **News:** Engadget Podcast: How AI will shape Apple's WWDC 2024

   **News:** This Week in Security: Recall, Modem Mysteries, and Flipping Pages

   **News:** Wholesome Pokemon-like "Creatures of Ava" shows off a new trailer, with a playable demo coming soon

   **News:** Microsoft Copilot Plus hands-on: Does it need a Recall?

   **News:** Surface Laptop 7 vs. Samsung Galaxy Book4 Edge: Which high-end Copilot+ PC works better for you?

   **News:** Nvidia was officially more valuable than Apple — for a couple of hours, at least

   **News:** New Windows 10 update gives it Windows 11’s photo-sharing capabilities with Android devices – but you might want to hang on

   **News:** Russian Influence Campaign Targeting Paris Olympics, Microsoft Warns

   **News:** iOS 18 is coming next week: Here’s everything we know

   **News:** Nvidia app beta offers warranty-safe GPU tuning and improved stream recording

   **News:** Elon Musk Is Hurting Tesla To Help Twitter and xAI

   **News:** Bill Gates Could Be The World's First Trillionaire If He Had 'Diamond Handed' His Microsoft Shares — He'd Be Sitting On $1.47 Trillion Today

   **News:** Nvidia stock crosses $3 trillion market cap, overtakes Apple as second-largest co. in US market

   **News:** Adafruit Weekly Editorial Round-Up: AANHPI Month, National Paper Airplane Day, Adafruit TRRS Trinkey & more!

   **News:** Apple WWDC 2024: get ready for lots of AI news

   **News:** Microsoft Issues New Warning For 70% Of All Windows Users

   **News:** Microsoft is again named the overall leader in the Forrester Wave for XDR

Please analyze these articles and provide insights into any potential impacts on the financial industry Sentiment on the provided company.
1. Microsoft's recent news articles regarding the Xbox Games Showcase and the upcoming Windows 11 update have been generally positive, with a focus on the company's continued push towards the gaming industry. This could potentially lead to increased sales and revenue for Microsoft, as well as increased brand awareness and loyalty among consumers.

2. The news articles related to the new Starfield game from Bethesda have been generating a lot of buzz and excitement among gamers. The game's release date has been

# **Financial Data Investment Advisor**

**Problem Statment:** Building a Financial Advisor based on the Data that gathered from various financial advices in dataset from Stocks to mutual funds to gold or silver bonds as well using Python, Langchain and LLM (open source).

**Project Methodology**
- This Project using the Open Source Data from Kaggle regarding financial advices.
- Using Python, that load data and then pre-processed and saved in CSV File.
- Loading that same CSV file to insert into Vector DB using Embedding Model from Hugging Face.
- Building RAG QA Chain using Langchain and building the RAG architecture using Falcon 7B LLM (Open Source).
- Checking the Response.


![](https://media.licdn.com/dms/image/D5612AQFSyeoRrkC5fw/article-cover_image-shrink_720_1280/0/1701189671766?e=2147483647&v=beta&t=cpa6wlGMWG44ZyGW6MWyKZ2Vr0BT-G1zlb8RB0yio6w)

## **Loading the Financial Data from Kaggle or Any Open Source Platform**

Data Source - https://www.kaggle.com/datasets/nitindatta/finance-data

In [None]:
data = pd.read_csv("Finance_data.csv")
data_fin = data.to_dict(orient='records')

In [None]:
for entry in data_fin:
  prompt = f"I'm a {entry['age']}-year-old {entry['gender']} looking to invest in {entry['Avenue']} for {entry['Purpose']} over the next {entry['Duration']}. What are my options?"
  print(prompt)

### Pre-Processng the Data into Prompt-Response Format

In [None]:
# Convert the data to prompt-response format
prompt_response_data = []
for entry in data_fin:
    prompt = f"I'm a {entry['age']}-year-old {entry['gender']} looking to invest in {entry['Avenue']} for {entry['Purpose']} over the next {entry['Duration']}. What are my options?"
    response = (
        f"Based on your preferences, here are your investment options:\n"
        f"- Mutual Funds: {entry['Mutual_Funds']}\n"
        f"- Equity Market: {entry['Equity_Market']}\n"
        f"- Debentures: {entry['Debentures']}\n"
        f"- Government Bonds: {entry['Government_Bonds']}\n"
        f"- Fixed Deposits: {entry['Fixed_Deposits']}\n"
        f"- PPF: {entry['PPF']}\n"
        f"- Gold: {entry['Gold']}\n"
        f"Factors considered: {entry['Factor']}\n"
        f"Objective: {entry['Objective']}\n"
        f"Expected returns: {entry['Expect']}\n"
        f"Investment monitoring: {entry['Invest_Monitor']}\n"
        f"Reasons for choices:\n"
        f"- Equity: {entry['Reason_Equity']}\n"
        f"- Mutual Funds: {entry['Reason_Mutual']}\n"
        f"- Bonds: {entry['Reason_Bonds']}\n"
        f"- Fixed Deposits: {entry['Reason_FD']}\n"
        f"Source of information: {entry['Source']}\n"
    )
    prompt_response_data.append({"prompt": prompt, "response": response})

prompt_response_data[:5]

[{'prompt': "I'm a 34-year-old Female looking to invest in Mutual Fund for Wealth Creation over the next 1-3 years. What are my options?",
  'response': 'Based on your preferences, here are your investment options:\n- Mutual Funds: 1\n- Equity Market: 2\n- Debentures: 5\n- Government Bonds: 3\n- Fixed Deposits: 7\n- PPF: 6\n- Gold: 4\nFactors considered: Returns\nObjective: Capital Appreciation\nExpected returns: 20%-30%\nInvestment monitoring: Monthly\nReasons for choices:\n- Equity: Capital Appreciation\n- Mutual Funds: Better Returns\n- Bonds: Safe Investment\n- Fixed Deposits: Fixed Returns\nSource of information: Newspapers and Magazines\n'},
 {'prompt': "I'm a 23-year-old Female looking to invest in Mutual Fund for Wealth Creation over the next More than 5 years. What are my options?",
  'response': 'Based on your preferences, here are your investment options:\n- Mutual Funds: 4\n- Equity Market: 3\n- Debentures: 2\n- Government Bonds: 1\n- Fixed Deposits: 5\n- PPF: 6\n- Gold: 7\

### Storing Data into Vector DB

In [None]:
from langchain.docstore.document import Document
documents = []
for entry in prompt_response_data:
    combined_text = f"Prompt: {entry['prompt']}\nResponse: {entry['response']}"
    documents.append(Document(page_content=combined_text))

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=10)
texts = text_splitter.split_documents(documents)

In [None]:
from langchain.vectorstores import Chroma
persist_directory = 'docs/chroma/'
vectordb_fin = Chroma.from_documents(
    documents=texts,
    embedding=hg_embeddings,
    persist_directory=persist_directory
)

### Building RAG System using VectorDB and LLM

In [None]:
from langchain.chains import RetrievalQA
retriever_fin = vectordb_fin.as_retriever(search_kwargs={"k":5})
qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=retriever_fin, return_source_documents=False)
query = "I'm a 34-year-old female looking to invest in mutual funds for wealth creation over the next 1-3 years. What are my options?"
result = qa({"query": query})
result

{'query': "I'm a 34-year-old female looking to invest in mutual funds for wealth creation over the next 1-3 years. What are my options?",
 'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nPrompt: I'm a 34-year-old Female looking to invest in Mutual Fund for Wealth Creation over the next\n\nPrompt: I'm a 32-year-old Female looking to invest in Mutual Fund for Wealth Creation over the next\n\nPrompt: I'm a 28-year-old Female looking to invest in Mutual Fund for Wealth Creation over the next\n\nPrompt: I'm a 24-year-old Female looking to invest in Mutual Fund for Wealth Creation over the next\n\nPrompt: I'm a 29-year-old Male looking to invest in Mutual Fund for Wealth Creation over the next\n\nQuestion: I'm a 34-year-old female looking to invest in mutual funds for wealth creation over the next 1-3 years. What are my options?\nHelpful Answer:\n\nAs a 34-year-old

# **GenAI Financial Fraud Detection Application**

**Problem Statment:** Building a Financial Fraud Detection Algorithm that detects frauds or anomalies in transaction or user behaviors based on past data and pattern recognition. Data is mostly unstrcutured, so using GenAI or LLM is must.

**Project Methodology**
-

![](https://images.spiceworks.com/wp-content/uploads/2021/06/16094651/Fraud-Detection.png)