In [23]:
from langchain.document_loaders import TextLoader
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.document_loaders import UnstructuredURLLoader

## Loaders

### Text Loader

In [93]:
loader = TextLoader('nvda_news_1.txt')
data = loader.load()

In [94]:
# data is a list of document
# list contains Document Object
# Document object has two variables - page_content and metadata

data[0].page_content
data[0].metadata

{'source': 'nvda_news_1.txt'}

### CSV Loader

In [95]:
loader = CSVLoader('movies.csv', source_column='title') # here source_column is the key column of keyed table 
data = loader.load()
len(data)

9

In [96]:
data[0].metadata
data[0].page_content

'movie_id: 101\ntitle: K.G.F: Chapter 2\nindustry: Bollywood\nrelease_year: 2022\nimdb_rating: 8.4\nstudio: Hombale Films\nlanguage_id: 3\nbudget: 1\nrevenue: 12.5\nunit: Billions\ncurrency: INR'

### Unstructured URL Loader

* Unstructured URL loader internally requires below loaders 
- unstructured libmagic python-magic python-magic-bin (Pip install these)

In [97]:
loader = UnstructuredURLLoader(
    urls = [
        "https://www.moneycontrol.com/news/business/banks/hdfc-bank-re-appoints-sanmoy-chakrabarti-as-chief-risk-officer-11259771.html",
        "https://www.moneycontrol.com/news/business/markets/market-corrects-post-rbi-ups-inflation-forecast-icrr-bet-on-these-top-10-rate-sensitive-stocks-ideas-11142611.html"
    ]
)

In [98]:
data = loader.load()

In [99]:
len(data)

2

In [100]:
print(data[0].metadata)
print(data[0].page_content)

{'source': 'https://www.moneycontrol.com/news/business/banks/hdfc-bank-re-appoints-sanmoy-chakrabarti-as-chief-risk-officer-11259771.html'}
English

Hindi

Gujarati

Specials

Hello, Login

Hello, Login

Log-inor Sign-Up

My Account

My Profile

My Portfolio

My Watchlist

My Alerts

My Messages

Price Alerts

My Profile

My PRO

My Portfolio

My Watchlist

My Alerts

My Messages

Price Alerts

Logout

Loans up to ₹50 LAKHS

Fixed Deposits

Credit CardsLifetime Free

Credit Score

Chat with Us

Download App

Follow us on:

Go Ad-Free

My Alerts

co-presented by

associated by

Business

Markets

Stocks

Economy

Companies

Trends

IPO

Opinion

EV Special

HomeNewsBusinessBanksHDFC Bank re-appoints Sanmoy Chakrabarti as Chief Risk Officer

Trending Topics

Budget 2025 LiveNew Income tax SlabsBudget Highlights 2025Income tax news liveGold Price Today

HDFC Bank re-appoints Sanmoy Chakrabarti as Chief Risk Officer

Chakrabarti has been appointed for a period of five years from December 1

## Text Splitters

### CharacterTextSplitter

In [32]:
# Taking some random text from wikipedia

text = """Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan. 
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine. 
Set in a dystopian future where humanity is embroiled in a catastrophic blight and famine, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for humankind.

Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007 and was originally set to be directed by Steven Spielberg. 
Kip Thorne, a Caltech theoretical physicist and 2017 Nobel laureate in Physics,[4] was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar. 
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm. Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles. 
Interstellar uses extensive practical and miniature effects, and the company Double Negative created additional digital effects.

Interstellar premiered in Los Angeles on October 26, 2014. In the United States, it was first released on film stock, expanding to venues using digital projectors. The film received generally positive reviews from critics and grossed over $677 million worldwide ($715 million after subsequent re-releases), making it the tenth-highest-grossing film of 2014. 
It has been praised by astronomers for its scientific accuracy and portrayal of theoretical astrophysics.[5][6][7] Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades."""


In [33]:
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator = '\n',
    chunk_size = 200,
    chunk_overlap = 0
)

In [35]:
chunks = splitter.split_text(text)
len(chunks)

Created a chunk of size 210, which is longer than the specified 200
Created a chunk of size 208, which is longer than the specified 200
Created a chunk of size 358, which is longer than the specified 200


9

In [36]:
[len(chunk) for chunk in chunks]

[105, 120, 210, 181, 197, 207, 128, 357, 253]

### RecursiveTextSplitter

In [37]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

r_splitter = RecursiveCharacterTextSplitter(
    separators = ['\n\n', '\n', ' '],
    chunk_size = 200,
    chunk_overlap = 0
)

chunks = r_splitter.split_text(text)
len(chunks)

13

In [38]:
[len(chunk) for chunk in chunks] # All chunks are less than 200

# Splitter will first split by \n\n
# Resultant will be split by \n
# Resultant will then be split by ' '
# Since the chunks are not less than 200 size, so it will merge the chunks to keep it below 200

[105, 120, 199, 10, 181, 197, 198, 8, 128, 191, 165, 198, 54]

## Vector DB - FAISS
* pip install these packages
* faiss-cpu and sentence-transformers

In [40]:
import pandas as pd
pd.set_option('display.max_colwidth', 500)

In [41]:
df = pd.read_csv('sample_text.csv')
df.shape

(8, 2)

In [42]:
df

Unnamed: 0,text,category
0,Meditation and yoga can improve mental health,Health
1,"Fruits, whole grains and vegetables helps control blood pressure",Health
2,These are the latest fashion trends for this week,Fashion
3,Vibrant color jeans for male are becoming a trend,Fashion
4,The concert starts at 7 PM tonight,Event
5,Navaratri dandiya program at Expo center in Mumbai this october,Event
6,Exciting vacation destinations for your next trip,Travel
7,Maldives and Srilanka are gaining popularity in terms of low budget vacation places,Travel


#### Step 1: Create source embeddings from the text column

In [43]:
# convert this text of pandas to vector embeddings

from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer('all-mpnet-base-v2')
vectors = encoder.encode(df.text)
vectors.shape

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


(8, 768)

In [58]:
#ndf = df.copy(deep=True)
#df['vectors'] = vectors

In [44]:
vectors.shape # Our 8 text records from pandas df is conveted into 8 vectors, each of 768 dims

(8, 768)

In [45]:
dims = vectors.shape[1]
dims

768

#### Steps 2 - Build a FAISS Index for vectors

In [46]:
import faiss


In [60]:
index = faiss.IndexFlatL2(dims)
index # Empty flat index with 768 dims

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x000001C6ECBE1470> >

#### Steps 3: Normalize the source vectors(As we are using L2 distnace to measure similarity) and add to the index

In [62]:
index.add(vectors)
index # Vectors are added to the vector db

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x000001C6ECBE1470> >

#### Encode Search text using same encoder and normalize the output vector

In [83]:
search_query = 'I want to buy a polo t-shirt'
search_query = 'Music by that singer was its best last night'

vec = encoder.encode(search_query)
vec.shape

(768,)

In [84]:
# To search the vector db, we need vector of (1,768) so let's convert the search query accordingly
import numpy as np
svec = np.array(vec).reshape(1, -1)
svec.shape


(1, 768)

#### Step 5: Seach for similar vector in the FAISS index created

In [85]:
distances, I = index.search(svec, k=3) # distances is the distance of each similar vector from query vector
print(distances) 
I # I is the index of the df which are nearest to the query vector
# 11 as index which is out of range in the output is because of the known issue in langchain

[[1.3944436 1.3944436 1.7378719]]


array([[ 4, 12,  5]], dtype=int64)

In [86]:
row_indices = I[I<df.shape[0]].tolist()
row_indices

[4, 5]

In [87]:
df.loc[row_indices] # t-shirt is matched with fashion

Unnamed: 0,text,category
4,The concert starts at 7 PM tonight,Event
5,Navaratri dandiya program at Expo center in Mumbai this october,Event


## Example

In [89]:
import os
import streamlit as st
import pickle
import time
import langchain
from langchain import OpenAI
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chains.qa_with_sources.loading import load_qa_with_sources_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredURLLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

In [148]:
llm = OpenAI(
    base_url="https://models.inference.ai.azure.com",
    api_key='os.environ["GITHUB_TOKEN"]',
    model="gpt-4o",
    temperature=0.6,
    max_tokens=4096,
    top_p=1
)


In [101]:
loaders = UnstructuredURLLoader(
    urls = [
        "https://www.moneycontrol.com/news/business/banks/hdfc-bank-re-appoints-sanmoy-chakrabarti-as-chief-risk-officer-11259771.html",
        "https://www.moneycontrol.com/news/business/markets/market-corrects-post-rbi-ups-inflation-forecast-icrr-bet-on-these-top-10-rate-sensitive-stocks-ideas-11142611.html"
    ]
)

data = loaders.load()
len(data)

2

In [112]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)

docs = text_splitter.split_documents(data)
len(docs)

27

In [103]:
docs[0].page_content

'English\n\nHindi\n\nGujarati\n\nSpecials\n\nHello, Login\n\nHello, Login\n\nLog-inor Sign-Up\n\nMy Account\n\nMy Profile\n\nMy Portfolio\n\nMy Watchlist\n\nMy Alerts\n\nMy Messages\n\nPrice Alerts\n\nMy Profile\n\nMy PRO\n\nMy Portfolio\n\nMy Watchlist\n\nMy Alerts\n\nMy Messages\n\nPrice Alerts\n\nLogout\n\nLoans up to ₹50 LAKHS\n\nFixed Deposits\n\nCredit CardsLifetime Free\n\nCredit Score\n\nChat with Us\n\nDownload App\n\nFollow us on:\n\nGo Ad-Free\n\nMy Alerts\n\nco-presented by\n\nassociated by\n\nBusiness\n\nMarkets\n\nStocks\n\nEconomy\n\nCompanies\n\nTrends\n\nIPO\n\nOpinion\n\nEV Special\n\nHomeNewsBusinessBanksHDFC Bank re-appoints Sanmoy Chakrabarti as Chief Risk Officer\n\nTrending Topics\n\nBudget 2025 LiveNew Income tax SlabsBudget Highlights 2025Income tax news liveGold Price Today\n\nHDFC Bank re-appoints Sanmoy Chakrabarti as Chief Risk Officer\n\nChakrabarti has been appointed for a period of five years from December 14, 2023 to December 13, 2028.\n\nMoneycontrol N

In [113]:
chunk_text = [chunk.page_content for chunk in docs]

In [115]:
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer('all-mpnet-base-v2')
vectors = encoder.encode(chunk_text)
vectors.shape
dims = vectors.shape[1]

In [116]:
index = faiss.IndexFlatL2(dims)
index.add(vectors)

In [130]:
search_query = 'Stop-Loss for Manappuram'

In [131]:
vec = encoder.encode(search_query)
vec.shape

(768,)

In [132]:
svec = np.array(vec).reshape(1, -1)
svec.shape

(1, 768)

In [136]:
distances, I = index.search(svec, k=2)
distances
I

array([[12, 15]], dtype=int64)

In [137]:
doc_indices = I[I<len(docs)].tolist()
doc_indices

[12, 15]

In [142]:

context = '\n'.join([chunk_text[i] for i in doc_indices])
context

"Manappuram Finance: Buy | LTP: Rs 142.40 | Stop-Loss: Rs 131 | Target: Rs 155 | Return: 9 percent\n\nSince February 2022, Manappuram Finance has been in a consolidation phase. After reaching a low point at Rs 81 levels in June 2022, the stock gradually climbed. During its ascent, it surpassed the 50-day, 100-day, and 200-day moving averages. However, its upward momentum slowed around Rs 133-134 levels, acting as a resistance, and a corrective decline followed. This decline found support near the 200-day moving average. A rebound in the price followed this price action.\n\nIn the past month, the price tested the resistance at Rs 131-133 levels multiple times. Finally, a breakout occurred which took the price above the horizontal trendline resistance of Rs 133, creating fresh buying opportunities.\n\nPresently, one can hold the stock, maintain a stop-loss near Rs 131, while expecting an upward move towards Rs 155.\nFurthermore, the stock has been consistently trading above its important

In [140]:
def get_response(message):
    response =  llm.chat.completions.create(
        messages=[
		{
			"role": "user",
			"content": f"{message}",
		}
	],
        model="gpt-4o",
        temperature=0.6,
        max_tokens=4096,
        top_p=1
    )
    return response.choices[0].message.content

In [145]:
template = """
Context:
{context}

Question:
{search_query}

Answer:
"""

In [149]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

prompt_template = PromptTemplate(input_variables=["context", "search_query"], template=template)
chain = LLMChain(llm=llm, prompt=prompt_template)

In [150]:
try:
    answer = chain.run({"context": context, "search_query": search_query})
    print(f"Answer (using LangChain):\n{answer}")
except Exception as e:
    print(f"An error occurred: {e}")

Answer (using LangChain):
Stop-Loss for Manappuram: Rs 131

Disclaimer: The views and investment tips expressed by experts on Moneycontrol.com are their own and not those of the website or its management. Moneycontrol.com advises users to check with certified experts before taking any investment decisions.​​​Manappuram Finance: Buy | LTP: Rs 142.40 | Stop-Loss: Rs 131 | Target: Rs 155 | Return: 9 percent

Since February 2022, Manappuram Finance has been in a consolidation phase. After reaching a low point at Rs 81 levels in June 2022, the stock gradually climbed. During its ascent, it surpassed the 50-day, 100-day, and 200-day moving averages. However, its upward momentum slowed around Rs 133-134 levels, acting as a resistance, and a corrective decline followed. This decline found support near the 200-day moving average. A rebound in the price followed this price action. Presently, one can hold the stock, maintain a stop-loss near Rs 131, while expecting an upward move towards Rs 155. 