## Exploring LangChain - Part II

The purpose of this notebook is to explore fundamentals of LangChain.

Key topics covered include: TextSplitters, Embeddings, Vector DB

**NOTE: This is just a practice notebook that acts as a precursor to the main project**

In [12]:
# Text loaders allow you to load data from a text file
# Document class that contains metadata and page_content - data[0]

In [29]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="C:/Users/DELL/Downloads/salesdata.csv",)

data = loader.load()

type(data[0])

langchain_core.documents.base.Document

In [30]:
# contains meta data and page_content
# we can alter the meta data displayed by passing parameter source_column= "id" in CSVLoader
print(data[0])

page_content='id: 1
nItems: 1469
mostFreqStore: Stockton
mostFreqCat: Alcohol
nCats: 72
preferredBrand: Veina
nBrands: 517
nPurch: 82
salesLast3Mon: 2741.97
salesThisMon: 1283.87
daysSinceLastPurch: 1
meanItemPrice: 1.86655547991831
meanShoppingCartValue: 33.4386585365854
customerDuration: 821' metadata={'source': 'C:/Users/DELL/Downloads/salesdata.csv', 'row': 0}


In [25]:
#installing essential libraries, libmagic is used for file type detection
#!pip3 install unstructured libmagic python-magic python-magic-bin

## Text Loader
The unstructured library in Python assists with processing unstructured data like images and text documents, enabling easy transformation into structured formats for machine learning applications.

Langchains' UnstructuredURLLoader uses this library underneath to go to that website and see the html structure and pull data.

In [31]:
#Reading unstructured news articles from the web
from langchain.document_loaders import UnstructuredURLLoader

In [34]:
loader = UnstructuredURLLoader(
urls =["https://www.ft.com/content/632411eb-c3fa-4351-a3b6-b0e30bdc0ef7?accessToken=zwAAAYWwEkESkc9jJBHrw_pDUdOjtrDjC9wO9wE.MEQCIC7-tLEPkOYVG427tYIVBtANt60iz-FWXCfBDHwEb0G0AiB5JouSRl1fivzejChmdq5TnvVdNmiibHtVbJUCviVHxA&segmentId=501d7750-774f-dc19-66bb-320ebfb582d1",
       "https://www.ft.com/content/95745636-2d21-46aa-b0f1-6bda1c0fdd0b?accessToken=zwAAAYblEFF3kdOVdFY2LSFGqtOw8WvaHA_dCwE.MEYCIQCKqVGoyEh2jPvo574Ns5jiUzEVBHMrg2m8wfbjaLwupwIhANpYFrgjSfID76yCJIJPJEhzWtetNi5MsOMiYl_gyjaH&segmentId=8bab5fbd-4508-93c4-7ded-a9e1428c7053"])

In [35]:
data = loader.load()

In [38]:
print(data[0])

page_content='Accessibility helpSkip to navigationSkip to contentSkip to footer

Sign In

Subscribe

Open side navigation menuOpen search bar

SubscribeSign In

MenuSearch

Home

World

US

Companies

Tech

Markets

Climate

Opinion

Lex

Work & Careers

Life & Arts

HTSI

Financial Times

SubscribeSign In

The Big Read Markets volatility

Manage your delivery channels here

The cracks in the US Treasury bond market

The meltdown in UK gilts exposed the vulnerability of large bond markets. Could the biggest of them survive a wave of selling?

The cracks in the US Treasury bond market on x (opens in a new window)

The cracks in the US Treasury bond market on facebook (opens in a new window)

The cracks in the US Treasury bond market on linkedin (opens in a new window)

The cracks in the US Treasury bond market on whatsapp (opens in a new window)

The cracks in the US Treasury bond market on x (opens in a new window)

The cracks in the US Treasury bond market on facebook (opens in a new 

## Text Splitting
Any LLM has a token size limit. So, we need to reduce big block of text into smaller chunks.
Individual chunks may not be close to the token limit of 4000ish. One may be around 3000 words and other may be like 1000 words. So, it makes sense to merge them to make them efficient (close to token limit).
Divide huge chunk of text into smaller chunks and then merge splits such that each individual token is closer to the token limit depending on the LLM model. Further, we want to do some overlapping as we want some part of first para/chunk in the second chunk and so on. This is called overlapping of chunks. Can be done using some simple APIs in LangChain.


In [88]:
text = """
Interstellar is a 2014 epic science fiction drama film directed by Christopher Nolan, who co-wrote the screenplay with his brother Jonathan. It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine. Set in a dystopian future where Earth is suffering from catastrophic blight and famine, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind.

The screenplay had its origins in a script Jonathan developed in 2007 and was originally set to be directed by Steven Spielberg. Theoretical physicist Kip Thorne was an executive producer and scientific consultant on the film, and wrote the tie-in book The Science of Interstellar. Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm. Filming began in late 2013 and took place in Alberta, Klaustur, and Los Angeles. Interstellar uses extensive practical and miniature effects, and the company DNEG created additional digital effects.

Interstellar was released in theaters on November 7, 2014. In the United States, it was first released on film stock, expanding to venues using digital projectors. The film received generally positive reviews and grossed over $681 million worldwide ($730 million after subsequent re-releases), making it the tenth-highest-grossing film of 2014. Thorne's computer-generated depiction of a black hole in the film has also received commendation from astronomers and physicists.[4][5][6] Among its various accolades, Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades. It was Lynda Obst's final film as producer before her death.

"""

In [71]:
#Traditional approach to create chunks of size 200
i = 10
words = text.strip().split()
len(words)

271

In [79]:
#This is a tedious approach
i = 0
chunks = []
sentence = ""
for word in words:
    if(i<200):
        sentence = sentence + str(word) + " "
        i = i + 1
    else:
        chunks.append(sentence.strip())
        i=0
        sentence = ""
chunks.append(sentence.strip())

print(len(chunks[0].split()))
print(len(chunks[1].split()))      #FYI: One space character was eliminated = 270

200
70


**LangChain provides a simple API to this job**

- CharacterTextSplitter
- RecursiveCharacterTextSplitter

In [94]:
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator=".",    #dividing text into chunks based on .
    chunk_size=200,   
    chunk_overlap=0)

chunks = splitter.split_text(text)
print(len(chunks))

for chunk in chunks:
    print(len(chunk))

Created a chunk of size 205, which is longer than the specified 200


12
139
119
204
127
151
193
176
103
179
128
176
60


Observation: 12 sentences with different number of characters in them. Some chunk sizes are > 200 as "." didn't come before 200 limit

In [96]:
# If I use \n as splitter

splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=200,
    chunk_overlap=0)

chunks = splitter.split_text(text)
print(len(chunks))

for chunk in chunks:
    print(len(chunk))

Created a chunk of size 467, which is longer than the specified 200
Created a chunk of size 594, which is longer than the specified 200


3
467
594
712


Observation: There are 3 paragraphs.There is always going to some problem or the other.
### RecursiveCharacterTextSplitter
To address this we can use RecursiveCharacterTextSplitter that splits on the basis of multiple separators, one by one, until it gets chunks that are < chunk_size. It also merges smaller chunks to get to the optimal size.

In [100]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

r_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n\n","\n","."," "],
    chunk_size = 200,
    chunk_overlap = 0)            #what percentage of previous text overlaps

chunks = r_splitter.split_text(text)

In [101]:
len(chunks)

14

In [104]:
#All chunks are now < chunk_size
for chunk in chunks:
    print(len(chunk))

139
121
198
7
1
127
153
195
119
162
181
130
177
62


What's happening internally?

In [110]:
#Splitting basis first separator - 4 chunks
chunks = text.split("\n\n")

for chunk in chunks:
    print(len(chunk))

468
594
712
0


In [111]:
#first chunk
first_split = chunks[0]
first_split

'\nInterstellar is a 2014 epic science fiction drama film directed by Christopher Nolan, who co-wrote the screenplay with his brother Jonathan. It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine. Set in a dystopian future where Earth is suffering from catastrophic blight and famine, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind.'

In [115]:
#splitting first chunk basis second separator
first_split_again = first_split.split("\n")

for chunk in first_split_again:
    print(len(chunk))

0
467


In [117]:
#splitting second chunk from above basis third separator
first_split_again_again = first_split_again[1].split(" ")

for chunk in first_split_again_again:
    print(len(chunk))

12
2
1
4
4
7
7
5
4
8
2
11
6
3
8
3
10
4
3
7
9
2
5
7
12
4
9
7
9
4
6
5
8
4
6
3
7
6
3
2
1
9
6
5
5
2
9
4
12
6
3
7
3
4
7
1
5
2
10
3
6
7
1
8
4
6
2
6
2
1
3
4
3
8


Once separation is done, it merges again to form optimal chunk sizes. If there is a chunk of size 210, it splits it futher using " " and then merges them to get something like 198 + 12 (depending on whitespaces)

### Vector Databases

- For this project, we will use FAISS (Facebook AI Similarity Search). It is like a library that allows you to do fast search from a set of vectors you have. It can also be used as an in-memory vector DB for small projects.
- We will convert text into embeddings using OpenAI or Word2Vec etc embeddings. Then store them into vector database. For this project, we will store them in FAISS index.
- Let’s say 1M embeddings in Vector DB. Then the input query is converted into vector embedding and a search is performed. Vector DB will return how many of its vector embeddings are similar to the Input embedding.


In [129]:
#install packages

#%pip install faiss-cpu
#%pip install -U sentence-transformers

In [2]:
import pandas as pd

pd.set_option('display.max_colwidth',100)

In [3]:
df = pd.read_csv('C:/Users/DELL/Downloads/sample_text.csv')
df

Unnamed: 0,text,category
0,Meditation and yoga can improve mental health,Health
1,"Fruits, whole grains and vegetables helps control blood pressure",Health
2,These are the latest fashion trends for this week,Fashion
3,Vibrant color jeans for male are becoming a trend,Fashion
4,The concert starts at 7 PM tonight,Event
5,Navaratri dandiya program at Expo center in Mumbai this october,Event
6,Exciting vacation destinations for your next trip,Travel
7,Maldives and Srilanka are gaining popularity in terms of low budget vacation places,Travel


In [4]:
#converting sentences into vectors
#We will use hugging face sentence transformer
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-mpnet-base-v2")  #using this model for encoding

vectors = encoder.encode(df.text)  #encoder expects an array of text
vectors.shape

(8, 768)

In [5]:
#let's store the nos of parameters in a varible
dim = vectors.shape[1]

In [6]:
import faiss

index = faiss.IndexFlatL2(dim)   #creating an index of size 768
index 

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x0000025333DA5EC0> >

##### We are using L2 Indexing which uses Eucledian distance.

##### IndexFlatL2 is simply creating an index for fast search, similar to MySQL.

##### Faiss.Index will internally construct some datastructure that allows fast similarity search.

In [7]:
index.add(vectors)

In [13]:
search_query = "I want to buy a shirt for my upcoming trip"

vec = encoder.encode(search_query)
vec.shape


(768,)

In [14]:
#But this search vector expects a 2-D array as input

import numpy as np
svec = np.array(vec).reshape(1,-1)        #means 1 x whatever it takes
svec.shape


(1, 768)

In [16]:
#returns distances and row numbers
distances, I = index.search(svec, k = 2)   # K -> how many similar index you want?

In [23]:
print(distances, I)

[[1.3643883 1.3819258]] [[2 6]]


In [22]:
df.loc[I[0]]

Unnamed: 0,text,category
2,These are the latest fashion trends for this week,Fashion
6,Exciting vacation destinations for your next trip,Travel


Observation: We searched for shirt for trip and got the above as output. This was not a key words search. It was semantic search.

In [24]:
# Let's do it one more time. 

search_query_nw = "An apple a day keeps the doctor away"

In [26]:
vec_nw = encoder.encode(search_query_nw)
vec_nw.shape

(768,)

In [28]:
svec_nw = np.array(vec_nw).reshape(1,-1)

In [30]:
index.search(svec_nw, k = 2)

(array([[1.3433158, 1.7125269]], dtype=float32), array([[1, 0]], dtype=int64))