Blessy Chinthapalli

This Project is focused on creating a system to query content from PDF documents using Langchain and OpenAI's embeddings

### 1. Libraries and Dependencies:


*   Langchain: A library that helps with vectorization of text,
allowing you to work with text embeddings, which are numerical representations of text used for comparing and searching through content.
*   OpenAI: Used to obtain text embeddings that capture the semantic meaning of text, enabling tasks like similarity search.



*   PyPDF2: A Python library for reading and extracting text from PDF files.
*   FAISS: A library for efficient similarity search, particularly useful when working with large collections of embeddings.


*   Tenacity: A library that helps with retry mechanisms, useful for making API calls more robust by automatically retrying failed requests.












## PDF Query Using Langchain

In [14]:
!pip install langchain # to vectorize
!pip install openai # we use oopen ai embeddings - measures the relatedness of text
!pip install PyPDF2 # to read from pdf
!pip install faiss-cpu
!pip install tiktoken # dependency library for pdf to create tokens



In [2]:
pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.2.11-py3-none-any.whl.metadata (2.7 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.21.3-py3-none-any.whl.metadata (7.1 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl.metadata (1.1 kB)
Downloading langchain_community-0.2.11-py3-none-any.whl (2.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dataclasses_json-0.6.7-py3-none-any.whl (

In [3]:
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter # to split the content

from langchain.vectorstores import FAISS # vector database to store

In [4]:
import os
os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"
# os.environ["SERPAPI_API_KEY"] = "" # to do google search, but we are not doing a google search or implimentingc a chatbot

In [6]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('BlessyCH_EY_Resume.pdf')

### Reading PDF Content:
The PdfReader from PyPDF2 is used to read a PDF file. It iterates through the pages of the PDF, extracting text from each page and combining it into a single raw text string.

In [7]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [8]:
raw_text

"BlessyChinthapalliSanFrancisco,CA+1(415)225-1956blessychinthapalli20@gmail.comLinkedIn|Github|Tableau\nHiringManagerErnst&YoungLLP\nDearHiringManager,\nIamwritingtoexpressmyenthusiasmfortheDataEngineer,DataAnalystpositionsatErnst&YoungLLP.Witharobustbackgroundindataengineering,machinelearning,andbusinessintelligence,coupledwithhands-onexperienceacrossdiverseprojects,IameagertocontributetoEY’svisionofbuildingabetterworkingworld.\nMyjourneyindatascienceandengineeringbeganduringmyBachelorofEngineeringinElectronics&CommunicationfromShivNadarUniversity,whereIdevelopedakeeninterestindata-drivensolutions.MyacademicpursuitsculminatedinaMasterofScienceinBusinessAnalyticsfromtheUniversityofCalifornia,Davis,whereIhonedmyskillsindatamodeling,visualization,andmachinelearning.\nAtExPrep,asaDataEngineerBusinessConsultant,Iwaspivotalinrevolutionizingtheirdatavisualizationcapabilities.Theypossessedbillionsofstudentassessmentdatabutlackedaneffectivewaytoharnessthiswealthofinformation.Collaboratingclose

### Text Splitting:
Since the extracted text can be lengthy, it's split into smaller chunks using CharacterTextSplitter.

OpenAI's, have token limits, so the text needs to be split into manageable chunks.

The chunk_size defines the maximum length of each chunk, while chunk_overlap allows some overlap between chunks to maintain context.

In [9]:
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)



In [10]:
len(texts)

18

### Generating Embeddings:
OpenAI Embeddings: The code downloads embeddings from OpenAI, which are used to convert the text chunks into numerical vectors. These embeddings are essential for later querying or searching through the text based on semantic meaning.

In [11]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

  warn_deprecated(


In [12]:
!pip install tenacity
import time
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  # for exponential backoff




**Embeddings:** Numerical representations of text that capture its semantic meaning. These are crucial for tasks like text similarity, clustering, and search.

In [13]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings(allowed_special=['<|endofprompt|>'])

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def embed_with_retry(texts, embeddings):
    # Add a delay here if needed
    time.sleep(1)  # Example: wait for 1 second between calls
    document_search = FAISS.from_texts(texts, embeddings)
    return document_search

document_search = embed_with_retry(texts, embeddings)

### Vector Database (FAISS):

A database that stores vectors (embeddings) and allows for efficient similarity search. In this case, FAISS is used to store and query the embeddings of the text chunks.

In [14]:
document_search = FAISS.from_texts(texts, embeddings)

In [16]:
document_search

<langchain_community.vectorstores.faiss.FAISS at 0x79b6887439a0>

In [21]:
query = "blessy previous employer"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

  warn_deprecated(


' VeraLifecare'

In [22]:
query = "what are blessy's skills"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

" Blessy's skills include data engineering, business intelligence, data modeling, machine learning, SQL, Python, R, Scala, Matlab, cloud platforms such as Azure and GCP, BI tools like Tableau, PowerBI, and Looker, and certifications in Databricks and Microsoft Azure."

# Legal Document Summarization with Langchain
## Overview:
This script extracts, processes, and summarizes content from legal documents (e.g., lease contracts) using Langchain and OpenAI embeddings. It efficiently identifies and retrieves key information for faster document review.

In [27]:
### For reading big PDF's

pdfreader = PdfReader('Lease_Agreement_178_Bluxome_.pdf')

# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content


# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)



# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings(allowed_special=['<|endofprompt|>'])

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def embed_with_retry(texts, embeddings):
    # Add a delay here if needed
    time.sleep(1)  # Example: wait for 1 second between calls
    document_search = FAISS.from_texts(texts, embeddings)
    return document_search

document_search = embed_with_retry(texts, embeddings)


document_search = FAISS.from_texts(texts, embeddings)

query = "Summarise the lease agreement in 3 lines"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)



" The lease agreement requires the purchaser to submit an affidavit to the condominium owners' association before the sale can be finalized. It is also a condition of closing that the potential purchaser must execute the affidavit. The governing documents state that the project is subject to the Notice of Special Restrictions and any changes to windows may be required for future construction."

In [28]:
query = "when does the bluxome lease end"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

' The Bluxome lease ends on 1/16/2025.'

In [31]:
query = "who are there on the lease give names"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

' Shriya Yegalapati, Blessy Chinthapalli, Aditya Satpute, and Daniel Lau are listed as tenants on the lease.'