<a href="https://colab.research.google.com/github/anandaru/GEN-AI/blob/main/Langchain_Learn_from_PDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Empowering Language Models with LangChain: Leveraging PDF Data for Enhanced Question-Answering

We employ LangChain to augment the functionality of existing Language Models (LMs). Our approach involves training the LM using data extracted from a supplied PDF file. The primary aim is to assess the LM's proficiency in responding accurately to a series of inquiries based on the newly assimilated information. Specifically, we furnish a news excerpt from a current-day news site in PDF format for LM training. Prior to this acquisition of data, querying the LM would yield inaccurate responses. However, subsequent to training, the LM should demonstrate improved accuracy in its responses.

In [10]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken
!pip install -U langchain-openai

Collecting langchain-openai
  Downloading langchain_openai-0.0.6-py3-none-any.whl (29 kB)
Installing collected packages: langchain-openai
Successfully installed langchain-openai-0.0.6


# Install Required Libraries

In [38]:
from PyPDF2 import PdfReader
#from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
#from langchain.llms import OpenAI
from langchain_openai import OpenAI
from langchain_openai import OpenAIEmbeddings

# Provide OpenAI key

In [12]:
import os
os.environ['OPENAI_API_KEY'] ='sk-Om'

# Load gpt-3.5 turbo model

In [13]:
llm= OpenAI(model_name='gpt-3.5-turbo-instruct')

# Specific query from generic knowledge

In [16]:
q = 'when did ISRO launch GSLV F14/INSAT-3DS'
print(llm(q))
#print(llm_chain.invoke(q))#



ISRO launched GSLV-F14/INSAT-3DS on April 20, 2007.


In [17]:
q = 'The 16th space mission for the Geosynchronous Satellite Launch Vehicle (GSLV) by ISRO, aims to deploy the INSAT-3DS satellite into the Geosynchronous Transfer Orbit (GTO). Who has funded the project'
print(llm(q))

?

The INSAT-3DS satellite project is funded by the Indian Space Research Organisation (ISRO). ISRO is a government agency responsible for India's space program and research. It was established in 1969 with the goal to develop space technology and its application to various national tasks. The agency is funded by the Indian government through the Department of Space. 


# Read PDF

In [24]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [30]:
data =PdfReader("/content/drive/MyDrive/ISRO .pdf")
##/Users/aanand/Desktop/personal/ISRO .pdf

In [31]:
combined_text=''
for i,page in enumerate(data.pages):
  text=page.extract_text()
  if text:
    combined_text+=text

In [32]:
combined_text



# Divide the combined text into smaller chunks

In [34]:
TextSplitter=CharacterTextSplitter(
    separator='\n',
    chunk_size=200,
    chunk_overlap=20,
    length_function=len,
)
finaldata=TextSplitter.split_text(combined_text)

In [35]:
len(finaldata)

1

# Learn from PDF

Initializes embeddings using an OpenAI model, It constructs a search index (documentsearch) from the embeddings of the provided text data (finaldata) using the embedding model (embedding). This allows for fast similarity search operations on the text data., and loads a pre-trained question-answering chain model using LangChain

In [39]:
embedding=OpenAIEmbeddings()

In [41]:
documentsearch=FAISS.from_texts(finaldata,embedding)

# Specific query from proprietery (PDF) knowledge

In [42]:
chain=load_qa_chain(OpenAI(),chain_type='stuff')

In [51]:
q = 'when did ISRO launch GSLV F14/INSAT-3DS'
docs=documentsearch.similarity_search(q)
chain.run(question=q,input_documents=docs)


' ISRO launched GSLV F14/INSAT-3DS on February 17 at 5:35 pm from the Satish Dhawan Space Centre in Sriharikota.'

In [52]:
q = 'The 16th space mission for the Geosynchronous Satellite Launch Vehicle (GSLV) by ISRO, aims to deploy the INSAT-3DS satellite into the Geosynchronous Transfer Orbit (GTO). Who has funded the project'
docs=documentsearch.similarity_search(q)
chain.run(question=q,input_documents=docs)

' The Ministry of Earth Sciences has funded the project.'

In [53]:
q = 'in what ways will th 16th space mission for the Geosynchronous Satellite Launch Vehicle (GSLV) by ISRO,help India'
docs=documentsearch.similarity_search(q)
chain.run(question=q,input_documents=docs)

