# PDF Reader Chatbot   

This project is based on Codebasics video: https://www.youtube.com/watch?v=MoqgmWV1fm8

## Importing Libraries

In [2]:
from openai_key import secret_key 
import langchain
from langchain import OpenAI
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chains.qa_with_sources.loading import load_qa_with_sources_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
import pickle

## Loading Document

In [4]:
loader = PyPDFLoader('Flavio_Brienza_Abstract_Tesi.pdf') 

data = loader.load() 

data

[Document(page_content='DISSERTATION  ABSTRACT  \n(ENGLISH AND ITALIAN VERSIONS)   \nThe Digital Economy and Society Index Progress of European Companies: The \nApplication of Machine Learning in Financial Services  \n \nEnglish Version  \nThe following  thesis work has t wo main goals: analyzing the Digital Economy and Society \nIndex (DESI) progress  of European companies and providing a  practical application of \nmachine learning in the financial servi ces sector.  \nAbout the first one, to the traditional DESI ’s indicators , more general macroeconomic and \nsocial traits have been added  in order to have a better framework of the current situation \nand to understand  on the most problematic aspects  of the digitalization process.  \nThe second part is focused on the building of a machine learning model to predict the \nfinancial risk bearable  from banks’ clients to offer them the proper stocks investment. Before \ndoing this , the shares of  40 (+1, the gold) different companie

## Splitting Text

In [8]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

docs = text_splitter.split_documents(data) 
docs

[Document(page_content='DISSERTATION  ABSTRACT  \n(ENGLISH AND ITALIAN VERSIONS)   \nThe Digital Economy and Society Index Progress of European Companies: The \nApplication of Machine Learning in Financial Services  \n \nEnglish Version  \nThe following  thesis work has t wo main goals: analyzing the Digital Economy and Society \nIndex (DESI) progress  of European companies and providing a  practical application of \nmachine learning in the financial servi ces sector.  \nAbout the first one, to the traditional DESI ’s indicators , more general macroeconomic and \nsocial traits have been added  in order to have a better framework of the current situation \nand to understand  on the most problematic aspects  of the digitalization process.  \nThe second part is focused on the building of a machine learning model to predict the \nfinancial risk bearable  from banks’ clients to offer them the proper stocks investment. Before \ndoing this , the shares of  40 (+1, the gold) different companie

## Embedding  

OpenAI library will be used.

In [9]:
embeddings = OpenAIEmbeddings(openai_api_key=secret_key) 

vector_index = FAISS.from_documents(docs, embeddings) 

Storing the results

In [10]:
file_path="vectors.pkl"
with open(file_path, "wb") as f:
    pickle.dump(vector_index, f)

Calling them back

In [11]:
import os

In [12]:
if os.path.exists(file_path):
    with open(file_path, "rb") as f:
        vectorIndex = pickle.load(f)

## Creating the Chain 

In [20]:
llm = OpenAI(temperature=1, max_tokens=500, openai_api_key=secret_key)

chain = RetrievalQAWithSourcesChain.from_llm(llm=llm, retriever=vectorIndex.as_retriever())  

chain



Asking a question

In [21]:
question = 'How many companies have been considered to create portfolios?' 

langchain.debug = True

chain({'question':question}, return_only_outputs=True) 

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQAWithSourcesChain] Entering Chain run with input:
[0m{
  "question": "How many companies have been considered to create portfolios?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQAWithSourcesChain > 3:chain:MapReduceDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQAWithSourcesChain > 3:chain:MapReduceDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "input_list": [
    {
      "context": "financial risk bearable  from banks’ clients to offer them the proper stocks investment. Before \ndoing this , the shares of  40 (+1, the gold) different companies have been analyzed and \nclustered using both statistic met hods and Natural Language Proc essing of the latest news \nabout them.  \nIn both phases  Python programming language has been used.  \n \nItalian Version  \nIl seguente lav oro di tesi h a due obiettivi principa li: analizzare il

{'answer': ' Forty (+1, the gold) companies have been considered to create portfolios.\n',
 'sources': 'Flavio_Brienza_Abstract_Tesi.pdf'}

The answer is correct.