<a href="https://colab.research.google.com/github/dioschuarz/data_science/blob/main/llm/example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Libs

Here we install all libs we will need to do this model

In [None]:
%pip install \
    datasets==2.11.0  --quiet \
    PyMuPDF==1.22.5 --quiet \
    langchain --quiet \
    chromadb --quiet \
    sentence_transformers --quiet \
    pypdf --quiet \
    faiss-gpu --quiet \
    git+https://www.github.com/huggingface/transformers --quiet \
    git+https://github.com/huggingface/accelerate --quiet

# Import Libs

Now we just import libs for modeling

In [None]:
import fitz  # PyMuPDF
from langchain.chains import RetrievalQA, question_answering, ConversationalRetrievalChain
from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.schema import retriever
from langchain.vectorstores import Chroma
from langchain.vectorstores import FAISS
from langchain import HuggingFaceHub
import os

In [None]:
with open('token') as f:
  os.environ["HUGGINGFACEHUB_API_TOKEN"] = f.read()

# Mount Google Drive

Mount your Google Drive folder

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Without prompt

First, let's read the PDF with PyMuPDF and create an object with all this text in this PDF.

In [None]:
# Extract text from the PDF bytes

pdf_text = ""
pdf_document = fitz.open(f'/content/drive/MyDrive/Colab Notebooks/LLM/data/Prospecto_Definitivo.pdf', filetype="pdf")
for page_num in range(pdf_document.page_count):
    page = pdf_document.load_page(page_num)
    #pdf_text.append(page.get_text("text"))
    pdf_text += page.get_text("text")

pdf_document.close()

# Now 'pdf_text' contains the extracted text from the PDF
#print(pdf_text)

Now let's split this string object in smaller objects to make it easier to be read by the model

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
                                      chunk_size=500,
                                      chunk_overlap=25)

chunks = text_splitter.split_text(pdf_text)

embeddings = HuggingFaceEmbeddings()
vectorStore = FAISS.from_texts(chunks, embeddings)

With the file preprocessed, now let's load the model from HuggingFace

In [None]:
llm = HuggingFaceHub(repo_id="tiiuae/falcon-7b",
                     model_kwargs={"temperature":0.1,
                                   "top_k":10,
                                   "max_length":512,
                                   "num_return_sequences":1})

chain = RetrievalQA.from_chain_type(llm=llm,
                                    chain_type="stuff",
                                    retriever=vectorStore.as_retriever())

We know the answer from the model will generate a complete Q&A answer, that coul include more tokens, with more questions than what we are as asking to this model, so let's treat this answer!

In [None]:
def get_answer(qachain, query):

  answer = qachain({"query": query})

  return answer['result'].strip().split('Question:')[0]

Now just try your questions to this model!

In [None]:
question="Qual o código do ativo na B3?"

answer = get_answer(chain, question)
print(answer)

In [None]:
question="Qual o valor total da oferta?"

answer = get_answer(chain, question)
print(answer)

In [None]:
question="Qual o maior risco da oferta?"

answer = get_answer(chain, question)
print(answer)

A maior parte do risco está relacionada com a liquidação da oferta.


## Com Prompt

Here we do again the same preprocessing, but the change will happen in the model, we will pass a prompt.

In [None]:
# Extract text from the PDF bytes

pdf_text = ""
pdf_document = fitz.open(f'/content/drive/MyDrive/Colab Notebooks/LLM/data/Prospecto_Definitivo.pdf', filetype="pdf")
for page_num in range(pdf_document.page_count):
    page = pdf_document.load_page(page_num)
    pdf_text += page.get_text("text")

pdf_document.close()

text_splitter = RecursiveCharacterTextSplitter(
                                      chunk_size=500,
                                      chunk_overlap=25)

chunks = text_splitter.split_text(pdf_text)

embeddings = HuggingFaceEmbeddings()
docsearch = FAISS.from_texts(chunks, embeddings)

# Prepare embedding model
retriever = Chroma(persist_directory="./data",
                   embedding_function=embeddings)

As we do before, we load the model from HuggingFace, here we just added some arguments to improve the answer with our Prompt.

In [None]:
# Prepare Falcon Huggingface API
llm = HuggingFaceHub(repo_id="tiiuae/falcon-7b",
            model_kwargs = {
                "max_length":512,
                "max_new_tokens":300,
                "min_new_tokens":5,
                "temperature":0.1,
                "repetition_penalty": 1.5,
                "top_k":1
            }
      )

Now let's create an Prompt!

In [None]:
# prepare stuff prompt template
prompt_template = """
You are a talkative AI assistant. Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to
make up an answer.

Answer all user questions using at maximum 500 characters.

Context: {context}

Question: {question}

Answer:
""".strip()

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template
)

chain_type_kwargs = {"prompt" : prompt}

chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=docsearch.as_retriever(),
        return_source_documents=True,
        chain_type_kwargs=chain_type_kwargs
    )

Again, as this model we know wil get a complete answer using all the tokens we setted, we will treat this answer to give only our answer.

In [None]:
def get_answer(qachain, query):

  answer = qachain({"query": query})

  return answer['result'].strip().split('Question:')[0]

Now just try your questions to the model!

In [None]:
question =  "Qual o código do ativo na B3?"

answer = get_answer(chain, question)
print(answer)

In [None]:
question =  "Qual o valor da oferta em reais?"

answer = get_answer(chain, question)
print(answer)

In [None]:
question="Explique qual o maior risco da oferta"

answer = get_answer(chain, question)
print(answer)

In [None]:
question="Qual o custo da comissão de estruturação total?"

answer = get_answer(chain, question)
print(answer)

In [None]:
question="Qual a política de investimentos?"

answer = get_answer(chain, question)
print(answer)