# Install Libs

Here we install all libs we will need to do this model

In [1]:
%pip install \
    datasets==2.11.0  --quiet \
    PyMuPDF --quiet \
    langchain --quiet \
    chromadb --quiet \
    sentence_transformers --quiet \
    pypdf --quiet \
    faiss-gpu --quiet \
    git+https://www.github.com/huggingface/transformers --quiet \
    git+https://github.com/huggingface/accelerate --quiet

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m42.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m405.5/405.5 kB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hd

# Import Libs

Now we just import libs for modeling

In [2]:
import fitz  # PyMuPDF
from langchain.chains import RetrievalQA, question_answering, ConversationalRetrievalChain
from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.schema import retriever
from langchain.vectorstores import Chroma
from langchain.vectorstores import FAISS
from langchain import HuggingFaceHub
import os

In [3]:
with open('token') as f:
  os.environ["HUGGINGFACEHUB_API_TOKEN"] = f.read()

# Mount Google Drive

Mount your Google Drive folder

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Without prompt

First, let's read the PDF with PyMuPDF and create an object with all this text in this PDF.

In [7]:
# Extract text from the PDF bytes

pdf_text = ""
pdf_document = fitz.open(f'/content/drive/MyDrive/Colab Notebooks/LLM/data/Prospecto_Definitivo.pdf', filetype="pdf")
for page_num in range(pdf_document.page_count):
    page = pdf_document.load_page(page_num)
    #pdf_text.append(page.get_text("text"))
    pdf_text += page.get_text("text")

pdf_document.close()

# Now 'pdf_text' contains the extracted text from the PDF
#print(pdf_text)

Now let's split this string object in smaller objects to make it easier to be read by the model

In [8]:
text_splitter = RecursiveCharacterTextSplitter(
                                      chunk_size=500,
                                      chunk_overlap=25)

chunks = text_splitter.split_text(pdf_text)

embeddings = HuggingFaceEmbeddings()
vectorStore = FAISS.from_texts(chunks, embeddings)

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

With the file preprocessed, now let's load the model from HuggingFace

In [12]:
llm = HuggingFaceHub(repo_id="tiiuae/falcon-7b",
                     model_kwargs={"temperature":0.1,
                                   "top_k":10,
                                   "max_length":512,
                                   "num_return_sequences":1})

chain = RetrievalQA.from_chain_type(llm=llm,
                                    chain_type="stuff",
                                    retriever=vectorStore.as_retriever())

We know the answer from the model will generate a complete Q&A answer, that coul include more tokens, with more questions than what we are as asking to this model, so let's treat this answer!

In [13]:
def get_answer(qachain, query):

  answer = qachain({"query": query})

  return answer['result'].strip().split('Question:')[0]

Now just try your questions to this model!

In [14]:
question="Qual o código do ativo na B3?"

answer = get_answer(chain, question)
print(answer)

BTLG11




In [None]:
question="Qual o valor total da oferta?"

answer = get_answer(chain, question)
print(answer)

599,999,994.00




In [None]:
question="Qual o maior risco da oferta?"

answer = get_answer(chain, question)
print(answer)

A maior parte do risco está relacionada com a liquidação da oferta.


## Com Prompt

Here we do again the same preprocessing, but the change will happen in the model, we will pass a prompt.

In [None]:
# Extract text from the PDF bytes

pdf_text = ""
pdf_document = fitz.open(f'/content/drive/MyDrive/Colab Notebooks/LLM/data/Prospecto_Definitivo.pdf', filetype="pdf")
for page_num in range(pdf_document.page_count):
    page = pdf_document.load_page(page_num)
    pdf_text += page.get_text("text")

pdf_document.close()

text_splitter = RecursiveCharacterTextSplitter(
                                      chunk_size=500,
                                      chunk_overlap=25)

chunks = text_splitter.split_text(pdf_text)

embeddings = HuggingFaceEmbeddings()
docsearch = FAISS.from_texts(chunks, embeddings)

# Prepare embedding model
retriever = Chroma(persist_directory="./data",
                   embedding_function=embeddings)

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

As we do before, we load the model from HuggingFace, here we just added some arguments to improve the answer with our Prompt.

In [None]:
# Prepare Falcon Huggingface API
llm = HuggingFaceHub(repo_id="tiiuae/falcon-7b",
            model_kwargs = {
                "max_length":512,
                "max_new_tokens":300,
                "min_new_tokens":5,
                "temperature":0.1,
                "repetition_penalty": 1.5,
                "top_k":1
            }
      )

Now let's create an Prompt!

In [None]:
# prepare stuff prompt template
prompt_template = """
You are a talkative AI assistant. Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to
make up an answer.

Answer all user questions using at maximum 500 characters.

Context: {context}

Question: {question}

Answer:
""".strip()

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template
)

chain_type_kwargs = {"prompt" : prompt}

chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=docsearch.as_retriever(),
        return_source_documents=True,
        chain_type_kwargs=chain_type_kwargs
    )

Again, as this model we know wil get a complete answer using all the tokens we setted, we will treat this answer to give only our answer.

In [None]:
def get_answer(qachain, query):

  answer = qachain({"query": query})

  return answer['result'].strip().split('Question:')[0]

Now just try your questions to the model!

In [None]:
question =  "Qual o código do ativo na B3?"

answer = get_answer(chain, question)
print(answer)

BTLG11




In [None]:
question =  "Qual o valor da oferta em reais?"

answer = get_answer(chain, question)
print(answer)

2.6. Valor total da oferta e valor mínimo da Oferta
O valor total da Décima Segunda Emissão será de, inicialmente, até R$ 599.999.994,00 (cinco 
mil novecentos e setenta e nove mil novecentos e novecentos e quatro reais), considerando o Preço 
de Emissão acrescido do Custo Unitário de Distribuição, podendo tal montante ser reduzido em 
razão da Distribuição Parcial ou aumentado em razão da distribuição das Cotas.




In [None]:
question="Explique qual o maior risco da oferta"

answer = get_answer(chain, question)
print(answer)

O maior risco da oferta é que o montante mínimo da oferta não seja alcançado.




In [None]:
question="Qual o custo da comissão de estruturação total?"

answer = get_answer(chain, question)
print(answer)

0,80%




In [None]:
question="Qual a política de investimentos?"

answer = get_answer(chain, question)
print(answer)

O Fundo tem como objetivo a obtenção de retornos financeiros para os Cotistas, com base na aplicação de 
uma estratégia de investimento diversificada, que se destine a alcançar um retorno superior ao dos 
valores mobiliários, em médio prazo, sem comprometer a liquidez do Fundo.

O Fundo investirá em ativos de capitalização, que são aqueles que apresentam um valor maior do que o 
preço de venda, e que podem ser divididos em três grandes categorias:

1. Ações: são direitos sociais representativos de participação no lucro das empresas.

2. Bônus: são instrumentos negociados nas bolsas de valores, que permitem a compra ou venda de ativos 
financeiros, em determinado período de tempo.

3. Oportunidades de investimento: são instrumentos negociados nas bolsas de valores, que permitem a 
compra ou venda de ativos financeiros, em determinado período de tempo.

O Fundo também pode investir em títulos públicos, títulos de dívida pública emitidos por entidades 
públicas, títulos de dívida privada,