## Installation & Imports

You will need to install some others requirements like:

    - Ollama
    - Poppler
    - Pytesseract

In [1]:
!pip install -r ../requirements.txt --q

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ctgan 0.7.3 requires packaging<22,>=20, but you have packaging 23.2 which is incompatible.


In [2]:
!ollama list

NAME          	ID          	SIZE  	MODIFIED     
llama2:latest 	78e26419b446	3.8 GB	3 months ago	
mistral:latest	61e88e884507	4.1 GB	3 months ago	


In [1]:
%load_ext lab_black

In [2]:
from dotenv import load_dotenv, find_dotenv
import os
import sys
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import OnlinePDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import SingleStoreDB
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatOllama
from langchain.embeddings import OpenAIEmbeddings

sys.path.append("..")
load_dotenv("../config/.env", override=True)
model = "mistral:latest"

## Getting PDF Data

In [5]:
unstructured_loader = UnstructuredPDFLoader("../data/generative-ai-fundamentals-v1.pdf")

unstructured_pdf_data = unstructured_loader.load()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\caio_barros\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\caio_barros\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


In [6]:
unstructured_pdf_data

[Document(page_content='© databricks Academy\n\nGenerative Al Fundamentals\n\nDatabricks Academy 2023\n\nAE ad PO TOL\n\nSe Everyone Asks\n\na Z Yj @ at 3a tei UD a i Mam is Generative 4 {+ How exactly | 4 om = Y ‘waa How can l use Prag 9 Alathreat or canluse : ~ y, | ‘ ; my data an Generative Al | = wo yt.» securely with |\n\ns \\ opportunity to gaina 7 “\\ Generative\n\nBy for my U), competitive Hh Al?\n\nbusiness? es advantage? :\n\ny) e —~ x\n\nsession goals\n\nUpon completion of this content, you should be able to:\n\nDescribe how generative artificial intelligence (Al) is being used to revolutionize practical Al applications\n\nDescribe how Generative Al models works and discuss their potential business uses cases\n\nDescribe how a data organization can find initial success with generative Al applications\n\nRecognize the potential legal and ethical considerations of utilizing generative Al for applications and within the workplace.\n\n©2023 Databricks Inc. — All rights reserved 

In [9]:
unstructured_pdf_data[0].page_content

'© databricks Academy\n\nGenerative Al Fundamentals\n\nDatabricks Academy 2023\n\nAE ad PO TOL\n\nSe Everyone Asks\n\na Z Yj @ at 3a tei UD a i Mam is Generative 4 {+ How exactly | 4 om = Y ‘waa How can l use Prag 9 Alathreat or canluse : ~ y, | ‘ ; my data an Generative Al | = wo yt.» securely with |\n\ns \\ opportunity to gaina 7 “\\ Generative\n\nBy for my U), competitive Hh Al?\n\nbusiness? es advantage? :\n\ny) e —~ x\n\nsession goals\n\nUpon completion of this content, you should be able to:\n\nDescribe how generative artificial intelligence (Al) is being used to revolutionize practical Al applications\n\nDescribe how Generative Al models works and discuss their potential business uses cases\n\nDescribe how a data organization can find initial success with generative Al applications\n\nRecognize the potential legal and ethical considerations of utilizing generative Al for applications and within the workplace.\n\n©2023 Databricks Inc. — All rights reserved g\n\nAGENDA\n\nO01. Int

In [3]:
online_loader = OnlinePDFLoader("https://arxiv.org/pdf/2404.11018.pdf")

online_pdf_data = online_loader.load()
online_pdf_data

[Document(page_content='arXiv:2404.11018v2 [cs.LG] 22 May 2024\n\nGoogle DeepMind 2024-5-24\n\nMany-Shot In-Context Learning\n\nRishabh Agarwal’, Avi Singh”, Lei M. Zhang", Bernd Bohnet\', Luis Rosias\', Stephanie Chan‘, Biao Zhang", Ankesh Anand , Zaheer Abbas , Azade Nova , John D. Co-Reyes , Eric Chu , Feryal Behbahani , Aleksandra Faust and Hugo Larochelle\n\n“Contributed equally, ‘Key contribution\n\nLarge language models (LLMs) excel at few-shot in-context learning (ICL) — learning from a few input- output examples (“shots”) provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples - the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated outputs. To mitigate this limitation, we explore

In [4]:
len(online_pdf_data[0].page_content)

95180

In [5]:
online_pdf_data[0].page_content

'arXiv:2404.11018v2 [cs.LG] 22 May 2024\n\nGoogle DeepMind 2024-5-24\n\nMany-Shot In-Context Learning\n\nRishabh Agarwal’, Avi Singh”, Lei M. Zhang", Bernd Bohnet\', Luis Rosias\', Stephanie Chan‘, Biao Zhang", Ankesh Anand , Zaheer Abbas , Azade Nova , John D. Co-Reyes , Eric Chu , Feryal Behbahani , Aleksandra Faust and Hugo Larochelle\n\n“Contributed equally, ‘Key contribution\n\nLarge language models (LLMs) excel at few-shot in-context learning (ICL) — learning from a few input- output examples (“shots”) provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples - the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated outputs. To mitigate this limitation, we explore two settings: (1) “Rei

## SingleStoreDB

In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
texts = text_splitter.split_documents(online_pdf_data)
texts

[Document(page_content='arXiv:2404.11018v2 [cs.LG] 22 May 2024\n\nGoogle DeepMind 2024-5-24\n\nMany-Shot In-Context Learning\n\nRishabh Agarwal’, Avi Singh”, Lei M. Zhang", Bernd Bohnet\', Luis Rosias\', Stephanie Chan‘, Biao Zhang", Ankesh Anand , Zaheer Abbas , Azade Nova , John D. Co-Reyes , Eric Chu , Feryal Behbahani , Aleksandra Faust and Hugo Larochelle\n\n“Contributed equally, ‘Key contribution\n\nLarge language models (LLMs) excel at few-shot in-context learning (ICL) — learning from a few input- output examples (“shots”) provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples - the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated outputs. To mitigate this limitation, we explore

In [7]:
embedder = OpenAIEmbeddings()

  warn_deprecated(


ValidationError: 1 validation error for OpenAIEmbeddings
__root__
  Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass `openai_api_key` as a named parameter. (type=value_error)

In [None]:
os.environ["SINGLESTOREDB_URL"] = f"admin:{os.environ["SINGLESTORE_PASSWORD"]}@{os.environ["SINGLESTORE_USER"]}:3306/db_CaioBarros_7afc1"

docsearch = SingleStoreDB.from_documents(
    texts,
    embedder,
    table_name = "pdf_docs",
)

In [None]:
query_text = "What is the table of contexts of this pdf?"

docs = docsearch.similarity_search(query_text)

print(docs[0].page_content)

## ChatCompletion

In [None]:
prompt = f"The user asked: {query_text}. The most similar text from the document is: {docs[0].page_content}"