LOAD THE DOCUMENTS

In [27]:
!pip3 install pypdf

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [28]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("docs/ciberseguridad.pdf")
docs = loader.load()
pages = loader.load_and_split()
len(pages)

29

SPLIT DOCUMENT INTO CHUNCKS FOR EMBEDDING AND VECTOR STORAGE

"In this case we’ll split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases." [source](https://python.langchain.com/docs/use_cases/question_answering/quickstart#indexing-split)

In [29]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=200, 
    # add_start_index=True
)
all_splits = text_splitter.split_documents(docs)
len(all_splits)

91

STORE

"We need to index our chunks so we can search over them at runtime. The most common way to do this is to embed the contents of each document split and insert these embeddings into a vector database (or vector store). When we want to search over our splits, we take a text search query, embed it, and perform some sort of “similarity” search to identify the stored splits with the most similar embeddings to our query embedding. The simplest similarity measure is cosine similarity — we measure the cosine of the angle between each pair of embeddings (which are high dimensional vectors)." [source](https://python.langchain.com/docs/use_cases/question_answering/quickstart#indexing-store)

In [30]:
from langchain_community.vectorstores import Chroma # Options: https://python.langchain.com/docs/integrations/vectorstores
# from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import GPT4AllEmbeddings # Replaces the OpenAI option

vectorstore = Chroma.from_documents(documents=all_splits, embedding=GPT4AllEmbeddings())

bert_load_from_file: gguf version     = 2
bert_load_from_file: gguf alignment   = 32
bert_load_from_file: gguf data offset = 695552
bert_load_from_file: model name           = BERT
bert_load_from_file: model architecture   = bert
bert_load_from_file: model file type      = 1
bert_load_from_file: bert tokenizer vocab = 30522


In [33]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})
retrieved_docs = retriever.invoke("Cual es la teoria de la guerra y la comprension constructivista?")

In [34]:
len(retrieved_docs)

6

In [36]:
print(retrieved_docs[0].page_content)

En este sentido, un elemento clave del 
neorrealismo es su consideración del 
papel vital de las empresas privadas, 
promotoras o creadores de softwares, 
como entidades de suma importancia 
para la consolidación de un régimen 
internacional de normas del Internet, 
así como para salvaguardar la seguri -
dad nacional y delimitar la política 
exterior de las naciones. No obstante, 
deja en claro que los Estados-Nación 
son los mandamases de la regulación 
y control del ciberespacio (Nye, 2010).
B.- La teoría de la guerra y la 
comprensión constructivista
El desarrollo de la Teoría de la Gue -
rra Moderna, de Carl Clausewitz, ha 
marcado un fuerte énfasis en las carac -
terísticas de los campos o espacios de 
batalla, como un hecho crucial que de -
termina la superioridad de un Estado, 
sobre otro en una confrontación bé -
lica. En su obra clásica “De la Guerra” , 
Clausewitz delínea los conceptos clave 
de las estrategias castrenses del mun -
do contemporáneo, a la par que en su


GENERATE

"Let’s put it all together into a chain that takes a question, retrieves relevant documents, constructs a prompt, passes that to a model, and parses the output." [source](https://python.langchain.com/docs/use_cases/question_answering/quickstart#retrieval-and-generation-generate)

In [38]:
from langchain_community.llms import GPT4All

llm = GPT4All(
    model='/home/abraham/personal/osllm-doc-qna/mistral-7b-openorca.gguf2.Q4_0.gguf',
    max_tokens=2048,
)

Original prompt ([source](https://python.langchain.com/docs/use_cases/question_answering/quickstart#retrieval-and-generation-generate))

In [None]:
# from langchain import hub

# prompt = hub.pull("rlm/rag-prompt")

New prompt ([source](https://python.langchain.com/docs/use_cases/question_answering/quickstart#retrieval-and-generation-generate))

In [40]:
from langchain.prompts import PromptTemplate

prompt_template = PromptTemplate.from_template(
    """
    Eres un asistente para responder preguntas. 
    Utiliza los siguientes fragmentos de contexto recuperado para responder la pregunta. 
    Si no conoces la respuesta, simplemente di que no la sabes. 
    Usa máximo tres oraciones y mantén la respuesta concisa.
    Pregunta: {question}
    Contexto: {context}
    Respuesta:
    """
)
prompt_template.format(question='Pregunta ejemplo', context='respuesta ejemplo')

'\n    Eres un asistente para responder preguntas. \n    Utiliza los siguientes fragmentos de contexto recuperado para responder la pregunta. \n    Si no conoces la respuesta, simplemente di que no la sabes. \n    Usa máximo tres oraciones y mantén la respuesta concisa.\n    Pregunta: Pregunta ejemplo\n    Contexto: respuesta ejemplo\n    Respuesta:\n    '

In [None]:
# example_messages = prompt_template.invoke