# Introduction

Employed the Llama2-7b parameter model, RAG framework, Langchain, and Chroma Vector Database to intelligently query the freely available PDF version of "Feynman Lectures on Computation" converted into a text format.







In [None]:
# Ensuring the dependencies are working

!pip install transformers==4.33.0 accelerate==0.22.0 einops==0.6.1 langchain==0.0.300 xformers==0.0.21 \
bitsandbytes==0.41.1 sentence_transformers==2.2.2 chromadb==0.4.12

Collecting einops==0.6.1
  Downloading einops-0.6.1-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m694.5 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting langchain==0.0.300
  Downloading langchain-0.0.300-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[?25hCollecting xformers==0.0.21
  Downloading xformers-0.0.21-cp310-cp310-manylinux2014_x86_64.whl (167.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m167.0/167.0 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting bitsandbytes==0.41.1
  Downloading bitsandbytes-0.41.1-py3-none-any.whl (92.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting sentence_transformers==2.2.2
  Downloading senten

In [None]:
# Basic import

from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma


# Initialize model, tokenizer, query pipeline

Define the model, the device, and the `bitsandbytes` configuration.

In [None]:
# Loading the Llama2-7b parameter model using 4-bit Quantization

model_id = '/kaggle/input/llama-2/pytorch/7b-chat-hf/1'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

Prepare the model and the tokenizer.

In [None]:
# Setting up the model and tokenizer

time_1 = time()
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
time_2 = time()
print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Prepare model, tokenizer: 247.951 sec.


Defining the query pipeline

In [None]:
time_1 = time()

query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",)

time_2 = time()

print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")

Prepare pipeline: 2.595 sec.


We define a function for testing the pipeline.

In [None]:
def test_model(tokenizer, pipeline, prompt_to_test):
    """
    Perform a query
    print the result
    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        prompt_to_test: the prompt
    Returns
        None
    """

    # adapted from https://huggingface.co/blog/llama2#using-transformers

    time_1 = time()
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,)
    time_2 = time()
    print(f"Test inference: {round(time_2-time_1, 3)} sec.")
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

## Test the query pipeline

We test the pipeline with a query about the meaning of Computation.

In [None]:
test_model(tokenizer,
           query_pipeline,
           "Please explain what is Computation?")



Test inference: 21.763 sec.
Result: Please explain what is Computation? Composition is a fundamental concept in computer science that refers to the process of combining two or more things (such as programs, data, or functions) to create a new entity. nobody@example.com (John Doe) wrote: Hello, I'm interested in learning more about computation. In general, computation refers to any process of combining or manipulating data in a systematic and algorithmic way. Computation involves the use of algorithms, which are well-defined procedures for solving mathematical problems or performing tasks. In computer science, the term computation often refers specifically to the processes that are performed by a computer, such as executing programs or manipulating data in a computer program. The concept of computation is central to many areas of computer science, including algorithms, programming language, computer architecture, and database systems. In computational complexity theory, the concept of c

## Checking the model with a HuggingFace pipeline




In [None]:
llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt="Please explain what is Computation very briefly and tell us what is the connection of computation and thermodynamics, In less than 100 words.")

' Unterscheidung between computation and information processing. Computation is a fundamental concept in computer science that refers to the process of manipulating and transforming data in a systematic and algorithmic way. It involves the use of computational models, such as Turing machines, to perform tasks such as problem solving, data analysis, and decision making. The connection between computation and thermodynamics is that both are concerned with the flow of information and energy. In thermodynamics, the flow of energy is studied in terms of its entropy, which is a measure of the amount of disorder or randomness in a system. Similarly, in computation, the flow of information is studied in terms of the complexity of the computational model, which can be thought of as a measure of the disorder or randomness of the system.\n\nIn summary, computation is the process of manipulating and transforming data in a systematic and algorithmic way, while thermodynamics is the study of the flo

## Using our data

In [None]:
loader = TextLoader("/kaggle/input/feynmancomputation/feynman_computation.txt",
                    encoding="utf8")
documents = loader.load()

## Split data in chunks

We split data in chunks using a recursive character text splitter.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

## Creating Embeddings and Storing in Vector Store

Create the embeddings using Sentence Transformer and HuggingFace embeddings.

In [None]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

Downloading .gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading 1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.

In [None]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

Batches:   0%|          | 0/28 [00:00<?, ?it/s]

## Initialize chain

In [None]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
)

## Test the Retrieval-Augmented Generation


We define a test function, that will run the query and time it.

In [None]:
def test_rag(qa, query):
    print(f"Query: {query}\n")
    time_1 = time()
    result = qa.run(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print("\nResult: ", result)

Let's check few queries.

In [None]:
query = "Please explain what is Computation very briefly and tell us what is the connection of computation and thermodynamics."
test_rag(qa, query)

Query: Please explain what is Computation very briefly and tell us what is the connection of computation and thermodynamics.



[1m> Entering new RetrievalQA chain...[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


[1m> Finished chain.[0m
Inference time: 9.945 sec.

Result:   Computation is the process of manipulating information, typically using a computer. Thermodynamics is the study of the relationships between heat, work, and energy. The connection between computation and thermodynamics is that the energy required to perform computations can be measured and analyzed using thermodynamic concepts, such as entropy. For example, the entropy of a computation can be used to quantify the amount of information that is processed during the computation.


In [None]:
query = "What is a Turing Machine, How is it related to the Halting problem?"
test_rag(qa, query)

Query: What is a Turing Machine, How is it related to the Halting problem?



[1m> Entering new RetrievalQA chain...[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


[1m> Finished chain.[0m
Inference time: 71.122 sec.

Result:   A Turing Machine is a mathematical model for computation that was first introduced by Alan Turing in the 1930s. It consists of a tape that can be read and written to, and a read/write head that can move along the tape. The machine can be in one of a finite number of states, and it can change state based on the input it reads and the tape it is currently reading. The Halting problem is a famous result in the theory of computation that states that there cannot exist an algorithm that can determine, given a particular Turing Machine and input, whether the machine will halt or run indefinitely. This result has important implications for the design and analysis of Turing Machines, and it highlights the fundamental limits of what can be computed using these machines.

In this answer, we will explore the concept of a Turing Machine, how it is related to the Halting problem, and the implications of this result for the field of c

## Document sources

Let's check the documents sources, for the last query run.

In [None]:
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")
    break

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: What is a Turing Machine, How is it related to the Halting problem?
Retrieved documents: 4
Source:  /kaggle/input/feynmancomputation/feynman_computation.txt
Text:  mathematician has a long strip of paper broken up into squares, in each of
which he can write and read, one at a time. He looks at a square, and what he
sees puts him in some state of mind which determines what he writes in the next
square. So imagine the guy's brain having lots of different possible states which
are mixed up and changed by looking at the strip of paper. After thinking along
these lines and abstracting a bit, Turing came up with a kind of machine which
is referred to as - surprise, surprise - a Turing machine. We will see that these
machines are horribly inefficient and slow - so much so that no one would ever
waste their time building one except for amusement - but that, if we are patient
with them, they can do wonderful things.
Now Turing invented all manner of Turing machines, but he eventually
dis