## STEPS

1. Open a PDF document (even a collection of PDFs)
2. Format the text of the PDF textbook ready for an embedding model.
3. Embed all of the chunks of text in the textbook and turn them into numerical embeddings which can store for later.
4. Build a retrieval system that uses vector search to find relevant chunck of text based ona aquery
5. Create a prompt that incorporates the relevant pieces of text.
6. Generat an answer to a query based on the passages of the textbook with an LLM

In [1]:
# downloading requirements
!pip install PyMuPDF
!pip install tqdm



In [2]:
# Step 1
import os

# get pdf path
pdf_path = "/kaggle/input/nutrition-rag/human-nutrition-text.pdf"

In [3]:
# Step 2

import fitz
from tqdm.auto import tqdm 

def text_formatter(text: str) ->str:
    cleaned_text = text.replace("\n"," ").strip()
    
    return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    
    for page_number, page in tqdm(enumerate(doc)):
#         print(page_number)
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number - 41,
                               "page_char_count": len(text),
                               "page_word_count": len(text.split(" ")),
                               "page_sentence_count_raw": len(text.split(".")),
                               "page_token_count": len(text)/4, # 1 toke  is ~4 characters
                               "text":text})
        
    return pages_and_texts
    
pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [4]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,3,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,3,199.25,Contents Preface University of Hawai‘i at Mā...


In [5]:
df.describe()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.594371,198.889901,14.180464,287.148593
std,348.86387,560.441673,95.747365,9.544587,140.110418
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.75,134.0,8.0,190.6875
50%,562.5,1232.5,215.0,13.0,308.125
75%,864.25,1605.25,271.25,19.0,401.3125
max,1166.0,2308.0,429.0,82.0,577.0


In [6]:
# Splitting text in page into sentences, using NLP lib - spaCy
from spacy.lang.en import English

nlp = English()
nlp.add_pipe("sentencizer")

for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item['text']).sents)
    
    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    
    #Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [7]:
import random

random.sample(pages_and_texts, k=1)

[{'page_number': 345,
  'page_char_count': 1644,
  'page_word_count': 290,
  'page_sentence_count_raw': 9,
  'page_token_count': 411.0,
  'text': 'Lipids and Disease  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  Because heart disease, cancer, and stroke are the three leading  causes of death in the United States, it is critical to address dietary  and lifestyle choices that will ultimately decrease risk factors for  these diseases. According to the US Department of Health and  Human Services (HHS), the following risk factors are controllable:  high blood pressure, high cholesterol, cigarette smoking, diabetes,  poor diet, physical inactivity, being overweight, and obesity.  In light of that, we present the following informational tips to help  you define, evaluate, and implement healthy dietary choices to last  a lifetime. The amount and the type of fat that composes a person’s  dietary profile will have a profound effect upon th

In [8]:
# STEP 3

# Chucking our sentences together
# Chucking - grouping sentences into groups

# We do this because, to make our text chunks fit into our embedding model context window

num_sentence_chunk_size = 10

# eg: [25] => [10,10,5] 

def split_list(input_list:list[str],
              slice_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0, len(input_list),slice_size)]

# Testing
test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [9]:
# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list = item["sentences"],
                                        slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [10]:
random.sample(pages_and_texts, k=1)

[{'page_number': 149,
  'page_char_count': 103,
  'page_word_count': 15,
  'page_sentence_count_raw': 4,
  'page_token_count': 25.75,
  'text': 'view it online here:  http:/ /pressbooks.oer.hawaii.edu/ humannutrition2/?p=130  \xa0 Introduction  |  149',
  'sentences': ['view it online here:  http:/ /pressbooks.oer.hawaii.edu/ humannutrition2/?p=130  \xa0 Introduction  |  149'],
  'page_sentence_count_spacy': 1,
  'sentence_chunks': [['view it online here:  http:/ /pressbooks.oer.hawaii.edu/ humannutrition2/?p=130  \xa0 Introduction  |  149']],
  'num_chunks': 1}]

In [11]:
# Split each chunk into its own item
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        
        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo 
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters
        
        pages_and_chunks.append(chunk_dict)

  0%|          | 0/1208 [00:00<?, ?it/s]

In [12]:
# View a random sample
random.sample(pages_and_chunks, k=1)

[{'page_number': 957,
  'sentence_chunk': 'Another way for athletes to avoid “hitting the wall” is to consume carbohydrate-containing drinks and foods during an endurance event. In fact, throughout the Tour de France—a twenty-two-day, twenty-four-hundred-mile race—the average cyclist consumes greater than 60 grams of carbohydrates per hour. Learning Activities Technology Note: The second edition of the Human Nutrition Open Educational Resource (OER) textbook Fuel Sources | 957',
  'chunk_char_count': 438,
  'chunk_word_count': 61,
  'chunk_token_count': 109.5}]

In [13]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,734.83,112.72,183.71
std,347.79,447.43,71.07,111.86
min,-41.0,12.0,3.0,3.0
25%,280.5,315.0,45.0,78.75
50%,586.0,746.0,114.0,186.5
75%,890.0,1118.5,173.0,279.62
max,1166.0,1831.0,297.0,457.75


In [14]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 27.5 | Text: Figure 15.1 reused “Silohuette of Three People” by photo- nic.co.uk nic / Unsplash License 1158 | Attributions
Chunk token count: 16.5 | Text: PART X CHAPTER 10. MAJOR MINERALS Chapter 10. Major Minerals | 607
Chunk token count: 26.75 | Text: Image by Allison Calabrese / CC BY 4.0 Figure 9.13 Niacin Deficiency, Pellagra 566 | Water-Soluble Vitamins
Chunk token count: 24.25 | Text: There are several lecithin supplements on the market Nonessential and Essential Fatty Acids | 315
Chunk token count: 13.0 | Text: PART VII CHAPTER 7. ALCOHOL Chapter 7. Alcohol | 429


In [15]:
# filter our DataFrame for rows with under 30 tokens

pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

In [16]:
# Embedding our text chunks

# using all-mpnet-base-v2 sentence transformer

!pip install -U sentence-transformers



In [17]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2',
                           device = "cuda")

2024-06-22 19:45:25.961917: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-22 19:45:25.961987: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-22 19:45:25.963450: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [18]:
embed = embedding_model.encode(sentences)

embed

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

array([[ 0.0225026 , -0.07829171, -0.02303073, ..., -0.00827931,
         0.02652686, -0.00201899],
       [ 0.04170235,  0.00109742, -0.0155342 , ..., -0.02181628,
        -0.0635936 , -0.00875286]], dtype=float32)

In [19]:
# Turn text chunks into a single list
text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]

text_chunks[0]

'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE'

In [None]:
%%time

# Embed all texts in batches
text_chunk_embeddings = embedding_model.encode(text_chunks)
text_chunk_embeddings

In [None]:
# Save embeddings to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"

In [None]:
text_chunks_and_embeddings_df.shape

In [None]:
len(text_chunk_embeddings)

In [None]:
new_column_list_cpu = [tensor.cpu().numpy() for tensor in text_chunk_embeddings]
text_chunks_and_embeddings_df["embedding"] = new_column_list_cpu
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [None]:
# Import saved file and view
text_chunks_and_embedding_df = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df.head()

#### NOTE: In this notebook, Haven't used Vector databases

*  In RAG, documents and queries are represented as high-dimensional vectors using embeddings
* Vector databases, like FAISS, Milvus, or Pinecone, are optimized for performing efficient similarity searches. They can quickly find the nearest vectors (similar documents) to a given query vector using algorithms like Approximate Nearest Neighbors (ANN).

#### Approximate Nearest Neighbors (ANN) Search:

* Indexing: Before searching, the database creates an index using ANN algorithms. This index organizes the vectors in a way that makes it efficient to search for nearest neighbors.
* Partitioning and Probing: The index partitions the vector space into multiple smaller regions or clusters. When a query is made, the search algorithm probes a subset of these regions that are most likely to contain similar vectors.
* Reduced Comparisons: Instead of comparing the query vector against all 10M vectors, the search algorithm only compares it against vectors in the selected regions. For example, it might compare the query vector with 1 million (1M) vectors that are in the most promising regions.

## Run from here

In [21]:
!pip install -U sentence-transformers

  pid, fd = os.forkpty()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [22]:
# STEP 4
# RAG Search and Answer

# Comparing embedding is called similarity search or vector search

import torch
import numpy as np
import pandas as pd

text_chunks_and_embedding_df = pd.read_csv("/kaggle/input/nutrition-rag/text_chunks_and_embeddings_df.csv")
text_chunks_and_embedding_df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0,[ 6.74242601e-02 9.02282074e-02 -5.09547768e-...
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5,[ 5.52156083e-02 5.92139438e-02 -1.66167226e-...
2,-37,Contents Preface University of Hawai‘i at Māno...,766,114,191.5,[ 2.79801711e-02 3.39814313e-02 -2.06426643e-...
3,-36,Lifestyles and Nutrition University of Hawai‘i...,941,142,235.25,[ 6.82566911e-02 3.81274782e-02 -8.46855436e-...
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.5,[ 3.30264196e-02 -8.49765539e-03 9.57158115e-...


In [23]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")

Device: cuda


In [24]:
# Convert embedding column back to np.array (it got converted to string when it got saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)
embeddings.shape

torch.Size([1680, 768])

In [25]:
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device=device)



In [26]:
# Step to peforming search

# 1. Define the query
query = "macronutrients functions"
print(f"Query: {query}")

# 2. Embed the query to the same numerical space as the text examples 
query_embedding = embedding_model.encode(query, convert_to_tensor=True)
query_embedding

Query: macronutrients functions


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

tensor([ 3.6350e-02, -5.8203e-02, -1.8262e-02,  1.4228e-02,  1.3136e-02,
         4.3278e-02, -3.3172e-02, -1.1831e-02, -4.3201e-02, -6.8741e-02,
        -1.8463e-02, -1.5240e-02,  1.9626e-02,  4.7535e-02,  2.7343e-02,
        -7.1382e-03,  1.0865e-02, -2.5707e-02, -7.7592e-02, -5.8880e-03,
        -3.2791e-02,  2.0640e-02,  1.3843e-02, -6.2041e-03, -2.2706e-02,
         3.9274e-02,  4.0431e-02,  5.1309e-03,  1.1578e-02, -1.1135e-02,
        -1.9891e-02, -1.8691e-03, -4.9330e-02, -8.7934e-02,  1.5259e-06,
        -3.3809e-02, -4.7408e-02,  1.7066e-02, -7.5259e-02,  3.4821e-02,
         2.9377e-02, -3.2112e-02, -1.6509e-02,  7.3573e-03,  4.6359e-02,
        -4.1220e-03,  1.4522e-02,  3.7190e-04, -6.3193e-02,  1.9782e-02,
         2.9439e-02,  4.7811e-02,  1.6230e-02, -1.6853e-02,  2.3415e-02,
         4.1553e-02,  8.8754e-03,  2.3375e-02, -3.9122e-02,  3.5091e-02,
         6.1852e-02,  3.5322e-02, -1.1847e-03, -8.1968e-03,  4.3451e-02,
         5.5786e-02, -2.3311e-02, -3.3593e-02, -2.1

* For similarity -> not used cosine similarity, used dot product

Only difference in cosine sim and dot product is, cosine sim has normalization step, but we are already getting normalized embeddings from our embedding model

#### Dot Product
* Measure of magnitude and direction between two vectors
* Vectors that are aligned in direction and magnitude have a higher positive value
* Vectors that are opposite in direction and magnitude have a higher negative value

In [27]:
# 3. Get similarity scores with the dot product (we'll time this for fun)
from time import perf_counter as timer

start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
end_time = timer()

print(f"Time take to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

# 4. Get the top-k results (we'll keep this to 5)
top_results_dot_product = torch.topk(dot_scores, k=5)
top_results_dot_product

Time take to get scores on 1680 embeddings: 0.00348 seconds.


torch.return_types.topk(
values=tensor([0.6926, 0.6738, 0.6646, 0.6536, 0.6473], device='cuda:0'),
indices=tensor([42, 47, 41, 51, 46], device='cuda:0'))

In [28]:
# Define helper function to print wrapped text, so it doesn't print a whole text chunk as a single line 
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

In [29]:
print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"Page number: {pages_and_chunks[idx]['page_number']}")
    print("\n")

Query: 'macronutrients functions'

Results:
Score: 0.6926
Text:
Macronutrients Nutrients that are needed in large amounts are called
macronutrients. There are three classes of macronutrients: carbohydrates,
lipids, and proteins. These can be metabolically processed into cellular energy.
The energy from macronutrients comes from their chemical bonds. This chemical
energy is converted into cellular energy that is then utilized to perform work,
allowing our bodies to conduct their basic functions. A unit of measurement of
food energy is the calorie. On nutrition food labels the amount given for
“calories” is actually equivalent to each calorie multiplied by one thousand. A
kilocalorie (one thousand calories, denoted with a small “c”) is synonymous with
the “Calorie” (with a capital “C”) on nutrition food labels. Water is also a
macronutrient in the sense that you require a large amount of it, but unlike the
other macronutrients, it does not yield calories. Carbohydrates Carbohydrates
are 

#### Functions for the pipeline

In [30]:
def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=5,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query, 
                                   convert_to_tensor=True) 

    # Get dot product scores on embeddings
    start_time = timer()
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    end_time = timer()

    if print_time:
        print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

    scores, indices = torch.topk(input=dot_scores, 
                                 k=n_resources_to_return)

    return scores, indices

def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 pages_and_chunks: list[dict]=pages_and_chunks,
                                 n_resources_to_return: int=5):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.

    Note: Requires pages_and_chunks to be formatted in a specific way (see above for reference).
    """
    
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  n_resources_to_return=n_resources_to_return)
    
    print(f"Query: {query}\n")
    print("Results:")
    # Loop through zipped together scores and indicies
    for score, index in zip(scores, indices):
        print(f"Score: {score:.4f}")
        # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
        print_wrapped(pages_and_chunks[index]["sentence_chunk"])
        # Print the page number too so we can reference the textbook further and check the results
        print(f"Page number: {pages_and_chunks[index]['page_number']}")
        print("\n")

In [31]:
query = "symptoms of pellagra"

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[INFO] Time taken to get scores on 1680 embeddings: 0.00065 seconds.


(tensor([0.5000, 0.3741, 0.2959, 0.2793, 0.2721], device='cuda:0'),
 tensor([ 822,  853, 1536, 1555, 1531], device='cuda:0'))

In [32]:
# Print out the texts of the top scores
print_top_results_and_scores(query=query,
                             embeddings=embeddings)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[INFO] Time taken to get scores on 1680 embeddings: 0.00056 seconds.
Query: symptoms of pellagra

Results:
Score: 0.5000
Niacin deficiency is commonly known as pellagra and the symptoms include
fatigue, decreased appetite, and indigestion.  These symptoms are then commonly
followed by the four D’s: diarrhea, dermatitis, dementia, and sometimes death.
Figure 9.12  Conversion of Tryptophan to Niacin Water-Soluble Vitamins | 565
Page number: 565


Score: 0.3741
car. Does it drive faster with a half-tank of gas or a full one?It does not
matter; the car drives just as fast as long as it has gas. Similarly, depletion
of B vitamins will cause problems in energy metabolism, but having more than is
required to run metabolism does not speed it up. Buyers of B-vitamin supplements
beware; B vitamins are not stored in the body and all excess will be flushed
down the toilet along with the extra money spent. B vitamins are naturally
present in numerous foods, and many other foods are enriched with th

In [33]:
# Get GPU available memory
import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
print(f"Available GPU memory: {gpu_memory_gb} GB")

Available GPU memory: 16 GB


In [34]:
# STEP 5 - Designing prompt

def get_prompt(query: str,
               context_items : list[dict]) -> str:
    
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])
    base_prompt = """Based on the following context items, please answer the query.
    Give yourself room to think by extracting relevant passages from the context before answering the query.
    Don't return the thinking, only return the answer.
    Make sure your answers are as explanatory as possible.
    Use the following examples as reference for the ideal answer style.
    \nExample 1:
    Query: What are the fat-soluble vitamins?
    Answer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from damage. Vitamin K is essential for blood clotting and bone metabolism.
    \nExample 2:
    Query: What are the causes of type 2 diabetes?
    Answer: Type 2 diabetes is often associated with overnutrition, particularly the overconsumption of calories leading to obesity. Factors include a diet high in refined sugars and saturated fats, which can lead to insulin resistance, a condition where the body's cells do not respond effectively to insulin. Over time, the pancreas cannot produce enough insulin to manage blood sugar levels, resulting in type 2 diabetes. Additionally, excessive caloric intake without sufficient physical activity exacerbates the risk by promoting weight gain and fat accumulation, particularly around the abdomen, further contributing to insulin resistance.
    \nExample 3:
    Query: What is the importance of hydration for physical performance?
    Answer: Hydration is crucial for physical performance because water plays key roles in maintaining blood volume, regulating body temperature, and ensuring the transport of nutrients and oxygen to cells. Adequate hydration is essential for optimal muscle function, endurance, and recovery. Dehydration can lead to decreased performance, fatigue, and increased risk of heat-related illnesses, such as heat stroke. Drinking sufficient water before, during, and after exercise helps ensure peak physical performance and recovery.
    \nNow use the following context items to answer the user query:
    {context}
    \n
    User query: {query}
    """
    
    prompt = base_prompt.format(context=context, query=query)
    
    return prompt

In [37]:
query = input("Enter your query :")

# Get relevant resources
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
    
# Create a list of context items
context_items = [pages_and_chunks[i] for i in indices]

# Format prompt with context items
prompt = get_prompt(query=query,
                    context_items=context_items)
print(prompt)

Enter your query : How to keep heart healthy ?


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[INFO] Time taken to get scores on 1680 embeddings: 0.00019 seconds.
Based on the following context items, please answer the query.
    Give yourself room to think by extracting relevant passages from the context before answering the query.
    Don't return the thinking, only return the answer.
    Make sure your answers are as explanatory as possible.
    Use the following examples as reference for the ideal answer style.
    
Example 1:
    Query: What are the fat-soluble vitamins?
    Answer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from damage. Vitamin K is essential for blood clotting and bone metabolism.
    
Example 2:
    Query

In [38]:
#STEP 6
!pip install replicate

  pid, fd = os.forkpty()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [39]:
import replicate
import os

os.environ['REPLICATE_API_TOKEN'] = "r8_NdnlIegb5kSd6lAoBcS8OIe5yHW0lso26g4jA"

inp = {
    "top_p": 1,
    "prompt": prompt,
    "temperature": 0.5,
    "max_new_tokens": 500,
    "min_new_tokens": -1
}

for event in replicate.stream(
    "meta/llama-2-70b-chat",
    input=inp
):
    print(event, end="")


 To keep your heart healthy, it is important to make lifestyle changes that reduce your risk of cardiovascular disease. Here are some tips that can help:

1. Choose whole-grain and high-fiber foods: Diets that are high in whole grains and fiber have been associated with a reduced risk of cardiovascular disease. The American Heart Association recommends that at least half of your daily grain intake should originate from whole grains, and that you consume 14 grams of fiber per 1,000 kilocalories.

2. Get regular exercise: Regular physical activity can help you manage or prevent high blood pressure and blood cholesterol levels, both of which are risk factors for heart disease. Aim for at least 20 minutes of physical activity three times a week.

3. Incorporate a wide variety of nutrients in your diet: Eating a variety of fruits and vegetables rich in antioxidants and phytochemicals promotes health. Consider following a diet like the Mediterranean diet, which emphasizes fresh fruit and veg