## RAG - Based System (Planet_Earth AI):

**Source :** Planet_Earth PDF File

**Description:** We are using a PDF File call "Planet_Earth.pdf" which contains information and details about planet earth such as climate ,lanes, water level , air quality, and other details. we will be using this pdf as source to build a rag system , and will extract the details from this file to ask questions.

In [None]:
# Install the following packages incase they're not loaded already
!pip install pdfplumber
!pip install chromadb
!pip install tiktoken
!pip install openai

Collecting pdfplumber
  Downloading pdfplumber-0.11.7-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20250506 (from pdfplumber)
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.7-py3-none-any.whl (60 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer_six-20250506-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

**1. Data Processing:**

In [None]:
# Import all the required Libraries
import pdfplumber
from pathlib import Path
import pandas as pd
from operator import itemgetter
import json
import tiktoken
import chromadb
import openai

In [None]:
import os
from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('my-key')

In [None]:
pdf_path = "/"

In [None]:
# Function to check whether a word is present in a table or not for segregation of regular text and tables

def check_bboxes(word, table_bbox):
    # Check whether word is inside a table bbox.
    l = word['x0'], word['top'], word['x1'], word['bottom']
    r = table_bbox
    return l[0] > r[0] and l[1] > r[1] and l[2] < r[2] and l[3] < r[3]

In [None]:
# Function to extract text from a PDF file.
# 1. Declare a variable p to store the iteration of the loop that will help us store page numbers alongside the text
# 2. Declare an empty list 'full_text' to store all the text files
# 3. Use pdfplumber to open the pdf pages one by one
# 4. Find the tables and their locations in the page
# 5. Extract the text from the tables in the variable 'tables'
# 6. Extract the regular words by calling the function check_bboxes() and checking whether words are present in the table or not
# 7. Use the cluster_objects utility to cluster non-table and table words together so that they retain the same chronology as in the original PDF
# 8. Declare an empty list 'lines' to store the page text
# 9. If a text element in present in the cluster, append it to 'lines', else if a table element is present, append the table
# 10. Append the page number and all lines to full_text, and increment 'p'
# 11. When the function has iterated over all pages, return the 'full_text' list

def extract_text_from_pdf(pdf_path):
    p = 0
    full_text = []


    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_no = f"Page {p+1}"
            text = page.extract_text()

            tables = page.find_tables()
            table_bboxes = [i.bbox for i in tables]
            tables = [{'table': i.extract(), 'top': i.bbox[1]} for i in tables]
            non_table_words = [word for word in page.extract_words() if not any(
                [check_bboxes(word, table_bbox) for table_bbox in table_bboxes])]
            lines = []

            for cluster in pdfplumber.utils.cluster_objects(non_table_words + tables, itemgetter('top'), tolerance=5):

                if 'text' in cluster[0]:
                    try:
                        lines.append(' '.join([i['text'] for i in cluster]))
                    except KeyError:
                        pass

                elif 'table' in cluster[0]:
                    lines.append(json.dumps(cluster[0]['table']))


            full_text.append([page_no, " ".join(lines)])
            p +=1

    return full_text

In [None]:
# Define the directory containing the PDF files
pdf_directory = Path("")

# Initialize an empty list to store the extracted texts and document names
data = []

# Loop through all files in the directory
for pdf_path in pdf_directory.glob("*.pdf"):

    # Process the PDF file
    print(f"...Processing {pdf_path.name}")

    # Call the function to extract the text from the PDF
    extracted_text = extract_text_from_pdf(pdf_path)

    # Convert the extracted list to a PDF, and add a column to store document names
    extracted_text_df = pd.DataFrame(extracted_text, columns=['Page No.', 'Page_Text'])
    extracted_text_df['Document Name'] = pdf_path.name

    # Append the extracted text and document name to the list
    data.append(extracted_text_df)

    # Print a message to indicate progress
    print(f"Finished processing {pdf_path.name}")

# Print a message to indicate all PDFs have been processed
print("All PDFs have been processed.")

...Processing Planet_Earth.pdf
Finished processing Planet_Earth.pdf
All PDFs have been processed.


In [None]:
planet_earth_pdfs_data = pd.concat(data, ignore_index=True)

In [None]:
planet_earth_pdfs_data.head(5)

Unnamed: 0,Page No.,Page_Text,Document Name
0,Page 1,Earth 2020 Earth 2020 An Insider’s Guide to a ...,Planet_Earth.pdf
1,Page 2,EARTH 2020,Planet_Earth.pdf
2,Page 3,,Planet_Earth.pdf
3,Page 4,Earth 2020 An Insider’s Guide to a Rapidly Cha...,Planet_Earth.pdf
4,Page 5,https://www.openbookpublishers.com Text © 2020...,Planet_Earth.pdf


In [None]:
len(planet_earth_pdfs_data)

290

In [None]:
planet_earth_pdfs_data['Metadata'] = planet_earth_pdfs_data.apply(lambda x: {'filing_name': x['Document Name'][:-4], 'Page_No.': x['Page No.']}, axis=1)

In [None]:
planet_earth_pdfs_data.head(5)

Unnamed: 0,Page No.,Page_Text,Document Name,Metadata
0,Page 1,Earth 2020 Earth 2020 An Insider’s Guide to a ...,Planet_Earth.pdf,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P..."
1,Page 2,EARTH 2020,Planet_Earth.pdf,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P..."
2,Page 3,,Planet_Earth.pdf,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P..."
3,Page 4,Earth 2020 An Insider’s Guide to a Rapidly Cha...,Planet_Earth.pdf,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P..."
4,Page 5,https://www.openbookpublishers.com Text © 2020...,Planet_Earth.pdf,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P..."


**2. Setting Up Embedding and Chroma DB Store:**

In [None]:
# Import the OpenAI Embedding Function into chroma

from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

In [None]:
# Define the path where chroma collections will be stored

chroma_data_path = "/chromadb_store"

In [None]:
# Call PersistentClient()

client = chromadb.PersistentClient(path=chroma_data_path)

In [None]:
# Set up the embedding function using the OpenAI embedding model

model = "text-embedding-ada-002"
embedding_function = OpenAIEmbeddingFunction(api_key=openai.api_key, model_name=model)

In [None]:
## Create an empty collection
planet_earth_collection = client.get_or_create_collection(name='Planet_Earth', embedding_function=embedding_function)

In [None]:
documents_list = planet_earth_pdfs_data["Page_Text"].tolist()
metadata_list = planet_earth_pdfs_data['Metadata'].tolist()

In [None]:
documents_list

['Earth 2020 Earth 2020 An Insider’s Guide to a Rapidly Changing Planet E P T DITED BY HILIPPE ORTELL PHILIPPE Fi� y years has passed since the fi rst Earth Day, on April 22nd, 1970. This accessible, An Insider’s Guide to a Rapidly incisive and � mely collec� on of essays brings together a diverse set of expert voices to examine how the Earth’s environment has changed over these past fi � y years, and to consider what lies in store for our planet over the coming fi � y years. Changing Planet TORTELL Earth 2020: An Insider’s Guide to a Rapidly Changing Planet responds to a public increasingly concerned about the deteriora� on of Earth’s natural systems, off ering readers a wealth of perspec� ves on our shared ecological past, and on the future trajectory of planet Earth. (ED.) Wri� en by world-leading thinkers on the front-lines of global change research and policy, this mul� -disciplinary collec� on maintains a dual focus: some essays inves� gate specifi c facets of the physical Earth 

In [None]:
# Define batch size (try a smaller batch size or even 1)
batch_size = 1  # Start with a batch size of 1 to isolate the issue

for i in range(0, len(documents_list), batch_size):
    batch_docs = documents_list[i:i+batch_size]
    batch_ids = [str(j) for j in range(i, i+len(batch_docs))]
    batch_meta = metadata_list[i:i+batch_size]

      # Filter out empty strings and corresponding metadata and ids
    non_empty_docs = [doc for doc in batch_docs if doc.strip()]
    non_empty_meta = [batch_meta[k] for k, doc in enumerate(batch_docs) if doc.strip()]
    non_empty_ids = [batch_ids[k] for k, doc in enumerate(batch_docs) if doc.strip()]

    if non_empty_docs: # Only add if there are non-empty documents in the batch
        try:
            planet_earth_collection.add(
                documents=non_empty_docs,
                ids=non_empty_ids,
                metadatas=non_empty_meta
            )
            print(f"Successfully added batch starting with id {non_empty_ids[0]}")
        except Exception as e:
            print(f"Error adding batch starting with id {batch_ids[0]}: {e}")
            # You can add more specific error handling or logging here
            # For example, you could print the problematic document content
            # print(f"Problematic document: {non_empty_docs[0]}")
            break # Stop on the first error to investigate

Successfully added batch starting with id 0
Successfully added batch starting with id 1
Successfully added batch starting with id 3
Successfully added batch starting with id 4
Successfully added batch starting with id 5
Successfully added batch starting with id 6
Successfully added batch starting with id 7
Successfully added batch starting with id 9
Successfully added batch starting with id 11
Successfully added batch starting with id 12
Successfully added batch starting with id 13
Successfully added batch starting with id 14
Successfully added batch starting with id 15
Successfully added batch starting with id 16
Successfully added batch starting with id 17
Successfully added batch starting with id 18
Successfully added batch starting with id 19
Successfully added batch starting with id 20
Successfully added batch starting with id 21
Successfully added batch starting with id 23
Successfully added batch starting with id 24
Successfully added batch starting with id 25
Successfully added

In [None]:
print("First 5 documents:")
print(documents_list[:5])
print("\nFirst 5 metadata entries:")
print(metadata_list[:5])
print("\nLength of documents_list:", len(documents_list))
print("Length of metadata_list:", len(metadata_list))

First 5 documents:
['Earth 2020 Earth 2020 An Insider’s Guide to a Rapidly Changing Planet E P T DITED BY HILIPPE ORTELL PHILIPPE Fi� y years has passed since the fi rst Earth Day, on April 22nd, 1970. This accessible, An Insider’s Guide to a Rapidly incisive and � mely collec� on of essays brings together a diverse set of expert voices to examine how the Earth’s environment has changed over these past fi � y years, and to consider what lies in store for our planet over the coming fi � y years. Changing Planet TORTELL Earth 2020: An Insider’s Guide to a Rapidly Changing Planet responds to a public increasingly concerned about the deteriora� on of Earth’s natural systems, off ering readers a wealth of perspec� ves on our shared ecological past, and on the future trajectory of planet Earth. (ED.) Wri� en by world-leading thinkers on the front-lines of global change research and policy, this mul� -disciplinary collec� on maintains a dual focus: some essays inves� gate specifi c facets of 

In [None]:
planet_earth_collection.peek(1)

{'ids': ['0'],
 'embeddings': array([[-0.00738377, -0.03350642, -0.02071139, ..., -0.00245249,
         -0.00608849, -0.03334862]]),
 'documents': ['Earth 2020 Earth 2020 An Insider’s Guide to a Rapidly Changing Planet E P T DITED BY HILIPPE ORTELL PHILIPPE Fi� y years has passed since the fi rst Earth Day, on April 22nd, 1970. This accessible, An Insider’s Guide to a Rapidly incisive and � mely collec� on of essays brings together a diverse set of expert voices to examine how the Earth’s environment has changed over these past fi � y years, and to consider what lies in store for our planet over the coming fi � y years. Changing Planet TORTELL Earth 2020: An Insider’s Guide to a Rapidly Changing Planet responds to a public increasingly concerned about the deteriora� on of Earth’s natural systems, off ering readers a wealth of perspec� ves on our shared ecological past, and on the future trajectory of planet Earth. (ED.) Wri� en by world-leading thinkers on the front-lines of global cha

In [None]:
# Let's take a look at the first few entries in the collection

planet_earth_collection.get(
   ids = ['0','1','2'],
   include = ['embeddings', 'documents', 'metadatas']
)

{'ids': ['0', '1'],
 'embeddings': array([[-0.00738377, -0.03350642, -0.02071139, ..., -0.00245249,
         -0.00608849, -0.03334862],
        [-0.00702442, -0.03759547, -0.00351551, ..., -0.00789505,
         -0.00997269, -0.02296622]]),
 'documents': ['Earth 2020 Earth 2020 An Insider’s Guide to a Rapidly Changing Planet E P T DITED BY HILIPPE ORTELL PHILIPPE Fi� y years has passed since the fi rst Earth Day, on April 22nd, 1970. This accessible, An Insider’s Guide to a Rapidly incisive and � mely collec� on of essays brings together a diverse set of expert voices to examine how the Earth’s environment has changed over these past fi � y years, and to consider what lies in store for our planet over the coming fi � y years. Changing Planet TORTELL Earth 2020: An Insider’s Guide to a Rapidly Changing Planet responds to a public increasingly concerned about the deteriora� on of Earth’s natural systems, off ering readers a wealth of perspec� ves on our shared ecological past, and on the 

In [None]:
cache_collection = client.get_or_create_collection(name='PlanetEarth_Cache', embedding_function=embedding_function)

In [None]:
cache_collection.peek()

{'ids': [],
 'embeddings': array([], dtype=float64),
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents', 'embeddings'],
 'data': None,
 'metadatas': []}

**3. Quering and Semantic Search Implementation & RAG:**

=========== Query 1=================

In [None]:
# Read the user query
query = input()

At what perecentage ice free land surface is used for growing crops


In [None]:
query

'At what perecentage ice free land surface is used for growing crops'

In [None]:
## Quickly checking the results of the query
results = planet_earth_collection.query(
      query_texts=query,
      n_results=10
      )

In [None]:
results

{'ids': [['111',
   '205',
   '113',
   '115',
   '116',
   '207',
   '112',
   '209',
   '158',
   '208']],
 'embeddings': None,
 'documents': [['Ice —— Julian Dowdeswell With an average surface temperature of 15°C (and rising), much of our planet is inhospitable to ice. Today, less than 2% of Earth’s water exists in a frozen form, locked up in glaciers and ice sheets, sea ice and permafrost. This ‘cryosphere’ is critically important for controlling global sea level and the distribution of the planet’s fresh water, yet it has always existed in a rather perilous state. In contrast, the ice caps on Mars and the frozen surface of Jupiter’s moon, Europa, enjoy a much colder and more stable existence. To understand the impacts of climate change on Earth’s cryosphere, it is necessary to examine the different components of our icy world separately, for each has its own sensitivity to local and global forces. Land-based glaciers and ice sheets develop when winter snowfall persists through suc

In [None]:
# Searh the Cache collection first
# Query the collection against the user query and return the top result

cache_results = cache_collection.query(
    query_texts=query,
    n_results=1
)

In [None]:
cache_results

{'ids': [[]],
 'embeddings': None,
 'documents': [[]],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[]],
 'distances': [[]]}

In [None]:
# Implementing Cache in Semantic Search

# Set a threshold for cache search
threshold = 0.2

ids = []
documents = []
distances = []
metadatas = []
results_df = pd.DataFrame()


# If the distance is greater than the threshold, then return the results from the main collection.

if cache_results['distances'][0] == [] or cache_results['distances'][0][0] > threshold:
      # Query the collection against the user query and return the top 10 results
      results = planet_earth_collection.query(
      query_texts=query,
      n_results=10
      )

      # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
      # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
      Keys = []
      Values = []

      # for key, val in results.items():
      #   if key not in ['embeddings', 'uris','data']:
      #     for i in range(10):
      #       Keys.append(str(key)+str(i))
      #       Values.append(str(val[0][i]))

      for key, val in results.items():
        if key not in ['embeddings', 'uris', 'data']:
            if isinstance(val[0], list):  # Expected case
                for i in range(len(val[0])):
                    Keys.append(f"{key}{i}")
                    Values.append(str(val[0][i]))
            else:
                # Handle non-list values safely
                Keys.append(f"{key}0")
                Values.append(str(val[0]))



      cache_collection.add(
          documents= [query],
          ids = [query],  # Or if you want to assign integers as IDs 0,1,2,.., then you can use "len(cache_results['documents'])" as will return the no. of queries currently in the cache and assign the next digit to the new query."
          metadatas = dict(zip(Keys, Values))
      )

      print("Not found in cache. Found in main collection.")

      result_dict = {'Metadatas': results['metadatas'][0], 'Documents': results['documents'][0], 'Distances': results['distances'][0], "IDs":results["ids"][0]}
      results_df = pd.DataFrame.from_dict(result_dict)
      results_df


# If the distance is, however, less than the threshold, you can return the results from cache

elif cache_results['distances'][0][0] <= threshold:
      cache_result_dict = cache_results['metadatas'][0][0]

      # Loop through each inner list and then through the dictionary
      for key, value in cache_result_dict.items():
          if 'ids' in key:
              ids.append(value)
          elif 'documents' in key:
              documents.append(value)
          elif 'distances' in key:
              distances.append(value)
          elif 'metadatas' in key:
              metadatas.append(value)

      print("Found in cache!")

      # Create a DataFrame
      results_df = pd.DataFrame({
        'IDs': ids,
        'Documents': documents,
        'Distances': distances,
        'Metadatas': metadatas
      })

Not found in cache. Found in main collection.


In [None]:
results_df

Unnamed: 0,Metadatas,Documents,Distances,IDs
0,"{'Page_No.': 'Page 112', 'filing_name': 'Plane...",Ice —— Julian Dowdeswell With an average surfa...,0.314146,111
1,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...",Land —— Navin Ramankutty and Hannah Wittman Ou...,0.344902,205
2,"{'Page_No.': 'Page 114', 'filing_name': 'Plane...",often being the last to become clear of ice. A...,0.353272,113
3,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...",water that absorbs much greater amounts of sol...,0.359818,115
4,"{'Page_No.': 'Page 117', 'filing_name': 'Plane...",sea-ice minima have declined from around 7–8 m...,0.360826,116
5,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...",for cultural and linguistic genocide across se...,0.36586,207
6,"{'Page_No.': 'Page 113', 'filing_name': 'Plane...","Today, the great ice sheets of Antarctica and ...",0.369905,112
7,"{'Page_No.': 'Page 210', 'filing_name': 'Plane...","just seventeen crops. On the flip side, excess...",0.389362,209
8,"{'Page_No.': 'Page 159', 'filing_name': 'Plane...",(and occasionally removing) barriers that migh...,0.393565,158
9,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...",biologically-inert) nitrogen in the atmosphere...,0.395671,208


In [None]:
## Checking if the cache also contains the results
cache_results = cache_collection.query(
    query_texts=query,
    n_results=1
)

In [None]:
cache_results

{'ids': [['At what perecentage ice free land surface is used for growing crops']],
 'embeddings': None,
 'documents': [['At what perecentage ice free land surface is used for growing crops']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'metadatas9': "{'filing_name': 'Planet_Earth', 'Page_No.': 'Page 209'}",
    'documents2': 'often being the last to become clear of ice. As a result of the seasonal cycle of ice growth and melting, sea ice is usually only a few meters thick at most, as compared to hundreds or thousands of meters for glaciers and ice sheets. A third type of ice is permafrost, which occurs in polar and high-mountain areas where the ground is permanently frozen to depths of ten to hundreds of meters. In summer, ice in the upper meter or so of the soil matrix melts to produce a soft ‘active layer’, which refreezes again each winter. Permanently frozen ground occupies vast areas of the Arctic beyond the margins of mode

In [None]:
!pip install sentence_transformers

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence_transformers)
 

In [None]:
# Import the CrossEncoder library from sentence_transformers

from sentence_transformers import CrossEncoder, util

In [None]:
# Initialise the cross encoder model

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

In [None]:
# Input (query, response) pairs for each of the top 20 responses received from the semantic search to the cross encoder
# Generate the cross_encoder scores for these pairs

cross_inputs = [[query, response] for response in results_df['Documents']]
cross_rerank_scores = cross_encoder.predict(cross_inputs)

In [None]:
 #Store the rerank_scores in results_df

results_df['Reranked_scores'] = cross_rerank_scores

In [None]:
results_df

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Page_No.': 'Page 112', 'filing_name': 'Plane...",Ice —— Julian Dowdeswell With an average surfa...,0.314146,111,-9.258108
1,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...",Land —— Navin Ramankutty and Hannah Wittman Ou...,0.344902,205,-2.087678
2,"{'Page_No.': 'Page 114', 'filing_name': 'Plane...",often being the last to become clear of ice. A...,0.353272,113,-7.688014
3,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...",water that absorbs much greater amounts of sol...,0.359818,115,-9.087379
4,"{'Page_No.': 'Page 117', 'filing_name': 'Plane...",sea-ice minima have declined from around 7–8 m...,0.360826,116,-8.430554
5,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...",for cultural and linguistic genocide across se...,0.36586,207,-6.485122
6,"{'Page_No.': 'Page 113', 'filing_name': 'Plane...","Today, the great ice sheets of Antarctica and ...",0.369905,112,-8.80987
7,"{'Page_No.': 'Page 210', 'filing_name': 'Plane...","just seventeen crops. On the flip side, excess...",0.389362,209,-8.827705
8,"{'Page_No.': 'Page 159', 'filing_name': 'Plane...",(and occasionally removing) barriers that migh...,0.393565,158,-10.664641
9,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...",biologically-inert) nitrogen in the atmosphere...,0.395671,208,-8.143909


In [None]:
# Return the top 3 results from semantic search
top_3_semantic = results_df.sort_values(by='Distances')
top_3_semantic[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Page_No.': 'Page 112', 'filing_name': 'Plane...",Ice —— Julian Dowdeswell With an average surfa...,0.314146,111,-9.258108
1,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...",Land —— Navin Ramankutty and Hannah Wittman Ou...,0.344902,205,-2.087678
2,"{'Page_No.': 'Page 114', 'filing_name': 'Plane...",often being the last to become clear of ice. A...,0.353272,113,-7.688014


In [None]:
# Return the top 3 results after reranking

top_3_rerank = results_df.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank[:3]
top_3_RAG = top_3_rerank[["Documents", "Metadatas"]][:3]

In [None]:
top_3_RAG

Unnamed: 0,Documents,Metadatas
1,Land —— Navin Ramankutty and Hannah Wittman Ou...,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P..."
5,for cultural and linguistic genocide across se...,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P..."
2,often being the last to become clear of ice. A...,"{'Page_No.': 'Page 114', 'filing_name': 'Plane..."


In [None]:
retrieved = top_3_RAG[["Documents", "Metadatas"]][:3]



In [None]:
# retrieved = # Just the text from 'Documents' column
retrieved_text = "\n\n".join(top_3_RAG['Documents'].tolist())


In [None]:
messages = [
    {"role":"system", "content":"You are an AI assistant to user."},
    {"role":"user", "content":f"""{query}. Please use these details to extract that info:'{retrieved_text}' """},
          ]

In [None]:
response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages)
response.choices[0].message.content

"About 40% of the world's ice-free land surface is used for growing crops or grazing animals."

In [None]:
results = {'question':query,'answer':response.choices[0].message.content}

In [None]:
results

{'question': 'At what perecentage ice free land surface is used for growing crops',
 'answer': "About 40% of the world's ice-free land surface is used for growing crops or grazing animals."}

**Query 1 Result:**

---



**Question:**
At what perecentage ice free land surface is used for growing crops

**Answer:**
About 40% of the world's ice-free land surface is used for growing crops or grazing animals.



========= Query2 ====================

In [None]:
query2='what is the global average rise of sea level rise'

In [None]:
cache_results2 = cache_collection.query(
    query_texts=query2,
    n_results=1
)

In [None]:
# Implementing Cache in Semantic Search

# Set a threshold for cache search
threshold = 0.2

ids = []
documents = []
distances = []
metadatas = []
results_df = pd.DataFrame()


# If the distance is greater than the threshold, then return the results from the main collection.

if cache_results2['distances'][0] == [] or cache_results2['distances'][0][0] > threshold:
      # Query the collection against the user query and return the top 10 results
      results = planet_earth_collection.query(
      query_texts=query2,
      n_results=10
      )

      # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
      # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
      Keys = []
      Values = []

      # for key, val in results.items():
      #   if key not in ['embeddings', 'uris','data']:
      #     for i in range(10):
      #       Keys.append(str(key)+str(i))
      #       Values.append(str(val[0][i]))

      for key, val in results.items():
        if key not in ['embeddings', 'uris', 'data']:
            if isinstance(val[0], list):  # Expected case
                for i in range(len(val[0])):
                    Keys.append(f"{key}{i}")
                    Values.append(str(val[0][i]))
            else:
                # Handle non-list values safely
                Keys.append(f"{key}0")
                Values.append(str(val[0]))



      cache_collection.add(
          documents= [query2],
          ids = [query2],  # Or if you want to assign integers as IDs 0,1,2,.., then you can use "len(cache_results2['documents'])" as will return the no. of queries currently in the cache and assign the next digit to the new query."
          metadatas = dict(zip(Keys, Values))
      )

      print("Not found in cache. Found in main collection.")

      result_dict = {'Metadatas': results['metadatas'][0], 'Documents': results['documents'][0], 'Distances': results['distances'][0], "IDs":results["ids"][0]}
      results_df = pd.DataFrame.from_dict(result_dict)
      results_df


# If the distance is, however, less than the threshold, you can return the results from cache

elif cache_results2['distances'][0][0] <= threshold:
      cache_result_dict = cache_results2['metadatas'][0][0]

      # Loop through each inner list and then through the dictionary
      for key, value in cache_result_dict.items():
          if 'ids' in key:
              ids.append(value)
          elif 'documents' in key:
              documents.append(value)
          elif 'distances' in key:
              distances.append(value)
          elif 'metadatas' in key:
              metadatas.append(value)

      print("Found in cache!")

      # Create a DataFrame
      results_df2 = pd.DataFrame({
        'IDs': ids,
        'Documents': documents,
        'Distances': distances,
        'Metadatas': metadatas
      })

Not found in cache. Found in main collection.


In [None]:
results_df

Unnamed: 0,Metadatas,Documents,Distances,IDs
0,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...",and groundwater supplies in low-lying coastal ...,0.213172,154
1,"{'Page_No.': 'Page 153', 'filing_name': 'Plane...","gravitational field, affecting sea level diffe...",0.230776,152
2,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...",And we may see even more dramatic changes over...,0.252099,155
3,"{'Page_No.': 'Page 154', 'filing_name': 'Plane...",While the ice sheets are the major (and most v...,0.264784,153
4,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...","Sea Level Rise, 1970–2070: A View from the Fut...",0.286664,151
5,"{'Page_No.': 'Page 160', 'filing_name': 'Plane...","6. W. Sweet, G. Dusek, J. Obeysekera and J. J....",0.311768,159
6,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...",water that absorbs much greater amounts of sol...,0.321367,115
7,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...","650,000-800,000 years before present’, Nature,...",0.338241,118
8,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...",believed that the oceans’ vastness would serve...,0.372986,217
9,"{'Page_No.': 'Page 36', 'filing_name': 'Planet...",implying that the oceans were taking up less o...,0.376253,35


In [None]:
cross_inputs = [[query, response] for response in results_df['Documents']]
cross_rerank_scores = cross_encoder.predict(cross_inputs)


results_df['Reranked_scores'] = cross_rerank_scores


top_3_semantic = results_df.sort_values(by='Distances')
top_3_semantic[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...",and groundwater supplies in low-lying coastal ...,0.213172,154,-10.171543
1,"{'Page_No.': 'Page 153', 'filing_name': 'Plane...","gravitational field, affecting sea level diffe...",0.230776,152,-10.484325
2,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...",And we may see even more dramatic changes over...,0.252099,155,-10.212143


In [None]:
top_3_rerank = results_df.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank[:3]
top_3_RAG = top_3_rerank[["Documents", "Metadatas"]][:3]


retrieved = top_3_RAG[["Documents", "Metadatas"]][:3]

retrieved_text = "\n\n".join(top_3_RAG['Documents'].tolist())


In [None]:
messages = [
    {"role":"system", "content":"You are an AI assistant to user."},
    {"role":"user", "content":f"""{query2}. Please use these details to extract that info, dont use any other details:'{retrieved_text}' """},
          ]

In [None]:
response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages)
response.choices[0].message.content

'The global average rise of sea level was 1.4 mm per year between 1901 and 1990. From 2006 to 2015, the sea level rise increased to 3.6 mm per year, which is about 2.5 times the rate seen in much of the twentieth century. Predictions for future sea level increases suggest a global rise of between 0.4 m and about 1 m by 2100 for low and high greenhouse-gas emission scenarios, respectively.'

In [None]:
another_result = {'question':query2, 'answer':response.choices[0].message.content}

In [None]:
another_result

{'question': 'what is the global average rise of sea level rise',
 'answer': 'The global average rise of sea level was 1.4 mm per year between 1901 and 1990. From 2006 to 2015, the sea level rise increased to 3.6 mm per year, which is about 2.5 times the rate seen in much of the twentieth century. Predictions for future sea level increases suggest a global rise of between 0.4 m and about 1 m by 2100 for low and high greenhouse-gas emission scenarios, respectively.'}

**Query 2 Result:**

---



**Question:** what is the global average rise of sea level rise

**Answer:** The global average rise of sea level was 1.4 mm per year between 1901 and 1990. From 2006 to 2015, the sea level rise increased to 3.6 mm per year, which is about 2.5 times the rate seen in much of the twentieth century. Predictions for future sea level increases suggest a global rise of between 0.4 m and about 1 m by 2100 for low and high greenhouse-gas emission scenarios, respectively.

=========== Query 3=================

In [None]:
query3='When was the first IPCC report appeared'

In [None]:
cache_results3 = cache_collection.query(
    query_texts=query3,
    n_results=1
)

In [None]:
# Implementing Cache in Semantic Search

# Set a threshold for cache search
threshold = 0.2

ids = []
documents = []
distances = []
metadatas = []
results_df = pd.DataFrame()


# If the distance is greater than the threshold, then return the results from the main collection.

if cache_results3['distances'][0] == [] or cache_results3['distances'][0][0] > threshold:
      # Query the collection against the user query and return the top 10 results
      results = planet_earth_collection.query(
      query_texts=query3,
      n_results=10
      )

      # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
      # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
      Keys = []
      Values = []

      # for key, val in results.items():
      #   if key not in ['embeddings', 'uris','data']:
      #     for i in range(10):
      #       Keys.append(str(key)+str(i))
      #       Values.append(str(val[0][i]))

      for key, val in results.items():
        if key not in ['embeddings', 'uris', 'data']:
            if isinstance(val[0], list):  # Expected case
                for i in range(len(val[0])):
                    Keys.append(f"{key}{i}")
                    Values.append(str(val[0][i]))
            else:
                # Handle non-list values safely
                Keys.append(f"{key}0")
                Values.append(str(val[0]))



      cache_collection.add(
          documents= [query3],
          ids = [query3],  # Or if you want to assign integers as IDs 0,1,2,.., then you can use "len(cache_results2['documents'])" as will return the no. of queries currently in the cache and assign the next digit to the new query."
          metadatas = dict(zip(Keys, Values))
      )

      print("Not found in cache. Found in main collection.")

      result_dict = {'Metadatas': results['metadatas'][0], 'Documents': results['documents'][0], 'Distances': results['distances'][0], "IDs":results["ids"][0]}
      results_df = pd.DataFrame.from_dict(result_dict)
      results_df


# If the distance is, however, less than the threshold, you can return the results from cache

elif cache_results3['distances'][0][0] <= threshold:
      cache_result_dict = cache_results3['metadatas'][0][0]

      # Loop through each inner list and then through the dictionary
      for key, value in cache_result_dict.items():
          if 'ids' in key:
              ids.append(value)
          elif 'documents' in key:
              documents.append(value)
          elif 'distances' in key:
              distances.append(value)
          elif 'metadatas' in key:
              metadatas.append(value)

      print("Found in cache!")

      # Create a DataFrame
      results_df3 = pd.DataFrame({
        'IDs': ids,
        'Documents': documents,
        'Distances': distances,
        'Metadatas': metadatas
      })

Not found in cache. Found in main collection.


In [None]:
results_df

Unnamed: 0,Metadatas,Documents,Distances,IDs
0,"{'Page_No.': 'Page 184', 'filing_name': 'Plane...",If it takes a village to ensure the well-being...,0.295294,183
1,"{'Page_No.': 'Page 86', 'filing_name': 'Planet...","12. IPCC, ‘Summary for policymakers’, in Globa...",0.311781,85
2,"{'Page_No.': 'Page 14', 'filing_name': 'Planet...",when the UN’s Intergovernmental Panel on Clima...,0.314651,13
3,"{'Page_No.': 'Page 169', 'filing_name': 'Plane...","3. IPPC, Special Report on the Ocean and Cryos...",0.322288,168
4,"{'Page_No.': 'Page 42', 'filing_name': 'Planet...",11. These estimates are based on the carbon bu...,0.350454,41
5,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...","to the Antarctic ozone hole.3 By 1990, the US ...",0.359831,242
6,"{'filing_name': 'Planet_Earth', 'Page_No.': 'P...",larger emergent narrative. And what stands out...,0.376936,18
7,"{'Page_No.': 'Page 164', 'filing_name': 'Plane...",Sustainable development is development that me...,0.384532,163
8,"{'Page_No.': 'Page 170', 'filing_name': 'Plane...","18. United Nations, Report of the Conference o...",0.385031,169
9,"{'Page_No.': 'Page 41', 'filing_name': 'Planet...","are much more urgent now than in 1970 or 1992,...",0.393937,40


In [None]:
cross_inputs = [[query, response] for response in results_df['Documents']]
cross_rerank_scores = cross_encoder.predict(cross_inputs)


results_df['Reranked_scores'] = cross_rerank_scores


top_3_semantic = results_df.sort_values(by='Distances')
top_3_semantic[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Page_No.': 'Page 184', 'filing_name': 'Plane...",If it takes a village to ensure the well-being...,0.295294,183,-10.980108
1,"{'Page_No.': 'Page 86', 'filing_name': 'Planet...","12. IPCC, ‘Summary for policymakers’, in Globa...",0.311781,85,-11.326469
2,"{'Page_No.': 'Page 14', 'filing_name': 'Planet...",when the UN’s Intergovernmental Panel on Clima...,0.314651,13,-9.628475


In [None]:
top_3_rerank = results_df.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank[:3]
top_3_RAG = top_3_rerank[["Documents", "Metadatas"]][:3]


retrieved = top_3_RAG[["Documents", "Metadatas"]][:3]

retrieved_text = "\n\n".join(top_3_RAG['Documents'].tolist())


In [None]:
messages = [
    {"role":"system", "content":"You are an AI assistant to user."},
    {"role":"user", "content":f"""{query3}. Please use these details to extract that info, dont use any other details:'{retrieved_text}' """},
          ]

In [None]:
response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages)
response.choices[0].message.content

'The first IPCC Assessment Report was released in 1990.'

In [None]:
query3

'When was the first IPCC report appeared'

In [None]:
response.choices[0].message.content

'The first IPCC Assessment Report was released in 1990.'

In [None]:
q3_result = {'question':query3, 'answer':response.choices[0].message.content}

In [None]:
q3_result

{'question': 'When was the first IPCC report appeared',
 'answer': 'The first IPCC Assessment Report was released in 1990.'}

**Query 3 Result:**

---



**Question:** When was the first IPCC report appeared?

**Answer:** The first IPCC Assessment Report was released in 1990.

=============================== END OF NOTEBOOK ==========================================