**Author:** Carolina Gonçalves, carolina.goncalves@research.fchampalimaud.org

**Scope**: A simple and personalized use case of RAG, joining search methods with LLMs. The goal is to get a basic understanding of how systems like search-answer engines might work under the hood.


**First of all**, change runtime type to GPU, before you start. This will make the LLM answering process faster.

# (Just run):

1. Install packages
2. Import packages; access to google drive
3. All code needed for the rest of the notebook.

In [None]:
!pip install -U sentence-transformers
!pip install --upgrade transformers
!pip install -U bitsandbytes
!pip install pymupdf
!pip install rank_bm25



In [None]:
import fitz  # PyMuPDF: https://pymupdf.readthedocs.io/en/latest/tutorial.html#extracting-text-and-images
import pandas as pd
import os
import re
from IPython.display import Markdown, display
import numpy as np
import textwrap
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from rank_bm25 import BM25Okapi
import time
# Use a pipeline as a high-level helper
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import random
import torch

from google.colab import drive
# Connect to drive: this will enable you to save content in your drive and also load it from there
drive.mount('/content/drive', force_remount=True)

# Set a random seed
random_seed = 42
random.seed(random_seed)

# Set a random seed for PyTorch (for GPU as well)
torch.manual_seed(random_seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(random_seed)

Mounted at /content/drive


In [None]:
## PDF Reader and Index

def save_results_to_txt(all_docs, pdf_names, folder_path, FILENAME):
  '''
  Function to save text retrieved from the PDF to a txt file

  folder_path (str): folder where you want to save it
  FILENAME (str): name of the file where you want to save the text
  all_docs (list of dicts): where the text of each of your documents is stored
  pdf_names (list of str): list of your documents names (as they are stored in your drive)
  '''
  with open(folder_path + FILENAME, "w") as f: #open file where you want to save your PDF
    for i, doc in enumerate(all_docs): #Go through all the documents, one at a time
      if bool(doc): # if text was retrieved from this doc
        f.write(f"{i} URL: {pdf_names[i]}\n")
        for title, text in doc.items(): # Go through each sub-section (title: paragraphs)
          f.write(f"{title}:\n") #Writing the title
          for paragraph in text:
            f.write(paragraph + "\t") #Writing each paragraoh
          f.write("\n")
        f.write("\n\n")
  f.close() #close the file

def contains_titles_to_ignore(paragraph, keywords = ["content", "index", "appendix", "figures", "contacts"]):
  ''' This checks is any of those keywords is present in the given paragraph.
  The goal here is to detect pages that have the "Appendix", "Index of Contents",...
  And ignore that '''
  return any(keyword in paragraph.strip().lower() for keyword in keywords)


def extract_pdf_content(pdf_file_path, max_title_size=20, max_text_size=12, min_text_size=9):
  '''
  Extracts and parses text from a pdf_file.
  Details being IGNORED:
      page numeration;
      pages with big titles (with font_size > max_title_size) that have the words "Appendix", "Index of Contents",...;
      headers and foot notes (font_size < min_text_size);
      images;
      bigger titles (font_size > max_title_size);
      lines that just have numbers or weird symbols and no text.

  pdf_file_path (str): path to the pdf file
  max_title_size (int): maximum size allowed for a title to be retrieved/stored.
                      if font_size >= max_title_size, text will be ignored.
  max_text_size (int): maximum size allowed for a paragraph.
                  if font_size > max_text_size, text will be considered a title.
  min_text_size (int): minimum size allowed for a paragraph to be retrieved/stored.
                  if font_size <= min_text_size, text will be ignored.

  returns content_list (dict): a dictionary with the pdf's text organized by title and paragraphs as follows
          content_list = {
            "Title 1": ["Paragraph 1", "Paragraph 2", ...],
            "Title 2": ["Paragraph 1", "Paragraph 2", ...],
            ...
          }
  '''
  doc = fitz.open(pdf_file_path) # Open the PDF file

  content_list, all_titles = {}, [] # Initialize an empty list to store the content of each page
  bold_flag=8   # this is a flag from PyMuPDF package that sinalizes when a text
                # is bold (flag > 8). (DON'T NEED TO CHANGE THIS)

  # Iterate over each page and extract text
  for page_num in range(len(doc)):
      page = doc.load_page(page_num)  # Load the page
      text = page.get_text("dict")['blocks'] # get text blocks (= paragraphs).

      for block in text: # Iterate over blocks of text
          if 'lines' in block:  # Check if the block contains text lines
              paragraph = ""
              font_size, flag, count = 0, 0, 0
              for line in block['lines']: # get all lines inside a block
                  for span in line['spans']: # get all "sub-lines"
                      if len(span['text'].strip()) > 1: # Ignore empty spans
                        saved_span = span
                        paragraph += span['text']  # Get the actual text
                        font_size += span['size']  # font size for each span
                        flag += span["flags"] # flag of each span just to know if
                                    # we are dealing with some kind of sub-title
                                    # if so, all text in this block would have an
                                    # average high flag (meaning, for e.g., bold + superscript)
                        count += 1

              if len(paragraph.strip())==0: # if this paragraph == empty spaces
                continue # continue to other block that actually has text

              else:
                font_size /= count # compute the average font_size for the paragraph
                flag /= count

                # ignoring FULL page with index of contents, figues, appendix,...
                if font_size > max_text_size and contains_titles_to_ignore(paragraph):
                  break #ignore page; jumps back in the code to go to next page_num

                # Ignoring headers and foot notes based on font_size
                if font_size < max_title_size and font_size > min_text_size:
                  # Get titles: font_size > max_text_size or bold (flag>8) with a slightly lower font_size
                  if font_size > max_text_size or (font_size >= (max_text_size-2) and flag > bold_flag):
                    # paragraph is a title

                    # CORRECT FOR MISTAKES: (before adding a new title)
                    # if we previously just added a title, but has no paragraphs
                    if bool(content_list) and len(content_list[all_titles[-1]]) == 0:
                      # check if we mistankenly ignored some paragraph that is not
                      # big enough to be a title and add it back
                      if backup_text[1] <= max_text_size:
                        # correct for possible missing relevant text because of variations in font size
                        content_list[all_titles[-1]].append(backup_text[0])
                      else: # or if it's just an isolated (sub)title, will delete it
                        content_list.pop(all_titles[-1])

                    # now, we can start the new section of the doc with the new title
                    content_list[paragraph.strip()] = [] # paragraph here is the title
                    all_titles.append(paragraph.strip())

                  # Get normal text paragraphs
                  else:
                    if not bool(content_list): # if the pdf does not start with a title
                      content_list["First paragraph"] = [] # add a fake title
                      all_titles.append(["First paragraph"])

                    # Check if the first letter of the text is capital letter
                    first_letter = next((char for i, char in enumerate(paragraph) if char.isalpha()), -1)
                    if first_letter == -1: # couldn't find any letter
                      continue # ignore and continue to other block that actually has text

                    elif first_letter.isupper(): #new paragraph if it starts with capital letter
                      # append new paragraph to previous title; remove bulletpoint signal, if it's there one
                      content_list[all_titles[-1]].append(paragraph[1:].strip() if paragraph.startswith('•') else paragraph.strip())

                    else: # if it starts with lower case, it's the continuation of the
                      # previous paragraph (e.g: when same paragraph extends across pages)
                      if len(content_list[all_titles[-1]]) == 0:
                        # in case this is the first paragraph in a title
                        content_list[all_titles[-1]] = [""]
                      content_list[all_titles[-1]][-1] += " " + (paragraph[1:] if paragraph.startswith('•') else paragraph.strip())

                else:
                  # CORRECT FOR MISTAKES: keep record of the ignored text for
                  # one iteration longer, in case this might be normal paragraphs following
                  # some sub-title or box of highlights that got ignored, because it has
                  # a smaller font_size than 9
                  backup_text = (paragraph, font_size)

  doc.close() # Close the PDF file
  return content_list

def scrape_pdfs(folder_path, pdfs_list, max_title_size=20, max_text_size=12, min_text_size=9):
  '''
  Gets text from multiple pdf files, as long as they are in the same folder.

  folder_path (str): path to folder with all pdfs
  pdfs_list (list of str): list of pdfs' filenames to be retrieved
  max_title_size (int): maximum size allowed for a title to be retrieved/stored.
                      if font_size >= max_title_size, text will be ignored.
  max_text_size (int): maximum size allowed for a paragraph.
                  if font_size > max_text_size, text will be considered a title.
  min_text_size (int): minimum size allowed for a paragraph to be retrieved/stored.
                  if font_size <= min_text_size, text will be ignored.

  returns all_docs_text (list of dicts): a list with a  a dictionary per pdf
         organized by title and paragraphs as follows
          dict_pdf_0 = {
            "Title 1": ["Paragraph 1", "Paragraph 2", ...],
            "Title 2": ["Paragraph 1", "Paragraph 2", ...],
            ...
          }
  '''
  all_docs_text = []
  for idx, file_name in enumerate(pdfs_list):
    assert file_name.endswith(".pdf"), "File must be a pdf (end with '.pdf')"
    print(f"Fecthing text from pdf {idx}...")
    text = extract_pdf_content(os.path.join(folder_path, file_name),
                               max_title_size, max_text_size, min_text_size)
    all_docs_text.append(text)

  return all_docs_text

def extract_font_info(folder_path, file_name, n_pages=6):
  '''
  Same logic as function "extract_pdf_content", but only serves
  to check the font_sizes of some pages (n_pages) and use it to adjust the
  parameters of that function, if necessary.
  '''
  doc = fitz.open(os.path.join(folder_path, file_name))

  for page_num in range(len(doc)):
      page = doc.load_page(page_num)

      # Extract all text information including fonts and sizes
      text_instances = page.get_text("dict")["blocks"]

      for block in text_instances:
        if 'lines' in block:
          for line in block.get("lines", []):
              for span in line.get("spans", []):
                  text = span.get("text")
                  font_size = span.get("size")
                  font_name = span.get("font")
                  print(f"Page {page_num + 1}: '{text}' \n-> Font: {font_name}, Size: {font_size}\n")

      if page_num > n_pages:
        break

def show_your_index(all_docs, folder_path):
  '''
  Simple index to know where to find each title and paragraph
  (to which document/link in our list of pdfs/search results does it belong to)
  '''
  all_titles = {"Titles": [], "Doc_IDs": []}
  all_para = {"Paragraphs": [], "Doc_IDs": [], "Titles": []}
  for id, doc in enumerate(all_docs):
    if bool(doc):
      for title, text in doc.items():
        all_titles["Titles"].append(title)
        all_titles["Doc_IDs"].append(id)
        for paragraph in text:
          all_para["Paragraphs"].append(paragraph)
          all_para["Doc_IDs"].append(id)
          all_para["Titles"].append(title)

  df = pd.DataFrame.from_dict(all_titles)
  df_para = pd.DataFrame.from_dict(all_para)

  df.to_csv(folder_path + "titles_index.csv")
  df_para.to_csv(folder_path + "paragraphs_index.csv")
  return df, df_para






## Pre-processing and ranking w/ BM25

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))  # nlkt
print(stop_words)

porter_stemmer = PorterStemmer()

def preprocess(text, stem=False, print_tokens=False):
  # Tokenization and lowercasing; also ignore punctuation
  tokens = re.findall(r"\w+", text.lower()) #separating words

  # Remove noisy terms including stopwords from indexing
  filtered_tokens = [
      token
      for token in tokens
      if token not in stop_words
  ]
  # Converting words to their cannonical form
  if stem:
    filtered_tokens = [
        porter_stemmer.stem(token) for token in filtered_tokens
    ]
  if print_tokens:
    print(filtered_tokens)
  return filtered_tokens

def search_wBM25(query, processed_docs, original_docs, stem=False, print_keywords=False, k=1.2, b=0.75, N=3):
    '''
    Implements BM25 algortihm for ranking documents based on their relevance to a given query.
    k and b are hyperparameters that control for term frequency saturation and document length normalization, respectively.
    Returns the top-3 most relevant documents to the query, order by relevance (BM25 score).
    '''
    tini = time.time()
    # Initialize BM25 with the tokenized and pre-processed corpus
    # IDF will be computed based on this corpus
    bm25 = BM25Okapi(processed_docs, k1=k, b=b)
    # Pre-process the query as the processed_docs (stem=True if the docs were stemmed)
    # Then, use BM25 to score each document based on its similarity to the query
    bm25_scores = bm25.get_scores(preprocess(query, stem=stem, print_tokens=print_keywords))
    top_n = np.argpartition(bm25_scores, -N)[-N:] #select the top 3 most matched docs to the query
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    if len(bm25_hits)>1:
      bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True) #sorting the passages by their score
    tout = time.time()

    print(f"Total time spent searching: {tout-tini}s")
    return bm25_hits[:N]

def ranking(question, docs, N=3):
  tokenized_text = [preprocess(t, stem=True) for t in docs]
  hits = search_wBM25(question, tokenized_text, docs, stem=True,
                      print_keywords=False, k=1.2, b=0.75, N=N)

  matches = []
  for result in hits:
    if result["score"] > 0:
      t = docs[result['corpus_id']]
      print("{:.3f}\t{}\n".format(result['score'], t))
      matches.append(t)

  if len(matches) == 0:
    print("No matches found with score > 0")
  return matches


## Manual Selection

def select_full_docs(all_docs_pdf, doc_ids):
  selected_docs = []
  for i in doc_ids:
    plain_text = []
    for title, text in all_docs_pdf[i].items():
      plain_text.append(title)
      for para in text:
        plain_text.append(para)
    selected_docs.extend(plain_text)
  return selected_docs

def select_by_title(all_docs_pdf, titles, title_index):
  selected_titles_text = []
  for title in titles:
    row = title_index[title_index['Titles'] == title]
    assert not row.empty, "You gave a title that does not exist"
    doc_ids = row["Doc_IDs"].values
    for id in doc_ids:
      selected_titles_text.extend(all_docs_pdf[id][title])
  return selected_titles_text



## Choose how to select

def estimate_tokens(text):
  words = re.findall(r"\w+", text) #separating words
  # Rule of thumb: 100 tokens ~= 75 words
  n_tokens = len(words) * 100/75
  return n_tokens

def select(all_docs_pdf, to_select, title_index, para_index,
           mode="manually", by="doc", N=3, question=None):
  assert mode in ["manually", "ranking"], "Mode has to be one of: 'manually' or 'ranking'"
  if mode == "manually":
    assert by in ["doc", "title", "paragraph"], "By has to be one of: 'doc' or 'title' or 'paragraph'"
    assert len(to_select) > 0, "You need to select at least one document/title/paragraph"
  else:
    assert by in ["title", "paragraph"], "For ranking, By has to be one of: 'title' or 'paragraph'"
    assert question is not None, "You need to provide a question to rank."

  if mode=="manually":
    if by=="doc":
      selected_text = select_full_docs(all_docs_pdf, to_select)

    elif by=="title":
      selected_text = select_by_title(all_docs_pdf, to_select, title_index)

    elif by=="paragraph":
      selected_text = list(paragraph_index.loc[to_select, :]['Paragraphs'])

  elif mode=="ranking":
    if by=="title":
      selected_titles = ranking(question, list(title_index["Titles"]), N)
      if len(selected_titles) > 0: #get the paragraphs for the selected titles
        selected_text = select_by_title(all_docs_pdf, selected_titles, title_index)
      else:
        selected_text = []
    elif by=="paragraph":
      selected_text = ranking(question, list(para_index["Paragraphs"]), N)

  if len(selected_text) == 0:
    print("No matches found")
  else:
    total_no_tokens = 0
    print(f" Selected text: \n")
    for t in selected_text:
      total_no_tokens += estimate_tokens(t)
      print(textwrap.fill(t, width=100))
      print("\n")
  print(f"\n Estimated number of tokens: {total_no_tokens}")
  return selected_text





## LOAD LLM model

def load_LLM_model(model_id, token):
  # Configure quantization
  # We'll also use [*quantization*]
  # (https://huggingface.co/docs/optimum/concept_guides/quantization)
  # to make inference faster. If we don't use it, we can't fit the model in colab's GPU.
  quant_config = BitsAndBytesConfig(
      load_in_8bit=True,  # Enable 8-bit quantization
      llm_int8_threshold=6.0,  # Optional: Adjust this threshold to balance accuracy and performance
                              # A lower value can result in more aggressive quantization, potentially speeding up
                              # inference further at the cost of some accuracy.
  )

  # Load the model with quantization
  model = AutoModelForCausalLM.from_pretrained(model_id,
                                              token=token,
                                              device_map="auto",
                                              quantization_config=quant_config)

  #https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html
  pipe = pipeline("text-generation",
                  model=model,
                  tokenizer=model_id,
                  token=token,
                  )
  return pipe

## Write prompt and generate answer with LLM

def write_prompt(instructions, context=[], question=""):
  if len(context)>0:
    results_text = ""
    for p_id, content in enumerate(context):
      results_text += f'''[{p_id}] {content}\n'''

    if question == "":
      prompt_template = f'''{instructions}
      Web search results:\n{results_text}'''
    else:
        prompt_template = f'''{instructions}
        Web search results:\n{results_text}
        Query: {question}'''

  else: #no context given
    if question == "":
      prompt_template = f'''{instructions}'''
    else:
      prompt_template = f'''{instructions}
      Query: {question}'''

  wrapped_text = textwrap.fill(prompt_template, width=150)
  print(f"FULL PROMPT:\n {wrapped_text}\n")

  print(f" ******* ESTIMATED NUMBER OF TOKENS: {estimate_tokens(prompt_template)}")
  return prompt_template

def generative_response(llm_pipe, prompt, temperature=1,
                        max_length=8000, instruction_tunned=True):
  if instruction_tunned:
    messages = [
      {"role": "user", "content": prompt},
    ]
  else:
    #if it's just a foundational model, won't need or accept the messages format above
    messages = prompt
  # Generate text using the LLM
  pipeline_output = llm_pipe(messages,
                        temperature=temperature,
                        max_length=max_length, # Maximum number of tokens in the generated output
                        #if False,  model will generate the most probable sequence (greedy decoding)
                        #num_return_sequences=1,
                        truncation=True,
                        )
  return pipeline_output[0]["generated_text"][-1]["content"]

{'he', 'him', 'after', "you're", "it's", 'between', 'will', 'll', 'only', 'what', 'doesn', 'yourself', 'haven', "haven't", 'she', 'm', 'having', "you'd", 'then', 'been', 'an', "should've", 'needn', 'this', 'or', 'any', 'out', 'but', 'above', 'again', 'has', 'very', "mightn't", 'because', 'of', 'does', 'aren', 'these', 'was', 'it', 'that', 'o', 'have', 'further', 'no', 'them', 'd', 'and', 'wasn', 'why', 'their', 'on', 'each', 'is', "don't", "wasn't", 'not', 'other', 'did', 'how', 'if', 'we', 'just', 'my', 'hadn', 'shouldn', 'who', 'where', 'once', 'against', "doesn't", 'during', 'both', 'won', 'over', 'until', 'should', "mustn't", 'had', 'below', 'when', 'yourselves', "isn't", 'were', 've', "that'll", 'shan', 'most', 'at', 'about', 'weren', "needn't", 'mightn', 'those', 'into', "shan't", 'couldn', 'up', 'in', 'off', 'before', 'me', "didn't", "hadn't", 'ours', 'with', 'nor', 'be', 'to', 'doing', 'more', 'ma', "won't", 'itself', 'all', 'herself', "she's", 'through', 'too', "aren't", 'them

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Use it here!

Before starting: **CHANGE the second part of *folder_path*** (after "+") to the location in your drive where this notebook is.

In [None]:
# set the folder where the data and notebooks are inside your drive
folder_path = "/content/drive/" + "Shareddrives/AI_Law_Course_2024/Content/"
print(f"You're notebook and files should be in this folder: {folder_path}")

You're notebook and files should be in this folder: /content/drive/Shareddrives/AI_Law_Course_2024/Content/


**CHANGE ALSO** the Runtime of your notebook to GPU if you want to use LLMs (it will be faster to run them).


Now, you can start reading your pdf's.

## 1. Get your pdfs:

* Upload the pdfs you want to use in this drive's folder (*folder_path*).
* Then write the names of each pdf file to be searched through, like this:

> pdfs_list = ["pdf1.pdf", "pdf2.pdf"]

In [None]:
pdfs_list = ["EU_QA.pdf"] # WRITE here your pdf file names
all_docs_pdf = scrape_pdfs(folder_path, pdfs_list,
                           max_title_size=20, max_text_size=12,
                           min_text_size=9)

# Save pdf texts to a txt file
FILENAME_PDF = "retrieved_docs_pdf.txt"
save_results_to_txt(all_docs_pdf, pdfs_list, folder_path, FILENAME_PDF)

Fecthing text from pdf 0...


* These are the rest of the parameters for the function:
  - folder_path (str): path to folder with all pdfs
  - max_title_size (int): maximum size allowed for a title to be retrieved/stored.
                      if font_size >= max_title_size, text will be ignored.
  - max_text_size (int): maximum size allowed for a paragraph.
                  if font_size > max_text_size, text will be considered a title.
  - min_text_size (int): minimum size allowed for a paragraph to be retrieved/stored.
                  if font_size <= min_text_size, text will be ignored.



**If there's any text that is being OMMITED in one pdf**, you can run the following cell to check the font_size of some pages of that pdf and alter the values above. Otherwise, just ignore!

In [None]:
pdf_name = pdfs_list[0]
extract_font_info(folder_path, pdf_name, n_pages=6)

Page 1: ' ' 
-> Font: Verdana-Italic, Size: 8.039999961853027

Page 1: ' ' 
-> Font: TimesNewRomanPSMT, Size: 11.039999961853027

Page 1: ' ' 
-> Font: TimesNewRomanPSMT, Size: 0.9599999785423279

Page 1: ' ' 
-> Font: TimesNewRomanPSMT, Size: 12.0

Page 1: ' ' 
-> Font: TimesNewRomanPSMT, Size: 12.0

Page 1: 'The European Union: Questions and Answers ' 
-> Font: PalatinoLinotype-Bold, Size: 21.959999084472656

Page 1: 'Updated January 26, 2024 ' 
-> Font: PalatinoLinotype-Roman, Size: 14.039999961853027

Page 1: 'Congressional Research Service ' 
-> Font: PalatinoLinotype-Bold, Size: 9.960000038146973

Page 1: 'https://crsreports.congress.gov ' 
-> Font: PalatinoLinotype-Roman, Size: 9.960000038146973

Page 1: 'RS21372 ' 
-> Font: PalatinoLinotype-Roman, Size: 9.960000038146973

Page 2: ' ' 
-> Font: TimesNewRomanPSMT, Size: 12.0

Page 2: 'Congressional Research Service ' 
-> Font: ArialMT, Size: 8.039999961853027

Page 2: ' ' 
-> Font: ArialMT, Size: 8.039999961853027

Page 2: 'SUMMA

1.1 You can create your **index**: check all your titles and paragraphs here and their respective pdf number.
> This is being saved in your folder as well!

In [None]:
title_index, paragraph_index = show_your_index(all_docs_pdf, folder_path)

In [None]:
title_index # after running this, you can click in your little calculator for a better visualization

Unnamed: 0,Titles,Doc_IDs
0,"Updated January 26, 2024",0
1,SUMMARY,0
2,RS21372,0
3,What Is the European Union?,0
4,How Does the EU Work?,0
5,How Is the EU Governed?,0
6,What Is the Lisbon Treaty?,0
7,Key EU Positions and Current Leaders,0
8,What Are the Euro and the Eurozone?,0
9,Why and How Is the EU Enlarging?,0


In [None]:
paragraph_index

Unnamed: 0,Paragraphs,Doc_IDs,Titles
0,Congressional Research Service https://crsrepo...,0,"Updated January 26, 2024"
1,RS21372,0,"Updated January 26, 2024"
2,The European Union (EU) is a political and eco...,0,SUMMARY
3,How the EU Works The EU has been built through...,0,SUMMARY
4,Challenges Facing the EU The EU is generally c...,0,SUMMARY
...,...,...,...
91,Managing relations with China also may test U....,0,What Is the Current State of U.S.-EU Relations?
92,"Additionally, both the United States and the E...",0,What Is the Current State of U.S.-EU Relations?
93,Israel-Hamas cease fire and on managing the co...,0,What Is the Current State of U.S.-EU Relations?
94,"Moreover, EU concerns exist about ongoing U.S....",0,What Is the Current State of U.S.-EU Relations?




---



## 2. **Select** which text you want to give to your LLM.

  * mode = "manually":
    - by == "doc": You can selected full documents by giving doc_ids;
    - by == "title": You can select text of certain titles, by giving the titles (as they appear in *title_index").
    - by == "paragraph": You can select specific paragraphs, by giving their respective *index* number (column *index* in *paragraph_index*)

    > "manually" mode needs a *to_select* to be defined (with the ids, titles or index number to collect)

  * mode = "ranking": (only keyword-search w/ BM25)
    - by == "title": automatically select the most relevant text based on the relevance of the titles with respect to your question. Returns the text of the titles that better matched your question.
    - by == "paragraph": automatically select the most relevant text based on the relevance of the paragraphs with respect to your question. Returns the individual paragraphs that better matched your question.

    > "ranking" mode needs a question to be defined, along with the number of results you want to retrieve (e.g: N=3, will retrieve the top 3)


  
  **DON'T USE RANKING WHEN** you have a very little number of titles or paragraphs. BM25 will not work that well with a small dataset.


  > (because it relies also on the ferquency of each word across all documents (here titles or paragraphs) to define its importance. If a word appears in all of the,, it's because is a general non-relevant word. Thus, if you only have one title/paragraph in your pdf(s), all words will be considered irrelevant and the scores will be negative ("No matches found").

In [None]:
to_select = [0, 1, 2]
mode = "manually"
by = "paragraph"
selected_text = select(all_docs_pdf, to_select, title_index, paragraph_index, mode=mode, by=by)

 Selected text: 

Congressional Research Service https://crsreports.congress.gov


RS21372


The European Union (EU) is a political and economic partnership that represents a unique form of
cooperation among sovereign countries. The EU is the latest stage in a process of integration begun
after World War II, initially by six Western European countries, to foster interdependence and make
another war in Europe unthinkable. The EU currently consists of 27 member states, including most of
the countries of Central and Eastern Europe, and has helped to promote peace, stability, and
economic prosperity throughout the European continent.



 Estimated number of tokens: 122.66666666666667


In [None]:
to_select = []
mode = "ranking"
by = "paragraph"
question = "What is the Schengen area?"
selected_text = select(all_docs_pdf, to_select, title_index,
                       paragraph_index, mode=mode, by=by, question=question, N=3)

Total time spent searching: 0.004138946533203125s
6.600	The Schengen area of free movement currently encompasses 23 EU member states plus 4 non-EU countries.24 Within the Schengen area, internal border controls have been eliminated, and individuals may travel without passport checks among participating countries. In effect, Schengen participants share a common external border where immigration checks for individuals entering or leaving the Schengen area are carried out. The Schengen area is founded upon the Schengen Agreement of 1985 (Schengen is the town in Luxembourg where the agreement was signed, originally by five countries). In 1999, the Schengen Agreement was incorporated into EU law. The Schengen Borders Code comprises a detailed set of rules governing both external and internal border controls in the Schengen area, including common rules on visas, asylum requests, and border checks. Participating countries may reintroduce internal border controls for a limited period of time i



---





## 3. Use the LLM to do whatever you want.

* Choose from hugging face wich LLM you want (model_id), create your token and load the model.

**IMPORTANT:** Please pay attention to the maximum number of tokens your model allows (this should be written in the documentation of the model).

> For example, here I chose to use one of the *gemma-2* models from hugging face. The maximum sequence length that this model might ever be used with is 8192 tokens [model's documentation](https://huggingface.co/docs/transformers/main/en/model_doc/gemma2).


In [None]:
token = 'your_token' #you'll need to get your own token from hugging face
model_id = "google/gemma-2-2b-it" # CHANGE this if you want to use another model

LLM = load_LLM_model(model_id, token)

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

* Write your own prompt instructions or use one of the ones bellow. Than you can select the one you want by the name you gave it:
  *prompt_instructions[name_of_prompt]*

In [None]:
prompt_instructions = {
    # "name_of_prompt":
    # ''' Prompt instrutions '''

    "question_answering":
    '''Instructions: Using the provided web search results, write a comprehensive reply to the given query.
    Make sure to cite results using [number] notation after the reference.
    If the provided search results refer to multiple subjects with the same name, write separate answers for each subject.
    Be concise and to the point.'''
    ,

    "generate_queries":
    '''Instructions: What are some queries that should find the following results? List at least 5 unique queries, where at leats one for these documents would be better than others in an question-answering dataset.
    Be concise. Output only the list of queries and the corresponding results numbers that would be retrieved by each query, in the format "query - [n]".'''
    ,

    "summarize":
    '''

    '''
    ,

    }

* Check the full prompt and the estimated number of tokens to confirm that evrything is okay (total number of tokens should be < than the maximum allowed by the model you chose).

In [None]:
# Write here a question if you want to ask any to the model
# don't write anything, if you just want it to use the selected_text
# and your instructions
question = "What is the Schengen area?"

full_prompt = write_prompt(instructions=prompt_instructions["question_answering"],
                          context=selected_text,
                          question=question)

FULL PROMPT:
 Instructions: Using the provided web search results, write a comprehensive reply to the given query.     Make sure to cite results using [number]
notation after the reference.     If the provided search results refer to multiple subjects with the same name, write separate answers for each
subject.     Be concise and to the point.         Web search results: [0] The Schengen area of free movement currently encompasses 23 EU member states
plus 4 non-EU countries.24 Within the Schengen area, internal border controls have been eliminated, and individuals may travel without passport checks
among participating countries. In effect, Schengen participants share a common external border where immigration checks for individuals entering or
leaving the Schengen area are carried out. The Schengen area is founded upon the Schengen Agreement of 1985 (Schengen is the town in Luxembourg where
the agreement was signed, originally by five countries). In 1999, the Schengen Agreement was inc

* If everything is okay, you can now generate the answer to your prompt. You can select the temperature and maximum number of tokens (*max_length*) you want the model to return.
  - temperature > 1: more creative response (changes the probabilities of the possible tokens to give more unlikely results)
  - temperaure == 1 means regular sampling (uses the real probabilities)
  - temperature == 0 means always answers with the most likely tokens.

In [None]:
answer = generative_response(LLM, full_prompt, temperature=1,
                        max_length=8000, instruction_tunned=True)
display(Markdown(answer))

The Schengen area is a zone within the European Union (EU) where internal border controls have been eliminated. This allows for free movement of people between participating countries.  The Schengen area comprises 23 EU member states plus 4 non-EU countries.  [0]

The Schengen Agreement, signed in 1985 in Schengen, Luxembourg, established the framework for the area.  It was incorporated into EU law in 1999.  The Schengen Borders Code outlines rules governing both external and internal border controls, including visa requirements and asylum requests. [0]

Participating countries can temporarily reintroduce internal border controls in specific circumstances, such as security threats or major events. [0]

While Bulgaria, Romania, and Cyprus are not yet full Schengen members, they are working towards meeting the necessary requirements.  In December 2023, EU member states approved partial entry for Bulgaria and Romania, with passport controls at internal air and sea borders to be removed from March 31, 2024.  [1] 

Ireland has an opt-out from the Schengen free movement area but participates in some aspects of the Schengen Agreement related to police and judicial cooperation. [1] 


And now you have your small prototype of a search-answer engine!