# Structuring Your Data

## Structured Data in LLM(s)
Structured data in LLM: the LLM generates JSON or XML output data, according to a schema.<br>
A first attempt is add the to prompt, the instructions on how to generate the output, e.g. the format and the schema.  
Moreover, usually some examples of output are provided.  This approach seems effective, but it is unreliable.<br>
There is no guarantee that the model adheres to the specificed format.<br>
<p/>
<b>Response format</b> Another technique is using the <code>response_format</code> argument.<br>
Hint: with structed output, use a low temperature, e.g. ZERO.<br>

## Structured Data with OpenAI 

In [3]:
from huggingface_hub import hf_hub_download
from openai import OpenAI
from pydantic import BaseModel, TypeAdapter
from typing import List

import json
import os
import pprint

In [4]:
OPEN_AI_KEY_NAME='OPENAI_API_KEY'
assert OPEN_AI_KEY_NAME in os.environ

TAI_DATASET_ROOT_ENV_VAR='TAI_DATASET_ROOT'
assert TAI_DATASET_ROOT_ENV_VAR in os.environ

In [10]:
# The response format-JSON schema
response_format_json = {
  "type": "json_schema",
  "json_schema": {
    "name": "Top10BestSellingBooks",
    "strict": True,
    "schema": {
      "type": "object",
      "properties": {
        "Top10BestSellingBooks": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "title": { "type": "string" },
              "author": { "type": "string" },
              "yearPublished": { "type": "integer" },
              "summary": { "type": "string" }
            },
            "required": ["title", "author", "yearPublished", "summary"],
            "additionalProperties": False
          }
        }
      },
      "required": ["Top10BestSellingBooks"],
      "additionalProperties": False
    }
  }
}

In [11]:
prompt = "Give me the names of the 10 best-selling books, their authors, the year they were published, and a concise summary in JSON format"

In [12]:
system_prompt = """
You are a helpful assistant designed to output information exclusively in JSON format.
### Example JSON Format
{
  "Top10BestSellingBooks": [
    {
      "title": "Book Title",
      "author": "Author Name",
      "yearPublished": "Year",
      "summary": "Brief summary of the book."
    }
  ]
}

"""

In [13]:
client = OpenAI()

In [39]:
response = client.chat.completions.create(
  model="gpt-4o",
  temperature = 0,
  response_format=response_format_json,
  messages=[
    {"role": "system", "content":system_prompt},
    {"role": "user", "content": prompt}
  ]
)

In [40]:
response.choices[0].message.content

'{"Top10BestSellingBooks":[{"title":"Don Quixote","author":"Miguel de Cervantes","yearPublished":1605,"summary":"A Spanish novel about the adventures of a nobleman who reads so many chivalric romances that he loses his sanity and decides to become a knight-errant."},{"title":"A Tale of Two Cities","author":"Charles Dickens","yearPublished":1859,"summary":"A historical novel set in London and Paris before and during the French Revolution, focusing on themes of resurrection and transformation."},{"title":"The Lord of the Rings","author":"J.R.R. Tolkien","yearPublished":1954,"summary":"An epic fantasy novel that follows the quest to destroy the One Ring and defeat the Dark Lord Sauron."},{"title":"The Little Prince","author":"Antoine de Saint-Exupéry","yearPublished":1943,"summary":"A philosophical tale about a young prince who travels from planet to planet, exploring themes of loneliness, friendship, and love."},{"title":"Harry Potter and the Philosopher\'s Stone","author":"J.K. Rowling"

In [41]:
print(response.choices[0].message.content)

{"Top10BestSellingBooks":[{"title":"Don Quixote","author":"Miguel de Cervantes","yearPublished":1605,"summary":"A Spanish novel about the adventures of a nobleman who reads so many chivalric romances that he loses his sanity and decides to become a knight-errant."},{"title":"A Tale of Two Cities","author":"Charles Dickens","yearPublished":1859,"summary":"A historical novel set in London and Paris before and during the French Revolution, focusing on themes of resurrection and transformation."},{"title":"The Lord of the Rings","author":"J.R.R. Tolkien","yearPublished":1954,"summary":"An epic fantasy novel that follows the quest to destroy the One Ring and defeat the Dark Lord Sauron."},{"title":"The Little Prince","author":"Antoine de Saint-Exupéry","yearPublished":1943,"summary":"A philosophical tale about a young prince who travels from planet to planet, exploring themes of loneliness, friendship, and love."},{"title":"Harry Potter and the Philosopher's Stone","author":"J.K. Rowling","

In [42]:
result_book = json.loads(json_response)

In [43]:
pprint.pprint(result_book)

{'Top10BestSellingBooks': [{'author': 'Miguel de Cervantes',
                            'summary': 'A Spanish novel about the adventures '
                                       'of a nobleman who reads so many '
                                       'chivalric romances that he loses his '
                                       'sanity and decides to become a '
                                       'knight-errant.',
                            'title': 'Don Quixote',
                            'yearPublished': '1605'},
                           {'author': 'Charles Dickens',
                            'summary': 'A historical novel set in London and '
                                       'Paris before and during the French '
                                       'Revolution, focusing on themes of '
                                       'resurrection and transformation.',
                            'title': 'A Tale of Two Cities',
                            'yearPublished': '

In [44]:
pprint.pprint(result_book['Top10BestSellingBooks'][0])

{'author': 'Miguel de Cervantes',
 'summary': 'A Spanish novel about the adventures of a nobleman who reads so '
            'many chivalric romances that he loses his sanity and decides to '
            'become a knight-errant.',
 'title': 'Don Quixote',
 'yearPublished': '1605'}


### Using Pydantic to define the output schema

In [46]:
# Structured output scheme.
class Book(BaseModel):
    title: str
    author: str
    yearPublished: int
    summary: str

class Top10BestSellingBooks(BaseModel):
    books: List[Book]

In [65]:
ta = TypeAdapter(List[Book])
schema = ta.json_schema(mode='validation')
schema

{'$defs': {'Book': {'properties': {'title': {'title': 'Title',
     'type': 'string'},
    'author': {'title': 'Author', 'type': 'string'},
    'yearPublished': {'title': 'Yearpublished', 'type': 'integer'},
    'summary': {'title': 'Summary', 'type': 'string'}},
   'required': ['title', 'author', 'yearPublished', 'summary'],
   'title': 'Book',
   'type': 'object'}},
 'items': {'$ref': '#/$defs/Book'},
 'type': 'array'}

## Extracting data from PDF

In [11]:
assert TAI_DATASET_ROOT_ENV_VAR in os.environ
papersDatasetPath= os.path.join(os.environ[TAI_DATASET_ROOT_ENV_VAR],'papers_dataset')
pdfDirectory= os.path.join( papersDatasetPath,'rag_research_paper')

print(f'papersDatasetPath: {papersDatasetPath}')
print(f'pdfDirectory:      {pdfDirectory}')

papersDatasetPath: /home/minguzzi/repo/towards_ai_course/dataset/papers_dataset
pdfDirectory:      /home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/rag_research_paper


In [7]:
if not os.path.exists(papersDatasetPath):
    print(f'The dataset at {papersDatasetPath} does not exist. Downloading it..')
    file_path = hf_hub_download(repo_id="jaiganesan/ai_tutor_knowledge", filename="rag_research_paper.zip",repo_type="dataset",local_dir=papersDatasetPath)

assert os.path.exists(papersDatasetPath)

In [9]:
!unzip ${TAI_DATASET_ROOT}/papers_dataset/rag_research_paper.zip -d ${TAI_DATASET_ROOT}/papers_dataset/

Archive:  /home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/rag_research_paper.zip
   creating: /home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/rag_research_paper/
  inflating: /home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/rag_research_paper/2405.07437v2.pdf  
  inflating: /home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/rag_research_paper/2407.01219v1.pdf  
  inflating: /home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/rag_research_paper/2407.07858v1.pdf  
  inflating: /home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/rag_research_paper/2407.08223v1.pdf  
  inflating: /home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/rag_research_paper/2407.16833v1.pdf  
  inflating: /home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/rag_research_paper/2407.21712v1.pdf  
  inflating: /home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/rag_research_paper/2408.08067v2.pdf  
  inflating: /home/minguzzi/

In [23]:
import os
from pdf2image import convert_from_path

outputDir= os.path.join( papersDatasetPath,'pages')
os.makedirs( outputDir, exist_ok=True)

pages_png = []

for pdf_file in os.listdir(pdfDirectory):
    if pdf_file.endswith(".pdf"):
        pdf_path = os.path.join( pdfDirectory, pdf_file)
        print(f'pdf_path: {pdf_path}')

        convert = convert_from_path(pdf_path, use_pdftocairo=True)

        pdf_output_dir = os.path.join( outputDir, os.path.splitext(pdf_file)[0])
        os.makedirs(pdf_output_dir, exist_ok=True)

        for page_num, image in enumerate(convert):
            page_filename = f"page-{str(page_num + 1).zfill(3)}.png"
            full_path = os.path.join(pdf_output_dir, page_filename)
            print(f'  Page {page_num}:{full_path}')
            image.save(full_path)

            pages_png.append(full_path)

print(pages_png)

pdf_path: /home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/rag_research_paper/2407.07858v1.pdf
  Page 0:/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.07858v1/page-001.png
  Page 1:/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.07858v1/page-002.png
  Page 2:/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.07858v1/page-003.png
  Page 3:/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.07858v1/page-004.png
  Page 4:/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.07858v1/page-005.png
  Page 5:/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.07858v1/page-006.png
  Page 6:/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.07858v1/page-007.png
  Page 7:/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.07858v1/page-008.png
['/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.078

In [15]:
# Function to base64 encode an image.
from io import BytesIO
import base64
import json

# Function to encode the image

def encode_image(image_path):

  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

### Json Schema of the structured output

In [20]:
# The response format- JSON schema
json_response_format = {
  "type": "json_schema",
  "json_schema": {
    "name": "research_paper_data",
    "strict": True,
    "schema": {
      "type": "object",
      "properties": {
        "research_paper_data": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
                "name": { "type": "string" },
                "source": { "type": "string" },
                "content": { "type": "string"}
            },

            "required": ["name", "source", "content"],
            "additionalProperties": False
          }
        },
      },
      "required": ["research_paper_data"],
      "additionalProperties": False
    }
  }
}

In [21]:
# Prompt
system_instruction_prompt ="""
You are an expert in extracting structured data from research paper images.

Task Description:
Extract comprehensive information from PDF research paper images, including all headlines, content, and visual elements.
Preserve complete information without fragmentation.

Must Follow Guidenline: Extract all text and information accurately from each image provided. Organize content into multiple
JSON objects when appropriate, based on the amount and type of content. Each JSON should clearly reflect distinct content
sections for streamlined analysis

Content Requirements:
1. Missing Headlines
- If no visible headline exists, generate appropriate ones based on content
- Group related content under these generated headlines

2. Visual Elements
For figures, graphs, tables, and architectures:
- Extract title/caption
- Describe main trends and comparisons
- Detail architecture designs
- Include related insights from surrounding text

3. Text Processing
- Extract complete sentences without summarization
- Maintain original detail level
- Merge fragmented content logically
- Preserve all technical information

Required output Format (JSON):
[
{
    "source": "Extract complete arXiv ID including prefix (e.g., arXiv:2405.07437v2).
               Verify ID accuracy multiple times. if there is no Arxiv ID return None",

    "name": "Extract or generate all headlines and subheadlines (e.g., Abstract,
            Introduction, Methods, etc). Include section titles and subsection headings.",

    "content": "For each section:
                - Complete text content
                - Visual element descriptions
                - Figure/graph details:
                  * Title/caption
                  * Description
                  * Key trends/comparisons
                  * Architecture details
                  * Related insights"
},
]

Key Guidelines:
- Extract exact content without summarization
- Ensure accuracy in complex technical details
- Maintain logical content organization
- Include complete visual element analysis
"""

In [22]:
import arxiv
import re

def arxiv_extraction(arxiv_id):
  client = arxiv.Client()
  search = arxiv.Search(id_list=re.findall(r'(\d{4}\.\d{5}|\w+(?:-\w+)?/\d{7})', arxiv_id), max_results=1)
  results = client.results(search)

  for result in results:
    return result.title, result.pdf_url

In [25]:
# Extracts data using OpenAI

import json
from openai import OpenAI
client = OpenAI()

desc = []


for page in pages_png:
  # Getting the base64
  base64_image = encode_image(page)
  print(f'page:{page}')

  try:

    response = client.chat.completions.create(
        model="gpt-4o-2024-08-06",
        response_format = json_response_format,
        temperature = 0,
        messages= [
              {"role": "system","content":system_instruction_prompt},
              {"role": "user","content": [{"type": "text", "text": "Extract the content from this research paper image."},
                                          {"type": "image_url","image_url": {"url":f"data:image/jpeg;base64,{base64_image}",
                                                                              "detail": "high"}}
                                          ]
                  }
                  ],
      )

    if response.choices[0].message.content is None:
      continue

    result = json.loads(response.choices[0].message.content)

    if 'page-001' in page:
      # Or You can use the Image path to extract the Arxiv Research paper ID.
      research_paper_id = result['research_paper_data'][0]['source']
      research_paper_title, research_paper_url = arxiv_extraction(research_paper_id)

      for i in range(len(result['research_paper_data'])):
        result['research_paper_data'][i]['source'] = research_paper_id
        result['research_paper_data'][i]['name'] = research_paper_title +":"+ result['research_paper_data'][i]['name']
        result['research_paper_data'][i]['url'] = research_paper_url

    if 'page-001' not in page:
      for i in range(len(result['research_paper_data'])):
        result['research_paper_data'][i]['source'] = research_paper_id
        result['research_paper_data'][i]['name'] = research_paper_title +":"+ result['research_paper_data'][i]['name']
        result['research_paper_data'][i]['url'] = research_paper_url

    desc.extend(result['research_paper_data'])

  except Exception as e:
    print(response.choices[0].finish_reason)
    print(f"Skipping {page}... error: {e}")
    break

page:/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.07858v1/page-001.png
page:/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.07858v1/page-002.png
page:/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.07858v1/page-003.png
page:/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.07858v1/page-004.png
page:/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.07858v1/page-005.png
page:/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.07858v1/page-006.png
page:/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.07858v1/page-007.png
page:/home/minguzzi/repo/towards_ai_course/dataset/papers_dataset/pages/2407.07858v1/page-008.png


In [26]:
print("Content research paper title and Headline :",desc[0]['name'],"\n")
print("Content :",desc[0]['content'],"\n")
print("Source :",desc[0]['source'],"\n")
print("URL :",desc[0]['url'])


Content research paper title and Headline : FACTS About Building Retrieval Augmented Generation-based Chatbots:FACTS About Building Retrieval Augmented Generation-based Chatbots 

Content : Rama Akkiraju, Anbang Xu, Deepak Bora, Tan Yu, Lu An, Vishal Seth, Aaditya Shukla, Pritam Gundecha, Hridhay Mehta, Ashwin Jha, Prithvi Raj, Abhinav Balasubramanian, Murali Maram, Guru Muthusamy, Shivakesh Reddy Annepally, Sidney Knowles, Min Du, Nick Burnett, Sean Jayjya, Ashok Maram, Mamta Kumari, Surbhi Jha, Ethan Dereszenski, Anupam Chakraborty, Subhash Ranjan, Amina Terfai, Anoop Surya, Tracey Mercer, Vinodh Kumar Thanigaichalam, Tamar Bar, Sanjana Krishnan, Jasmine Jaksic, Nave Aligarci, Jacob Liberman, Joey Conway, Sonu Nayyar and Justin Boitano
NVIDIA
{rakkiraju,anbangx,dbora}@nvidia.com

ABSTRACT
Enterprise chatbots, powered by generative AI, are rapidly emerging as the most explored initial applications of this technology in the industry, aimed at enhancing employee productivity. Retrieval 

# Downloading the full dataset for the next steps of the course

In [28]:
assert TAI_DATASET_ROOT_ENV_VAR in os.environ
aiTutorDatasetFilePath= os.path.join(os.environ[TAI_DATASET_ROOT_ENV_VAR],'ai_tutor_knowledge.jsonl')

print(f'aiTutorDatasetFilePath: {aiTutorDatasetFilePath}')

aiTutorDatasetFilePath: /home/minguzzi/repo/towards_ai_course/dataset/ai_tutor_knowledge.jsonl


In [33]:
# Downloading the dataset from Huggingface hub
from huggingface_hub import hf_hub_download
import json

file_path = hf_hub_download(repo_id="jaiganesan/ai_tutor_knowledge", filename="ai_tutor_knowledge.jsonl", repo_type="dataset")

# Exploring the dataset content
with open(file_path, "r") as inFile:
    ai_tutor_knowledge = [json.loads(line) for line in inFile]
    
with open(file_path, "r") as inFile:        
    with open(aiTutorDatasetFilePath,'w') as outFile:   
        outFile.write( inFile.read())
        
print("Title: ", ai_tutor_knowledge[0]['name'])
print("Content: ", ai_tutor_knowledge[0]['content'])
print("URL: ", ai_tutor_knowledge[0]['url'])
print("Source: ", ai_tutor_knowledge[0]['source'])

Title:  BERT HuggingFace Model Deployment using Kubernetes [ Github Repo]  03/07/2024
Content:  Github Repo : https://github.com/vaibhawkhemka/ML-Umbrella/tree/main/MLops/Model_Deployment/Bert_Kubernetes_deployment   Model development is useless if you dont deploy it to production  which comes with a lot of issues of scalability and portability.   I have deployed a basic BERT model from the huggingface transformer on Kubernetes with the help of docker  which will give a feel of how to deploy and manage pods on production.   Model Serving and Deployment:ML Pipeline:Workflow:   Model server (using FastAPI  uvicorn) for BERT uncased model    Containerize model and inference scripts to create a docker image    Kubernetes deployment for these model servers (for scalability)  Testing   Components:Model serverUsed BERT uncased model from hugging face for prediction of next word [MASK]. Inference is done using transformer-cli which uses fastapi and uvicorn to serve the model endpoints   Server