<a href="https://colab.research.google.com/github/anandvicky24/GenAI/blob/main/09_03_Using_LlamaParse_to_extract_structured_data_from_a_complex_document.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q llama-parse llama-index pdfminer.six pytesseract

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/6.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m4.6/6.6 MB[0m [31m141.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m6.6/6.6 MB[0m [31m150.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m86.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.8/76.8 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m362.0/362.0 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m99.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from google.colab import userdata
API_KEY=userdata.get('LLAMACLOUD_API_KEY_E2') # Get from https://cloud.llamaindex.ai

### 1. Basic Parser

In [None]:
from llama_parse import LlamaParse


# Initialize the parser with your API key
parser = LlamaParse(
    api_key=API_KEY,
    result_type="markdown"  # or "text"
)

# Parse a PDF file
pdf_file_path = './08-AttentionIsAllYouNeed.pdf'
documents = parser.load_data(pdf_file_path)

# Access the parsed content
for doc in documents:
    print(doc.text)



Started parsing the file under job_id d4a8cdcb-c816-479d-94a7-2a13cbb9e57f
arXiv:1706.03762v5 [cs.CL] 6 Dec 2017

# Attention Is All You Need

Ashish Vaswani∗  Noam Shazeer∗        Niki Parmar∗  Jakob Uszkoreit∗

Google Brain         Google Brain     Google Research    Google Research

avaswani@google.com    noam@google.com    nikip@google.com    usz@google.com

Llion Jones∗     Aidan N. Gomez∗ †         Łukasz Kaiser∗

Google Research    University of Toronto          Google Brain

llion@google.com    aidan@cs.toronto.edu    lukaszkaiser@google.com

Illia Polosukhin∗ ‡

illia.polosukhin@gmail.com

# Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convoluti

**Simple Text Extraction drawbacks** (e.g., PyPDF2, pdfplumber)
- Extracts raw text in the order it appears in the PDF file
- Often scrambles text from multi-column layouts
- Tables become unstructured text mess
- Loses document structure (headings, sections, hierarchy)
- Ignores images and figures
- Can't handle complex layouts well

**LlamaParse**
- Preserves document structure: Maintains headings, sections, and hierarchy
- Intelligent table extraction: Converts tables into proper markdown tables or structured format
- Layout understanding: Handles multi-column layouts correctly, reading in logical order
- OCR capability: Can extract text from scanned PDFs and images
- Vision-based parsing: Uses AI models to understand document layout visually
- Metadata extraction: Captures document structure and semantics
- Figure and image handling: Can describe or reference figures and charts

### 2. Using Parsing Instructions

In [None]:
from llama_parse import LlamaParse

parser = LlamaParse(
    api_key=API_KEY,
    result_type="markdown",
    parsing_instruction="""
    Extract only the following:
    - All tables with financial data
    - Section headers that start with 'Summary' or 'Conclusion'
    - Any text in bold or italics
    - Ignore footnotes and references
    """
)

documents = parser.load_data(pdf_file_path)

# Access the parsed content
for doc in documents:
    print(doc.text)

Started parsing the file under job_id dc8ff05f-b0db-4fa0-982a-3ba6d13cce2c
**Tables with Financial Data:**
- None found.

**Section Headers that Start with 'Summary' or 'Conclusion':**
- None found.

**Text in Bold or Italics:**
- *Attention Is All You Need*
- *Equal contribution. Listing order is random.*
- *Work performed while at Google Brain.*
- *Work performed while at Google Research.*


**Summary of Financial Data:**

*No financial data tables or relevant information were found in the provided text.*

**Section Headers:**

- **2 Background**
- **3 Model Architecture**

**Text in Bold or Italics:**

- *Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences.*
- *Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a represen

### 3. Target Specific Sections

In [None]:
parser = LlamaParse(
    api_key=API_KEY,
    result_type="markdown",
    parsing_instruction="""
    Extract only:
    - The 'Executive Summary' section
    - Key metrics and KPIs (look for numbers with % or $)
    """
)

# Parse a PDF file
# pdf_file_path = './08-AttentionIsAllYouNeed.pdf'
documents = parser.load_data(pdf_file_path)

# Access the parsed content
for doc in documents:
    print(doc.text)

Started parsing the file under job_id 34559108-9b24-4eb4-be33-e10475d113f4
**Executive Summary:**

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a sm

### 4. Patterns based extraction post processing (using regex)

In [None]:
import re
from llama_parse import LlamaParse

# Parse the document
parser = LlamaParse(api_key=API_KEY)
documents = parser.load_data(pdf_file_path)

# Extract specific patterns from parsed content
textList = []
for doc in documents:
    textList.append(documents[0].text)
text = '\n'.join(textList)

# Extract email addresses
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)

# Extract phone numbers
phones = re.findall(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', text)

# Extract dates
dates = re.findall(r'\b\d{1,2}/\d{1,2}/\d{4}\b', text)

# Extract currency amounts
amounts = re.findall(r'\$[\d,]+\.?\d*', text)

print(f"Emails: {emails}")
print(f"Phones: {phones}")
print(f"Dates: {dates}")
print(f"Amounts: {amounts}")

Started parsing the file under job_id 64b5b4cc-f911-46e9-8f72-e6ee9bc15bf4
Emails: ['avaswani@google.com', 'noam@google.com', 'nikip@google.com', 'usz@google.com', 'llion@google.com', 'aidan@cs.toronto.edu', 'lukaszkaiser@google.com', 'illia.polosukhin@gmail.com', 'avaswani@google.com', 'noam@google.com', 'nikip@google.com', 'usz@google.com', 'llion@google.com', 'aidan@cs.toronto.edu', 'lukaszkaiser@google.com', 'illia.polosukhin@gmail.com', 'avaswani@google.com', 'noam@google.com', 'nikip@google.com', 'usz@google.com', 'llion@google.com', 'aidan@cs.toronto.edu', 'lukaszkaiser@google.com', 'illia.polosukhin@gmail.com', 'avaswani@google.com', 'noam@google.com', 'nikip@google.com', 'usz@google.com', 'llion@google.com', 'aidan@cs.toronto.edu', 'lukaszkaiser@google.com', 'illia.polosukhin@gmail.com', 'avaswani@google.com', 'noam@google.com', 'nikip@google.com', 'usz@google.com', 'llion@google.com', 'aidan@cs.toronto.edu', 'lukaszkaiser@google.com', 'illia.polosukhin@gmail.com', 'avaswani@g

### 5. Using Page Selection

In [None]:
parser = LlamaParse(api_key=API_KEY,
                    page_separator="\n---PAGE_BREAK---\n",
                    parsing_instruction="Extract only tables and numerical data"
                  )

documents = parser.load_data(pdf_file_path, extra_info={"pages": "1-5,10"})

for doc in documents:
    print(doc.text)

Started parsing the file under job_id 892f75c9-ed04-46d6-b2df-9c82905902ce
    arXiv:1706.03762v5 [cs.CL] 6 Dec 2017

    Attention Is All You Need

    Ashish Vaswani∗  Noam Shazeer∗        Niki Parmar∗  Jakob Uszkoreit∗
    Google Brain         Google Brain     Google Research    Google Research
avaswani@google.com    noam@google.com    nikip@google.com    usz@google.com

    Llion Jones∗     Aidan N. Gomez∗ †         Łukasz Kaiser∗
Google Research    University of Toronto          Google Brain
llion@google.com    aidan@cs.toronto.edu    lukaszkaiser@google.com

                     Illia Polosukhin∗ ‡
                     illia.polosukhin@gmail.com

                                                            Abstract

                          The dominant sequence transduction models are based on complex recurrent or
                         convolutional neural networks that include an encoder and a decoder. The best
                          performing models also connect the enc

### 6. Structure Extraction with Custom instructions

In [None]:
parser = LlamaParse(
    api_key=API_KEY,
    result_type="markdown",
    parsing_instruction="""
    Extract and format as follows:
    1. Company names (look for Inc., LLC, Corp.)
    2. Dates in format YYYY-MM-DD
    3. Contract values (look for $ amounts)
    4. Names of signatories

    Format output as:
    ## Contract Details
    - Company: [name]
    - Date: [date]
    - Value: [amount]
    - Signed by: [names]
    """
)

### 7. Combined approach (LlamaParse and RAG-LlamaIndex)

In [None]:
%pip install -q -U sentence-transformers
%pip install -q llama-index-embeddings-huggingface llama-index-llms-groq groq

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings



In [None]:
pdf_file_path = './08-AttentionIsAllYouNeed.pdf'
embed_model_name="BAAI/bge-small-en-v1.5"
embed_model = HuggingFaceEmbedding(model_name=embed_model_name)
Settings.embed_model = embed_model

In [None]:
from llama_index.llms.groq import Groq
from google.colab import userdata
Settings.llm = Groq(model="llama-3.1-8b-instant", api_key=userdata.get('GROQ_API_KEY'))

In [None]:
from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import Document

# Parse with instructions
parser = LlamaParse(
    api_key=userdata.get('LLAMACLOUD_API_KEY_E2'),
    parsing_instruction="Focus on technical specifications and requirements"
)

docs = parser.load_data(pdf_file_path)

# Query for specific patterns
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()

# Extract specific information
query1 = "What are all the version numbers mentioned and ?"
print("\n\nQuery:", query1, '\n-------------------------------------------')
response = query_engine.query(query1)
print(response, '\n-----------------------------------------------------------------------------------')
query2 = "What is the type of hardware used for training the model"
print("\n\nQuery:", query2, '\n-------------------------------------------')
response = query_engine.query(query2)
print(response, '\n-----------------------------------------------------------------------------------')

Started parsing the file under job_id ec3e0b21-325c-43ad-8b84-25730bd1c144


Query: What are all the version numbers mentioned and ? 
-------------------------------------------
The version numbers mentioned are:

1. 2014
2. 2013
3. 2016
4. 2015
5. 2019
6. 2020
7. 2021 
-----------------------------------------------------------------------------------


Query: What is the type of hardware used for training the model 
-------------------------------------------
The models were trained on a machine with 8 NVIDIA P100 GPUs. 
-----------------------------------------------------------------------------------


### 8. Advanced Custom meta data extraction

In [None]:
parser = LlamaParse(
    api_key="your_api_key_here",
    parsing_instruction="""
    Extract metadata:
    - Document title
    - Author names
    - Publication date
    - Abstract/Summary
    Then extract all section headers and their content.
    """
)

documents = parser.load_data("research_paper.pdf")

# The parsed content will prioritize the requested patterns
for doc in documents:
    print(doc.text)
    print(doc.metadata)