## Unstructured Document Text Extraction - LLM's

 Follow the steps in our API Key Access tutorial to retrieve your API key before starting here!

#### 1. Install relevant Python libraries: 

- See requirements.txt for list of packages being installed.
- Add new libraries to requirements.txt as needed and rerun the cell below to install them.

In [None]:

%pip install -r requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 KB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting PyMuPDF
  Downloading pymupdf-1.26.0-cp39-abi3-macosx_11_0_arm64.whl (22.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.4/22.4 MB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting gdown
  Using cached gdown-5.2.0-py3-none-any.whl (18 kB)
Collecting lxml>=3.1.0
  Downloading lxml-5.4.0-cp310-cp310-macosx_10_9_universal2.whl (8.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.1/8.1 MB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.13.4-py3-none-any.whl (187 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m187.3/187.3 K

#### 2. Import libraries after installation

In [4]:
import google.generativeai as genai 
import docx
import requests
import io

  from .autonotebook import tqdm as notebook_tqdm


#### 3. Prepare document for text extraction

In [5]:
import gdown
# Define document's file path
# Link to public Google Drive document: https://drive.google.com/file/d/1gBY-kDwJFOX6FMl7rRrY-u4wUlzxLfjm/view?usp=sharing

# File ID is the string after /file/d/ and before /view in the URL 
file_id = '1gBY-kDwJFOX6FMl7rRrY-u4wUlzxLfjm'
gdrive_url =f'https://drive.google.com/uc?id={file_id}'

# Change output_path to represent your specific file name 
output_path = 'temecula_quality_life_plan.pdf'

# Download the file
gdown.download(gdrive_url, output_path, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1gBY-kDwJFOX6FMl7rRrY-u4wUlzxLfjm
To: /Users/nidhi/CapstoneDocumentation/SantaBarbara-TrialRun/temecula_quality_life_plan.pdf
100%|██████████| 20.0M/20.0M [00:08<00:00, 2.50MB/s]


'temecula_quality_life_plan.pdf'

#### 4. Extract text from document, line-by-line in chunks:

- Why chunks? 
    - We can store multiple lines in one chunk, reducing the number of requests sent to Gemini's LLM.
    - This keeps us within the rate limits for Gemini's free tier, since we can only send 1500 requests (queries) per day. 

In [None]:
import fitz  # PyMuPDF

def extract_chunks_from_pdf(pdf_path, max_chunk_words=300):
    chunks = []
    current_chunk = []

    # Open PDF file
    with fitz.open(pdf_path) as doc:
        for page in doc:
            text = page.get_text()  # Get text from each page
            lines = [line.strip() for line in text.split('\n') if line.strip()] # Remove empty lines, whitespace

            for line in lines:      # Loop through each line
                current_chunk.append(line)      # Add line to current chunk
                # When the word count exceeds the threshold (maximum words per chunk), add chunk to list of chunks
                if sum(len(l.split()) for l in current_chunk) >= max_chunk_words:
                    chunks.append(" ".join(current_chunk))
                    current_chunk = []

    # Add any remaining text as a final chunk
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

# Now our "lines" variable stores chunks of text extracted from PDF
lines = extract_chunks_from_pdf(output_path, max_chunk_words=300)

In [7]:
# Print out first 5 lines (change 5 to another number n to display n lines)
lines[:5]

# Print out the 6th line (index starts at 0 so index = 5 means 6th line)
#lines[:6]

['Lighting 2040 the Path to QUALITY OF LIFE MASTER PLAN Commission and Board Representatives Lanae Turley-Trejo\t Planning Commission David Matics Public/Traffic Safety Commission Eric Faulkner Race, Equity, Diversity, and Inclusion Commission Kathy Sizemore Community Services Commission Ross Jackson Old Town Local Review Board Community Representatives Aaron Petroff Social Work Action Group Amy Minniear Special Needs Community Brooke Nunn Chamber of Commerce Carl Love Local Historian Darlene Wetton Temecula Valley Hospital Gary Oddi Bike Temecula Valley Coalition Jacob Mejia Pechanga Band of Luiseño Indians Jeremy Brown Mount San Jacinto College Juan Carlos Duron\t Optiforms Karen Valdes TVUSD Kimberly Adams Visit Temecula Valley Sandy Rosenstein Interfaith Council Chad Pelekai Camp Pendleton Scott Treadway Rancho Christian Schools Tammy Marine Habitat for Humanity Teri Biancardi Sierra Club the blue ribbon committee temecula city council the project team Matt Rahn Mayor Zak Schwank M

#### 5. Load API key 

In [8]:
api_key = 'AIzaSyBAvpRpMaXRkFpX2IeCzBV67Sv6MXKDjD8'

In [9]:
genai.configure(api_key=api_key) # load in API key

6. Check to see if any lines exceed rate limits for our specific Gemini LLM

- We're using Gemini Flash 2.0 with the following rate and token limits:

    - 15 requests / minute (RPM)
    - 1,000,000 tokens / minute (TPM)
    - 1,500 requests / day (RPD)
    - 8192 tokens / request 


- 15 requests / 60 second limit = Maximum of 4 requests per minute

- Let's check to see if any extracted lines exceed the token limit

In [10]:
len(lines)

41

In [None]:
import google.generativeai as genai

# Configure Gemini with API key
genai.configure(api_key=api_key)

# Initialize the model - Gemini 2.0 Flash in this case
model = genai.GenerativeModel("gemini-2.0-flash")

max_tokens = 0
max_line_index = -1
TOKEN_LIMIT = 8192  # Gemini 2.0 Flash token limit

# Loop through text in each line read from document
for i, line_text in enumerate(lines):
    token_data = model.count_tokens(line_text)  # Count number of tokens per line
    total_tokens = token_data.total_tokens # Calculate total tokens 

    if total_tokens > max_tokens:   
        max_tokens = total_tokens     # Update max_tokens if current line has more tokens 
        max_line_index = i

print(f"Line {max_line_index + 1} has the maximum tokens: {max_tokens}")

if max_tokens > TOKEN_LIMIT:
    print(f"Warning: Line {max_line_index + 1} exceeds the token limit of {TOKEN_LIMIT} tokens.")


Line 28 has the maximum tokens: 587


- Since none of the lines have more than 587 tokens, we are well below the limit of 8192 tokens per request. 
- If any lines exceed the token limit, consider breaking the lines apart into smaller chunks. 

#### 6. Set up Gemini LLM

In [12]:
def query_gemini(prompt):

    try:    # Try building the model
        model = genai.GenerativeModel(model_name="gemini-2.0-flash")
        response = model.generate_content(prompt)   # Store response from prompt sent to LLM

        return response.text    # Return text response

    except Exception as e:   # If build fails, print error
        print(f"Error: {str(e)}")
        return f"Error processing text/table"

7. Ask Gemini LLM to extract policies from lines of text

- In our case, Temecula's Quality of Life Master Plan has policies listed out in lines instead of tables. Therefore, we are building a function to extract policies from lines specifically. 

In [13]:
import time

In [14]:
def process_lines(lines):

    extracted_policies = []

    # Loop through each line in document
    # 1 query per line
    for i, line in enumerate(lines):
        # Uncomment line below to check if each line is processing correctly
        # print(f"Processing line {i+1}...")

        # Write the prompt
        text_prompt = f"""Extract both explicit and implicit policies 
                        related to wildfire resilience and/or mitigation from this text. 
                        A policy can be a rule, guideline, or a recommended action. 
                        Provide the exact wording:\n\n{line}"""

        # Store response and save it to list of extracted policies
        response = query_gemini(text_prompt)
        extracted_policies.append(f"Line {i+1}:\n{response}\n")

        if i < len(lines) - 1:  # Avoid waiting after last line
            time.sleep(6)  # Wait 6 seconds before next request to stay within rate limit

    return extracted_policies

In [15]:
extracted_policies_lines = process_lines(lines)

# Print out first 5 elements
extracted_policies_lines[:5]

['Line 1:\nBased on the provided text, which is primarily a list of participants in the "Lighting 2040 the Path to QUALITY OF LIFE MASTER PLAN," I cannot find explicit or implicit policies related to wildfire resilience and/or mitigation.\n\n',
 "Line 2:\nBased on the provided text, here's an extraction of explicit and implicit policies related to wildfire resilience and/or mitigation. Note that the provided text is very high level and doesn't mention specific wildfire policies. The following are inferenced based on the overall descriptions:\n\n**Implicit Policies:**\n\n*   **Proactive Community Building:** This implies a general policy of taking initiative and forward-thinking approaches to community issues, which could implicitly include wildfire resilience.\n\n*   **Aligning Capital Improvement Program projects with QLMP Core Values:** While not explicitly stated, this suggests that infrastructure projects (which could include those related to wildfire prevention or mitigation) must

In [17]:
extracted_policies_lines

['Line 1:\nBased on the provided text, which is primarily a list of participants in the "Lighting 2040 the Path to QUALITY OF LIFE MASTER PLAN," I cannot find explicit or implicit policies related to wildfire resilience and/or mitigation.\n\n',
 "Line 2:\nBased on the provided text, here's an extraction of explicit and implicit policies related to wildfire resilience and/or mitigation. Note that the provided text is very high level and doesn't mention specific wildfire policies. The following are inferenced based on the overall descriptions:\n\n**Implicit Policies:**\n\n*   **Proactive Community Building:** This implies a general policy of taking initiative and forward-thinking approaches to community issues, which could implicitly include wildfire resilience.\n\n*   **Aligning Capital Improvement Program projects with QLMP Core Values:** While not explicitly stated, this suggests that infrastructure projects (which could include those related to wildfire prevention or mitigation) must

8. Save policies to CSV

In [24]:
import pandas as pd

In [None]:
import csv 

def save_lines_to_csv(lines, output_file='temecula_quality_2.csv'):
    with open(output_file, mode='w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['line number', 'text'])  # Write header

        for entry in lines:
            # Check if the entry starts with 'Line' to extract line number and text
            if entry.startswith('Line'):
                try:
                    # Split only at the first colon
                    prefix, text = entry.split(':', 1)
                    line_number = prefix.replace('Line', '').strip()    # Extract line number
                    full_text = text.strip()    # Extrac text after line number
                    writer.writerow([line_number, full_text])   # Write line number, text as a new row
                except ValueError:
                    writer.writerow(['', entry.strip()])
            else:
                writer.writerow(['', entry.strip()])

In [None]:
# Save (line number, text) to CSV
save_lines_to_csv(extracted_policies_lines)

print("Spreadsheet saved successfully!")

Spreadsheet saved successfully!


#### 9. Save as dataframe in case you'd like to clean the output

In [None]:
import pandas as pd

def lines_to_dataframe(lines):
    data = []

    # Loop through each line
    for entry in lines:
        if entry.startswith('Line'):    # If entry starts with 'Line', extract line number and text
            try:
                prefix, text = entry.split(':', 1)
                line_number = prefix.replace('Line', '').strip()
                full_text = text.strip()
                data.append([line_number, full_text])   # Append line number and text to data list
            except ValueError:
                data.append(['', entry.strip()])
        else:
            data.append(['', entry.strip()])

    return pd.DataFrame(data, columns=['line number', 'text'])  # Return DataFrame with columns 'line number' and 'text'

In [36]:
df_temecula = lines_to_dataframe(extracted_policies_lines)
df_temecula.head()

Unnamed: 0,line number,text
0,1,"Based on the provided text, which is primarily..."
1,2,"Based on the provided text, here's an extracti..."
2,3,This text focuses on the city of Temecula's Qu...
3,4,"Based on the text provided, here's an extracti..."
4,5,"Based on the text provided, here are the expli..."


#### 10. Clean output (optional)

- After opening the CSV in Excel, you may notice that the policies extracted are in a messy format.

- This is because Gemini returns extra text with the query to explain why certain policies were returned or to categorize them as implicit / explicit.

- If this is the case (it usually is!), you can run this code to clean up the text and keep just the extracted policies.

In [37]:
import pandas as pd
import re

# Function to extract clean policy statements
def extract_policies(text):
    # Filter out lines that say no policies are present
    if re.search(r'no (explicit|implicit) policies|does not.*mention.*polic|cannot find', text, re.IGNORECASE):
        return None

    # Extract lines that look like policies (start with bullet points or bold patterns)
    policies = re.findall(r'\*\*?"?(.*?)"?\*\*?(?=\s*[-:]|\s*\n|\s*$)', text)
    
    # Clean extra characters
    cleaned = [re.sub(r'\*\*|\*|["“”]', '', p).strip() for p in policies if p.strip()]

    return cleaned if cleaned else None

# Apply function
df_temecula["policies"] = df_temecula["text"].apply(extract_policies)

# Drop rows with no policies
df_temecula = df_temecula.dropna(subset=["policies"])

# Optionally drop the original text column
df_temecula = df_temecula[["line number", "policies"]]

# Save the cleaned data
df_temecula.to_csv("cleaned_temecula_quality.csv", index=False)


Unnamed: 0,line number,policies
1,2,"[Implicit Policies:, Lack of Explicit Policies:]"
3,4,"[Implicit Policies:, resilient capital and soc..."
5,6,"[Explicit Policies:, Emergency Management Mast..."
6,7,"[Explicit Policies, Implicit Policies]"
7,8,"[Explicit Policies:, Implicit Policies (Inferr..."
10,11,"[Explicit Policies:, Implicit Policies:]"
11,12,"[Explicit Policies (Directly Stated):, Residen..."
13,14,"[Explicit Policies:, Implicit Policies:, Expla..."
14,15,"[Explicit Policies:, Implicit Policies (Relate..."
15,16,[Explicit Policies related to Wildfire Resilie...
