# Brazilian Traffic Law Parser and Structurer

## Overview
This notebook implements a comprehensive pipeline for parsing and structuring the Brazilian Traffic Law (Código de Trânsito Brasileiro) to create a hierarchical representation suitable for Retrieval-Augmented Generation (RAG) systems.

## Key Features
- **Law Text Parsing**: Extracts and structures legal text into hierarchical components
- **AI-Powered Classification**: Uses OpenAI GPT-4o-mini to identify and classify legal elements
- **Hierarchical Organization**: Creates a tree structure following legal document hierarchy
- **Vector Database Preparation**: Generates structured data for embedding and retrieval

## Legal Hierarchy Structure
The pipeline recognizes and organizes the following legal elements in order of precedence:
1. **Título** (Title)
2. **Capítulo** (Chapter) 
3. **Seção** (Section)
4. **Artigo** (Article)
5. **Parágrafo** (Paragraph)
6. **Inciso** (Subsection)
7. **Item** (Item)

## Output
- Structured JSON with hierarchical law organization
- Vector database dictionary for RAG implementation
- Cleaned and filtered legal content ready for AI applications

## Dependencies and Setup

### Required Libraries
- `openai`: For AI-powered text classification and structuring
- `Levenshtein`: For fuzzy string matching and hierarchy mapping
- `concurrent.futures`: For parallel processing of law sections
- `json`, `os`, `sys`, `re`: Standard Python libraries for file handling and text processing

### Prerequisites
1. OpenAI API key configured
2. Brazilian Traffic Law text file (`cod_transito.txt`) in the working directory
3. Output directory structure for processed JSON files

In [1]:
# Standard library imports
import os      # File and directory operations
import sys     # System-specific parameters and functions
import json    # JSON data handling
import re      # Regular expressions for text pattern matching

# Third-party imports
from Levenshtein import distance              # String similarity calculations
from concurrent.futures import ThreadPoolExecutor  # Parallel processing
from openai import OpenAI                    # OpenAI API for AI text processing

## 1. Import Required Libraries

The following libraries are essential for the law parsing pipeline:

In [None]:
# Initialize content variable
content = None

# Load the Brazilian Traffic Law text file
# Using UTF-8 encoding to handle Portuguese characters properly
with open(r"cod_transito.txt", 'r', encoding='utf8') as f:
    content = f.read()

# Display the loaded content for verification
content

## 2. Load Traffic Law Text

This section loads the Brazilian Traffic Law text from a local file. The text will be processed and structured in subsequent steps.

In [None]:
# Initialize OpenAI client with API key
# Note: In production, consider using environment variables for API keys
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# System prompt for GPT-4o-mini to structure legal text
# This prompt defines the AI's role as a legal text analyzer
system_description = """Você receberá um trecho de uma lei, a partir dele estruture uma lista com os campos detectados. O tipos dos campos podem ser: Título, Capítulo, Seção, Artigo, Parágrafo, Inciso, item.

Regras:
- Artigos são escritos na forma "Art. 1º" ou "Art. 1"
- Parágrafos são escritos na forma "§ 1º" ou "§ 1"
- Incisos são escritos na forma "I" ou "1"
- Itens são escritos na forma "a" ou "a)"
- É possível que o trecho não tenha todos os campos, por exemplo, um trecho pode ter apenas um artigo e um parágrafo
- Se um texto parecer incompleto finalize o campo com "..."

O JSON  deve seguir a seguinte estrutura:

{"campos_detectados":[{"tipo_do_campo": str, "título_do_campo": str, "texto": str},...]}
"""

def generate_json_from_law_llm(law_text, output_path):
    """
    Generate structured JSON from legal text using OpenAI GPT-4o-mini.
    
    This function takes a section of legal text and uses AI to identify and structure
    the various legal elements (articles, paragraphs, sections, etc.) into a 
    standardized JSON format.
    
    Parameters:
    -----------
    law_text : str
        The legal text to be analyzed and structured
    output_path : str
        Path where the generated JSON file will be saved
    
    Returns:
    --------
    dict
        Structured legal elements in JSON format, or None if file already exists
    
    Notes:
    ------
    - Uses GPT-4o-mini model for cost-effective processing
    - Implements file existence check to avoid reprocessing
    - Uses JSON mode to ensure valid JSON output
    - Temperature set to 0.3 for consistent, focused responses
    """
    
    # Skip processing if output file already exists (avoid unnecessary API calls)
    if os.path.exists(output_path):
        print(f"Output file {output_path} already exists. Skipping generation.")
        return
    
    # Make API call to OpenAI GPT-4o-mini
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Cost-effective model suitable for text analysis
        messages=[
            {
                "role": "system",
                "content": [
                    {
                        "type": "text",
                        "text": system_description
                    }
                ]
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": law_text
                    }
                ]
            }
        ],
        temperature=0.3,  # Low temperature for consistent, focused responses
        top_p=1,
        presence_penalty=0,
        response_format={'type': 'json_object'}  # Ensure JSON output format
    )
    
    # Parse the AI response and extract JSON content
    text = json.loads(response.to_dict()['choices'][0]['message']['content'])

    # Save the structured data to file
    with open(output_path, 'w', encoding='utf8') as f:
        json.dump(text, f, indent=4)

    return text

## 3. OpenAI Setup and Law Structure Detection

### AI-Powered Legal Structure Detection
This section configures the OpenAI client and defines the core function for detecting legal structure elements using GPT-4o-mini. The AI model is trained to recognize and classify different types of legal elements in Brazilian law.

### System Prompt Design
The system prompt instructs the AI to:
- Identify legal hierarchy elements (Titles, Chapters, Sections, Articles, Paragraphs, etc.)
- Extract and structure the content in a consistent JSON format
- Handle incomplete or fragmented text gracefully

In [None]:
def particionate_arts(law_text):
    """
    Partition legal text into smaller sections based on articles.
    
    This function splits the complete law text into manageable chunks, with each chunk
    containing one article and its associated content. This segmentation strategy:
    1. Reduces token count for AI processing
    2. Maintains logical legal structure
    3. Enables parallel processing of different articles
    
    Parameters:
    -----------
    law_text : str
        Complete legal text to be partitioned
    
    Returns:
    --------
    list
        List of text segments, each containing one article and related content
    
    Notes:
    ------
    - Splits on "Art. " pattern which identifies article beginnings
    - Preserves the "Art. " prefix in each segment (except the first)
    - Filters out empty segments
    - First segment includes any preamble text before the first article
    """
    parts = []
    
    # Split text by article markers
    parts = law_text.split("Art. ")
    
    # Reconstruct segments with proper "Art. " prefixes
    # First part: preamble + first article (if exists)
    # Remaining parts: individual articles with "Art. " prefix restored
    parts = [parts[0].strip() + "Art. " + parts[1].strip()] + ["Art. " + part.strip() for part in parts[2:] if part.strip()]
    
    return parts

## 4. Text Partitioning Strategy

### Article-Based Segmentation
The law text needs to be broken down into manageable chunks for AI processing. This function splits the text by articles ("Art.") to create logical segments that can be processed independently while maintaining legal context.

In [None]:
# Apply the partitioning function to the loaded law content
art_list = particionate_arts(content)

# Display the list of article segments for verification
# This helps verify that the partitioning worked correctly
art_list

### Test Article Partitioning
Execute the partitioning function and examine the results:

In [None]:
# Parallel processing of law articles using ThreadPoolExecutor
# This dramatically reduces processing time by handling multiple articles concurrently

with ThreadPoolExecutor(max_workers=20) as executor:
    # Generate output filenames based on article titles
    output_names = []
    for art in art_list:
        # Extract article title from the first 10 characters
        title = art[:10].strip()
        
        # Clean up title formatting for filename use
        if len(title.split()) == 3:
            title = title.split()[0] + " " + title.split()[1]
        
        # Remove trailing periods from filenames
        if title.endswith('.'):
            title = title[:-1]
        
        # Create JSON filename for each article
        output_names.append(f"{title}.json")

    # Submit all articles for parallel processing
    for i, art in enumerate(art_list):
        # Define output path for each processed article
        output_path = os.path.join(r"trafficLaw", output_names[i])
        
        # Submit article processing task to thread pool
        # Each task will call the OpenAI API independently
        executor.submit(generate_json_from_law_llm, art, output_path)

    # Wait for all tasks to complete before proceeding
    executor.shutdown(wait=True)

## 5. Parallel AI Processing

### Concurrent Processing Strategy
This section processes all article segments in parallel using ThreadPoolExecutor to significantly reduce processing time. Each article segment is sent to the OpenAI API concurrently, with results saved as individual JSON files.

**Performance Benefits:**
- Processes multiple articles simultaneously
- Reduces total processing time from hours to minutes
- Maintains API rate limits through controlled concurrency (max 20 workers)

In [None]:
# Initialize list to store all processed JSON data
json_files = []

# Regular expression to extract numeric values from filenames
# Used for proper numerical sorting (e.g., Art 2 before Art 10)
get_num = re.compile(r'\d+')

def files_num(file):
    """Extract numeric value from filename for sorting purposes."""
    return int(get_num.search(file).group())

# Get list of all processed JSON files from the traffic law directory
file_list = os.listdir(r"..\data\trafficLaw")

# Sort files numerically to maintain legal document order
# This ensures articles are processed in their original sequence
file_list.sort(key=files_num)

# Load and combine all JSON files into a single list
for file in file_list:
    with open(os.path.join(r"..\data\trafficLaw", file), 'r', encoding='utf8') as f:
        json_files.append(json.load(f))

# Display the aggregated JSON data
json_files

## 6. JSON File Aggregation

### Collecting Processed Results
After parallel processing, this section collects all individual JSON files and combines them into a single comprehensive dataset. The files are sorted numerically to maintain the original law structure order.

In [None]:
# Create a flattened list of all detected legal fields
ultimate_json = []

# Extract all detected fields from individual JSON files
# This flattens the nested structure into a single list
for json_file in json_files:
    ultimate_json.extend(json_file['campos_detectados'])

# Initialize sub_campos field for each element
# This prepares the structure for hierarchical organization
for element in ultimate_json:
    element['sub_campos'] = []

### Save Intermediate Results
Save the flattened legal structure for backup and debugging purposes:

In [None]:
# Define the legal hierarchy in order of precedence
# This list determines parent-child relationships in the legal structure
hierarchy_list = ["Título", "Capítulo", "Seção", "Artigo", "Parágrafo", "Inciso", "Item"]

def check_type_hierarchy(field1, field2):
    """
    Compare two legal fields to determine their hierarchical relationship.
    
    Parameters:
    -----------
    field1, field2 : dict
        Legal field objects containing 'tipo_do_campo' keys
    
    Returns:
    --------
    str
        'sub' if field2 should be a child of field1
        'super' if field2 should be a parent of field1  
        'equal' if fields are at the same hierarchy level
    """
    print(field1, field2)
    type1 = field1['tipo_do_campo']
    type2 = field2['tipo_do_campo']
    
    # Compare positions in hierarchy list
    if hierarchy_list.index(type1) < hierarchy_list.index(type2):
        return 'sub'     # field2 is subordinate to field1
    elif hierarchy_list.index(type1) > hierarchy_list.index(type2):
        return 'super'   # field2 is superior to field1
    else:
        return 'equal'   # fields at same level

def map_name_to_hierarchy(name):
    """
    Map field type names to standardized hierarchy names using fuzzy matching.
    
    This function handles variations in AI-generated field type names by finding
    the closest match in the standard hierarchy list using Levenshtein distance.
    
    Parameters:
    -----------
    name : str
        Field type name to be normalized
    
    Returns:
    --------
    str
        Standardized hierarchy name, or "ERRO" if no close match found
    """
    min_dist = 100
    closest = None
    
    # Find the closest match using string distance
    for hierarchy in hierarchy_list:
        dist = distance(name, hierarchy)
        if dist < min_dist:
            min_dist = dist
            closest = hierarchy
    
    # Reject matches that are too different (likely errors)
    if min_dist >= 3:
        return "ERRO"
    return closest

def get_json_element_by_address(json, address):
    """
    Navigate to a specific element in nested JSON using address path.
    
    Parameters:
    -----------
    json : dict/list
        The JSON structure to navigate
    address : list
        List of keys/indices representing the path to the target element
    
    Returns:
    --------
    dict/list
        The element at the specified address
    """
    element = json
    for i in address:
        element = element[i]
    return element

def change_json_element_by_address(json, address, new_element):
    """
    Update a specific element in nested JSON structure.
    
    Parameters:
    -----------
    json : dict/list
        The JSON structure to modify
    address : list
        Path to the element to be changed
    new_element : any
        New value to assign to the element
    """
    element = json
    for i in address[:-1]:
        element = element[i]
    element[address[-1]] = new_element

def create_hierarchy_json(field_list):
    """
    Transform a flat list of legal fields into a hierarchical tree structure.
    
    This function implements the core logic for building the legal document hierarchy.
    It processes fields sequentially and determines where each field should be placed
    in the tree based on its type and relationship to previously processed fields.
    
    Algorithm:
    1. Normalize field types using fuzzy matching
    2. For each field, find the appropriate parent in the existing tree
    3. Insert the field at the correct position maintaining hierarchy rules
    4. Track insertion positions for efficient subsequent operations
    
    Parameters:
    -----------
    field_list : list
        Flat list of legal field dictionaries
    
    Returns:
    --------
    list
        Hierarchical structure with nested sub_campos relationships
    """
    for i, field in enumerate(field_list):
        # Normalize field type name using fuzzy matching
        field['tipo_do_campo'] = map_name_to_hierarchy(field['tipo_do_campo'])
        
        # Skip fields that couldn't be properly classified
        if field['tipo_do_campo'] == "ERRO":
            continue
        
        # Initialize hierarchy with first valid field
        if i == 0:
            hierarchy_json = {'sub_campos': [field]}
            last_added_address = ['sub_campos', 0]
        else:
            # Get the most recently added field for comparison
            last_added = get_json_element_by_address(hierarchy_json, last_added_address)
            
            # Check if current field should be a child of the last added field
            if check_type_hierarchy(last_added, field) == 'sub':
                last_added['sub_campos'].append(field)
                last_added_address = last_added_address + ['sub_campos', len(last_added['sub_campos']) - 1]
            else:
                # Find appropriate parent by traversing up the tree
                added = False
                for j in range(len(last_added_address)):
                    # Create potential parent address by removing elements from current address
                    new_last_added_address = last_added_address[:-j]
                    
                    # Skip invalid addresses
                    if new_last_added_address and new_last_added_address[-1] == 'sub_campos':
                        continue
                    elif new_last_added_address == []:
                        continue
                    
                    # Check if this parent can accommodate the new field
                    new_last_added = get_json_element_by_address(hierarchy_json, new_last_added_address)
                    print(new_last_added_address)
                    
                    if check_type_hierarchy(new_last_added, field) == 'sub':
                        new_last_added['sub_campos'].append(field)
                        last_added_address = new_last_added_address + ['sub_campos', len(new_last_added['sub_campos']) - 1]
                        added = True
                        break
                
                # If no suitable parent found, add to root level
                if not added:
                    hierarchy_json['sub_campos'].append(field)
                    last_added_address = ['sub_campos', len(hierarchy_json['sub_campos']) - 1]
    
    return hierarchy_json['sub_campos']

# Execute the hierarchical organization
# This transforms the flat list into a proper legal document tree structure
hierarchy_json = create_hierarchy_json(ultimate_json)

## 7. Hierarchical Organization System

### Legal Hierarchy Implementation
This section implements a sophisticated system to organize legal elements into their proper hierarchical structure. The system:

1. **Normalizes field types** using fuzzy string matching (Levenshtein distance)
2. **Establishes parent-child relationships** based on legal precedence
3. **Builds a tree structure** that mirrors the actual legal document organization

### Hierarchy Levels (in order of precedence):
1. **Título** (Title) - Highest level organizational unit
2. **Capítulo** (Chapter) - Major subdivisions within titles  
3. **Seção** (Section) - Subdivisions within chapters
4. **Artigo** (Article) - Individual legal provisions
5. **Parágrafo** (Paragraph) - Clarifications or extensions of articles
6. **Inciso** (Subsection) - Enumerated items within articles/paragraphs
7. **Item** (Item) - Detailed specifications within subsections

In [None]:
# Save the complete hierarchical structure to JSON file
# This represents the fully organized Brazilian Traffic Law structure
with open(r"cod_transito.json", 'w', encoding='utf8') as f:
    json.dump(hierarchy_json, f, indent=4, ensure_ascii=False)

### Generate vector database paths and values with references to later upload to a vector database

In [None]:
# generate vector database dict
vector_db_dict = {}

def generate_vector_db_dict(hierarchy_json, vector_db_dict, address=[], texts_to_concat=[]):
    for i, field in enumerate(hierarchy_json):
        if len(field['sub_campos']) > 0:
            generate_vector_db_dict(field['sub_campos'], vector_db_dict, address+[field['título_do_campo']], texts_to_concat+[field['texto']])
        else:
            vector_db_dict[" - ".join(address+[field['título_do_campo']])] = " ".join(texts_to_concat+[field['texto']])
    
    return vector_db_dict

vector_db_dict = generate_vector_db_dict([hierarchy_json[16]], vector_db_dict)

vector_db_dict

### Manual data clean

In [None]:
# print only keys with "Infração", "Penalidade", "Medida Administrativa"
vector_db_dict = {k: v for k, v in vector_db_dict.items() if not any(x in k.lower() for x in ["penalidade", "medida administrativa"])}
vector_db_dict = {k: v for k, v in vector_db_dict.items() if not any(x in v.lower() for x in ['vetado', 'revogado', 'suspenso'])}

# replace " - Infração" in keys with ""
vector_db_dict = {k.replace(" - Infração", ""): v for k, v in vector_db_dict.items()}
vector_db_dict

In [None]:
# save vector_db dict as json file
with open(r"vector_db_dict.json", 'w', encoding='utf8') as f:
    json.dump(vector_db_dict, f, indent=4, ensure_ascii=False)