# STJ Acórdãos Data Download and Exploration

This notebook downloads acórdãos (court decisions) from STJ's open data portal and organizes them for the LexAudit pipeline.

**Data Source:** [STJ Dados Abertos](https://dadosabertos.web.stj.jus.br/)

**Purpose:** These acórdãos will serve as our "golden corpus" - high-quality legal documents with presumably correct citations that we can use for:
1. Training/validation of citation extraction (Part A)
2. Creating synthetic datasets with mutations (Part C)
3. Benchmarking the validation pipeline

In [1]:
pip install PyMuPDF

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import required libraries
import requests
import json
from pathlib import Path
import pandas as pd
from datetime import datetime
import time
from urllib.parse import urlencode
from bs4 import BeautifulSoup
import re
import fitz  # PyMuPDF for PDF text extraction

## 1. Setup Data Directory Structure

We'll organize data in stages:
- `data/raw/stj/` - Original downloaded files
- `data/intermediate/stj/` - Partially processed data
- `data/cleaned/stj/` - Final cleaned datasets ready for use

In [3]:
# Define data directory structure
PROJECT_ROOT = Path(__file__).parent.parent if '__file__' in globals() else Path.cwd().parent
DATA_DIR = PROJECT_ROOT / "data"

# Create directory structure
RAW_DIR = DATA_DIR / "raw" / "stj"
INTERMEDIATE_DIR = DATA_DIR / "intermediate" / "stj"
CLEANED_DIR = DATA_DIR / "cleaned" / "stj"

for directory in [RAW_DIR, INTERMEDIATE_DIR, CLEANED_DIR]:
    directory.mkdir(parents=True, exist_ok=True)
    print(f"✓ Created/verified: {directory}")

✓ Created/verified: d:\Stuff\Estudo\UNICAMP\IA368\final\LexAudit\data\raw\stj
✓ Created/verified: d:\Stuff\Estudo\UNICAMP\IA368\final\LexAudit\data\intermediate\stj
✓ Created/verified: d:\Stuff\Estudo\UNICAMP\IA368\final\LexAudit\data\cleaned\stj


## 2. Download STJ Acórdãos Data

Download the JSON file from STJ's open data portal. The file will only be downloaded if it doesn't already exist locally.

In [4]:
# Configuration
STJ_URL = "https://dadosabertos.web.stj.jus.br/dataset/5ebbfe8a-05f3-4106-a160-794d91b740b8/resource/9cbc519d-b262-4894-8304-0c38d0f266ef/download/20220630.json"
FILE_DATE = "20220630"  # Extract date from URL
OUTPUT_FILE = RAW_DIR / f"acordaos_{FILE_DATE}.json"

print(f"Target file: {OUTPUT_FILE}")
print(f"File exists: {OUTPUT_FILE.exists()}")

Target file: d:\Stuff\Estudo\UNICAMP\IA368\final\LexAudit\data\raw\stj\acordaos_20220630.json
File exists: True


In [5]:
# Download function
def download_stj_data(url: str, output_path: Path) -> bool:
    """
    Download STJ acórdãos data from the given URL.
    
    Args:
        url: URL to download from
        output_path: Path where to save the file
    
    Returns:
        True if download was successful, False otherwise
    """
    if output_path.exists():
        print(f"   File already exists: {output_path}")
        print(f"   Size: {output_path.stat().st_size / (1024**2):.2f} MB")
        return True
    
    print(f"   Downloading from: {url}")    

    response = requests.get(url, stream=True)
    response.raise_for_status()
    
    # Get file size if available
    total_size = int(response.headers.get('content-length', 0))
    
    with open(output_path, 'wb') as f:
        if total_size:
            downloaded = 0
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
                downloaded += len(chunk)
        else:
            f.write(response.content)
    print(f"   Downloaded to: {output_path}")
    print(f"   Size: {output_path.stat().st_size / (1024**2):.2f} MB")
    return True

# Execute download
download_successful = download_stj_data(STJ_URL, OUTPUT_FILE)

   File already exists: d:\Stuff\Estudo\UNICAMP\IA368\final\LexAudit\data\raw\stj\acordaos_20220630.json
   Size: 0.37 MB


## 3. Load and Explore the Data

In [6]:
with open(OUTPUT_FILE, 'r', encoding='utf-8') as f:
    acordaos_data = json.load(f)
if isinstance(acordaos_data, list):
    print(f"   Number of acórdãos: {len(acordaos_data)}")


   Number of acórdãos: 86


In [7]:
# Explore structure of first acórdão
if acordaos_data:
    print("=" * 80)
    print("STRUCTURE OF FIRST ACÓRDÃO")
    print("=" * 80)
    
    if isinstance(acordaos_data, list) and len(acordaos_data) > 0:
        first_acordao = acordaos_data[0]
        
        print(f"\nType: {type(first_acordao)}")
        
        if isinstance(first_acordao, dict):
            print(f"\nAvailable fields ({len(first_acordao)} total):")
            for key in first_acordao.keys():
                value = first_acordao[key]
                value_type = type(value).__name__
                
                # Show preview of value
                if isinstance(value, str):
                    preview = value[:100] + "..." if len(value) > 100 else value
                    print(f"  • {key:30s} ({value_type:10s}): {preview}")
                elif isinstance(value, (list, dict)):
                    print(f"  • {key:30s} ({value_type:10s}): {len(value)} items")
                else:
                    print(f"  • {key:30s} ({value_type:10s}): {value}")
    
    elif isinstance(acordaos_data, dict):
        print(f"\nTop-level structure:")
        for key, value in acordaos_data.items():
            print(f"  • {key}: {type(value).__name__}")
            if isinstance(value, list):
                print(f"    └─ Length: {len(value)}")

STRUCTURE OF FIRST ACÓRDÃO

Type: <class 'dict'>

Available fields (20 total):
  • id                             (str       ): 000818587
  • numeroProcesso                 (str       ): 1950922
  • numeroRegistro                 (str       ): 202102414010
  • siglaClasse                    (str       ): AgInt nos EDcl nos EAREsp
  • descricaoClasse                (str       ): AGRAVO INTERNO NOS EMBARGOS DE DECLARAÇÃO NOS EMBARGOS DE
DIVERGÊNCIA EM AGRAVO EM RECURSO ESPECIAL
  • nomeOrgaoJulgador              (str       ): CORTE ESPECIAL
  • ministroRelator                (str       ): JORGE MUSSI
  • dataPublicacao                 (str       ): DJE        DATA:24/06/2022
  • ementa                         (str       ): AGRAVO INTERNO. EMBARGOS DE DIVERGÊNCIA. INDEFERIMENTO LIMINAR.
AUSÊNCIA DE JUNTADA DO INTEIRO TEOR ...
  • tipoDeDecisao                  (str       ): ACÓRDÃO
  • dataDecisao                    (str       ): 20220621
  • decisao                        (str       ): V

In [8]:
import textwrap

# Display a full sample acórdão (pretty printed with wrapped text)
if acordaos_data:
    print("=" * 80)
    print("FULL SAMPLE ACÓRDÃO (First Entry, wrapped)")
    print("=" * 80)
    
    if isinstance(acordaos_data, list) and len(acordaos_data) > 0:
        sample = acordaos_data[0]
        wrap_width = 100
        
        for key, val in sample.items():
            print(f"\n{key}:")
            if isinstance(val, str):
                # preserve existing line breaks, wrap each line separately
                wrapped = "\n".join(textwrap.fill(line, width=wrap_width) for line in val.splitlines())
                print(wrapped)
            elif isinstance(val, (list, dict)):
                # pretty-print structured values (no aggressive wrapping to preserve JSON structure)
                if not val:
                    print(val)
                else:
                    print(json.dumps(val, indent=2, ensure_ascii=False))
            else:
                print(val)
        print("\n" + "=" * 80)

FULL SAMPLE ACÓRDÃO (First Entry, wrapped)

id:
000818587

numeroProcesso:
1950922

numeroRegistro:
202102414010

siglaClasse:
AgInt nos EDcl nos EAREsp

descricaoClasse:
AGRAVO INTERNO NOS EMBARGOS DE DECLARAÇÃO NOS EMBARGOS DE
DIVERGÊNCIA EM AGRAVO EM RECURSO ESPECIAL

nomeOrgaoJulgador:
CORTE ESPECIAL

ministroRelator:
JORGE MUSSI

dataPublicacao:
DJE        DATA:24/06/2022

ementa:
AGRAVO INTERNO. EMBARGOS DE DIVERGÊNCIA. INDEFERIMENTO LIMINAR.
AUSÊNCIA DE JUNTADA DO INTEIRO TEOR DO ACÓRDÃO PARADIGMA. SÚMULA 315
DO STJ.  AGRAVO INTERNO DESPROVIDO.
1. Na esteira da jurisprudência desta Corte, a comprovação da
divergência pressupõe a apresentação de cópias do inteiro teor dos
acórdãos apontados como paradigmas pela parte recorrente.
2. No caso posto, a parte embargante deixou de instruir o recurso
com  a cópia do inteiro teor dos acórdãos, restando desatendidas as
exigências dos arts. 1.043 e 1.044 do CPC e dos arts. 266 a 267, do
RISTJ, para a configuração da suposta divergência pre

## 4. Create Initial DataFrame

In [9]:
# Convert to DataFrame
df_acordaos = pd.DataFrame(acordaos_data)

print(f"DataFrame created:")
print(f"  Shape: {df_acordaos.shape} (rows × columns)")
print(f"\nColumn names:")
for i, col in enumerate(df_acordaos.columns, 1):
    print(f"  {i:2d}. {col}")

print(f"\nDataFrame info:")
print(df_acordaos.info())

DataFrame created:
  Shape: (86, 20) (rows × columns)

Column names:
   1. id
   2. numeroProcesso
   3. numeroRegistro
   4. siglaClasse
   5. descricaoClasse
   6. nomeOrgaoJulgador
   7. ministroRelator
   8. dataPublicacao
   9. ementa
  10. tipoDeDecisao
  11. dataDecisao
  12. decisao
  13. jurisprudenciaCitada
  14. notas
  15. informacoesComplementares
  16. termosAuxiliares
  17. teseJuridica
  18. tema
  19. referenciasLegislativas
  20. acordaosSimilares

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86 entries, 0 to 85
Data columns (total 20 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   id                         86 non-null     object
 1   numeroProcesso             86 non-null     object
 2   numeroRegistro             86 non-null     object
 3   siglaClasse                86 non-null     object
 4   descricaoClasse            86 non-null     object
 5   nomeOrgaoJulgador   

In [10]:
# Display first few rows
if df_acordaos is not None:
    print("=" * 80)
    print("FIRST 3 ACÓRDÃOS")
    print("=" * 80)
    display(df_acordaos.head(3))

FIRST 3 ACÓRDÃOS


Unnamed: 0,id,numeroProcesso,numeroRegistro,siglaClasse,descricaoClasse,nomeOrgaoJulgador,ministroRelator,dataPublicacao,ementa,tipoDeDecisao,dataDecisao,decisao,jurisprudenciaCitada,notas,informacoesComplementares,termosAuxiliares,teseJuridica,tema,referenciasLegislativas,acordaosSimilares
0,818587,1950922,202102414010,AgInt nos EDcl nos EAREsp,AGRAVO INTERNO NOS EMBARGOS DE DECLARAÇÃO NOS ...,CORTE ESPECIAL,JORGE MUSSI,DJE DATA:24/06/2022,AGRAVO INTERNO. EMBARGOS DE DIVERGÊNCIA. INDEF...,ACÓRDÃO,20220621,Vistos e relatados estes autos em que são part...,,,,,,,[],[]
1,818791,16694,202102270904,EDcl no AgInt na CR,EMBARGOS DE DECLARAÇÃO NO AGRAVO INTERNO NA CA...,CORTE ESPECIAL,HUMBERTO MARTINS,DJE DATA:27/06/2022,EMBARGOS DE DECLARAÇÃO. CARTA ROGATÓRIA. TEMPE...,ACÓRDÃO,20220621,Vistos e relatados estes autos em que são part...,,,,,,,[],[]
2,818936,1787941,202002952022,AgInt no RE nos EDcl no AgInt no AREsp,AGRAVO INTERNO NO RECURSO EXTRAORDINÁRIO NOS E...,CORTE ESPECIAL,JORGE MUSSI,DJE DATA:23/06/2022,AGRAVO INTERNO. NEGATIVA DE SEGUIMENTO. RECURS...,ACÓRDÃO,20220621,Vistos e relatados estes autos em que são part...,,,,,,,[],[]


## 5. Summary Statistics

In [11]:
# Summary statistics
print("=" * 80)
print("DATASET SUMMARY")
print("=" * 80)

print(f"\nTotal records: {len(df_acordaos):,}")

# Analyze columns
text_columns = [col for col in df_acordaos.columns if df_acordaos[col].dtype == 'object']
print(f"\nColumns with data ({len(text_columns)} text columns):")

for col in text_columns:
    non_null = df_acordaos[col].notna().sum()
    
    # Check if column contains lists
    if non_null > 0 and isinstance(df_acordaos[col].iloc[0], list):
        non_empty = df_acordaos[col].apply(lambda x: isinstance(x, list) and len(x) > 0).sum()
        print(f"  {col:30s}: {non_empty:,}/{len(df_acordaos):,} non-empty ({non_empty/len(df_acordaos)*100:.1f}%)")
    else:
        print(f"  {col:30s}: {non_null:,}/{len(df_acordaos):,} filled ({non_null/len(df_acordaos)*100:.1f}%)")

# Missing/empty summary
missing = df_acordaos.isnull().sum()
missing = missing[missing > 0]
if len(missing) > 0:
    print(f"\nNull values found in {len(missing)} columns")

DATASET SUMMARY

Total records: 86

Columns with data (20 text columns):
  id                            : 86/86 filled (100.0%)
  numeroProcesso                : 86/86 filled (100.0%)
  numeroRegistro                : 86/86 filled (100.0%)
  siglaClasse                   : 86/86 filled (100.0%)
  descricaoClasse               : 86/86 filled (100.0%)
  nomeOrgaoJulgador             : 86/86 filled (100.0%)
  ministroRelator               : 86/86 filled (100.0%)
  dataPublicacao                : 86/86 filled (100.0%)
  ementa                        : 86/86 filled (100.0%)
  tipoDeDecisao                 : 86/86 filled (100.0%)
  dataDecisao                   : 86/86 filled (100.0%)
  decisao                       : 86/86 filled (100.0%)
  jurisprudenciaCitada          : 42/86 filled (48.8%)
  notas                         : 3/86 filled (3.5%)
  informacoesComplementares     : 18/86 filled (20.9%)
  termosAuxiliares              : 1/86 filled (1.2%)
  teseJuridica                  : 0/86 

## 6. Select 10 Diverse Samples

Select samples with maximum diversity of legal references to ensure broad coverage of citation types.

In [12]:
# Filter acórdãos with non-empty referenciasLegislativas
df_with_refs = df_acordaos[
    df_acordaos['referenciasLegislativas'].apply(lambda x: isinstance(x, list) and len(x) > 0)
].copy()
# Deduplicate acordãos based based on numero_registro
df_with_refs = df_with_refs.drop_duplicates(subset='numeroRegistro')

print(f"Found {len(df_with_refs):,} acórdãos with legal references\n")

# Examine reference structure first
print("Sample reference structure:")
sample_ref = df_with_refs.iloc[0]['referenciasLegislativas'][0]
print(json.dumps(sample_ref, indent=2, ensure_ascii=False))
print()

# Extract unique reference types
def extract_ref_keys(refs):
    """Extract unique reference identifiers from list of references."""
    keys = set()
    for ref in refs:
        if isinstance(ref, dict):
            # Create composite key from available fields
            leg = ref.get('legislacao', '')
            art = ref.get('artigo', '')
            key = f"{leg}|{art}" if leg or art else json.dumps(ref, sort_keys=True)
            keys.add(key)
        else:
            keys.add(str(ref))
    return keys

df_with_refs['ref_keys'] = df_with_refs['referenciasLegislativas'].apply(extract_ref_keys)
df_with_refs['num_refs'] = df_with_refs['referenciasLegislativas'].apply(len)

# Greedy selection: maximize diversity
selected_indices = []
covered_refs = set()

while len(selected_indices) < 10 and len(df_with_refs) > 0:
    df_with_refs['score'] = df_with_refs['ref_keys'].apply(
        lambda keys: len(keys - covered_refs) * 10 + len(keys)
    )
    
    best_idx = df_with_refs['score'].idxmax()
    selected_indices.append(best_idx)
    new_refs = df_with_refs.loc[best_idx, 'ref_keys'] - covered_refs
    covered_refs.update(new_refs)
    
    print(f"Sample {len(selected_indices)}: Added {len(new_refs)} new references (total: {len(covered_refs)})")
    
    df_with_refs = df_with_refs.drop(best_idx)

samples = df_acordaos.loc[selected_indices]

print(f"\nSelected 10 samples covering {len(covered_refs)} unique references")
print(f"Average references per sample: {samples['referenciasLegislativas'].apply(len).mean():.1f}\n")

# Print all unique references found
print("All unique references:")
for i, ref in enumerate(sorted(covered_refs), 1):
    print(f"  {i}. {ref}")

Found 39 acórdãos with legal references

Sample reference structure:
"LEG:FED LEI:013105 ANO:2015\n*****  CPC-15    CÓDIGO DE PROCESSO CIVIL DE 2015\n        ART:01022 ART:01026 PAR:00002"

Sample 1: Added 14 new references (total: 14)
Sample 2: Added 9 new references (total: 23)
Sample 3: Added 7 new references (total: 30)
Sample 4: Added 5 new references (total: 35)
Sample 5: Added 5 new references (total: 40)
Sample 6: Added 4 new references (total: 44)
Sample 7: Added 4 new references (total: 48)
Sample 8: Added 3 new references (total: 51)
Sample 9: Added 3 new references (total: 54)
Sample 10: Added 3 new references (total: 57)

Selected 10 samples covering 57 unique references
Average references per sample: 6.3

All unique references:
  1. LEG:FED CFB:****** ANO:1988
*****  CF-1988    CONSTITUIÇÃO FEDERAL DE 1988
        ART:00005 INC:00012
  2. LEG:FED CFB:****** ANO:1988
*****  CF-1988    CONSTITUIÇÃO FEDERAL DE 1988
        ART:00005 INC:00033 LET:A LET:B INC:00034
        IN

In [13]:
import textwrap

# Display selected samples
wrap_width = 100

for idx, (i, row) in enumerate(samples.iterrows(), 1):
    print("=" * 80)
    print(f"SAMPLE {idx} - Process: {row.get('numeroProcesso', 'N/A')}")
    print("=" * 80)
    
    if 'ementa' in row and pd.notna(row['ementa']):
        print("\nDECISION:")
        wrapped = "\n".join(textwrap.fill(line, width=wrap_width) for line in str(row['ementa']).splitlines())
        print(wrapped)
    
    refs = row['referenciasLegislativas']
    print(f"\nLEGAL REFERENCES ({len(refs)} found):")
    for ref_idx, ref in enumerate(refs, 1):
        if isinstance(ref, dict):
            leg = ref.get('legislacao', 'N/A')
            art = ref.get('artigo', '')
            print(f"  {ref_idx}. {leg}" + (f" - {art}" if art else ""))
        else:
            print(f"  {ref_idx}. {ref}")
    print()

# Save samples
samples_file = INTERMEDIATE_DIR / "sample_10_diverse.json"
samples.to_json(samples_file, orient='records', force_ascii=False, indent=2)
print(f"Samples saved to: {samples_file}")

SAMPLE 1 - Process: 927

DECISION:
AÇÃO PENAL PROPOSTA CONTRA MEMBRO DE TRIBUNAL DE CONTAS ESTADUAL E
DE SUA ESPOSA. PRELIMINARES: CERCEAMENTO DE DEFESA POR OFENSA À
SÚMULA 14 DO STF; VIOLAÇÃO AO PRINCÍPIO DA ESPECIALIDADE NA
UTILIZAÇÃO DAS INFORMAÇÕES ENCAMINHADAS PELO VATICANO E PELAS
BAHAMAS POR MEIO DE COOPERAÇÃO INTERNACIONAL; INÉPCIA MATERIAL DA
DENÚNCIA; "VIOLAÇÃO DA CADEIA DE CUSTÓDIA DA PROVA". IMPROCEDÊNCIA,
NO CASO. DENÚNCIA PELA PRÁTICA DO CRIME DE "LAVAGEM" OU OCULTAÇÃO DE
BENS, DIREITOS E VALORES (LEI 9.613, DE 1998, ART. 1º, § 4º).
AFASTAMENTO JUSTIFICADO COM BASE NA GRAVIDADE DAS IMPUTAÇÕES. LOMAN,
ART. 29. DENÚNCIA RECEBIDA.
1.  Em matéria de cooperação jurídica internacional, o procedimento
seguido é o ditado pela legislação do Estado requerido. A utilização
da prova obtida é ampla, observadas eventuais restrições
expressamente formuladas pelo Estado requerido. Precedentes do STJ e
do STF.
2. Denúncia que descreve suficientemente a prática de movimentações
tendentes a

## 7. Download full texts of selected acórdãos

Download full text files (PDF format) for complete text extraction (inteiro teor).

In [16]:
# Download PDFs and extract full text
def download_and_extract_pdf(num_registro: str, dt_publicacao: str, output_dir: Path) -> str:
    """
    Download PDF from STJ and extract text.
    
    Args:
        num_registro: Registry number
        dt_publicacao: Publication date (format: DD/MM/YYYY)
        output_dir: Directory to save PDF
        
    Returns:
        Extracted text or empty string if failed
    """
    url = f"https://sv03.stj.jus.br/SCON/GetInteiroTeorDoAcordao?num_registro={num_registro}&dt_publicacao={dt_publicacao}"
    pdf_path = output_dir / f"{num_registro}.pdf"
    print(f"  Downloading PDF for {url}...")

    try:
        # Download PDF
        if not pdf_path.exists():
            response = requests.get(url, timeout=30)
            response.raise_for_status()
            with open(pdf_path, 'wb') as f:
                f.write(response.content)
            print(f"  ✓ Downloaded: {num_registro}")
        
        # Extract text
        doc = fitz.open(pdf_path)
        text = ""
        for page in doc:
            text += page.get_text()
        doc.close()
        
        return text.strip()
    
    except Exception as e:
        print(f"  ✗ Error for {num_registro}: {str(e)}")
        return ""

def extract_date(date_str: str) -> str:
    """
    Extract date in DD/MM/YYYY format from various input formats.
    
    Args:
        date_str: Input date string
    Example:     "data_publicacao": "DJE        DATA:10/06/2022",
    """
    match = re.search(r'(\d{2}/\d{2}/\d{4})', date_str)
    if match:
        return match.group(1)
    else:
        raise ValueError(f"Date not found in string: {date_str}")

# Create PDF directory
PDF_DIR = RAW_DIR / "pdfs"
PDF_DIR.mkdir(exist_ok=True)

# Download and extract for all samples
print("Downloading and extracting PDFs...\n")
samples['inteiroTeor'] = ""

for idx, (i, row) in enumerate(samples.iterrows(), 1):
    num_reg = row.get('numeroRegistro', '')
    dt_pub = row.get('dataPublicacao', '')
    dt_pub = extract_date(dt_pub)

    if num_reg and dt_pub:
        print(f"Sample {idx}/{len(samples)}: {num_reg}")
        full_text = download_and_extract_pdf(num_reg, dt_pub, PDF_DIR)
        samples.at[i, 'inteiroTeor'] = full_text
        time.sleep(1)  # Be respectful to the server
    else:
        print(f"Sample {idx}/{len(samples)}: Missing registry or date")

print(f"\n✓ Completed. Text extracted from {(samples['inteiroTeor'].str.len() > 0).sum()}/{len(samples)} documents")

Downloading and extracting PDFs...

Sample 1/10: 201902237934
  Downloading PDF for https://sv03.stj.jus.br/SCON/GetInteiroTeorDoAcordao?num_registro=201902237934&dt_publicacao=10/06/2022...
  ✓ Downloaded: 201902237934
  ✓ Downloaded: 201902237934
Sample 2/10: 201702135303
  Downloading PDF for https://sv03.stj.jus.br/SCON/GetInteiroTeorDoAcordao?num_registro=201702135303&dt_publicacao=01/06/2022...
Sample 2/10: 201702135303
  Downloading PDF for https://sv03.stj.jus.br/SCON/GetInteiroTeorDoAcordao?num_registro=201702135303&dt_publicacao=01/06/2022...
  ✓ Downloaded: 201702135303
  ✓ Downloaded: 201702135303
Sample 3/10: 202103354411
  Downloading PDF for https://sv03.stj.jus.br/SCON/GetInteiroTeorDoAcordao?num_registro=202103354411&dt_publicacao=03/06/2022...
Sample 3/10: 202103354411
  Downloading PDF for https://sv03.stj.jus.br/SCON/GetInteiroTeorDoAcordao?num_registro=202103354411&dt_publicacao=03/06/2022...
  ✓ Downloaded: 202103354411
  ✓ Downloaded: 202103354411
Sample 4/10: 20

In [18]:
# Save enriched dataset
enriched_file = CLEANED_DIR / "sample_10_with_fulltext.json"
samples.to_json(enriched_file, orient='records', force_ascii=False, indent=2)

print(f"Enriched dataset saved to: {enriched_file}")
print(f"\nDataset summary:")
print(f"  Total samples: {len(samples)}")
print(f"  With full text: {(samples['inteiroTeor'].str.len() > 0).sum()}")
print(f"  Average text length: {samples['inteiroTeor'].str.len().mean():.0f} chars")

Enriched dataset saved to: d:\Stuff\Estudo\UNICAMP\IA368\final\LexAudit\data\cleaned\stj\sample_10_with_fulltext.json

Dataset summary:
  Total samples: 10
  With full text: 10
  Average text length: 49004 chars
