# STJ Ac√≥rd√£os Data Download and Exploration

This notebook downloads ac√≥rd√£os (court decisions) from STJ's open data portal and organizes them for the LexAudit pipeline.

**Data Source:** [STJ Dados Abertos](https://dadosabertos.web.stj.jus.br/)

**Purpose:** These ac√≥rd√£os will serve as our "golden corpus" - high-quality legal documents with presumably correct citations that we can use for:
1. Training/validation of citation extraction (Part A)
2. Creating synthetic datasets with mutations (Part C)
3. Benchmarking the validation pipeline

In [1]:
# Import required libraries
import requests
import json
from pathlib import Path
import pandas as pd
from datetime import datetime

## 1. Setup Data Directory Structure

We'll organize data in stages:
- `data/raw/stj/` - Original downloaded files
- `data/intermediate/stj/` - Partially processed data
- `data/cleaned/stj/` - Final cleaned datasets ready for use

In [2]:
# Define data directory structure
PROJECT_ROOT = Path(__file__).parent.parent if '__file__' in globals() else Path.cwd().parent
DATA_DIR = PROJECT_ROOT / "data"

# Create directory structure
RAW_DIR = DATA_DIR / "raw" / "stj"
INTERMEDIATE_DIR = DATA_DIR / "intermediate" / "stj"
CLEANED_DIR = DATA_DIR / "cleaned" / "stj"

for directory in [RAW_DIR, INTERMEDIATE_DIR, CLEANED_DIR]:
    directory.mkdir(parents=True, exist_ok=True)
    print(f"‚úì Created/verified: {directory}")

‚úì Created/verified: d:\Stuff\Estudo\UNICAMP\IA368\final\LexAudit\data\raw\stj
‚úì Created/verified: d:\Stuff\Estudo\UNICAMP\IA368\final\LexAudit\data\intermediate\stj
‚úì Created/verified: d:\Stuff\Estudo\UNICAMP\IA368\final\LexAudit\data\cleaned\stj


## 2. Download STJ Ac√≥rd√£os Data

Download the JSON file from STJ's open data portal. The file will only be downloaded if it doesn't already exist locally.

In [3]:
# Configuration
STJ_URL = "https://dadosabertos.web.stj.jus.br/dataset/5ebbfe8a-05f3-4106-a160-794d91b740b8/resource/9cbc519d-b262-4894-8304-0c38d0f266ef/download/20220630.json"
FILE_DATE = "20220630"  # Extract date from URL
OUTPUT_FILE = RAW_DIR / f"acordaos_{FILE_DATE}.json"

print(f"Target file: {OUTPUT_FILE}")
print(f"File exists: {OUTPUT_FILE.exists()}")

Target file: d:\Stuff\Estudo\UNICAMP\IA368\final\LexAudit\data\raw\stj\acordaos_20220630.json
File exists: False


In [5]:
# Download function
def download_stj_data(url: str, output_path: Path) -> bool:
    """
    Download STJ ac√≥rd√£os data from the given URL.
    
    Args:
        url: URL to download from
        output_path: Path where to save the file
    
    Returns:
        True if download was successful, False otherwise
    """
    if output_path.exists():
        print(f"‚è≠Ô∏è  File already exists: {output_path}")
        print(f"   Size: {output_path.stat().st_size / (1024**2):.2f} MB")
        return True
    
    print(f"‚¨áÔ∏è  Downloading from: {url}")
    print(f"   This may take a few minutes...")
    
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()
        
        # Get file size if available
        total_size = int(response.headers.get('content-length', 0))
        
        with open(output_path, 'wb') as f:
            if total_size:
                downloaded = 0
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
                    downloaded += len(chunk)
                    # Simple progress indicator
                    if downloaded % (1024 * 1024 * 10) == 0:  # Every 10 MB
                        print(f"   Downloaded: {downloaded / (1024**2):.1f} MB / {total_size / (1024**2):.1f} MB")
            else:
                f.write(response.content)
        
        print(f"‚úÖ Download complete: {output_path}")
        print(f"   Size: {output_path.stat().st_size / (1024**2):.2f} MB")
        return True
        
    except Exception as e:
        print(f"‚ùå Error downloading file: {e}")
        if output_path.exists():
            output_path.unlink()  # Remove partial file
        return False

# Execute download
download_successful = download_stj_data(STJ_URL, OUTPUT_FILE)

‚è≠Ô∏è  File already exists: d:\Stuff\Estudo\UNICAMP\IA368\final\LexAudit\data\raw\stj\acordaos_20220630.json
   Size: 0.37 MB


## 3. Load and Explore the Data

Now let's load the JSON file and explore its structure.

In [6]:
# Load the JSON data
if OUTPUT_FILE.exists():
    print(f"üìÇ Loading data from: {OUTPUT_FILE}")
    
    with open(OUTPUT_FILE, 'r', encoding='utf-8') as f:
        acordaos_data = json.load(f)
    
    print(f"‚úÖ Data loaded successfully!")
    print(f"   Type: {type(acordaos_data)}")
    
    if isinstance(acordaos_data, list):
        print(f"   Number of ac√≥rd√£os: {len(acordaos_data)}")
    elif isinstance(acordaos_data, dict):
        print(f"   Top-level keys: {list(acordaos_data.keys())}")
else:
    print("‚ùå File not found. Please run the download cell first.")
    acordaos_data = None

üìÇ Loading data from: d:\Stuff\Estudo\UNICAMP\IA368\final\LexAudit\data\raw\stj\acordaos_20220630.json
‚úÖ Data loaded successfully!
   Type: <class 'list'>
   Number of ac√≥rd√£os: 86


In [7]:
# Explore structure of first ac√≥rd√£o
if acordaos_data:
    print("=" * 80)
    print("STRUCTURE OF FIRST AC√ìRD√ÉO")
    print("=" * 80)
    
    if isinstance(acordaos_data, list) and len(acordaos_data) > 0:
        first_acordao = acordaos_data[0]
        
        print(f"\nType: {type(first_acordao)}")
        
        if isinstance(first_acordao, dict):
            print(f"\nAvailable fields ({len(first_acordao)} total):")
            for key in first_acordao.keys():
                value = first_acordao[key]
                value_type = type(value).__name__
                
                # Show preview of value
                if isinstance(value, str):
                    preview = value[:100] + "..." if len(value) > 100 else value
                    print(f"  ‚Ä¢ {key:30s} ({value_type:10s}): {preview}")
                elif isinstance(value, (list, dict)):
                    print(f"  ‚Ä¢ {key:30s} ({value_type:10s}): {len(value)} items")
                else:
                    print(f"  ‚Ä¢ {key:30s} ({value_type:10s}): {value}")
    
    elif isinstance(acordaos_data, dict):
        print(f"\nTop-level structure:")
        for key, value in acordaos_data.items():
            print(f"  ‚Ä¢ {key}: {type(value).__name__}")
            if isinstance(value, list):
                print(f"    ‚îî‚îÄ Length: {len(value)}")

STRUCTURE OF FIRST AC√ìRD√ÉO

Type: <class 'dict'>

Available fields (20 total):
  ‚Ä¢ id                             (str       ): 000818587
  ‚Ä¢ numeroProcesso                 (str       ): 1950922
  ‚Ä¢ numeroRegistro                 (str       ): 202102414010
  ‚Ä¢ siglaClasse                    (str       ): AgInt nos EDcl nos EAREsp
  ‚Ä¢ descricaoClasse                (str       ): AGRAVO INTERNO NOS EMBARGOS DE DECLARA√á√ÉO NOS EMBARGOS DE
DIVERG√äNCIA EM AGRAVO EM RECURSO ESPECIAL
  ‚Ä¢ nomeOrgaoJulgador              (str       ): CORTE ESPECIAL
  ‚Ä¢ ministroRelator                (str       ): JORGE MUSSI
  ‚Ä¢ dataPublicacao                 (str       ): DJE        DATA:24/06/2022
  ‚Ä¢ ementa                         (str       ): AGRAVO INTERNO. EMBARGOS DE DIVERG√äNCIA. INDEFERIMENTO LIMINAR.
AUS√äNCIA DE JUNTADA DO INTEIRO TEOR ...
  ‚Ä¢ tipoDeDecisao                  (str       ): AC√ìRD√ÉO
  ‚Ä¢ dataDecisao                    (str       ): 20220621
  ‚Ä¢ decisao      

In [8]:
# Display a full sample ac√≥rd√£o (pretty printed)
if acordaos_data:
    print("=" * 80)
    print("FULL SAMPLE AC√ìRD√ÉO (First Entry)")
    print("=" * 80)
    
    if isinstance(acordaos_data, list) and len(acordaos_data) > 0:
        print(json.dumps(acordaos_data[0], indent=2, ensure_ascii=False))

FULL SAMPLE AC√ìRD√ÉO (First Entry)
{
  "id": "000818587",
  "numeroProcesso": "1950922",
  "numeroRegistro": "202102414010",
  "siglaClasse": "AgInt nos EDcl nos EAREsp",
  "descricaoClasse": "AGRAVO INTERNO NOS EMBARGOS DE DECLARA√á√ÉO NOS EMBARGOS DE\nDIVERG√äNCIA EM AGRAVO EM RECURSO ESPECIAL",
  "nomeOrgaoJulgador": "CORTE ESPECIAL",
  "ministroRelator": "JORGE MUSSI",
  "dataPublicacao": "DJE        DATA:24/06/2022",
  "ementa": "AGRAVO INTERNO. EMBARGOS DE DIVERG√äNCIA. INDEFERIMENTO LIMINAR.\nAUS√äNCIA DE JUNTADA DO INTEIRO TEOR DO AC√ìRD√ÉO PARADIGMA. S√öMULA 315\nDO STJ.  AGRAVO INTERNO DESPROVIDO.\n1. Na esteira da jurisprud√™ncia desta Corte, a comprova√ß√£o da\ndiverg√™ncia pressup√µe a apresenta√ß√£o de c√≥pias do inteiro teor dos\nac√≥rd√£os apontados como paradigmas pela parte recorrente.\n2. No caso posto, a parte embargante deixou de instruir o recurso\ncom  a c√≥pia do inteiro teor dos ac√≥rd√£os, restando desatendidas as\nexig√™ncias dos arts. 1.043 e 1.044 do CPC e

## 4. Create Initial DataFrame

Convert to pandas DataFrame for easier exploration and analysis.

In [9]:
# Convert to DataFrame
if acordaos_data and isinstance(acordaos_data, list):
    df_acordaos = pd.DataFrame(acordaos_data)
    
    print(f"DataFrame created:")
    print(f"  Shape: {df_acordaos.shape} (rows √ó columns)")
    print(f"\nColumn names:")
    for i, col in enumerate(df_acordaos.columns, 1):
        print(f"  {i:2d}. {col}")
    
    print(f"\nDataFrame info:")
    print(df_acordaos.info())
else:
    print("Cannot create DataFrame - data not loaded or not in expected format")
    df_acordaos = None

DataFrame created:
  Shape: (86, 20) (rows √ó columns)

Column names:
   1. id
   2. numeroProcesso
   3. numeroRegistro
   4. siglaClasse
   5. descricaoClasse
   6. nomeOrgaoJulgador
   7. ministroRelator
   8. dataPublicacao
   9. ementa
  10. tipoDeDecisao
  11. dataDecisao
  12. decisao
  13. jurisprudenciaCitada
  14. notas
  15. informacoesComplementares
  16. termosAuxiliares
  17. teseJuridica
  18. tema
  19. referenciasLegislativas
  20. acordaosSimilares

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86 entries, 0 to 85
Data columns (total 20 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   id                         86 non-null     object
 1   numeroProcesso             86 non-null     object
 2   numeroRegistro             86 non-null     object
 3   siglaClasse                86 non-null     object
 4   descricaoClasse            86 non-null     object
 5   nomeOrgaoJulgador  

In [10]:
# Display first few rows
if df_acordaos is not None:
    print("=" * 80)
    print("FIRST 3 AC√ìRD√ÉOS")
    print("=" * 80)
    display(df_acordaos.head(3))

FIRST 3 AC√ìRD√ÉOS


Unnamed: 0,id,numeroProcesso,numeroRegistro,siglaClasse,descricaoClasse,nomeOrgaoJulgador,ministroRelator,dataPublicacao,ementa,tipoDeDecisao,dataDecisao,decisao,jurisprudenciaCitada,notas,informacoesComplementares,termosAuxiliares,teseJuridica,tema,referenciasLegislativas,acordaosSimilares
0,818587,1950922,202102414010,AgInt nos EDcl nos EAREsp,AGRAVO INTERNO NOS EMBARGOS DE DECLARA√á√ÉO NOS ...,CORTE ESPECIAL,JORGE MUSSI,DJE DATA:24/06/2022,AGRAVO INTERNO. EMBARGOS DE DIVERG√äNCIA. INDEF...,AC√ìRD√ÉO,20220621,Vistos e relatados estes autos em que s√£o part...,,,,,,,[],[]
1,818791,16694,202102270904,EDcl no AgInt na CR,EMBARGOS DE DECLARA√á√ÉO NO AGRAVO INTERNO NA CA...,CORTE ESPECIAL,HUMBERTO MARTINS,DJE DATA:27/06/2022,EMBARGOS DE DECLARA√á√ÉO. CARTA ROGAT√ìRIA. TEMPE...,AC√ìRD√ÉO,20220621,Vistos e relatados estes autos em que s√£o part...,,,,,,,[],[]
2,818936,1787941,202002952022,AgInt no RE nos EDcl no AgInt no AREsp,AGRAVO INTERNO NO RECURSO EXTRAORDIN√ÅRIO NOS E...,CORTE ESPECIAL,JORGE MUSSI,DJE DATA:23/06/2022,AGRAVO INTERNO. NEGATIVA DE SEGUIMENTO. RECURS...,AC√ìRD√ÉO,20220621,Vistos e relatados estes autos em que s√£o part...,,,,,,,[],[]


## 5. Summary Statistics

Get an overview of the dataset to understand what we're working with.

In [11]:
# Summary statistics
if df_acordaos is not None:
    print("=" * 80)
    print("DATASET SUMMARY")
    print("=" * 80)
    
    print(f"\nüìä Basic Statistics:")
    print(f"  Total ac√≥rd√£os: {len(df_acordaos):,}")
    
    # Check for text fields (potential citation sources)
    text_columns = [col for col in df_acordaos.columns if df_acordaos[col].dtype == 'object']
    print(f"\nüìù Text columns ({len(text_columns)}):")
    for col in text_columns:
        non_null = df_acordaos[col].notna().sum()
        print(f"  ‚Ä¢ {col:30s}: {non_null:,} non-null ({non_null/len(df_acordaos)*100:.1f}%)")
    
    # Check for missing values
    print(f"\n‚ùì Missing values:")
    missing = df_acordaos.isnull().sum()
    missing = missing[missing > 0].sort_values(ascending=False)
    if len(missing) > 0:
        for col, count in missing.items():
            print(f"  ‚Ä¢ {col:30s}: {count:,} ({count/len(df_acordaos)*100:.1f}%)")
    else:
        print("  ‚úÖ No missing values!")

DATASET SUMMARY

üìä Basic Statistics:
  Total ac√≥rd√£os: 86

üìù Text columns (20):
  ‚Ä¢ id                            : 86 non-null (100.0%)
  ‚Ä¢ numeroProcesso                : 86 non-null (100.0%)
  ‚Ä¢ numeroRegistro                : 86 non-null (100.0%)
  ‚Ä¢ siglaClasse                   : 86 non-null (100.0%)
  ‚Ä¢ descricaoClasse               : 86 non-null (100.0%)
  ‚Ä¢ nomeOrgaoJulgador             : 86 non-null (100.0%)
  ‚Ä¢ ministroRelator               : 86 non-null (100.0%)
  ‚Ä¢ dataPublicacao                : 86 non-null (100.0%)
  ‚Ä¢ ementa                        : 86 non-null (100.0%)
  ‚Ä¢ tipoDeDecisao                 : 86 non-null (100.0%)
  ‚Ä¢ dataDecisao                   : 86 non-null (100.0%)
  ‚Ä¢ decisao                       : 86 non-null (100.0%)
  ‚Ä¢ jurisprudenciaCitada          : 42 non-null (48.8%)
  ‚Ä¢ notas                         : 3 non-null (3.5%)
  ‚Ä¢ informacoesComplementares     : 18 non-null (20.9%)
  ‚Ä¢ termosAuxiliares          