# AALM — Data Preparation with semchunk
Build an Administrative Law–focused dataset from the Open Australian Legal Corpus using `semchunk`.

This notebook filters the corpus for Administrative Law material by keywords (tribunals, judicial review concepts, FOI, etc.),
chunks the texts with `semchunk`, and saves a `text`-only dataset for SFT/QLoRA.

In [1]:
%pip -q install -U datasets semchunk transformers tiktoken

Note: you may need to restart the kernel to use updated packages.


In [1]:
import os, re, math, itertools
from typing import Dict, Any, List
from datasets import load_dataset, Dataset
import semchunk
from transformers import AutoTokenizer

CORPUS_DATASET = os.environ.get('CORPUS_DATASET', 'isaacus/open-australian-legal-corpus')
CORPUS_SPLIT = os.environ.get('CORPUS_SPLIT', 'corpus')
OUTPUT_DIR = os.environ.get('OUTPUT_DIR', 'data/aalm-adminlaw-semchunk')
BASE_TOKENIZER = os.environ.get('BASE_TOKENIZER', 'openai/gpt-oss-20b')  # used for token counting
CHUNK_SIZE = int(os.environ.get('CHUNK_SIZE', '1024'))
OVERLAP = float(os.environ.get('OVERLAP', '0.2'))
DOC_LIMIT = int(os.environ.get('DOC_LIMIT', '0'))  # 0 = no explicit cap; use for quick dry runs

# Broad Administrative Law signal via keywords (case-insensitive)
ADMIN_KEYWORDS = [
    'administrative appeals tribunal', 'aat', 'administrative decisions tribunal',
    'civil and administrative tribunal', 'ncat', 'vcat', 'qcat', 'acat',
    'merits review', 'judicial review', 'procedural fairness', 'natural justice',
    'jurisdictional error', 'wednesbury', 'unreasonableness',
    'freedom of information', 'foi', 'ombudsman',
    'delegate', 'delegated legislation', 'minister', 'review of decision',
    'administrative arrangement', 'administrative review tribunal', 'visa'
]
KW = re.compile('|'.join(re.escape(k) for k in ADMIN_KEYWORDS), flags=re.I)

def is_admin_law(record: Dict[str, Any]) -> bool:
    citation = (record.get('citation') or '')
    text = (record.get('text') or '')
    return bool(KW.search(citation + '\n' + text))

tokenizer = AutoTokenizer.from_pretrained(BASE_TOKENIZER, use_fast=True)
chunker = semchunk.chunkerify(tokenizer, chunk_size=min(CHUNK_SIZE, getattr(tokenizer, 'model_max_length', CHUNK_SIZE)))
print('Using tokenizer:', BASE_TOKENIZER)
print('Chunk size:', CHUNK_SIZE, 'Overlap:', OVERLAP)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Using tokenizer: openai/gpt-oss-20b
Chunk size: 1024 Overlap: 0.2


## Load and filter the Corpus

In [3]:
USE_STREAMING = bool(os.environ.get('USE_STREAMING', 'True').lower() in ['true', '1', 'yes'])
SAMPLE_SIZE = int(os.environ.get('SAMPLE_SIZE', '1000'))  # 0 = process all

if USE_STREAMING:
    corpus = load_dataset(CORPUS_DATASET, split=CORPUS_SPLIT, streaming=True)
    print('Using streaming mode (no total count available)')
else:
    corpus = load_dataset(CORPUS_DATASET, split=CORPUS_SPLIT, keep_in_memory=False)
    print('Total documents in split:', len(corpus))

admin_docs = []
count = 0
processed = 0
for ex in corpus:
    processed += 1
    if SAMPLE_SIZE and processed > SAMPLE_SIZE:
        print(f'Reached sample limit of {SAMPLE_SIZE} documents')
        break
    
    if not ex.get('text'):
        continue
    if is_admin_law(ex):
        admin_docs.append(ex)
        count += 1
        if DOC_LIMIT and count >= DOC_LIMIT:
            break

print(f'Processed {processed} documents')
print('Matched Administrative Law docs:', len(admin_docs))

Using streaming mode (no total count available)
Reached sample limit of 1000 documents
Processed 1001 documents
Matched Administrative Law docs: 670


## Chunk texts with semchunk

In [4]:
texts: List[str] = []
citations: List[str] = []
jurisdictions: List[str] = []
types: List[str] = []
urls: List[str] = []

for ex in admin_docs:
    chunks = chunker(ex['text'], overlap=OVERLAP)
    n = len(chunks)
    texts.extend(chunks)
    citations.extend([ex.get('citation') or ''] * n)
    jurisdictions.extend([ex.get('jurisdiction') or ''] * n)
    types.extend([ex.get('type') or ''] * n)
    urls.extend([ex.get('url') or ''] * n)

print('Total chunks:', len(texts))


Total chunks: 12643


## Save dataset to disk

In [5]:
out = Dataset.from_dict({
    'text': texts,
    'citation': citations,
    'jurisdiction': jurisdictions,
    'type': types,
    'url': urls,
})
os.makedirs(OUTPUT_DIR, exist_ok=True)
out.save_to_disk(OUTPUT_DIR)
print('Saved to', OUTPUT_DIR)
out[:2]


Saving the dataset (0/1 shards):   0%|          | 0/12643 [00:00<?, ? examples/s]

Saved to data/aalm-adminlaw-semchunk


{'text': ["Proclamation under the Commonwealth Powers (De Facto Relationships) Act 2006\n\nI, the Governor in and over the State of Tasmania and its Dependencies in the Commonwealth of Australia, acting with the advice of the Executive Council, by this my proclamation made under section 2 of the Commonwealth Powers (De Facto Relationships) Act 2006 fix 8 October 2008 as the day on which that Act commences.\n\n29 September 2008\n\nPETER G. UNDERWOOD\n\nGovernor\n\nBy His Excellency's Command,\n\nLARA GIDDINGS\n\nMinister for Justice\n\nDisplayed and numbered in accordance with the Rules Publication Act 1953.\n\nNotified in the Gazette on 8 October 2008\n\nThis proclamation is administered in the Department of Justice.",
  "Local Government Order 2004\n\nI make the following order under section 137(1)(b) of the Local Government Act 1993 .\n\n21 September 2004\n\nJ. G. COX\n\nMinister Assisting the Premier on Local Government\n\n1. Short title\n    This order may be cited as the Local Gov