### TESTING NEW FUNCTIONS

In [24]:
import pandas as pd
import getout_of_text_3 as got3
from getout_of_text_3 import ScotusAnalysisTool, ScotusFilteredAnalysisTool
from langchain.chat_models import init_chat_model

# print version of getout_of_text_3
print('Running getout_of_text3 version:', got3.__version__)


Running getout_of_text3 version: 0.3.2


### Step 1. Read SCOTUS DIY Corpus

- saved as `loc_gov.json`

In [15]:
# read pdf scotus files
df = pd.read_json("loc_gov.json", lines=True)

df['key'] = df['filename'].apply(lambda x: x.split('usrep')[1][:3])
df['subkey'] = df['filename'].apply(lambda x: x.split('usrep')[1].split('.pdf')[0])

# Create a dictionary to hold the DataFrame contents
df_dict = {}

for _, row in df.iterrows():
    if row['key'] not in df_dict:
        df_dict[row['key']] = {}
    df_dict[row['key']][row['subkey']] = row['content']

# format scotus data for getout_of_text_3, similar to COCA keyword results
db_dict_formatted = {}
for volume, cases in df_dict.items():
    # Create a DataFrame for each volume with case text
    case_data = []
    for case_id, case_text in cases.items():
        case_data.append({'case_id': case_id, 'text': case_text})
    db_dict_formatted[volume] = pd.DataFrame(case_data)


### Step 2. Load Langchain AWS Bedrock Model

Here we are using AWS Bedrock model `openai.gpt-oss-120b-1:0` as the max token count of 128,000 allows for a large context window at a cost-effective price structure. Notably as well we always aim for open models to promote transparent and responsible AI. For more info, see:
- https://openai.com/open-models/
- https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-openai.html

#### Tools

- `search_tool` takes a keyword and analysis type, which allows for a quick summary of a keyword across SCOTUS
- `filtered_tool` takes a filtered corpus dict, allowing for the researcher to control exactly what is passed to AI 

In [16]:
model = init_chat_model(
    "openai.gpt-oss-120b-1:0",
    model_provider="bedrock_converse",
    credentials_profile_name="atn-developer",
    max_tokens=128000
)
# Assume you built db_dict_formatted (volume -> DataFrame with columns ['case_id','text'])
search_tool = ScotusAnalysisTool(model=model, db_dict_formatted=db_dict_formatted)
filtered_tool = ScotusFilteredAnalysisTool(model=model)

### Step 3. Using the `search_tool` for a keyword of interest, namely `bank`

- this is a classic NLP keyword to benchmark results (i.e. financial bank vs riverbank)

In [20]:
text_result = search_tool._run(keyword="bank", 
                               analysis_focus="general")


In [21]:
text_result

[{'type': 'reasoning_content',
  'reasoning_content': {'text': 'We need to provide insights on temporal evolution, contextual variation, notable intra-dataset patterns, interpretive themes relevant to ordinary meaning of "bank". Use only provided data. Summarize trends across volumes (likely chronological). Identify contexts: bank as financial institution, bank of river, bank in bankruptcy, bank robbery, bank accounts, Bank of United States, bank holidays, etc. We can note that early volumes (329-335) have many citations to early 19th century cases, focus on banks in reorganization, banks of rivers, bank of the United States etc. Later volumes (350-380) have lots about bank robbery statutes, bank-ruptcy Act, etc. Also see patterns: repetitive citations of Mullane v. Central Hanover Bank, Osborn v. Bank of United States, etc. Also usage of "bank" as adjective (Bank of America, Bank Holiday, Bank-ruptcy). Provide themes: ordinary meaning includes financial institution, river bank, storag

In [22]:
export_blocks_reasoning_first(text_result, 'bank')

Markdown (reasoning first) written: bank_reasoning_first_stream.md | length=16763 chars


'bank_reasoning_first_stream.md'

### Step 4. Using the `filtered_tool` for a keyword of interest, namely `dictionary`

- to see how SCOTUS references dictionaries

In [29]:
# After you have filtered JSON:
filtered_json = got3.search_keyword_corpus("dictionary", 
                                           db_dict_formatted, 
                                           output="json",
                                           parallel=True)
analysis = filtered_tool._run(
    keyword="dictionary",
    results_json=filtered_json,
    extraction_strategy="all",
    #max_contexts=60, # to filter results
    return_json=True,
    debug=True
)

[DEBUG] raw_chars=58786 extracted_chars=54143 reduction_ratio=0.921 raw≈14696tok extracted≈13536tok strategy=all limit=None


In [30]:
analysis

{'keyword': 'dictionary',
 'total_contexts': 641,
 'occurrences_summary': '641 context snippet(s) across 641 case(s)',
 'reasoning_content': ['Model did not return valid JSON; wrapped raw text.',
  'Multiple contexts allow limited comparative analysis.'],
 'summary': '{\'type\': \'reasoning_content\', \'reasoning_content\': {\'text\': \'The task: provide JSON with fields. Need to analyze usage patterns of keyword "dictionary" across provided contexts (641 cases). Provide occurrences_summary: likely counts? We have total cases 641, but occurrences per case variable. Need to summarize: appears in legal contexts, definitions, statutes, dictionary acts. Provide reasoning steps. Provide summary. Limitations: only provided contexts, no external. Let\\\'s craft.\\n\\nCompute total occurrences? Not given exact count; maybe approximate. Could count appearances in snippet list: many lines. Might state "occurs in majority of cases, over 500". We can approximate: appears in virtually all samples; 

### Helper functions for exporting to markdown

In [33]:
def export_markdown_reasoning_first(result, keyword: str, filename: str | None = None):
    """Export a markdown report with REASONING CONTENT first, then the rest.

    Ordering Rules:
    1. # reasoning content  (aggregated reasoning_content list or string)
    2. Remaining known sections in this order if present: summary, occurrences_summary, limitations, total_contexts.
    3. Any other string fields in the dict are appended at the end under a generic heading.

    Parameters
    ----------
    result : dict | str | list
        Output from the SCOTUS analysis tool (JSON mode recommended for richer structure).
    keyword : str
        Used to build filename (sanitized) if filename not provided.
    filename : str | None
        Optional explicit filename. If None, auto-generated.
    """
    import json, re, os

    def _sanitize(name: str) -> str:
        cleaned = ''.join(c if (c.isalnum() or c in ('-','_')) else '_' for c in name.strip())
        return cleaned or 'analysis'

    reasoning_block = ''
    body_sections = []

    if isinstance(result, dict):
        rc = result.get('reasoning_content')
        if isinstance(rc, list):
            reasoning_block = '\n'.join(str(x) for x in rc if str(x).strip())
        elif isinstance(rc, str):
            reasoning_block = rc.strip()
        else:
            # try nested style
            if isinstance(rc, dict) and 'text' in rc:
                reasoning_block = str(rc['text']).strip()
        # Collect ordered sections
        ordered_keys = ['summary', 'occurrences_summary', 'limitations', 'total_contexts']
        used = set(['reasoning_content'])
        for k in ordered_keys:
            if k in result and k not in used:
                val = result[k]
                if isinstance(val, (str, int, float)):
                    body_sections.append(f"## {k}\n\n{val}\n")
                else:
                    body_sections.append(f"## {k}\n\n{json.dumps(val, indent=2)}\n")
                used.add(k)
        # Append any other simple string fields not yet used
        for k, v in result.items():
            if k in used:
                continue
            if isinstance(v, str) and v.strip():
                body_sections.append(f"## {k}\n\n{v.strip()}\n")
            elif isinstance(v, (int, float)):
                body_sections.append(f"## {k}\n\n{v}\n")
    elif isinstance(result, list):
        # Try to extract reasoning blocks and remainder
        reasoning_parts = []
        other_parts = []
        for block in result:
            if isinstance(block, dict):
                rc = block.get('reasoning_content')
                if isinstance(rc, list):
                    reasoning_parts.extend(str(x) for x in rc if str(x).strip())
                elif isinstance(rc, str) and rc.strip():
                    reasoning_parts.append(rc.strip())
                if 'text' in block and isinstance(block['text'], str):
                    other_parts.append(block['text'])
            elif isinstance(block, str):
                other_parts.append(block)
        reasoning_block = '\n'.join(reasoning_parts)
        if other_parts:
            body_sections.append("## response\n\n" + '\n\n'.join(other_parts))
    else:  # plain string
        reasoning_block = ''
        body_sections.append(f"## response\n\n{str(result)}")

    # Fallback if no reasoning extracted
    if not reasoning_block:
        reasoning_block = '(No explicit reasoning_content found in result)'

    report = ["# reasoning content\n", "```text\n", reasoning_block, "\n```\n\n"]
    report.extend(body_sections)

    safe_keyword = _sanitize(keyword)
    outname = filename or f"{safe_keyword}_reasoning_first.md"
    with open(outname, 'w', encoding='utf-8') as f:
        f.write('\n'.join(report))
    print(f"Markdown report written: {outname} (length={sum(len(x) for x in report)} chars)")
    return outname

# Example export for the filtered JSON analysis (only if 'analysis' exists)
try:
    if 'analysis' in globals():
        export_markdown_reasoning_first(analysis, 'ordinary_meaning')
except Exception as e:
    print(f"Export failed: {e}")

Markdown report written: ordinary_meaning_reasoning_first.md (length=3422 chars)


In [34]:
# Utility: export a LangChain / Bedrock style streamed block list (reasoning + text) to markdown
from datetime import datetime
from pathlib import Path

def export_blocks_reasoning_first(blocks, keyword: str, filename: str | None = None):
    """Export a list of model blocks (each a dict with 'type' and maybe 'reasoning_content' or 'text')
    to a markdown report where the reasoning content appears FIRST.

    Expected block shapes (any others are stringified):
      {'type': 'reasoning_content', 'reasoning_content': {'text': '...'}}
      {'type': 'text', 'text': '...'}

    Ordering:
      1. # reasoning content (aggregate all reasoning text blocks in order)
      2. ## response (concatenate all text blocks)

    Each section fenced appropriately for readability. Returns output filepath.
    """
    def _sanitize(name: str) -> str:
        return ''.join(c if (c.isalnum() or c in ('-','_')) else '_' for c in name.strip()) or 'analysis'

    reasoning_parts = []
    text_parts = []
    for b in blocks:
        if not isinstance(b, dict):
            text_parts.append(str(b))
            continue
        b_type = b.get('type')
        if b_type == 'reasoning_content':
            rc = b.get('reasoning_content')
            if isinstance(rc, dict) and 'text' in rc and rc['text']:
                reasoning_parts.append(str(rc['text']))
            elif isinstance(rc, str):
                reasoning_parts.append(rc)
        elif b_type == 'text':
            t = b.get('text')
            if isinstance(t, str):
                text_parts.append(t)
        else:  # fallback
            # include unknown block types in response section for transparency
            text_parts.append(str(b))

    reasoning_block = '\n\n'.join(p.strip() for p in reasoning_parts if p and p.strip())
    if not reasoning_block:
        reasoning_block = '(No reasoning_content blocks found)'
    response_block = '\n\n'.join(p.strip() for p in text_parts if p and p.strip()) or '(No text blocks found)'

    safe_keyword = _sanitize(keyword)
    outname = filename or f"{safe_keyword}_reasoning_first_stream.md"

    lines = []
    lines.append(f"# reasoning content\n")
    lines.append("```text\n")
    lines.append(reasoning_block)
    lines.append("\n```\n\n")
    lines.append("## response\n\n")
    lines.append(response_block)
    lines.append("\n\n---\n")
    lines.append(f"_generated: {datetime.utcnow().isoformat()}Z | keyword: {keyword}_")

    with open(outname, 'w', encoding='utf-8') as f:
        f.write(''.join(lines))
    print(f"Markdown (reasoning first) written: {outname} | length={sum(len(x) for x in lines)} chars")
    return outname

# Example using the provided structure (assign to variable `blocks_example` before calling if not already)
try:
    if 'blocks_example' in globals():
        export_blocks_reasoning_first(blocks_example, 'textualism')
except Exception as e:
    print('Export failed:', e)


In [35]:
export_blocks_reasoning_first(text_result, 'textualism')

Markdown (reasoning first) written: textualism_reasoning_first_stream.md | length=16769 chars


'textualism_reasoning_first_stream.md'