## Library of Congress: Supreme Court Database extracted from PDF files

If you have a new dataset, we provide sample steps you can perform to prepare your data for analysis using `got3`. Specifcally this is to setup a dictionary of DataFrames with text columns to be used for keyword search and embedding generation.

1. Perform keyword search results from loc.gov SCOTUS PDFs via `loc_gov.json`
2. Use your filtered results to create a list of text snippets for embedding documents
3. Use `got3.embedding` to prompt queries based on terms in the documents provided (i.e. refs to supreme court) for similarity analysis

In [1]:
import pandas as pd
import getout_of_text_3 as got3

In [2]:
# read pdf scotus files
df = pd.read_json("loc_gov.json", lines=True)

df['key'] = df['filename'].apply(lambda x: x.split('usrep')[1][:3])
df['subkey'] = df['filename'].apply(lambda x: x.split('usrep')[1].split('.pdf')[0])

# Create a dictionary to hold the DataFrame contents
df_dict = {}

for _, row in df.iterrows():
    if row['key'] not in df_dict:
        df_dict[row['key']] = {}
    df_dict[row['key']][row['subkey']] = row['content']

# format scotus data for getout_of_text_3, similar to COCA keyword results
db_dict_formatted = {}
for volume, cases in df_dict.items():
    # Create a DataFrame for each volume with case text
    case_data = []
    for case_id, case_text in cases.items():
        case_data.append({'case_id': case_id, 'text': case_text})
    db_dict_formatted[volume] = pd.DataFrame(case_data)


In [3]:
print(f"Dictionary contains {len(df_dict)} keys, each with {len(df_dict[next(iter(df_dict))])} sub-keys.")
total_token_count = sum(len(content.split()) for subdict in df_dict.values() for content in subdict.values())
print(f"Total token count across all documents: {total_token_count}")

# hits for 'dictionary' 
dictionary_hits = sum(content.lower().count("dictionary") for subdict in df_dict.values() for content in subdict.values())
print(f"Total 'dictionary' hits across all documents: {dictionary_hits}")

Dictionary contains 242 keys, each with 20 sub-keys.
Total token count across all documents: 63175725
Total 'dictionary' hits across all documents: 1681


## Checking out `dictionary` references in the dataset

We are interested in exploring how the Supreme Court has used the term "`dictionary`" in its opinions. This could be relevant for understanding how justices reference dictionary definitions in their legal reasoning, particularly in the context of statutory interpretation and ambiguity resolution.

____________

The top three results for the query `"How is a dictionary be used in textualism?"` here point to some interesting cases where dictionaries were referenced to resolve the meaning of statutory terms. Consider going through all dictionary refs, to find related ambiguities!

1. United States v. Mersky, 361 U.S. 431 (1960) for `statute` dictionary ref https://supreme.justia.com/cases/federal/us/361/431/
2. Mohamad v. Palestinian Authority, 566 U.S. 449 (2012) for dictionary ref on `person` https://supreme.justia.com/cases/federal/us/566/449/
  - for term `personal` FCC v. AT&T Inc., 562 U.S. 397 (2011)  https://supreme.justia.com/cases/federal/us/562/397/
3. UNITED STATES v. SANTOS 2008 for dictionary ref on `proceeds` https://tile.loc.gov/storage-services/service/ll/usrep/usrep553/usrep553507/usrep553507.pdf

In [4]:
query = "How is a dictionary be used in textualism?"
#query = 'Show me a time when the dictionary meaning was wrong'

keyword="dictionary"

loc_results = got3.search_keyword_corpus(
    keyword=keyword,
    db_dict=db_dict_formatted,
    case_sensitive=False,
    show_context=True,
    context_words=20,
    output="json"
)

gemma_result = got3.embedding.gemma.task(
    statutory_language=query,
    ambiguous_term=keyword,
    search_results=loc_results, # Pass the JSON results from search_keyword_corpus
    model="google/embeddinggemma-300m"
)

print('')
print("🔍 Query:")
print(query)
print("🎯 Top 3 most relevant contexts:")
for i, item in enumerate(gemma_result['all_ranked'][:3]):
    print(f"{i+1}. Genre: {item['genre']}, Score: {item['score']:.4f}")
    print(f"   Context: {item['context'][:]}...")
    print()

📚 Using pre-computed search results for 'dictionary'
📚 Found 641 context examples across 242 genres
🤖 Loading model: google/embeddinggemma-300m

🎯 RESULTS:

Most relevant context from 361 (score: 0.5357)
Context: U. S. Judge Learned Hand, that judges in construing legislation ought not to imprison themselves in the fortress of the **dictionary** . The immediately relevant ambiguity of "statute" as a legal term derives from the fact that it may mean either

🔍 Query:
How is a dictionary be used in textualism?
🎯 Top 3 most relevant contexts:
1. Genre: 361, Score: 0.5357
   Context: U. S. Judge Learned Hand, that judges in construing legislation ought not to imprison themselves in the fortress of the **dictionary** . The immediately relevant ambiguity of "statute" as a legal term derives from the fact that it may mean either...

2. Genre: 566, Score: 0.5264
   Context: ration is fairly regarded as at home”). Congress does not, in the ordinary course, employ the word any differently. The **

______________
### Explore keywords of interest from the Supreme Court Database

- namely, terms where there is ambiguity in the meaning (i.e. "`modify`", "`stationary source`", "`observer costs`", etc)
- alternatively, the words that denote lack of clarity (i.e. "`ambiguity`","`ambiguous`", etc)
- the references of a textualist, (i.e. "`dictionary`", "`ordinary meaning`", "`textualism`", etc)

## TODO
1. set custom query param for gemma.task, so that it's not always 'What is the ordinary meaning of the ambiguous term "{ambiguous_term}" in the context of the following statutory language, "{statutory_language}"?'
2. print out proper references for each context as https://loc.gov/item/usrep<volume><issue>
3. compare and combine with the other penn state database files with all the various variables on differentials and ideology
4. implement a way to visualize the relationships between the different variables and how they impact the interpretation of the ambiguous terms
5. include Oyez API call for usrep<volume><issue> cases of interest to quickly get the ideological breakdown of the justices for that case
6. include the Oyez API call to AI summary generation for the case as additional context item for a court case of interest

### to implement and formalize

- Maybe update to `got3.corpus.` for corpus related functions:
  - ✅ `got3.read_corpus` 
  - ✅ `got3.search_keyword_corpus`
  - ✅ `got3.find_collocates`
  - ⚠️ `got3.find_concordance`
  - ✅`got3.keyword_frequency_analysis`
  - ? `got3.pos_tagging (spacy)`
  - ? `got3.scattertext()`

- already there as `got3.embedding`, think about a third model to integrate
  - ✅`got3.embedding.legal_bert.pipe`
  - ⚠️ `got3.embedding.gemma.task`

- tbd on `got3.ai`
  - `got3.ai.bedrock.api`
  - `got3.ai.openai.api`
  - `got3.ai.claude.api`
  - `got3.ai.gemini.api`
  - `got3.ai.ollama.api` (local)

- lastly some third party services, probably under `got3.tools` namespace for anything not fitting above
  - `got3.tools.oyez.api`
  - `got3.tools.scdb`

### Scotus important cases and terms
- `mineral` in Marvel v. Merritt, 116 U.S. 11 (1885)
  - iron ore case
- `laborer` in Church of the Holy Trinity v. United States (1892)
  - church worker case
- pre-textualist United States v. Wurzbach (1930)
- `statute` in United States v. Mersky, 361 U.S. 431 (1960)
  - dictionary ref for statute
- `stationary source` in Chevron U.S.A. Inc. v. Natural Resources Defense Council, Inc., 467 U.S. 837 (1984)
  - clean air act stationary source! so important
- `race` in St. Francis Coll. v. Al-Khazraji, 481 U.S. 604 (1987)
  - racial discrimination case
- `servitude` in United States v. Kozminski, 487 U.S. 931 (1988)
  - involuntary servitude case
- `modify` in MCI Telecommunications Corp. v. AT&T Co., 512 U.S. 218 (1994)
  - telecom modify case
- `carry` in Muscarello v. United States, 524 U.S. 125 (1998)
  - gun carrying case
- `to bear arms` in District of Columbia v. Heller, 554 U.S. 570 (2008)
  - second amendment case
- `proceeds` in UNITED STATES v. SANTOS, 553 U.S. 507 (2008)
  - money laundering proceeds case,  "profits" rather than "gross receipts".
- `process` in Bilski v. Kappos (2010)
  - patent process case
- `personal` in FCC v. AT&T Inc., 562 U.S. 397 (2011)
  - personal privacy case
- `harboring` in United States v. Costello (2012)
  - harboring a fugitive case
- `person` in Mohamad v. Palestinian Authority, 566 U.S. 449 (2012)
  - torture act definition of person
- `interpreter` in Taniguchi v. Kan Pacific Saipan, Ltd., 566 U.S. 560 (2012)
  - dictionary war referenc
- `regulate` in National Federation of Independent Business v. Sebelius (2012)
  - commerce clause
- `defalcation` in Bullock v. BankChampaign (2013)
  - old word could not find references
- `physical force` in United States v. Castleman (2014)
  - domestic violence case
- `major source` in Utility Air (2014), 
  - reference of “vast economic and political significance”... “skepticism”  
- `state exchange` in King v. Burwell, 576 U.S. 473 (2015)
  - healthcare state exchanges in obamacare
- `also` in Mount Lemmon Fire District v. Guido | 586 U.S. ___ (2018)
  - also means "and" case
- `whistleblower` in Digital Realty Trust, Inc. v. Somers (2018)
  - dodd-frank act
- `comprehensive”` in Gundy v. United States (2019)
  - nondelegation doctrine
- `employment` in New Prime Inc. v. Oliveira (2019)
  - independent contractor vs employee
- `because of sex` in Bostock v. Clayton County (2020)
  - Title VII
- `translator` in Niz-Chavez v. Garland, 593 U.S. ___ (2021)
  - meaning of "a"
- `a single function of the trigger` in Garland v. Cargill (2024)
  - about machineguns
- West Virginia v EPA (2022), 
  - first named reference of Major Questions Doctrine
- `observer costs` in Loper Bright Enterprises v. Raimondo, 598 U.S. ___ (2024)
  - fisheries observer costs
- `textually literal approach` in Dewberry Group, Inc. v. Dewberry Engineers Inc. (2025)
