## Library of Congress: Supreme Court Database extracted from PDF files

If you have a new dataset, we provide sample steps you can perform to prepare your data for analysis using `got3`. Specifcally this is to setup a dictionary of DataFrames with text columns to be used for keyword search and embedding generation.

1. Perform keyword search results from loc.gov SCOTUS PDFs via `loc_gov.json`
2. Use your filtered results to create a list of text snippets for embedding documents
3. Use `got3.embedding` to prompt queries based on terms in the documents provided (i.e. refs to supreme court) for similarity analysis

In [1]:
#pip install sentence-transformers torch
#pip install -U sentence-transformers==2.2.2

In [2]:
import pandas as pd
import getout_of_text_3 as got3

  from scipy.stats import fisher_exact
  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# read pdf scotus files
df = pd.read_json("loc_gov.json", lines=True)

df['key'] = df['filename'].apply(lambda x: x.split('usrep')[1][:3])
df['subkey'] = df['filename'].apply(lambda x: x.split('usrep')[1].split('.pdf')[0])

# Create a dictionary to hold the DataFrame contents
df_dict = {}

for _, row in df.iterrows():
    if row['key'] not in df_dict:
        df_dict[row['key']] = {}
    df_dict[row['key']][row['subkey']] = row['content']

# format scotus data for getout_of_text_3, similar to COCA keyword results
db_dict_formatted = {}
for volume, cases in df_dict.items():
    # Create a DataFrame for each volume with case text
    case_data = []
    for case_id, case_text in cases.items():
        case_data.append({'case_id': case_id, 'text': case_text})
    db_dict_formatted[volume] = pd.DataFrame(case_data)


In [4]:
print(f"Dictionary contains {len(df_dict)} keys, each with {len(df_dict[next(iter(df_dict))])} sub-keys.")
total_token_count = sum(len(content.split()) for subdict in df_dict.values() for content in subdict.values())
print(f"Total token count across all documents: {total_token_count}")

# hits for 'dictionary' 
dictionary_hits = sum(content.lower().count("dictionary") for subdict in df_dict.values() for content in subdict.values())
print(f"Total 'dictionary' hits across all documents: {dictionary_hits}")

Dictionary contains 242 keys, each with 20 sub-keys.
Total token count across all documents: 63175725
Total 'dictionary' hits across all documents: 1681


## Checking out `dictionary` references in the dataset

We are interested in exploring how the Supreme Court has used the term "`dictionary`" in its opinions. This could be relevant for understanding how justices reference dictionary definitions in their legal reasoning, particularly in the context of statutory interpretation and ambiguity resolution.

____________

The top three results for the query `"How is a dictionary be used in textualism?"` here point to some interesting cases where dictionaries were referenced to resolve the meaning of statutory terms. Consider going through all dictionary refs, to find related ambiguities!

1. United States v. Mersky, 361 U.S. 431 (1960) for `statute` dictionary ref https://supreme.justia.com/cases/federal/us/361/431/
2. Mohamad v. Palestinian Authority, 566 U.S. 449 (2012) for dictionary ref on `person` https://supreme.justia.com/cases/federal/us/566/449/
  - for term `personal` FCC v. AT&T Inc., 562 U.S. 397 (2011)  https://supreme.justia.com/cases/federal/us/562/397/
3. UNITED STATES v. SANTOS 2008 for dictionary ref on `proceeds` https://tile.loc.gov/storage-services/service/ll/usrep/usrep553/usrep553507/usrep553507.pdf

In [5]:
df.sort_values(by='key')

Unnamed: 0,filename,content,key,subkey
6713,usrep329433.pdf,STEELE v. GENERAL MILLS.\n424 Syllabus.\nor in...,329,329433
6355,usrep329156.pdf,"OCTOBER TERM, 1946.\nSyllabus. 329 U. S.\nvolu...",329,329156
940,usrep329607.pdf,INSURANCE GROUP v. D. & R. G. W. R. CO. 607\nS...,329,329607
3767,usrep329663.pdf,DE MEERLEER v. MICHIGAN.\n654 Statement of the...,329,329663
4554,usrep329040.pdf,"40 OCTOBER TERM, 1946.\nCounsel for Parties. 3...",329,329040
...,...,...,...,...
4388,usrep570421.pdf,"OCTOBER \nTERM, 2012 \n421 \nSyllabus \nV ANCE...",570,570421
2749,usrep570205.pdf,"OCTOBER \nTERM, 2012 \n205 \nSyllabus \nAGENCY...",570,570205
8145,usrep570693.pdf,"OCTOBER \nTERM, 2012 \n693 \nSyllabus \nHOLLIN...",570,570693
341,usrep570338.pdf,"338 \nOCTOBER \nTERM, 2012 \nSyllabus \nUNIVER...",570,570338


In [6]:
db_dict_formatted

{'448':    case_id                                               text
 0   448261  THOMAS v. WASHINGTON GAS LIGHT CO.\nSyllabus\n...
 1   448448  OCTOBER TERM, 1979\nSyllabus 448 U. S.\nFULLIL...
 2   448098  OCTOBER TERM, 1979\nSyllabus 448 U. S.\nRAWLIN...
 3   448160  OCTOBER TERM, 1979\nSyllabus 448 U. S.\nCENTRA...
 4   448438  OCTOBER TERM, 1979\nPer Cunam 448 U. S.\nREID ...
 5   448607  INDUSTRIAL UNION DEPT. v. AMERICAN PETROL. INS...
 6   448001  CASES A]DJUDGED\nIN TME\nSUPREME COURT OF THE ...
 7   448176  OCTOBER TERM, 1979\nSyllabus 448 U. S.\nDAWSON...
 8   448358  OCTOBER TERM, 1979\nSyllabus 448 U. S.\nWILLIA...
 9   448371  UNITED STATES v. SIOUX NATION OF INDIANS\nSyll...
 10  448038  OCTOBER TERM, 1979\nSyllabus 448 U. S.\nADAMS ...
 11  448555  RICHMOND NEWSPAPERS, INC. v. VIRGINIA\nSyllabu...
 12  448297  HARRIS v. MdRAE\nSyllabus\nHARRIS, SECRETARY O...
 13  448056  OCTOBER TERM, 1979\nSyllabus 448 U. S.\nOHIO v...
 14  448136  OCTOBER TERM, 1979\nSyllabus 448 U.

In [7]:
loc_results = got3.search_keyword_corpus(
    keyword='ordinary meaning',
    db_dict=db_dict_formatted,
    case_sensitive=False,
    show_context=True,
    context_words=20,
    #output="json"
)

🔍 COCA Corpus Search: 'ordinary meaning'
🚀 Using parallel processing with 9 processes...


  from scipy.stats import fisher_exact
  from scipy.stats import fisher_exact
  from scipy.stats import fisher_exact
  from scipy.stats import fisher_exact
  from scipy.stats import fisher_exact
  from scipy.stats import fisher_exact
  from scipy.stats import fisher_exact
  from scipy.stats import fisher_exact
  from scipy.stats import fisher_exact



📚 448 :
------------------------------
  📝 Text 5: J., dissenting to its costs. I believe that the statute's language, structure, and legislative history foreclose respondents' position. In its **ordinary meaning** an activity is "feasible" if it is capable of achievement, not if its benefits outweigh its costs. See Web- ster's
  ✅ Found 1 occurrence(s) in 448

📚 356 :
------------------------------
  📝 Text 32: Act of 1917, we think, reveals nothing sufficient to indicate that Congress did not intend the word'entry' ...should have its **ordinary meaning** ." 289 U. S., at 425. See also Utiited States ex rel. Claussen v. Day, 279 U. S. 398 (1929).
  ✅ Found 1 occurrence(s) in 356

📚 495 :
------------------------------
  📝 Text 3: offense of conviction. Pp. 415-422. (a) VWPA's plain language clearly links restitution to the offense of conviction. Given that the **ordinary meaning** of "restitution" is restor- ing someone to a position he occupied before a particular event, § 3579's re

### Sanity checking against the BYU SCOTUS DB

- this is an example where I created my own dictionary, (i.e. DIY Corpus) for the SCOTUS work, but it's helpful to benchmark based on results in https://www.english-corpora.org/scotus/
- plus we get the latest years, which is helpful for recent cases and changes in strategies of statutory interpretation
- try on words like `'stupendous'` or `'ordinary meaning'`
    - see https://caselaw.findlaw.com/court/us-supreme-court/337/472.html as there are two hits on ordinary meaning that I get, but the SCOTUS DB only shows one!
    - another one in 334-624 as it's `ordinary mean-\ning` in my database -- will need to be careful as I extract text from PDFs so there are probably some hyphenation issues for line line breaks, page breaks, or page wrapping. https://caselaw.findlaw.com/court/us-supreme-court/334/624.html 
    - fixed for stripping `-\n` in `extract_loc_scotus_pdfs.py`


In [8]:
df[df['subkey']=='334624']['content'].values[0].replace('-\n','')

'OCTOBER TERM, 1947.\nSyllabus. 334 U. S.\nFurther evidence, were any needed, that Congress accepted as its own this interpretation of the language used\nin § 602 (h) (2) is supplied by the significant distinction\nmaintained in this reenactment between the mode of payment originally provided by § 602 (h) (2) and the refund\nlife income plan, viewed in the light of the House Committee Report on the bill. It is hardly conceivable-and\nif conceivable, hardly explicable-that Congress meant\none thing by the language it used in § 602 (h) (2) when\nenacting the original measure in 1940, and another, quite\ndifferent thing, when it reenacted that language in 1946.\nIn the light of the foregoing considerations, the validity\nof Regulation 3450 is sustained and the decision of the\nCircuit Court of Appeals is\nReversed.\nUNITED STATES v. JOHN J. FELIN & CO., INC.\nCERTIORARI TO THE COURT OF CLAIMS.\nNo. 17. Argued May 7, 1947.-Reargued November 18-19, 1947.Decided June 14, 1948.\nWhen prices o

In [9]:
filtered_loc_results = {k: v for k, v in loc_results.items() if v}
# Sort the keys numerically and display the filtered results in order
for key in sorted(filtered_loc_results.keys(), key=int):
    print(f"Volume {key}: {len(filtered_loc_results[key])} results")

Volume 329: 1 results
Volume 332: 2 results
Volume 335: 1 results
Volume 337: 2 results
Volume 338: 1 results
Volume 339: 2 results
Volume 340: 1 results
Volume 341: 1 results
Volume 347: 5 results
Volume 355: 4 results
Volume 356: 1 results
Volume 361: 1 results
Volume 366: 1 results
Volume 367: 1 results
Volume 368: 3 results
Volume 369: 1 results
Volume 370: 1 results
Volume 371: 2 results
Volume 374: 1 results
Volume 377: 1 results
Volume 380: 1 results
Volume 392: 2 results
Volume 401: 1 results
Volume 402: 1 results
Volume 404: 3 results
Volume 406: 2 results
Volume 413: 1 results
Volume 419: 3 results
Volume 420: 3 results
Volume 422: 2 results
Volume 434: 3 results
Volume 435: 1 results
Volume 436: 1 results
Volume 441: 1 results
Volume 442: 1 results
Volume 444: 1 results
Volume 445: 2 results
Volume 447: 1 results
Volume 448: 1 results
Volume 452: 3 results
Volume 455: 1 results
Volume 456: 2 results
Volume 457: 1 results
Volume 459: 1 results
Volume 460: 3 results
Volume 463

In [12]:
filtered_loc_results['337']

[{'text_id': 10,
  'match': 'ordinary meaning',
  'context': 'as used in the Trading with the Enemy Act, the Executive Orders and the regu- lations thereunder, is given its **ordinary meaning** of the obligation due on accounting between parties to transactions. P. 480. PROPPER v. CLARK. 472 Statement of the Case.',
  'full_text': 'OCTOBER TERM, 1948.\nSyllabus. 337 U. S.\nPROPPER, RECEIVER, v. CLARK, ATTORNEY\nGENERAL, AS SUCCESSOR ...'},
 {'text_id': 10,
  'match': 'ordinary meaning',
  'context': 'regulation, we, in considering credits as property subject to vesting under the Trading with the Enemy Act, give it its **ordinary meaning** of the obligation due on accounting between parties to transactions. This credit, owed by ASCAP to AKM, was in effect',
  'full_text': 'OCTOBER TERM, 1948.\nSyllabus. 337 U. S.\nPROPPER, RECEIVER, v. CLARK, ATTORNEY\nGENERAL, AS SUCCESSOR ...'}]

In [11]:
import spacy
from spacy import displacy
from IPython.display import HTML, display

nlp = spacy.load("en_core_web_sm")
doc = nlp(filtered_loc_results['567'][5]['context'].replace('**',''))

#doc = nlp("Rats are various medium-sized, long-tailed rodents.")

# Fix for IPython display import issue
try:
    displacy.render(doc, style="dep", jupyter=True)
except ImportError:
    # Alternative approach if IPython display has issues
    html = displacy.render(doc, style="dep", page=False)
    display(HTML(html))

IndexError: list index out of range

In [17]:
displacy.render(doc, style="ent", jupyter=True)


In [None]:
displacy.render(doc, style=, jupyter=True)


In [9]:
query = "How is a dictionary used in textualism?"
#query = 'Show me a time when the dictionary meaning was wrong'

keyword="dictionary"

loc_results = got3.search_keyword_corpus(
    keyword=keyword,
    db_dict=db_dict_formatted,
    case_sensitive=False,
    show_context=True,
    context_words=20,
    output="json"
)

gemma_result = got3.embedding.gemma.task(
    statutory_language=query,
    ambiguous_term=keyword,
    search_results=loc_results, # Pass the JSON results from search_keyword_corpus
    model="google/embeddinggemma-300m"
)

print('')
print("🔍 Query:")
print(query)
print("🎯 Top 3 most relevant contexts:")
for i, item in enumerate(gemma_result['all_ranked'][:3]):
    print(f"{i+1}. Genre: {item['genre']}, Score: {item['score']:.4f}")
    print(f"   Context: {item['context'][:]}...")
    print()

ImportError: EmbeddingGemma dependencies not installed. Run: pip install sentence-transformers torch

______________
### Explore keywords of interest from the Supreme Court Database

- namely, terms where there is ambiguity in the meaning (i.e. "`modify`", "`stationary source`", "`observer costs`", etc)
- alternatively, the words that denote lack of clarity (i.e. "`ambiguity`","`ambiguous`", etc)
- the references of a textualist, (i.e. "`dictionary`", "`ordinary meaning`", "`textualism`", etc)

## TODO
1. set custom query param for gemma.task, so that it's not always 'What is the ordinary meaning of the ambiguous term "{ambiguous_term}" in the context of the following statutory language, "{statutory_language}"?'
2. print out proper references for each context as https://loc.gov/item/usrep<volume><issue>
3. compare and combine with the other penn state database files with all the various variables on differentials and ideology
4. implement a way to visualize the relationships between the different variables and how they impact the interpretation of the ambiguous terms
5. include Oyez API call for usrep<volume><issue> cases of interest to quickly get the ideological breakdown of the justices for that case
6. include the Oyez API call to AI summary generation for the case as additional context item for a court case of interest

### to implement and formalize

- Maybe update to `got3.corpus.` for corpus related functions:
  - ✅ `got3.read_corpus` 
  - ✅ `got3.search_keyword_corpus`
  - ✅ `got3.find_collocates`
  - ✅`got3.keyword_frequency_analysis`
  - ? `got3.pos_tagging (spacy)` and networkx for graphing? have this in a notebook
  - ? `got3.scattertext()` interesting to review still

- already there as `got3.embedding`, think about a third model to integrate
  - ✅`got3.embedding.legal_bert.pipe`
  - ⚠️ `got3.embedding.gemma.gemma` (was `got3.embedding.gemma.task`)
     - I need to review the tasks

- tbd on `got3.ai`
  - `got3.ai.bedrock.api`
  - `got3.ai.openai.api`
  - `got3.ai.claude.api`
  - `got3.ai.gemini.api`
  - `got3.ai.ollama.api` (local)

- lastly some third party services, probably under `got3.tools` namespace for anything not fitting above
  - `got3.tools.oyez.api`
  - `got3.tools.scdb`

### Scotus important cases and terms
- `mineral` in Marvel v. Merritt, 116 U.S. 11 (1885)
  - iron ore case
- `laborer` in Church of the Holy Trinity v. United States (1892)
  - church worker case
- pre-textualist United States v. Wurzbach (1930)
- `statute` in United States v. Mersky, 361 U.S. 431 (1960)
  - dictionary ref for statute
- `stationary source` in Chevron U.S.A. Inc. v. Natural Resources Defense Council, Inc., 467 U.S. 837 (1984)
  - clean air act stationary source! so important
- `race` in St. Francis Coll. v. Al-Khazraji, 481 U.S. 604 (1987)
  - racial discrimination case
- `servitude` in United States v. Kozminski, 487 U.S. 931 (1988)
  - involuntary servitude case
- `modify` in MCI Telecommunications Corp. v. AT&T Co., 512 U.S. 218 (1994)
  - telecom modify case
- `carry` in Muscarello v. United States, 524 U.S. 125 (1998)
  - gun carrying case
- `to bear arms` in District of Columbia v. Heller, 554 U.S. 570 (2008)
  - second amendment case
- `proceeds` in UNITED STATES v. SANTOS, 553 U.S. 507 (2008)
  - money laundering proceeds case,  "profits" rather than "gross receipts".
- `process` in Bilski v. Kappos (2010)
  - patent process case
- `personal` in FCC v. AT&T Inc., 562 U.S. 397 (2011)
  - personal privacy case
- `harboring` in United States v. Costello (2012)
  - harboring a fugitive case
- `person` in Mohamad v. Palestinian Authority, 566 U.S. 449 (2012)
  - torture act definition of person
- `interpreter` in Taniguchi v. Kan Pacific Saipan, Ltd., 566 U.S. 560 (2012)
  - dictionary war referenc
- `regulate` in National Federation of Independent Business v. Sebelius (2012)
  - commerce clause
- `defalcation` in Bullock v. BankChampaign (2013)
  - old word could not find references
- `physical force` in United States v. Castleman (2014)
  - domestic violence case
- `major source` in Utility Air (2014), 
  - reference of “vast economic and political significance”... “skepticism”  
- `state exchange` in King v. Burwell, 576 U.S. 473 (2015)
  - healthcare state exchanges in obamacare
- `also` in Mount Lemmon Fire District v. Guido | 586 U.S. ___ (2018)
  - also means "and" case
- `whistleblower` in Digital Realty Trust, Inc. v. Somers (2018)
  - dodd-frank act
- `comprehensive”` in Gundy v. United States (2019)
  - nondelegation doctrine
- `employment` in New Prime Inc. v. Oliveira (2019)
  - independent contractor vs employee
- `because of sex` in Bostock v. Clayton County (2020)
  - Title VII
- `translator` in Niz-Chavez v. Garland, 593 U.S. ___ (2021)
  - meaning of "a"
- `a single function of the trigger` in Garland v. Cargill (2024)
  - about machineguns
- West Virginia v EPA (2022), 
  - first named reference of Major Questions Doctrine
- `observer costs` in Loper Bright Enterprises v. Raimondo, 598 U.S. ___ (2024)
  - fisheries observer costs
- `textually literal approach` in Dewberry Group, Inc. v. Dewberry Engineers Inc. (2025)
