## Library of Congress: Supreme Court Database extracted from PDF files

If you have a new dataset, we provide sample steps you can perform to prepare your data for analysis using `got3`. Specifcally this is to setup a dictionary of DataFrames with text columns to be used for keyword search and embedding generation.

1. Perform keyword search results from loc.gov SCOTUS PDFs via `loc_gov.json`
2. Use your filtered results to create a list of text snippets for embedding documents
3. Use `got3.embedding` to prompt queries based on terms in the documents provided (i.e. refs to supreme court) for similarity analysis

In [105]:
import pandas as pd
import getout_of_text_3 as got3

In [115]:
# read pdf scotus files
df = pd.read_json("loc_gov.json", lines=True)

df['key'] = df['filename'].apply(lambda x: x.split('usrep')[1][:3])
df['subkey'] = df['filename'].apply(lambda x: x.split('usrep')[1].split('.pdf')[0])

# Create a dictionary to hold the DataFrame contents
df_dict = {}

for _, row in df.iterrows():
    if row['key'] not in df_dict:
        df_dict[row['key']] = {}
    df_dict[row['key']][row['subkey']] = row['content']

# format scotus data for getout_of_text_3, similar to COCA keyword results
db_dict_formatted = {}
for volume, cases in df_dict.items():
    # Create a DataFrame for each volume with case text
    case_data = []
    for case_id, case_text in cases.items():
        case_data.append({'case_id': case_id, 'text': case_text})
    db_dict_formatted[volume] = pd.DataFrame(case_data)


In [116]:
print(f"Dictionary contains {len(df_dict)} keys, each with {len(df_dict[next(iter(df_dict))])} sub-keys.")
total_token_count = sum(len(content.split()) for subdict in df_dict.values() for content in subdict.values())
print(f"Total token count across all documents: {total_token_count}")

# hits for 'dictionary' 
dictionary_hits = sum(content.lower().count("dictionary") for subdict in df_dict.values() for content in subdict.values())
print(f"Total 'dictionary' hits across all documents: {dictionary_hits}")

Dictionary contains 242 keys, each with 20 sub-keys.
Total token count across all documents: 63175725
Total 'dictionary' hits across all documents: 1681


## Checking out `dictionary` references in the dataset

We are interested in exploring how the Supreme Court has used the term "`dictionary`" in its opinions. This could be relevant for understanding how justices reference dictionary definitions in their legal reasoning, particularly in the context of statutory interpretation and ambiguity resolution.

____________

The top three results for the query `"How is a dictionary be used in textualism?"` here point to some interesting cases where dictionaries were referenced to resolve the meaning of statutory terms. Consider going through all dictionary refs, to find related ambiguities!

1. United States v. Mersky, 361 U.S. 431 (1960) for `statute` dictionary ref https://supreme.justia.com/cases/federal/us/361/431/
2. Mohamad v. Palestinian Authority, 566 U.S. 449 (2012) for dictionary ref on `person` https://supreme.justia.com/cases/federal/us/566/449/
  - for term `personal` FCC v. AT&T Inc., 562 U.S. 397 (2011)  https://supreme.justia.com/cases/federal/us/562/397/
3. UNITED STATES v. SANTOS 2008 for dictionary ref on `proceeds` https://tile.loc.gov/storage-services/service/ll/usrep/usrep553/usrep553507/usrep553507.pdf

In [113]:
query = "How is a dictionary be used in textualism?"
#query = 'Show me a time when the dictionary meaning was wrong'

keyword="dictionary"

loc_results = got3.search_keyword_corpus(
    keyword=keyword,
    db_dict=db_dict_formatted,
    case_sensitive=False,
    show_context=True,
    context_words=20,
    output="json"
)

gemma_result = got3.embedding.gemma.task(
    statutory_language=query,
    ambiguous_term=keyword,
    search_results=loc_results, # Pass the JSON results from search_keyword_corpus
    model="google/embeddinggemma-300m"
)

print('')
print("üîç Query:")
print(query)
print("üéØ Top 3 most relevant contexts:")
for i, item in enumerate(gemma_result['all_ranked'][:3]):
    print(f"{i+1}. Genre: {item['genre']}, Score: {item['score']:.4f}")
    print(f"   Context: {item['context'][:]}...")
    print()

üìö Using pre-computed search results for 'dictionary'
üìö Found 641 context examples across 182 genres
ü§ñ Loading model: google/embeddinggemma-300m

üéØ RESULTS:

Most relevant context from 361 (score: 0.5357)
Context: U. S. Judge Learned Hand, that judges in construing legislation ought not to imprison themselves in the fortress of the **dictionary** . The immediately relevant ambiguity of "statute" as a legal term derives from the fact that it may mean either

üîç Query:
How is a dictionary be used in textualism?
üéØ Top 3 most relevant contexts:
1. Genre: 361, Score: 0.5357
   Context: U. S. Judge Learned Hand, that judges in construing legislation ought not to imprison themselves in the fortress of the **dictionary** . The immediately relevant ambiguity of "statute" as a legal term derives from the fact that it may mean either...

2. Genre: 566, Score: 0.5264
   Context: ration is fairly regarded as at home‚Äù). Congress does not, in the ordinary course, employ the word any

______________
### Explore keywords of interest from the Supreme Court Database

- namely, terms where there is ambiguity in the meaning (i.e. "`modify`", "`stationary source`", "`observer costs`", etc)
- alternatively, the words that denote lack of clarity (i.e. "`ambiguity`","`ambiguous`", etc)
- the references of a textualist, (i.e. "`dictionary`", "`ordinary meaning`", "`textualism`", etc)

## TODO
1. set custom query param for gemma.task, so that it's not always 'What is the ordinary meaning of the ambiguous term "{ambiguous_term}" in the context of the following statutory language, "{statutory_language}"?'
2. print out proper references for each context as https://loc.gov/item/usrep<volume><issue>
3. compare and combine with the other penn state database files with all the various variables on differentials and ideology
4. implement a way to visualize the relationships between the different variables and how they impact the interpretation of the ambiguous terms
5. include Oyez API call for usrep<volume><issue> cases of interest to quickly get the ideological breakdown of the justices for that case
6. include the Oyez API call to AI summary generation for the case as additional context item for a court case of interest

### to implement and formalize

- Maybe update to `got3.nlp.kwic`, `got3.nlp.collocates`, `got3.nlp.concordance` for different datasets, coca, scotus, etc
  - got3.search_keyword_corpus
  - got3.find_collocates
  - got3.find_concordance
  - tbd on sentiment analysis, valence, etc
  
- already there as `got3.embedding`, think about a third model to integrate
  - got3.embedding.gemma.task
  - got3.embedding.legal_bert.pipe

- tbd on `got3.ai`
  - got3.ai.openai.api
  - got3.ai.bedrock.api
  - tbd on local models, got3.ai.ollama.api

- lastly some third party services, probably under `got3.tools` namespace for anything not fitting above
  - got3.tools.oyez.api
  - got3.tools.???

