### Citation anonymization test

In this notebook, we apply the "citation anonymization" test. Modern LLMs have been trained on a large variety of text inputs. This raises the risk that the LLM has seen the input citation in its training process and has ``memorized" the correct answer.

For our anonymization test, we first ask an LLM to rewrite the citation text in a manner which anonymizes case names and citations. This ensures that the prompt has not been seen in training, and cases mentioned are unlikely to actually exist. We then use this modified prompt and use an LLM to identify the right **holding** as before.

In [1]:
import pandas as pd
from sklearn.metrics import f1_score
import numpy as np
import re
from Levenshtein import distance as lev
from langchain_openai import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage

from openai import OpenAI
import json
from multiprocess import Pool


In [None]:
oai_labeled = pd.read_csv('casehold_test_openai_labels.csv')

In [None]:
chat_4omini = ChatOpenAI(model='gpt-4o-mini-2024-07-18', temperature=0, max_completion_tokens = 5000)
chat_4o = ChatOpenAI(model='gpt-4o-2024-11-20', temperature=0, max_completion_tokens=5000)

### Create the anonymized citations

In [None]:
def create_new_citation_prompt(r):
    return """Input: "{}"
    Task: Please rewrite the input text while:
    1. Replace all case names CONSISTENTLY:
       - Use different but plausible names
       - If a name appears multiple times, use the same replacement each time
       - Example: If "Smith" becomes "Wilson", all instances of "Smith" should become "Wilson"
    
    2. For each citation:
       - Change all numbers (years, page numbers, etc.)
       - Change the jurisdiction (e.g., F.3d → P.2d or N.Y.S.2d → Cal.App.)
       - Keep citations in a valid legal format
    
    3. Preserve exactly:
       - The <HOLDING> tag location
       - All punctuation
       - All legal reasoning and discussions
       - The basic sentence structure
    
    Change ALL identifying information consistently while keeping the legal meaning identical. Surround your response with tags, <output> text </output>
    Output: """.format(r['citing_prompt'])

In [None]:
oai_labeled['initial_anonymization_prompt'] = oai_labeled.apply(create_new_citation_prompt, axis = 1)
oai_labeled['anonymized_citing_prompt'] = ''
print(oai_labeled.loc[0]['initial_anonymization_prompt'])

In [None]:
chat_4omini.invoke([("user", oai_labeled.loc[0]['initial_anonymization_prompt'])]).content

In [None]:
def get_anonymized_prompt(input):
    messages = [
        ("user", input)
    ]
    
    b = chat_4omini.invoke(messages)
    return b.content

In [None]:
with Pool(processes = 6) as pool:
    anonymized_prompts = pool.map(func = get_anonymized_prompt, iterable = oai_labeled['initial_anonymization_prompt'])

In [None]:
oai_labeled['anonymized_prompt'] = anonymized_prompts

In [None]:
oai_labeled.loc[453]['anonymized_prompt']

In [None]:
oai_labeled.loc[453]['citing_prompt']

In [None]:
def create_anonymized_prompt(r):
    pattern = r'<output>(.*)</output>'
    match = re.search(pattern, r['anonymized_prompt'], re.DOTALL)
    cite = match.group(1) if match else ''
    return """
    Task: Legal Holding Identification
    
    Context: You are analyzing a legal text to identify the most appropriate legal holding. A legal holding is the court's determination of a matter of law based on the facts of a particular case.
    
    Input Text: "{anonymized_prompt}"

    Question: Based on the legal context above, which of the following holdings best completes the text where the <HOLDING> tag appears? Consider:
    - The specific legal issue being discussed
    - The logical flow of the legal argument 
    - The precedential value implied by the context

    Options:
    A: {holding_0}
    B: {holding_1}
    C: {holding_2}
    D: {holding_3}
    E: {holding_4}

    Instructions:
    1. Analyze the context and legal reasoning in the input text
    2. Consider how each option would fit within the legal argument
    3. Evaluate which option best maintains the logical flow. Explain your reasoning first, formatted like this <reasoning> reasoning </reasoning>
    4. Provide your final answer in the format: ANSWER: X (where X is A, B, C, D, or E)""".format(
        anonymized_prompt = cite, holding_0 = r['holding_0'][:1000], holding_1 = r['holding_1'][:1000], 
        holding_2 = r['holding_2'][:1000], holding_3 = r['holding_3'][:1000], holding_4 = r['holding_4'][:1000])

oai_labeled['anonymized_final_prompt'] = oai_labeled.apply(create_anonymized_prompt, axis = 1)

In [None]:
print(oai_labeled['citing_prompt'].loc[4])

In [None]:
print(oai_labeled['anonymized_final_prompt'].loc[4])

In [None]:
for i in range(0, oai_labeled.shape[0]):

    print(i)
    row = oai_labeled.loc[i]
    messages = [("user", row['anonymized_final_prompt'])]
        
    for t in range(5):
        try:
            c = chat_4o.invoke(messages)
            oai_labeled.loc[i, 'response_gpt4o_anonymized'] = c.content
            break
        except:
            print('failed')
            pass

### Analyze the anonymized results

In [2]:
oai_labeled = pd.read_csv('casehold_test_with_anonymization_oai.csv')

In [3]:
print(oai_labeled.loc[0]['response_gpt4o'])

<reasoning> The input text discusses the concept of "consideration" in the context of legal cases involving promotional activities, such as sweepstakes or bingo games. Specifically, the text contrasts cases where consideration was found to exist with cases where it was not. The focus is on whether requiring a person to take certain actions, such as visiting a location, constitutes sufficient legal consideration. The text references cases like *State v. Socony Mobil Oil Co.* and *Lucky Calendar Co. v. Cohen*, which deal with promotional schemes and the legal implications of requiring participants to engage in specific actions to participate. 

Given this context, the most relevant holding would address whether the act of visiting a store or location constitutes consideration. Option C, which states "holding that consideration exists where a customer is burdened by having to visit the store where the coupons are being offered," directly aligns with the legal issue being discussed. It fit

In [4]:
print(oai_labeled.loc[0]['response_gpt4o_anonymized'])

<reasoning> The input text discusses the concept of "consideration" in the context of promotional activities, specifically whether requiring a person to visit a location to participate in a sweepstakes constitutes legal consideration. The text references cases that analyze whether certain actions or benefits meet the threshold for consideration, such as paying for bingo cards or requiring physical presence at a location. The logical flow of the argument centers on the legal issue of what constitutes consideration in promotional or sweepstakes scenarios. 

Option A, which involves vicarious liability for wrongful death, is unrelated to the issue of consideration and does not fit the context.  
Option B, which involves foreseeability in the context of alcohol sales and murder, is also unrelated to the issue of consideration.  
Option C directly addresses the issue of consideration, specifically in the context of requiring a customer to visit a store, which aligns with the discussion in t

### Examples of anonymized prompts

In [5]:
for i in [7, 5270, 816, 5308]:
    print('***')
    print(oai_labeled.loc[i]['citing_prompt'])
    print(oai_labeled.loc[i]['anonymized_prompt'])

***
she argues that that affidavit demonstrates that Mandeville’s testimony regarding DMV complaints about Bucci is not worthy of belief. It is important to note, however, that San Bento said in his affidavit that he did not recall making any complaints about Bucci and that he found her to be “professional, efficient and competent.” But Mandeville never said that it had been San Bento who had made a complaint to her about Bucci. Indeed, other employees from the DMV, or employees from motor-vehicle departments from other states, may well have complained about Bucci’s performance. When a plaintiff attempts to counter a claim by an employer that it fired an employee for poor performance, it is simply not sufficient for a plaintiff to present evidence that her performance wa (7th Cir.2007) (<HOLDING>). Further, the policy states that there are
<output> "she argues that that affidavit demonstrates that Thompson’s testimony regarding DMV complaints about Carter is not worthy of belief. It is

### Levenshtein distance

In [6]:
oai_labeled['levenshtein_distance'] = oai_labeled.apply(lambda x: lev(x['citing_prompt'], x['anonymized_prompt']), axis = 1)

In [7]:
oai_labeled['citing_prompt'].apply(lambda x: len(x)).mean()

np.float64(843.0598419269853)

In [8]:
oai_labeled['levenshtein_distance'].describe()

count    5314.000000
mean       96.213963
std        38.270417
min        20.000000
25%        69.000000
50%        91.000000
75%       116.000000
max       616.000000
Name: levenshtein_distance, dtype: float64

In [9]:
print(oai_labeled.query('levenshtein_distance == 91').iloc[0]['citing_prompt'])

print(oai_labeled.query('levenshtein_distance == 91').iloc[0]['anonymized_prompt'])

defects,” specifically, defects in establishing citizenship for the purpose of establishing diversity jurisdiction. Id. at 223. See also Harmon v. OKI Sys., 115 F.3d 477, 479 (7th Cir.1997) (citing with approval the reasoning in In re Allstate that “a defendant’s failure to allege citizenship as opposed to residency ... constituted a procedural defect”). We agree with the Fifth Circuit’s interpretation of § 1447(c) and construction of a party’s failure to establish citizenship in its notice of removal as a procedural defect. “[Wjhere subject matter jurisdiction exists and any procedural shortcomings may be cured by resort to § 1653, we can surmise no valid reason for the court to decline the exercise of jurisdiction.” In re Allstate, 8 F.3d at 223. See also Ellenburg, 519 F.3d at 198 (<HOLDING>). Section 1653 provides that “[d]efec-tive
<output> "defects,” specifically, defects in establishing citizenship for the purpose of establishing diversity jurisdiction. Id. at 456. See also Tayl

### Answer analysis

In [10]:
def extract_answers(text):
    if pd.isnull(text):
        return -1
    # Look for text surrounded by single backticks
    pattern = r'\bANSWER:\s*([A-E])'
    match = re.search(pattern, text)
    letter = match.group(1) if match else ''

    if letter:
        # Return the content if found, None otherwise
        return ord(letter) - ord('A')
    else: 
        return -1

In [11]:
oai_labeled['answer_gpt4o'] = oai_labeled['response_gpt4o'].apply(extract_answers)
oai_labeled['answer_gpt4o_anonymized'] = oai_labeled['response_gpt4o_anonymized'].apply(extract_answers)

In [12]:
#Fill in random guesses elsewhere
oai_labeled.loc[oai_labeled['answer_gpt4o'] == -1, 'answer_gpt4o'] = np.random.choice(5, size = oai_labeled.query('answer_gpt4o == -1').shape[0])
oai_labeled.loc[oai_labeled['answer_gpt4o_anonymized'] == -1, 'answer_gpt4o_anonymized'] = np.random.choice(5, size = oai_labeled.query('answer_gpt4o_anonymized == -1').shape[0])

In [13]:
(f1_score(oai_labeled['label'], oai_labeled['answer_gpt4o'], average='macro'), 
f1_score(oai_labeled['label'], oai_labeled['answer_gpt4o_anonymized'], average='macro'))


(0.7439582920642518, 0.7276315728579614)

In [19]:
(oai_labeled['answer_gpt4o'] == oai_labeled['answer_gpt4o_anonymized']).mean()

np.float64(0.8759879563417388)

In [14]:
print(oai_labeled.loc[100]['citing_prompt'])

to a wholesaler/distributor who then packs them with goods; 4. sold in an environment of sale that features the goods packed in the jar and not the jar itself; 5. used to commercially convey foodstuffs, beverages, oils, meat extracts, etc.; 6. capable of being used in the hot packing process; and 7. recognized in the trade as used primarily to pack and convey goods to a consumer who then discards the container after this initial use. T.D. 96-7, 30 Customs Bulletin at 32-33, 61 Fed. Reg. at 224. The Government urges that HQRL 962378 “complied with the standards set forth” in T.D. 96-7, and is therefore entitled to deference under Skidmore v. Swift & Co., 323 U.S. 134 (1944) (Def.’s Post Trial Mem. at 10); see Rocknel Fastener, Inc. v. United States, 267 F.3d 1354, 1357 (Fed. Cir. 2001) (<HOLDING>). In Rocknel, the Court of Appeals for the


In [15]:
print(oai_labeled.loc[100]['response_gpt4o'])

<reasoning>  
The input text discusses the application of deference to an agency's interpretation of a legal standard, specifically referencing Skidmore v. Swift & Co. and the standards set forth in T.D. 96-7. The Government argues that the agency's decision complies with these standards and is entitled to Skidmore deference. Skidmore deference is granted when an agency's interpretation is persuasive and based on thorough reasoning and analysis. The text also references Rocknel Fastener, Inc. v. United States, which likely involves a similar issue of deference to an agency's decision.

Given this context, the most appropriate holding would address the application of Skidmore deference in a situation where the agency's decision demonstrates thorough analysis and reasoning. This aligns with the logical flow of the argument, which emphasizes the persuasiveness and consistency of the agency's position.

Option A discusses Skidmore respect but focuses on the agency's manual and practice, wh

In [16]:
print(oai_labeled.loc[100]['anonymized_prompt'])

<output> "to a wholesaler/distributor who then packs them with goods; 4. sold in an environment of sale that features the goods packed in the jar and not the jar itself; 5. used to commercially convey foodstuffs, beverages, oils, meat extracts, etc.; 6. capable of being used in the hot packing process; and 7. recognized in the trade as used primarily to pack and convey goods to a consumer who then discards the container after this initial use. T.D. 97-8, 31 Customs Bulletin at 33-34, 62 Fed. Reg. at 225. The Government urges that HQRL 973479 “complied with the standards set forth” in T.D. 97-8, and is therefore entitled to deference under Johnson v. Swift & Co., 324 U.S. 135 (1945) (Def.’s Post Trial Mem. at 11); see Thompson Fastener, Inc. v. United States, 268 P.2d 1355, 1358 (Cal.App. 2002) (<HOLDING>). In Thompson, the Court of Appeals for the" </output>


In [17]:
print(oai_labeled.loc[100]['response_gpt4o_anonymized'])

<reasoning> The input text discusses the application of deference to an agency's interpretation of a legal standard, specifically referencing HQRL 973479 and its compliance with the standards set forth in T.D. 97-8. The Government argues that this compliance entitles the agency's decision to deference, citing Johnson v. Swift & Co. and other precedents. The context suggests that the legal issue revolves around the level of deference courts should give to agency decisions, particularly in the context of customs classifications. 

Option A discusses Chevron and Skidmore deference but focuses on a director's litigation position, which is not directly relevant to the customs classification issue in the input text. 

Option B directly addresses Skidmore deference in the context of customs classification decisions, emphasizing thorough analysis, which aligns with the discussion in the input text about the agency's compliance with standards and the Government's argument for deference.

Option