# Identifying Inadmissibility Cases in Canadian Federal Court Decisions (2014–2024)


## Step 1: Library Imports
Import all necessary libraries to load the dataset, process the text, and interact with external models.

- `pandas`: Used for tabular data manipulation.

- `re`: Python's built-in module for regular expressions to detect legal patterns.

- `subprocess`: Executes command-line commands, useful for interacting with external tools like LLaMA 3.

- `datasets`: Provides access to HuggingFace datasets.

- `tqdm`: Displays progress bars for loops, particularly useful when processing a large number of court cases.

To run this notebook from step 8, you will need to:
1. Download Ollam from [here](https://ollama.com/)
2. Go to your terminal and type: `ollama pull llama3` to download llama3 model in your local computer.
3. Then run the ollama application before running the code from step 8.

In [1]:
import pandas as pd
import tqdm as notebook_tqdm
from tqdm import tqdm
from datasets import load_dataset
import datetime
import re
import subprocess
import ast

  from .autonotebook import tqdm as notebook_tqdm


## Step 2: Loading and Filtering the Dataset (2014–2024)
Load the Federal Court dataset and retain only decisions published between 2014 and 2024.

- The dataset is retrieved using the datasets library from HuggingFace.

- It contains metadata and full decision texts from the Canadian Federal Court.

- The filtering ensures temporal relevance for the study, focusing on contemporary jurisprudence.



In [2]:
dataset = load_dataset("refugee-law-lab/canadian-legal-data", "FC", split="train")
FC = dataset.to_pandas()
FC_2014_2024 = FC.query("year >= 2014")

Generating train split: 100%|██████████| 63568/63568 [00:01<00:00, 34419.05 examples/s]


## Step 3: Removing Translated Versions (French Duplicates)
Eliminate French translations when English versions of the same case are available.

- Decisions can be published in both English and French.

- To avoid duplication and inconsistencies, the script standardizes the citation by replacing court acronyms (e.g., "FC", "CF") with a generic placeholder.

- French entries are removed only if a corresponding English version exists.



In [3]:
def remove_translated_cases(df, citation_col='citation', lang_col='language', lang_primary='en', lang_secondary='fr'):
    """
    Removes rows in the secondary language (e.g., French) that are translations of cases already 
    present in the primary language (e.g., English), based on normalized court citations.

    Parameters:
    -----------
    df : pd.DataFrame
        The DataFrame containing legal case data.
    citation_col : str, optional
        The name of the column containing case citations. Default is 'citation'.
    lang_col : str, optional
        The name of the column containing language information. Default is 'language'.
    lang_primary : str, optional
        The language code to be considered as the primary version (e.g., 'en'). Default is 'en'.
    lang_secondary : str, optional
        The language code to be considered as the translated version to remove (e.g., 'fr'). Default is 'fr'.

    Returns:
    --------
    pd.DataFrame
        A filtered DataFrame with translated cases removed when the same case exists in the primary language.
    """
    court_acronyms = ['FC', 'CF'] 
    pattern = r'\b(' + '|'.join(court_acronyms) + r')\b'

    def normalize(citation):
        return re.sub(pattern, 'COURT', citation)

    df = df.copy()
    df['normalized_citation'] = df[citation_col].apply(normalize)

    primary_citations = set(df[df[lang_col] == lang_primary]['normalized_citation'])

    filtered_df = df[~((df[lang_col] == lang_secondary) & (df['normalized_citation'].isin(primary_citations)))]

    return filtered_df.drop(columns=['normalized_citation'])

FC_remove_translation = remove_translated_cases(FC_2014_2024)

## Step 4: Isolating Immigration-Related Decisions
Filter for cases specifically related to immigration matters.

- The approach checks for the presence of immigration-related terms within the first 10 lines of each decision.

- Keywords include “Citizenship and Immigration”, “Citoyenneté et Immigration”, and “MCI” (Minister of Citizenship and Immigration).

- This step narrows the dataset to immigration law cases and excludes unrelated legal issues.



In [4]:
def immigration_cases(text):
    """
    Checks whether the given text contains references to immigration-related ministries 
    within the first 10 lines.

    This function is typically used to filter legal case documents that mention 
    either "Citizenship and Immigration" or "Citoyenneté et Immigration" early in the text.

    Parameters:
    ----------
    text : str or None
        The textual content of a legal case, potentially containing multiple lines.

    Returns:
    -------
    bool
        True if either phrase appears in the first 10 lines of the text; False otherwise.
    """
    if pd.isna(text):
        return False
    lines = text.splitlines()[:10]
    joined_lines = ' '.join(lines)
    return (
        # "Public Safety" in joined_lines or
        # "Immigration, Refugees and Citizenship" in joined_lines or
        "Citizenship and Immigration" in joined_lines or
        "Citoyenneté et Immigration" in joined_lines or
        "MCI" in joined_lines
    )

FC_immigration = FC_remove_translation[FC_remove_translation['unofficial_text'].apply(immigration_cases)]

## Step 5: Removing Refugee Claims
Exclude cases dealing with refugee protection to focus exclusively on inadmissibility under the Immigration and Refugee Protection Act (IRPA).

- A comprehensive regular expression is used to detect references to refugee claims, protected persons, and related terminology.

- These cases are removed since the legal framework and reasoning differ substantially from inadmissibility cases.

In [5]:
RE_exclude_refugee = re.compile(
    r'\b(Refugee Protection Division|convention refugees?|persons? in need of protection|refugee claimants?|protected persons?|réfugiés?)\b',
    re.IGNORECASE
)

def filter_refugee_cases(df, text_column="unofficial_text"):
    """
    Removes rows containing refugee exclusion terms from the DataFrame.

    Parameters:
    ----------
    df : pd.DataFrame
    text_column : str

    Returns:
    -------
    pd.DataFrame: Filtered DataFrame without refugee-related documents
    """
    mask = ~df[text_column].str.contains(RE_exclude_refugee, na=False)
    return df[mask].copy()

FC_non_refugee = filter_refugee_cases(FC_immigration)

  mask = ~df[text_column].str.contains(RE_exclude_refugee, na=False)


## Step 6: Filtering by Inadmissibility Mentions
Retain only those cases that explicitly mention “inadmissibility” or “inadmissible.”

- This is a keyword-based filter using simple substring checks.

- It helps narrow down the dataset to decisions where inadmissibility is central to the legal reasoning.



In [6]:
def filter_inadmissibility(df, text_column="unofficial_text"):
    """
    Filters rows in the DataFrame that contain 'inadmissible' or 'inadmissibility'
    in the relevant numbered lines extracted from the specified text column.

    Parameters:
    df (pd.DataFrame): Input DataFrame
    text_column (str): Name of the column containing case text

    Returns:
    pd.DataFrame: Filtered DataFrame with only relevant cases
    """

    return df[df[text_column].apply(lambda text: 
           'inadmissible' in text.lower() or
           'inadmissibility' in text.lower())]

FC_inadmissible = filter_inadmissibility(FC_non_refugee)

## Step 7: Categorizing Inadmissibility Grounds under IRPA
This step aims to automatically classify each court decision into one or more categories of inadmissibility, as defined in the Immigration and Refugee Protection Act (IRPA), using both:

- Explicit references to IRPA sections (e.g., s.34, s.36(1))

- Implicit references through legal keywords and concepts

**Components Involved**
1. SECTION_PATTERNS Dictionary
This dictionary maps each inadmissibility ground to a regular expression (regex) designed to match explicit references to legal provisions in the IRPA.
Each key corresponds to a legal category, and each value is a compiled regex that detects citations like “section 34” or “s.36(1)” within the decision text.

Examples:

- "security" → section 34 IRPA

- "serious_criminality" → section 36(1)

- "misrepresentation" → section 40

2. RE_patterns Dictionary
This dictionary maps each ground to a regex that captures semantic/legal terms that often imply the corresponding section, even without explicit citation.

Examples:

- "security" → matches terms like “espionage”, “terrorism”, “subversion”

- "criminality" → matches “convictions”, “summary offences”, “indictable”

- "health_grounds" → detects “danger to public health” or “excessive demand on health services”

**Function: categorize_document(text)**

To classify a court decision into one or more grounds of inadmissibility, based on either:

- Section citations (preferred)

- Legal semantics (fallback)

If neither is found, the case is classified as 'other'

**Execution Flow:**
Step 1: Extract Numbered Legal Paragraphs
- The decision is split into lines.

- Extraction starts from the first paragraph labeled [1].

- Only lines with paragraph numbers like [12] or numbered sentences (e.g., 12.) are retained.

- This section is usually the legal reasoning portion of the decision.

Step 2: Match Explicit Section Patterns
- Each regex in SECTION_PATTERNS is applied to the extracted portion.

- If a section match is found (e.g., “section 36(1)”), the corresponding ground (e.g., serious_criminality) is returned immediately as a single-element list.

Step 3: Fallback to Keyword Patterns
- If no section is found, each regex in RE_patterns is applied.

- Multiple matches can occur; the function returns a list of all matching grounds.

Step 4: Default Classification
- If neither section citations nor relevant keywords are found, the case is labeled as ['other'].

In [7]:
SECTION_PATTERNS = {
    'security': re.compile(r'\b(?:s(?:ection)?\.?\s*|subsection|paragraphs?)\s*34\b|\b34\(\d+\)', re.IGNORECASE),
    'human_rights': re.compile(r'\b(?:s(?:ection)?\.?\s*|subsection|paragraphs?)\s*35\b|\b35\(\d+\)', re.IGNORECASE),
    'serious_criminality': re.compile(r'\b(?:s(?:ection)?\.?\s*|subsection|paragraphs?)\s*36\(1\)', re.IGNORECASE),
    'criminality': re.compile(r'\b(?:s(?:ection)?\.?\s*|subsection|paragraphs?)\s*36\(2\)', re.IGNORECASE),
    'organized_criminality': re.compile(r'\b(?:s(?:ection)?\.?\s*|subsection|paragraphs?)\s*37\b|\b37\(\d+\)', re.IGNORECASE),
    'health_grounds': re.compile(r'\b(?:s(?:ection)?\.?\s*|subsection|paragraphs?)\s*38\b|\b38\(\d+\)', re.IGNORECASE),
    'financial_reasons': re.compile(r'\b(?:s(?:ection)?\.?\s*|subsection|paragraphs?)\s*39\b|\b39\(\d+\)', re.IGNORECASE),
    'misrepresentation': re.compile(r'\b(?:s(?:ection)?\.?\s*|subsection|paragraphs?)\s*40\b|\b40\(\d+\)', re.IGNORECASE),
    'non_compliance': re.compile(r'\b(?:s(?:ection)?\.?\s*|subsection|paragraphs?)\s*41\b|\b41\(\d+\)', re.IGNORECASE),
    'inadmissible_family': re.compile(r'\b(?:s(?:ection)?\.?\s*|subsection|paragraphs?)\s*42\b|\b42\(\d+\)', re.IGNORECASE)
}

RE_patterns = {
    'security': re.compile(
        r'\b(espionages?|against canada|canada[’\'‘s]* interests?|subversions?|democratic governments?|terrorisms?|dangers? to security|violences?|endangerments?|memberships?|complicity|reasonable grounds? to believe)\b',
        re.IGNORECASE
    ),
    'human_rights': re.compile(
        r'\b(human rights?|international rights?|violations?|senior officials?|governments?|regimes?|genocides?|war crimes?|crimes? against humanity|participations?|contributions?|reasonable grounds? to believe|terrorisms?)\b',
        re.IGNORECASE
    ),
    'serious_criminality': re.compile(
        r'\b(criminal convictions?|foreign convictions?|imprisonments?|10 years|ten years|sentences?|over (6|six) months|serious indictable offences?|commissions?|reasonable grounds? to believe)\b',
        re.IGNORECASE
    ),
    'criminality': re.compile(
        r'\b(criminal convictions?|foreign convictions?|indictments?|indictable offences?|summary offences?|commissions?)\b',
        re.IGNORECASE
    ),
    'organized_criminality': re.compile(
        r'\b(memberships?|criminal activities?|organized crimes?|acting in concert|people smuggling|traffickings?|money launderings?|proceeds? of crime|reasonable grounds? to believe)\b',
        re.IGNORECASE
    ),
    'health_grounds': re.compile(
        r'\b(dangers? to public health|dangers? to public safety|excessive demands? on health services|excessive demands? on social services)\b',
        re.IGNORECASE
    ),
    'financial_reasons': re.compile(
        r'\b(unable or unwilling to support (oneself|dependents?)|arrangements? for care and support|social assistances?)\b',
        re.IGNORECASE
    ),
    'misrepresentation': re.compile(
        r'\b(misrepresenting|withholding|material facts?|errors? in administration|non-disclosures?|omissions?|false statements?|false information)\b',
        re.IGNORECASE
    ),
    'non_compliance': re.compile(
        r'\b(contraventions?|non-compliances?|failures? to comply)\b',
        re.IGNORECASE
    ),
    'inadmissible_family': re.compile(
        r'\b(inadmissible family members?|accompanying family members?)\b',
        re.IGNORECASE
    )
}

def categorize_document(text):
    """
    Extracts relevant numbered lines and classifies a legal document 
    into IRPA inadmissibility grounds.

    Returns:
    - [single ground] if a section is matched.
    - [multiple grounds] if matched by keyword.
    - ['other'] if nothing matches.

    Parameters:
    ----------
    text : str

    Returns:
    -------
    list of str
    """

    # Step 1: Extract relevant numbered lines
    lines = text.splitlines()
    extracted = []
    start_extracting = False

    for line in lines:
        line_strip = line.strip()
        if not start_extracting:
            if "[1]" in line_strip:
                extracted.append(line)
                start_extracting = True
        else:
            if line_strip.startswith("[") and line_strip[1:line_strip.find("]")].isdigit():
                extracted.append(line)
            elif line_strip and line_strip[0].isdigit():
                extracted.append(line)
            else:
                break

    extracted_text = "\n".join(extracted)

    # Step 2: Check section patterns
    for category, pattern in SECTION_PATTERNS.items():
        if re.search(pattern, extracted_text):
            return [category]

    # Step 3: Fallback to keyword patterns
    matched_keywords = [
        category for category, pattern in RE_patterns.items()
        if re.search(pattern, extracted_text)
    ]

    return matched_keywords if matched_keywords else ['other']

FC_inadmissible['inadmissibility_ground'] = FC_inadmissible['unofficial_text'].apply(categorize_document)
FC_inadmissible.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  FC_inadmissible['inadmissibility_ground'] = FC_inadmissible['unofficial_text'].apply(categorize_document)


Unnamed: 0,citation,citation2,dataset,year,name,language,document_date,source_url,scraped_timestamp,unofficial_text,other,inadmissibility_ground
18532,2014 FC 105,,FC,2014,Muthui v. Canada (Citizenship and Immigration),en,2014-01-30,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-23,Muthui v. Canada (Citizenship and Immigration)...,,[inadmissible_family]
18595,2014 FC 112,,FC,2014,Adewole v. Canada (Citizenship and Immigration),en,2014-02-04,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-23,Adewole v. Canada (Citizenship and Immigration...,,[other]
18609,2014 FC 1137,,FC,2014,Sidhu v. Canada (Citizenship and Immigration),en,2014-11-25,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-23,Sidhu v. Canada (Citizenship and Immigration)\...,,[other]
18617,2014 FC 1146,,FC,2014,Abebe v. Canada (Citizenship and Immigration),en,2014-11-28,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-23,Abebe v. Canada (Citizenship and Immigration)\...,,[security]
18618,2014 FC 1147,,FC,2014,Bundhel v. Canada (Citizenship and Immigration),en,2014-11-28,https://decisions.fct-cf.gc.ca/fc-cf/decisions...,2022-08-23,Bundhel v. Canada (Citizenship and Immigration...,,[other]


## Step 8: Validating Inadmissibility with LLaMA 3 (LLM-Based Review)
Use a large language model (LLaMA 3) to validate whether the case is indeed about inadmissibility and generate a concise summary.

- A two-step prompt system is employed:

    - Summarization: Generate a 2-line summary of the case.

    - Binary Classification: Determine whether the decision primarily concerns inadmissibility.

- This approach ensures higher precision than regex alone and serves as a second-level validation.


In [None]:
def classify_inadmissibility(df, text_column="unofficial_text", model="llama3"):
    """
    Classify each court case using LLaMA 3 via subprocess by:
    1. Extracting numbered sections starting from [1].
    2. Summarizing the extracted text.
    3. Feeding the summary into a classification prompt.

    This version returns the raw classification output from the model.

    Parameters
    ----------
    df : pandas.DataFrame
        The DataFrame containing court case texts.
    text_column : str, optional
        The name of the column containing the court text (default is "unofficial_text").
    model : str, optional
        The name of the language model to use with Ollama (default is "llama3").

    Returns
    -------
    pandas.DataFrame
        A DataFrame with an added column 'inadmissibility' containing the raw model output.
    """

    def extract_numbered_lines(text):
        lines = text.splitlines()
        extracted = []
        start_extracting = False

        for line in lines:
            line_strip = line.strip()
            if not start_extracting:
                # Start if line contains "[1]" anywhere
                if "[1]" in line_strip:
                    extracted.append(line)
                    start_extracting = True
            else:
                # Continue only if line starts with [number] or number
                if line_strip.startswith("[") and line_strip[1:line_strip.find("]")].isdigit():
                    extracted.append(line)
                elif line_strip and line_strip[0].isdigit():
                    extracted.append(line)
                else:
                    break
        return "\n".join(extracted)

    def run_ollama(prompt):
        try:
            result = subprocess.run(
                ["ollama", "run", model],
                input=prompt.encode(),
                stdout=subprocess.PIPE,
                stderr=subprocess.PIPE
            )
            return result.stdout.decode().strip()
        except Exception as e:
            print(f"Subprocess error: {e}")
            return "subprocess_error"

    def generate_summary(text):
        prompt = f"""
You are a legal analyst specializing in Canadian immigration law.

Summarize the following court case in one sentence, clearly stating what the case is about.

Case Text:
{text}

Summary:
"""
        return run_ollama(prompt)

    def classify_summary(summary):
        prompt = f"""
You are a Canadian immigration law expert.

Based on the following summary of a legal case, classify whether the case involves 
a judicial review of an inadmissibility decision under Canadian immigration law.


Respond only with one of the following:
Inadmissibility
Not Inadmissibility

Summary:
{summary}

Classification:
"""
        return run_ollama(prompt)

    df = df.copy()
    df["inadmissibility"] = None

    for idx, row in df.iterrows():
        full_text = row[text_column]
        limited_text = extract_numbered_lines(full_text)
        summary = generate_summary(limited_text)
        raw_output = classify_summary(summary)
        df.at[idx, "inadmissibility"] = raw_output
        print(f"Row {idx} classified.")

    return df

inadmissibility_df = classify_inadmissibility(FC_inadmissible)
inadmissibility_df = inadmissibility_df.query("inadmissibility == 'Inadmissibility'")
inadmissibility_df = inadmissibility_df.drop("inadmissibility", axis=1)
inadmissibility_df.head()

## Step 9: Judge Name Extraction
Use llama3 from Ollama (LLM) to extract judge names.

- Analyzes the first 30 lines of each court case to locate judicial information, which is typically stated early in the decision.

- Uses a model-generated prompt to return a sentence identifying the judges, or indicates their absence.

- Extracts only the judge names (stripped of titles) from the generated sentence and stores them in a list.

In [None]:
def generate_judge_sentence(text, model="llama3"):
    """
    Generate a sentence identifying the judges in a given court text.

    Parameters
    ----------
    text : str
        The court text to analyze.
    model : str, optional
        The name of the language model to use (default is "llama3").

    Returns
    -------
    str
        A sentence either listing the judges or stating that none are mentioned.
    """
    prompt = f"""
You are a legal assistant. Your task is to identify the names of the judge(s) who presided over the case from the following court text.

Instructions:
- Return a single sentence that starts with "The judges in this case are ..." followed by the judge names.
- If no judge is mentioned, return "No judges are mentioned in the case."
- Do not assume any judges if it is not mentioned.
- Keep your answer in 1 sentence.

Court Text:
{text}
"""
    result = subprocess.run(
        ["ollama", "run", model],
        input=prompt.encode(),
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE
    )
    return result.stdout.decode().strip()


def extract_names_from_judge_sentence(sentence, model="llama3"):
    """
    Extract only the judge names from a sentence.

    Parameters
    ----------
    sentence : str
        A sentence that mentions the judges (e.g., output of `generate_judge_sentence`).
    model : str, optional
        The name of the language model to use (default is "llama3").

    Returns
    -------
    list of str
        A list of judge names with titles removed. Returns an empty list if no names are found.
    """
    prompt = f"""
You are a legal parser. Extract only the judge names from the sentence below.

Instructions:
- Return the names in a valid Python list of strings.
- Do not include any titles like "Judge", "Justice", or "Chief Justice".
- If no names are found, return: []

Sentence:
{sentence}

Output:
"""
    result = subprocess.run(
        ["ollama", "run", model],
        input=prompt.encode(),
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE
    )
    output = result.stdout.decode().strip()

    match = re.search(r"\[.*?\]", output, re.DOTALL)
    if match:
        try:
            return ast.literal_eval(match.group(0))
        except Exception:
            return []
    return []


def extract_judges_from_dataframe(df, text_column="unofficial_text", model="llama3"):
    """
    Extract judge names from a DataFrame containing court texts.

    Parameters
    ----------
    df : pandas.DataFrame
        The DataFrame containing court text data.
    text_column : str, optional
        The column name in `df` that contains the court text (default is "unofficial_text").
    model : str, optional
        The language model to use for generation and parsing (default is "llama3").

    Returns
    -------
    pandas.DataFrame
        A new DataFrame with an additional 'judges' column containing lists of judge names.
    """
    df = df.copy()
    df["judges"] = None

    for idx, row in df.iterrows():
        full_text = row[text_column]
        lines = full_text.splitlines()
        first_30 = "\n".join(lines[:30])

        judge_sentence = generate_judge_sentence(first_30, model=model)
        judge_names = extract_names_from_judge_sentence(judge_sentence, model=model)

        df.at[idx, "judges"] = judge_names
        print(f"Row {idx} processed.")
    
    return df

FC_judges = extract_judges_from_dataframe(inadmissibility_df, text_column="unofficial_text")

FC_judges = FC_judges.drop('Unnamed: 0', axis = 1).reset_index()
FC_judges = FC_judges.drop('index', axis = 1)

FC_judges.head()

## Step 10: City (Location) Extraction
Use llama3 from Ollama (LLM) to extract the city where each case was heard.

- Extracts lines 10 through 25 from the court text, where city names are commonly mentioned in Federal Court decisions.

- Prompts the language model to extract and return only the city name, or “NA” if not found.

- Adds a locations column to the DataFrame

In [None]:
def extract_locations_from_text(text, model="llama3"):
    """
    Directly extracts the city name from the given court text.

    Parameters
    ----------
    text : str
        The court text to analyze.
    model : str, optional
        The name of the language model to use (default is "llama3").

    Returns
    -------
    str
        The extracted city name, or 'NA' if no location is found.
    """
    prompt = f"""
You are a legal assistant. Identify the city where the case was heard from the following court text.

Instructions:
- Extract the city where the case was heard.
- Just provide the city name, nothing else.
- If no location is found, return NA

Court Text:
{text}
"""
    result = subprocess.run(
        ["ollama", "run", model],
        input=prompt.encode(),
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE
    )
    
    output = result.stdout.decode().strip()
    
    return output if output else "NA"

def extract_locations_from_dataframe(df, startline, endline, text_column="unofficial_text", model="llama3"):
    """
    Extract city names from a DataFrame containing court texts.

    Parameters
    ----------
    df : pandas.DataFrame
        The DataFrame containing court text data.
    startline : int
        The starting line number of the text to consider.
    endline : int
        The ending line number of the text to consider.
    text_column : str, optional
        The column in `df` that contains the court text (default is "unofficial_text").
    model : str, optional
        The language model to use (default is "llama3").

    Returns
    -------
    pandas.DataFrame
        A copy of the input DataFrame with an additional 'locations' column containing extracted city names.
    """
    df = df.copy()

    def extract_slice_and_location(text):
        lines = text.splitlines()
        selected_text = "\n".join(lines[startline:endline])
        return extract_locations_from_text(selected_text, model=model)

    df["locations"] = df[text_column].apply(extract_slice_and_location)

    return df

FC_city = extract_locations_from_dataframe(FC_judges, 10, 25,  text_column="unofficial_text", model="llama3")
FC_city.head()

## Step 11: Case Outcome Extraction
Use llama3 from Ollama (LLM) to extract legal outcome of each case.

- Analyzes the last 50 to 20 lines of each case, where the court's decision is typically summarized.

- Uses a language model to extract the final decision in a single word (e.g., “allowed”, “dismissed”).

- Returns “unknown” if the outcome cannot be determined.

- Appends the outcome as a new column 

In [None]:
def extract_case_outcome(text, start, end, model="llama3"):
    """
    Extract the outcome of a legal case in a single word.

    Parameters
    ----------
    text : str
        The court text from which to extract the outcome.
    model : str, optional
        The language model to use (default is "llama3").

    Returns
    -------
    str
        A single word summarizing the case outcome (e.g., "allowed", "dismissed", "granted").
        Returns "unknown" if no outcome is identified.
    """
    lines = text.splitlines()
    relevant_text = "\n".join(lines[start:end])
    prompt = f"""
You are a legal assistant. Your task is to determine the outcome of the court case based on the provided excerpt.

Instructions:
- Read the text carefully and identify the final decision or outcome.
- Return only one lowercase word that best summarizes the outcome (e.g., "allowed", "dismissed", etc).
- If the outcome is unclear or not mentioned, return "unknown".

Court Text:
{relevant_text}

Output:
"""

    result = subprocess.run(
        ["ollama", "run", model],
        input=prompt.encode(),
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE
    )
    return result.stdout.decode().strip().lower()

def extract_case_outcomes_from_dataframe(df, start, end, text_column="unofficial_text", model="llama3"):
    """
    Extract the case outcome (single word) from a DataFrame containing court texts.

    Parameters
    ----------
    df : pandas.DataFrame
        The DataFrame containing court text data.
    text_column : str, optional
        The column name in `df` that contains the court text (default is "unofficial_text").
    model : str, optional
        The language model to use (default is "llama3").

    Returns
    -------
    pandas.DataFrame
        A new DataFrame with an additional 'outcome' column containing the extracted outcome word.
    """
    df = df.copy()
    df["outcome"] = None

    for idx, row in df.iterrows():
        full_text = row[text_column]
        try:
            outcome = extract_case_outcome(full_text, start, end, model=model)
        except Exception as e:
            print(f"Error processing row {idx}: {e}")
            outcome = "unknown"

        df.at[idx, "outcome"] = outcome
        print(f"Row {idx} processed with outcome: {outcome}")
    
    return df

FC_outcome = extract_case_outcomes_from_dataframe(FC_city, -50, -20)
FC_final = FC_outcome.drop(['Unnamed: 0', 'Unnamed: 0.1', 'citation2', 'name', 'scraped_timestamp', 'unofficial_text', 'other'], axis=1)
FC_final.to_excel("../data/processed/court_cases_verification.xlsx")

FC_final.head()

# Reproducibility Issues When Using Large Language Models (LLMs)
The use of large language models, such as LLaMA 3, introduces several challenges for reproducibility in computational workflows, especially within legal reserach

1. Stochastic Nature of LLMs
- LLMs generate outputs based on probabilistic sampling.

- Unless temperature and random seeds are controlled precisely (and even then, not always guaranteed), the same input prompt may yield different outputs on different runs.

- This undermines deterministic reproducibility.

2. Model Version Variability
- LLMs like LLaMA 3 can be updated over time (e.g., new training data, fine-tuning changes).

- A prompt run today may return a different result in future versions, making long-term reproducibility difficult without strict version pinning.

3. Hardware and Deployment Differences
- Running the same model on different systems (e.g., GPU-enabled vs CPU-only, local vs cloud-based) can affect latency, performance, and in some cases, output formatting or completeness.

5. Prompt Sensitivity
- Minor changes in wording, spacing, or line breaks in a prompt can lead to substantially different model outputs.

- This sensitivity makes prompts fragile and complicates reproducibility.