# AI DDE Hackathon

Update the cell below with your Huggingface token (see: https://huggingface.co/docs/hub/en/security-tokens) and ensure you have permssion to use the LLama 3 Model (https://huggingface.co/meta-llama/Llama-3.1-8B).

In [1]:
# Install required packages
!pip install markitdown langchain chromadb gradio spacy
!pip install flash-attn git+https://github.com/huggingface/transformers.git triton


! pip install pdfplumber pymupdf
# Import basic libraries
import os
from markitdown import MarkItDown
from transformers import AutoTokenizer, AutoModel
from langchain.text_splitter import RecursiveCharacterTextSplitter
import torch
import logging
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
from google.colab import userdata

HF_TOKEN = userdata.get('HUGGING_FACE_HUB_TOKEN')


Collecting markitdown
  Downloading markitdown-0.0.1a3-py3-none-any.whl.metadata (4.8 kB)
Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting gradio
  Downloading gradio-5.12.0-py3-none-any.whl.metadata (16 kB)
Collecting mammoth (from markitdown)
  Downloading mammoth-1.9.0-py2.py3-none-any.whl.metadata (24 kB)
Collecting markdownify (from markitdown)
  Downloading markdownify-0.14.1-py3-none-any.whl.metadata (8.5 kB)
Collecting pathvalidate (from markitdown)
  Downloading pathvalidate-3.2.3-py3-none-any.whl.metadata (12 kB)
Collecting pdfminer-six (from markitdown)
  Downloading pdfminer.six-20240706-py3-none-any.whl.metadata (4.1 kB)
Collecting puremagic (from markitdown)
  Downloading puremagic-1.28-py3-none-any.whl.metadata (5.8 kB)
Collecting pydub (from markitdown)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-pptx (from markitdown)
  Downloading python_pptx-1.0.2-py3-none-any.whl.metadata (2.5 kB)




In [2]:
from google.colab import drive
# Mount Google Drive
drive.mount('/content/drive')
# Create documents folder
DOCUMENTS_PATH = '/content/drive/MyDrive/legal_documents'
if not os.path.exists(DOCUMENTS_PATH):
    os.makedirs(DOCUMENTS_PATH)

  and should_run_async(code)


Mounted at /content/drive


Document Processor Class

This cell defines our core document processing system:

In [None]:
import pdfplumber
import re
import fitz  # PyMuPDF

class PDFProccessor:

  pdf_path = None

  def __init__(self, pdf_path):
    self.pdf_path = pdf_path
    pass

  def process_pdf(self):
    text = self.open_pdf()
    print(self.remove_page_numbers(text))

  def open_pdf(self):
    with pdfplumber.open(self.pdf_path) as pdf:
        text = ""
        for page in pdf.pages:
            text += page.extract_text() + "\n"
    return text

  def remove_page_numbers(self, text):
    extrapolations_pattern =  r"^\s*THE\s+EXTRAPOLATIONS\s+DOCUMENT(?:\s*\n\s*|\s+)*\d+\s*$"
    dangling_page_number = r"^\s*\d+\s*$"
    clean_text = re.sub(extrapolations_pattern, "", text, flags=re.MULTILINE)
    return re.sub(dangling_page_number, "", clean_text, flags=re.MULTILINE)


This I think might help reduce the tokens for RAG.

In [14]:
pdf_path = "/content/drive/MyDrive/legal_documents/Leiden Guidelines on the Use of DDE in ICCTs_20220404.pdf"

def extract_sections_with_headers(pdf_path):
    pages = []

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            pages.append( page.extract_text())
    return pages

pages = extract_sections_with_headers(pdf_path)
for page in pages:
    print(f"Text: {page[:200]}...")  # Display first 200 chars of text
    print("---")

Text: Leiden Guidelines
on the Use of Digitally Derived Evidence
in International Criminal Courts and Tribunals
Sofia Aalto-Setälä Supervisor
Luca Caroli Sabrina K. Rewald, J.D., LL.M.
Julia Freytag
Maria F...
---
Text: Table of Contents
I. INTRODUCTION ......................................................................................................................................... 3
A. DEFINITION OF DDE ........
---
Text: D.2. The probative value of intercepts may be enhanced by tendering the original audio
recordings. ........................................................................................................
---
Text: I. Introduction
Digitally Derived Evidence (DDE) is increasingly used in international criminal courts and
tribunals to prosecute perpetrators of international crimes. Advanced digital tools, includin...
---
Text: a “computer environment,” as well as ‘digitized evidence,’ which is analogue material that has
been transferred to a digital format.2 The 

In [24]:
def filter_pages(pages_array):
    filtered_pages = []
    start_processing = False

    for page_number, text in enumerate(pages_array):
        # Skip the cover page (assumed to be the first page)
        if page_number == 0:
            continue

        # Skip the Table of Contents
        if "Table of Contents" in text:
            continue

        # Start processing after "I. Introduction"
        if not start_processing and "I. Introduction" in text:
            start_processing = True
            # Process content after "I. Introduction" within the page
            text = text.split("I. Introduction", 1)[1]

        if start_processing:
            filtered_pages.append(text.strip())

    return filtered_pages

filtered_pages = filter_pages(pages)

# Output the filtered pages to verify
for i, page in enumerate(filtered_pages):
    print(f"Filtered Page {i + 1}:")
    print(page[:500])  # Print the first 500 characters of the page
    print("---")

Filtered Page 1:
Digitally Derived Evidence (DDE) is increasingly used in international criminal courts and
tribunals to prosecute perpetrators of international crimes. Advanced digital tools, including
aerial photography, mobile devices, video, intercepted communications, amongst others,
capture new and vast quantities of data, which can add supplementary and supporting data to
existing evidence. For example, while an eyewitness account may provide relevant information
regarding an event, a satellite image may 
---
Filtered Page 2:
a “computer environment,” as well as ‘digitized evidence,’ which is analogue material that has
been transferred to a digital format.2 The concept is rooted in the following definitions:
International Bar Association (IBA)
‘Digital and technologically derived evidence, which means evidence taken from and
created by digital devices and via technology, such as cameras, satellites and other
‘remote sensing technologies’ […] We distinguish digital evidence, crea

  and should_run_async(code)


In [20]:
import re
import pdfplumber

def extract_sections_and_subsections(pdf_path):
    # Regex patterns for sections and subsections
    section_pattern = r"^(?P<header>[A-Z]\.\s.+)"  # Match main sections like "A. Videos"
    subsection_pattern = r"^(?P<subheader>[A-Z]\d+\.\s*.+)"
    extracted_sections = []
    content_to_process = ""
    start_processing = False

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text = page.extract_text()

            if not start_processing:
                # Skip the Table of Contents
                if "Table of Contents" in text:
                    continue

                # Look for the start marker ("I. Introduction" in this case)
                match = re.search(r"\bI\. Introduction\b", text, re.IGNORECASE)
                if match:
                    start_processing = True
                    _, remaining_text = re.split(r"\bI\. Introduction\b", text, flags=re.IGNORECASE, maxsplit=1)
                    content_to_process += remaining_text.strip()
            else:
                # Append remaining text from the document
                content_to_process += "\n" + text.strip()

    # Process text to extract sections and subsections
    lines = content_to_process.splitlines()
    current_section = None
    current_text = []
    for line in lines:
        line = line.strip()

        # Match a main section header
        section_match = re.match(section_pattern, line)
        if section_match:
            # Save the previous section and its text
            if current_section:
                extracted_sections.append({"header": current_section, "text": "\n".join(current_text)})

            # Start a new section
            current_section = section_match.group("header")
            current_text = []
            continue

        # Match a subsection header
        subsection_match = re.match(subsection_pattern, line)
        if subsection_match:
            # Save the current section and start a new subsection
            if current_section:
                extracted_sections.append({"header": current_section, "text": "\n".join(current_text)})
            current_section = subsection_match.group("subheader")
            current_text = []
            continue

        # Accumulate text under the current header
        if current_section:
            current_text.append(line)

    # Append the last section
    if current_section:
        extracted_sections.append({"header": current_section, "text": "\n".join(current_text)})

    return extracted_sections

# Example Usage
sections = extract_sections_and_subsections(pdf_path=pdf_path)

# Display results
for section in sections:
    print(f"Header: {section['header']}")
    print(f"Text: {section['text'][:200]}...")  # Display first 200 chars of text
    print("---")

Header: A. Definition of DDE
Text: The term ‘Digitally Derived Evidence’ was coined by the DDE Project to encompass both
‘digital evidence,’ which is material that has been “born-digital” in the sense of originating from
1 Internationa...
---
Header: B. Methodology
Text: The Guidelines are based on an in-depth analysis of the jurisprudence of the international
criminal courts and tribunals. They draw on the findings set out in the DDE Project's
extensive Case Summarie...
---
Header: C. Structure of the Leiden Guidelines
Text: The Leiden Guidelines address each type of DDE separately in order to take into account their
technological and legal particularities. Each section begins with a definition of the DDE
category. In so ...
---
Header: D. Scope of the Leiden Guidelines
Text: The rules and practice surrounding the use of DDE in international criminal courts and tribunals
continue to develop. Digital technology is being used more widely and frequently, not only in
the inves...
---
Head

In [12]:
import re
import pdfplumber

def extract_headers_and_chunks(chunks):
    headers_with_chunks = []

    for chunk in chunks:
        # Match header at the start of the chunk (e.g., C. Aerial and Satellite Images)
        match = re.match(r"^[A-Z](?:\.\d+)?\.\s.+", chunk)
        if match:
            header = match.group(0).strip()  # Extract the header
            content = chunk[len(header):].strip()  # Remaining content after the header
            headers_with_chunks.append({"header": header, "content": content})
        else:
            headers_with_chunks.append({"header": "Unknown Header", "content": chunk.strip()})

    return headers_with_chunks

for section in sections:
    subsections = extract_headers_and_chunks(section['text'])
    print(subsections)
    print(f"Header: {subsections['header']}")
    print(f"Text: {subsections['content'][:200]}...")  # Display first 200 chars of text
    print("---")

[{'header': 'Unknown Header', 'content': 'F'}, {'header': 'Unknown Header', 'content': '.'}, {'header': 'Unknown Header', 'content': '5'}, {'header': 'Unknown Header', 'content': '.'}, {'header': 'Unknown Header', 'content': ''}, {'header': 'Unknown Header', 'content': 'I'}, {'header': 'Unknown Header', 'content': 'n'}, {'header': 'Unknown Header', 'content': 's'}, {'header': 'Unknown Header', 'content': 'u'}, {'header': 'Unknown Header', 'content': 'f'}, {'header': 'Unknown Header', 'content': 'f'}, {'header': 'Unknown Header', 'content': 'i'}, {'header': 'Unknown Header', 'content': 'c'}, {'header': 'Unknown Header', 'content': 'i'}, {'header': 'Unknown Header', 'content': 'e'}, {'header': 'Unknown Header', 'content': 'n'}, {'header': 'Unknown Header', 'content': 't'}, {'header': 'Unknown Header', 'content': ''}, {'header': 'Unknown Header', 'content': 'a'}, {'header': 'Unknown Header', 'content': 'u'}, {'header': 'Unknown Header', 'content': 't'}, {'header': 'Unknown Header', 'conte

  and should_run_async(code)


TypeError: list indices must be integers or slices, not str

In [None]:
headers_with_chunks = extract_headers_and_chunks(chunks)

for item in headers_with_chunks:
    print(f"Section Header: {item['header']}\n")


Section Header: Unknown Header

Section Header: A. Definition of DDE

Section Header: B. Methodology

Section Header: C. Structure of the Leiden Guidelines

Section Header: D. Scope of the Leiden Guidelines

Section Header: A. Videos

Section Header: B. Photographs

Section Header: C. Aerial and Satellite Images

Section Header: D. Intercepts

Section Header: E. Call Data Records

Section Header: F. Audio Recordings



In [None]:
import chromadb

# vecorisation could be improved

class DocumentProcessor:
    def __init__(self):
        """Initialize the document processor with necessary components."""
        # Set up embedding model
        self.tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
        self.model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
        self.model.eval()
        # Initialize document converter
        self.md = MarkItDown()
        # Set up vector database
        self.vector_db = chromadb.Client()
        self.collection = self.vector_db.get_or_create_collection(name="legal_docs")

    def process_document(self, file_path):
        """Convert document to text and generate embeddings."""
        try:
            pdf_processor = PDFProccessor(file_path)
            # Convert document to text
            conversion_result = self.md.convert(file_path)
            conversion_result_text = self.md.convert(file_path).text_content
            # TODO - teh leiden guidelines contain a section of keywwords for each section - these should be parsed out and each section should be stored seperately
            conversion_result_text = pdf_processor.remove_page_numbers(conversion_result_text)
            print(conversion_result)

            # Create embeddings
            inputs = self.tokenizer(
                conversion_result_text,
                return_tensors="pt",
                truncation=True
            )
            # Use GPU if available
            if torch.cuda.is_available():
                self.model.to('cuda')
                inputs = {k: v.to('cuda') for k, v in inputs.items()}
            # Generate embeddings
            with torch.no_grad():
                outputs = self.model(**inputs)
            embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy().tolist()
            # **************
            return {
                'text': conversion_result_text,
                'embeddings': embeddings,
                'metadata': {}
            }
            # Note: we don't seem to get metadata from the docs anyway so better manually adding
            getattr(conversion_result, 'metadata', {})
        except Exception as e:
            logger.error(f"Error processing document {file_path}: {str(e)}")
            raise

    def store_document(self, doc_id, text, embedding, metadata=None):
        """Store document in the vector database."""
        if metadata is None:
            metadata = {}
        self.collection.add(
            documents=[text],
            embeddings=[embedding],
            metadatas=[metadata],
            ids=[doc_id]
        )

    def find_relevant_documents(self, query, metadata_titles, metadata_keywords, n_results=3):
        """Find relevant documents for a given query."""

        filter_dict = {}

        if metadata_titles and len(metadata_titles) > 0:
          filter_dict["title"] = {"$in": metadata_titles}

        if metadata_keywords:
          filter_dict["keywords"] = {"$in": metadata_keywords}

        filter = None if len(filter_dict) == 0 else filter_dict

        results = self.collection.query(
            query_texts=[query],
            where=filter,
            n_results=n_results
        )
        return [
            {
                'text': doc_text,
                'id': results['ids'][0][i],
                'metadata': results['metadatas'][0][i]
            }
            for i, doc_text in enumerate(results['documents'][0])
        ]
# Initialize processor
processor = DocumentProcessor()
# Move to GPU if available
if torch.cuda.is_available():
    processor.model = processor.model.to('cuda')

Set Up LLaMA Model

Go to HuggingFace and search for the llama model you want to use. For example, 3.1. Request permission to use it and get a HuggingFace token.

This cell initializes the LLaMA model for generating responses:

In [None]:
# fine tune llama to make this better
# experiment with prompts
# open ai option
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    token=HF_TOKEN
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    token=HF_TOKEN
)

Process Documents

This cell processes all documents in your legal_documents folder:

In [None]:

metadata_list = {
    'doc_0_Leiden Guidelines on the Use of DDE in ICCTs_20220404.pdf' : {
      'title': 'Leiden Guidelines on the Use of Digitally Derived Evidence in International Criminal Courts and Tribunals',
      'author': 'Unknown',
      'section-header': '',
      'category': '',
      'keywords': '',
    },
    'doc_1_Extrapolations from Case Law on DDE in ICCTs_20220405.pdf': {
      'title': 'Extrapolations from case law on the use of digitally derived evidence (dde) before international criminal courts and tribunals',
      'author': 'Unknown',
      'section-header': '',
      'category': '',
      'keywords': '',
      },
    'doc_2_Case Summaries-The Use of DDE before ICCTs.pdf': {
      'title': 'Analysis of Digitally Derived Evidence from the Jurisprudence of International Tribunals: Cases from the ICC, ICTR, ICTY, IRMCT, SCSL and STL',
      'author': 'Unknown',
      'section-header': '',
      'category': '',
      'keywords': '',
      },
    'doc_3_Fact-Finding-Missions.pdf': {
      'title': 'REPORT ON DIGITALLY DERIVED EVIDENCE USED IN UN HUMAN RIGHTS FACT-FINDING MISSIONS APPROACHES AND STANDARDS OF PROOF',
      'author': 'Unknown',
      'section-header': '',
      'category': '',
      'keywords': '',

      },
    'doc_4_DDE in ICL.pdf': {
      'title': 'REPORT ON DIGITALLY DERIVED EVIDENCE IN INTERNATIONAL CRIMINAL LAW',
      'author': 'Unknown',
      'section-header': '',
      'category': '',
      'keywords': '',
      },
}
# TODO 'section-header': '' each PDF needs chunked and the section header of each chunk need added to the metadata
print(metadata_list['doc_0_Leiden Guidelines on the Use of DDE in ICCTs_20220404.pdf'])

{'title': 'Leiden Guidelines on the Use of Digitally Derived Evidence in International Criminal Courts and Tribunals', 'author': 'Unknown', 'section-header': '', 'category': '', 'keywords': ''}


In [None]:
# Get list of documents
# improves the inteface
# disclaimers - agent answers - next agent makes it accesible - next agent is a lawyer that critiques answer

document_files = [
    f for f in os.listdir(DOCUMENTS_PATH)
    if f.endswith(('.pdf', '.docx', '.txt', '.html', '.pptx'))
]
if not document_files:
    print("⚠️ No documents found! Add some to your legal_documents folder")
else:
    print(f"Found {len(document_files)} documents to process")
    for idx, document in enumerate(document_files):
        print(f"Processing {document}...")
        file_path = os.path.join(DOCUMENTS_PATH, document)
        # Process document
        result = processor.process_document(file_path)
        # Store in database
        doc_id = f"doc_{idx}_{document}"
        metadata = metadata_list[doc_id]
        processor.store_document(
            doc_id=doc_id,
            text=result['text'],
            embedding=result['embeddings'],
            metadata=metadata
        )
        print(f"✅ Finished storing {document} in Chroma\n")
        # except Exception as e:
        #     print(f"Error processing {document}: {str(e)}")

Found 5 documents to process
Processing Leiden Guidelines on the Use of DDE in ICCTs_20220404.pdf...
<markitdown._markitdown.DocumentConverterResult object at 0x78084c990790>
✅ Finished storing Leiden Guidelines on the Use of DDE in ICCTs_20220404.pdf in Chroma

Processing Extrapolations from Case Law on DDE in ICCTs_20220405.pdf...
<markitdown._markitdown.DocumentConverterResult object at 0x78084fc52310>
✅ Finished storing Extrapolations from Case Law on DDE in ICCTs_20220405.pdf in Chroma

Processing Case Summaries-The Use of DDE before ICCTs.pdf...
<markitdown._markitdown.DocumentConverterResult object at 0x78084ffd6690>
✅ Finished storing Case Summaries-The Use of DDE before ICCTs.pdf in Chroma

Processing Fact-Finding-Missions.pdf...
<markitdown._markitdown.DocumentConverterResult object at 0x78084fcabb90>
✅ Finished storing Fact-Finding-Missions.pdf in Chroma

Processing DDE in ICL.pdf...
<markitdown._markitdown.DocumentConverterResult object at 0x78084fe14390>
✅ Finished storing

Question-Answering Function

This cell defines the function that generates answers using LLaMA. You may alter the values if you know what you’re doing :)

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
# Sample function for extracting keywords
def extract_keywords(text):
    doc = nlp(text)
    keywords = [token.text for token in doc if token.is_alpha and not token.is_stop]
    return keywords


In [None]:
!pip install fuzzywuzzy

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


  for line in open(toplevel):


In [None]:
# --- Document Excerpts ---
# {truncated_context}
# Function to enhance the query by adding the extracted keywords
from fuzzywuzzy import fuzz

titles_list = [doc_metadata['title'] for doc_metadata in metadata_list.values()]

def find_full_title(query, titles_list, threshold=60):
    matches = [title for title in titles_list if fuzz.partial_ratio(query.lower(), title.lower()) >= threshold]
    return matches

keywords_list = [doc_metadata['keywords'] for doc_metadata in metadata_list.values()]

def find_keywords(search_terms, keywords):
    for search_term in search_terms:
        if search_term.lower() in keywords:
            return search_term.lower()
    return None  # No matching title found

# TODO - if the chat bot doesn't recognise any keywords it should prompt back and say something like:
#  'Ask me about digital evidence related to photographic, video or etc. evidence.'
def ask_question_llama(question, leiden_guide_lines, case_law):
    # always Leiden Guidelines
    metadata_titles = [titles_list[0]]
    if case_law:
      # Extrapolations from case law
      metadata_titles.append(titles_list[1])
    # if
      # Cases from the ICC, ICTR, ICTY, IRMCT, SCSL and STL
    # if
      # UN HUMAN RIGHTS FACT-FINDING
    # if
      # INTERNATIONAL CRIMINAL LAW
    extracted_keywords = extract_keywords(question)

    metadata_keywords = find_keywords(extracted_keywords, keywords_list)
    # metadata_titles = find_full_title(question, titles_list)
    """Generate an answer to a legal question using LLaMA."""
    # Get relevant documents
    relevant_docs = processor.find_relevant_documents(query=question, metadata_keywords=metadata_keywords, metadata_titles=metadata_titles, n_results=5)

    # Prepare context
    context_pieces = [doc['text'][4000:5000] for doc in relevant_docs]
    titles = [doc['metadata']['title'] for doc in relevant_docs]
    keywords = [doc['metadata']['keywords'] for doc in relevant_docs]
    truncated_context = "\n".join(context_pieces)
    # Create prompt

    full_prompt = f"""You are a Human Rights Lawyer using the documents below to answer the following question.
--- Question ---
{question}

Based on the documents above, provide a clear, concise answer. If relevant, refer to legal precedent, case law, or any specific details from the documents. Do not simply restate the question; make sure the answer is grounded in the provided content.


Answer:
"""

    # Prepare for generation
    if torch.cuda.is_available():
        model.to('cuda')
    # TODO - increase the token limit to allow for more text
    inputs = tokenizer(
        full_prompt,
        return_tensors="pt",
        max_length=1024,
        truncation=True
    )
    if torch.cuda.is_available():
        inputs = {k: v.to('cuda') for k, v in inputs.items()}
    # TODO - increase the token limit to allow for more text
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=800,
            temperature=0.7,
            do_sample=True
        )
    # Process output
    raw_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    if "Answer:" in raw_output:
        final_answer = raw_output.split("Answer:", 1)[1].strip()
    else:
        final_answer = raw_output

    docs = ",\n - ".join(metadata_titles)
    final_answer = final_answer + f'\n\n DOCUMENTS: {docs}' + '\n\n NOTE: this is only guidance based on past case law.'
    # print(final_answer )
    return final_answer

    #

answer = ask_question_llama("Are you using the Leiden Guidelines and case law ?", True, True)
print(answer)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:00<00:00, 85.7MiB/s]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


The documents provided do not explicitly state whether the lawyer is using the Leiden Guidelines and case law. However, based on the information provided, it can be inferred that the lawyer is likely familiar with these resources. The first document, titled "Leiden Principles," outlines a set of guidelines for lawyers working on human rights cases. It emphasizes the importance of respecting cultural differences and ensuring that legal representation is accessible to all individuals, regardless of their socioeconomic status or geographic location. This aligns closely with the principles outlined in the Leiden Guidelines, which focus on ensuring that legal representation is fair and equitable for all parties involved. Additionally, the second document, which is a court ruling, references the use of case law as a source of legal precedent. This suggests that the lawyer may be familiar with the concept of relying on previous rulings to inform their own arguments and interpretations of the 

Create User Interface

Finally, we can also create the Gradio interface:

In [None]:
import gradio as gr

def create_interface():
    demo = gr.Interface(
        fn=ask_question_llama,
        inputs=[
            gr.Textbox(
                label="Your Legal Question",
                placeholder="Ask any question about your legal documents...",
                lines=3
            ),
            gr.Checkbox(label="Open Source Investigator", value=True),
            gr.Checkbox(label="Lawyer", value=False),
            ],
        outputs=[
            gr.Markdown(
                label="Answer",
            )
        ],

        title="Legal Document Assistant",
        description="This AI assistant can answer questions about your legal documents."
    )
    return demo
# Launch interface
demo = create_interface()
demo.launch(share=True)

  s = socket.socket()


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://c815b5b4678a3241ef.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)





---

Everything below here from the original Medium Article

---

Option A: Basic Processing (Free, No API Key Needed)

If you want to use the basic document processor (good for text-based documents), create a new cell and run this code:

In [None]:
# # Set up basic document processing
# processor = DocumentProcessor()
# # Add basic document conversion
# processor.md = MarkItDown()
# logger.info("✅ Basic document processor ready!")

Option B: Enhanced Processing (Requires OpenAI API Key)

If you want enhanced processing with better image handling and understanding, create a new cell and run this code instead:

In [None]:
# # First, set your OpenAI key
# OPENAI_API_KEY = "your-openai-key-here"  # Replace with your actual key

# # Set up enhanced document processing
# from openai import OpenAI
# processor = DocumentProcessor()
# client = OpenAI(api_key=OPENAI_API_KEY)
# processor.md = MarkItDown(llm_client=client, llm_model="gpt-4")
# logger.info("✅ Enhanced document processor ready!")

Part 3: Choose Your Storage System

Next, decide how you want to store your processed documents. You have two options:
Option A: Local Storage (Free, Good for Small Projects)

If you want to use local storage, create a new cell and run this code:

In [None]:
# Set up local storage with Chroma
import chromadb
processor.vector_db = chromadb.Client()
logger.info("✅ Local storage ready!")

Option B: Cloud Storage (Requires Qdrant Account)

If you want cloud storage, you need to create an account with Qdrant. Follow their instructions. Then, create a new cell and run this code instead:

In [None]:
# First, set your Qdrant credentials
QDRANT_URL = "your-qdrant-url-here"      # Replace with your URL
QDRANT_API_KEY = "your-qdrant-key-here"  # Replace with your key
# Set up cloud storage with Qdrant
from qdrant_client import QdrantClient
processor.vector_db = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
logger.info("✅ Cloud storage ready!")

Part 4: Connect to Your Documents

Now you’ll connect to your documents. Everyone should use Google Drive for this part:

In [None]:
from google.colab import drive
# Mount Google Drive
logger.info("Connecting to Google Drive...")
drive.mount('/content/drive')
# Set up documents folder
DOCUMENTS_PATH = '/content/drive/MyDrive/legal_documents'
if not os.path.exists(DOCUMENTS_PATH):
    os.makedirs(DOCUMENTS_PATH)
    logger.info("Created 'legal_documents' folder in your Google Drive")
    logger.info("⚠️ Please add your documents to this folder before continuing")
else:
    logger.info("✅ Found 'legal_documents' folder")


Mounted at /content/drive


Part 5: Choose Your AI Model

Finally, decide which AI model you want to use for answering questions. You have two options:

Open AI

In [None]:
# First, ensure you have your OpenAI key
OPENAI_API_KEY = "your-openai-key-here"  # Replace with your actual key
# Set up GPT-4
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
model = ChatOpenAI(
    model="gpt-4",
    openai_api_key=OPENAI_API_KEY,
    temperature=0.7
)
logger.info("✅ GPT-4 model ready!")

Llama

In [None]:
# First, set your HuggingFace token
HF_TOKEN = "your-huggingface-token-here"  # Replace with your actual token
# Set up Llama
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    token=HF_TOKEN
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    token=HF_TOKEN
)
logger.info("✅ Llama model ready!")

Part 6: Process Your Documents

Now that everything is set up, create a new cell to process your documents:

In [None]:
# Get list of documents
document_files = [f for f in os.listdir(DOCUMENTS_PATH)
                 if f.endswith(('.pdf', '.docx', '.txt', '.html', '.pptx'))]

if not document_files:
    logger.info("⚠️ No documents found! Add some to your legal_documents folder")
else:
    logger.info(f"Found {len(document_files)} documents to process")
    # Process each document
    for document in document_files:
        try:
            logger.info(f"Processing {document}...")
            file_path = os.path.join(DOCUMENTS_PATH, document)
            result = processor.process_document(file_path)
            logger.info(f"✅ Processed {document}")
        except Exception as e:
            logger.error(f"Error processing {document}: {str(e)}")

Part 7: Final Testing

Create one last cell to test your assistant. The code depends on which AI model you chose:
If You Chose GPT-4:

In [None]:
def ask_question(question):
    prompt = f"Based on the legal documents, please answer: {question}"
    return model.predict(prompt)

# Test the system
test_question = "What are the main terms of the agreement?"
answer = ask_question(test_question)
print(f"Question: {test_question}")
print(f"Answer: {answer}")

If You Chose Llama:

In [None]:
def ask_question(question):
    prompt = f"Question about legal documents: {question}\nAnswer:"
    inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test the system
test_question = "What are the main terms of the agreement?"
answer = ask_question(test_question)
print(f"Question: {test_question}")
print(f"Answer: {answer}")

Part 8: Creating Your User Interface

Now that we have our assistant working, let’s create a user-friendly interface. This will make it easy for anyone to ask questions about your legal documents. We’ll use Gradio, a tool that helps create web interfaces for AI applications.

In [None]:
# Install Gradio for our interface
!pip install gradio
print("✅ Gradio installation complete!")

Collecting gradio
  Downloading gradio-5.12.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.5.4 (from gradio)
  Downloading gradio_client-1.5.4-py3-none-any.whl.metadata (7.1 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.2.2 (from gradio)
  Downloading ruff-0.9.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting safehttpx<0.2.0,>=0.1.6 (from gradio)
  Downloading safehttpx-0.1.6-py3-none-any.whl.metadata (4.2 kB)
Collecting semantic-version~=2.0 (from gradio)
  Downloading semantic_version-2.1

  for line in open(toplevel):
  for line in open(toplevel):


✅ Gradio installation complete!


Step 2: Choose Your Interface Setup

Now you’ll need to set up the interface based on which AI model you chose earlier. Pick the option that matches your previous choice:
Option A: Interface with GPT-4

If you’re using GPT-4, create a new cell and run this code:

In [None]:
import gradio as gr
from typing import List, Dict
import logging
def create_gpt4_interface(model, processor):
    """Creates an interface for GPT-4 based assistant"""
    def get_response(question: str) -> str:
        """Process a question and get GPT-4's response"""
        try:
            # Find relevant document sections
            relevant_docs = processor.find_relevant_documents(question)
            context = "\n\n".join(doc.text for doc in relevant_docs)
            # Create our prompt
            prompt = f"""As a legal expert, please answer this question based on
            the provided documents:
            Documents: {context}
            Question: {question}
            """
            # Get GPT-4's response
            response = model.predict(prompt)
            return response
        except Exception as e:
            logging.error(f"Error processing question: {str(e)}")
            return f"I encountered an error: {str(e)}"
    # Create the interface
    demo = gr.Interface(
        fn=get_response,
        inputs=[
            gr.Textbox(
                label="Your Legal Question",
                placeholder="Ask any question about your legal documents...",
                lines=3
            )
        ],
        outputs=[
            gr.Textbox(
                label="Answer",
                lines=10
            )
        ],
        title="Legal Document Assistant (GPT-4)",
        description="""This AI assistant can answer questions about your legal documents.
        It uses GPT-4 to provide accurate, contextual responses based on your documents."""
    )
    return demo
# Create and launch the interface
demo = create_gpt4_interface(model, processor)
demo.launch(share=True)  # share=True creates a public link you can share
print("✅ Interface is ready! Click the link above to start using your assistant.")



NameError: name 'model' is not defined

Option B: Interface with Llama

If you’re using Llama, create a new cell and run this code instead:

In [None]:
import gradio as gr
from typing import List, Dict
import logging

def create_llama_interface(model, tokenizer, processor):
    """Creates an interface for Llama-based assistant"""
    def get_response(question: str) -> str:
        """Process a question and get Llama's response"""
        try:
            # Find relevant document sections
            relevant_docs = processor.find_relevant_documents(question)
            context = "\n\n".join(doc.text for doc in relevant_docs)
            # Create our prompt
            prompt = f"""Please answer this legal question based on the provided
            documents. Be specific and cite relevant sections.
            Documents: {context}
            Question: {question}
            Answer:"""
            # Prepare for Llama
            inputs = tokenizer(
                prompt,
                return_tensors="pt",
                max_length=2048,
                truncation=True
            )
            # Generate response
            outputs = model.generate(
                **inputs,
                max_new_tokens=512,
                temperature=0.7,
                do_sample=True
            )
            # Decode response
            response = tokenizer.decode(outputs[0], skip_special_tokens=True)
            return response.split("Answer:")[-1].strip()
        except Exception as e:
            logging.error(f"Error processing question: {str(e)}")
            return f"I encountered an error: {str(e)}"
    # Create the interface
    demo = gr.Interface(
        fn=get_response,
        inputs=[
            gr.Textbox(
                label="Your Legal Question",
                placeholder="Ask any question about your legal documents...",
                lines=3
            )
        ],
        outputs=[
            gr.Textbox(
                label="Answer",
                lines=10
            )
        ],
        title="Legal Document Assistant (Llama)",
        description="""This AI assistant can answer questions about your legal documents.
        It uses Llama to provide detailed responses based on your documents."""
    )
    return demo
# Create and launch the interface
demo = create_llama_interface(model, tokenizer, processor)
demo.launch(share=True)  # share=True creates a public link you can share
print("✅ Interface is ready! Click the link above to start using your assistant.")

Step 3: Enhanced Interface (Optional)

If you want a more sophisticated interface with additional features, create a new cell and run this code (works with either model):

In [None]:
import gradio as gr
from typing import List, Dict
import logging

def create_advanced_interface(model, processor, model_type="gpt-4"):
    """Creates an enhanced interface with additional features"""
    def process_query(
        question: str,
        show_sources: bool,
        response_length: str
    ) -> Dict[str, str]:
        """Process a question with additional options"""
        try:
            # Find relevant documents
            relevant_docs = processor.find_relevant_documents(question)
            context = "\n\n".join(doc.text for doc in relevant_docs)
            # Adjust response length
            max_tokens = {
                "Brief": 100,
                "Detailed": 300,
                "Comprehensive": 500
            }[response_length]
            # Get response based on model type
            if model_type == "gpt-4":
                response = model.predict(
                    f"Please provide a {response_length.lower()} answer: " + context
                )
            else:  # Llama
                inputs = tokenizer(context, return_tensors="pt", truncation=True)
                outputs = model.generate(
                    **inputs,
                    max_new_tokens=max_tokens,
                    temperature=0.7
                )
                response = tokenizer.decode(outputs[0], skip_special_tokens=True)
            # Prepare return value
            result = {"answer": response}
            # Add sources if requested
            if show_sources:
                sources = "\n\n".join(
                    f"From {doc.source}:\n{doc.text[:200]}..."
                    for doc in relevant_docs
                )
                result["sources"] = sources
            return result
        except Exception as e:
            logging.error(f"Error: {str(e)}")
            return {"answer": f"Error: {str(e)}", "sources": ""}
    # Create enhanced interface
    demo = gr.Interface(
        fn=process_query,
        inputs=[
            gr.Textbox(
                label="Your Legal Question",
                placeholder="Ask any question about your legal documents...",
                lines=3
            ),
            gr.Checkbox(label="Show Source Documents"),
            gr.Radio(
                choices=["Brief", "Detailed", "Comprehensive"],
                label="Response Length",
                value="Detailed"
            )
        ],
        outputs=[
            gr.Textbox(label="Answer", lines=10),
            gr.Textbox(label="Source Documents", visible=False) # Shows when sources requested
        ],
        title=f"Advanced Legal Assistant ({model_type})",
        description="""This enhanced AI assistant can answer questions about your legal documents.
        You can customize the response length and choose to see source documents."""
    )
    return demo
# Create and launch the enhanced interface
demo = create_advanced_interface(model, processor, model_type="gpt-4")  # or "llama"
demo.launch(share=True)
print("✅ Enhanced interface is ready! Click the link above to start using your assistant.")

SyntaxError: invalid syntax (<ipython-input-10-7b47c5f5b820>, line 1)

Interface Features Explained

The interfaces we’ve created offer different features:

    Basic Interface (Options A and B):

    Simple question input
    Clear answer output
    Model-specific optimizations
    Automatic error handling

2. Enhanced Interface (Optional):

    Adjustable response length
    Option to show source documents
    More sophisticated error handling
    Better context management

When using your interface:

    The URL provided will work as long as your Colab notebook is running
    You can share the link with others (if you used share=True)
    The interface will work with your processed documents
    Responses might take a few seconds, especially with longer questions

Troubleshooting Interface Issues

If you encounter problems:

    Interface Won’t Load:

    Make sure all previous cells ran successfully
    Check that Gradio installed correctly
    Verify your model connection is working

2. Slow Responses:

    Try shorter questions first
    Reduce the context window size
    Use the “Brief” response length option

3. Error Messages:

    Check your API keys are still valid
    Verify your documents were processed correctly
    Look for error messages in the Colab output

Remember:

    Keep your Colab notebook running while using the interface
    The public URL changes each time you run the cell
    Save your interface URL if you want to share it
    Monitor your API usage if using GPT-4

Optional Advanced Step: Teaching Your Assistant

If you’ve been using your legal assistant and want to make it even better at handling your specific legal questions, you can teach it using your own examples. This process has two main parts:

    Creating training examples (easier)
    Fine-tuning and deploying the model (more advanced)

You can do just the first part to prepare your training data, and come back to the second part when you’re ready for the more technical steps.
Part 1: Creating Your Training Examples

First, let’s create some examples that will help your assistant learn. We’ll use a friendly tool that makes this process easier. Create a new notebook called “Legal_Assistant_Training” and add this cell:

In [None]:
import json
import os
from IPython.display import display, clear_output
import ipywidgets as widgets

class ExampleCreator:
    """A friendly tool to help create training examples"""
    def __init__(self):
        self.examples = []
        self.filename = 'training_data.jsonl'
        # Create our input boxes
        self.question_box = widgets.Textarea(
            description='Question:',
            placeholder='Type your legal question here...',
            layout={'width': '90%', 'height': '100px'}
        )
        self.answer_box = widgets.Textarea(
            description='Answer:',
            placeholder='Type the correct answer here...',
            layout={'width': '90%', 'height': '200px'}
        )
        # Create our buttons
        self.save_button = widgets.Button(description='Save Example')
        self.save_button.on_click(self.save_example)
        self.display_button = widgets.Button(description='Show All Examples')
        self.display_button.on_click(self.show_examples)
        # Show our tool
        display(self.question_box)
        display(self.answer_box)
        display(self.save_button)
        display(self.display_button)
    def save_example(self, button):
        """Save a new example"""
        question = self.question_box.value.strip()
        answer = self.answer_box.value.strip()
        if not question or not answer:
            print("❌ Please provide both a question and an answer!")
            return
        # Create the example
        example = {
            "conversations": [
                {"from": "human", "value": question},
                {"from": "gpt", "value": answer}
            ]
        }
        # Save it
        with open(self.filename, 'a') as f:
            f.write(json.dumps(example) + '\n')
        self.examples.append(example)
        # Clear the boxes for next example
        self.question_box.value = ''
        self.answer_box.value = ''
        print(f"✅ Example saved! You now have {len(self.examples)} examples.")
    def show_examples(self, button):
        """Show all saved examples"""
        clear_output(wait=True)
        print(f"Your {len(self.examples)} Training Examples:\n")
        for i, example in enumerate(self.examples, 1):
            print(f"Example {i}:")
            print(f"Q: {example['conversations'][0]['value']}")
            print(f"A: {example['conversations'][1]['value']}\n")
        # Show our tool again
        display(self.question_box)
        display(self.answer_box)
        display(self.save_button)
        display(self.display_button)
# Create our example creator tool
creator = ExampleCreator()
print("✨ Example Creator is ready! Start adding your training examples above.")

Let’s add some starter examples to help you understand the format. Add this cell:

In [None]:
# Some example legal questions and answers to get you started
starter_examples = [
    {
        "question": "What makes a contract valid?",
        "answer": "A valid contract requires four essential elements: 1) Offer and acceptance, "
                 "2) Consideration (something of value exchanged), 3) Intention to create "
                 "legal relations, and 4) Capacity of the parties to contract. The agreement "
                 "must also be legal and sufficiently certain in its terms."
    },
    {
        "question": "How does force majeure work in contracts?",
        "answer": "Force majeure clauses excuse a party from performing their contractual "
                 "obligations when extraordinary events beyond their control prevent "
                 "performance. These events typically include natural disasters, wars, or "
                 "government actions. The clause must specifically define what constitutes "
                 "force majeure, and the party claiming it must prove the event's impact."
    }
]

# Add these examples using our tool
for example in starter_examples:
    with open('training_data.jsonl', 'a') as f:
        formatted_example = {
            "conversations": [
                {"from": "human", "value": example["question"]},
                {"from": "gpt", "value": example["answer"]}
            ]
        }
        f.write(json.dumps(formatted_example) + '\n')
print("✅ Added starter examples to your training data!")

Now you can use the tool to add your own examples. Here are some tips for creating good examples:

    Include different types of questions:

    What is… (definitions)
    How does… (processes)
    Why is… (reasoning)
    When should… (timing)

2. Make sure your answers:

    Start with a clear main point
    Include specific details
    Use proper legal terminology
    Stay concise but complete

Try to create at least 20–30 examples before moving on to the next part.
## Part 2: Fine-Tuning and Deploying Your Model

Once you have your training examples ready, you can use them to improve your model. This part is more technical and requires a new Colab notebook with more computational resources.

Create a new notebook called “Legal_Assistant_Finetuning” and add these cells:
Cell 1: Install Training Tools

In [None]:
# Install specialized training packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes datasets

print("✅ Training packages installed")

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-4tr4nd6i/unsloth_e4b5dcea890c4a99aa516d0b5d61c0ac
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-4tr4nd6i/unsloth_e4b5dcea890c4a99aa516d0b5d61c0ac
  Resolved https://github.com/unslothai/unsloth.git to commit 5dddf27f3ba94506c48251e907031039eecd40d1
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2025.1.2 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2025.1.3-py3-none-any.whl.metadata (16 kB)
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.git-

  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):
  for line in open(toplevel):


Collecting xformers<0.0.27
  Downloading xformers-0.0.26.post1-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.0 kB)
Collecting trl<0.9.0
  Downloading trl-0.8.6-py3-none-any.whl.metadata (11 kB)
Downloading xformers-0.0.26.post1-cp310-cp310-manylinux2014_x86_64.whl (222.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.7/222.7 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trl-0.8.6-py3-none-any.whl (245 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.2/245.2 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xformers, trl
  Attempting uninstall: trl
    Found existing installation: trl 0.13.0
    Uninstalling trl-0.13.0:
      Successfully uninstalled trl-0.13.0
Successfully installed trl-0.8.6 xformers-0.0.26.post1
✅ Training packages installed


  for line in open(toplevel):
  for line in open(toplevel):


Cell 2: Initialize Training

In [None]:
from unsloth import FastLanguageModel
from transformers import TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset
import torch

# Set up model for training
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None
)
# Prepare for fine-tuning
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    use_gradient_checkpointing=True
)
print("✅ Model prepared for training")

Cell 3: Upload and Train

First, upload your training_data.jsonl file that you created in the previous part of this guide, then run:

In [None]:
# Load your training data
dataset = load_dataset('json', data_files='training_data.jsonl')

# Configure training
training_args = TrainingArguments(
    output_dir="./legal_assistant_model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=10,
    save_strategy="epoch"
)
# Create trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'],
    tokenizer=tokenizer,
    args=training_args,
    packing=False
)
# Start training
print("Starting training...")
trainer.train()
print("✅ Training complete!")

Cell 4: Save and Share Your Model

In [None]:
# First, log in to Hugging Face
from huggingface_hub import login
login()  # You'll need to enter your Hugging Face token

# Save locally
trainer.save_model("./FineTunedLegal")
# Push to Hugging Face Hub
model_name = "your-username/LegalAssistant"  # Change this to your desired name
model.push_to_hub(model_name, tokenizer)
print(f"✅ Model saved and pushed to {model_name}")

Using Your Improved Model

To use your newly trained model, go back to your original legal assistant notebook and update the model loading cell with:

In [None]:
# Replace the original model loading code with:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "your-username/LegalAssistant",  # Use your model's name
    token="your-huggingface-token"
)
tokenizer = AutoTokenizer.from_pretrained(
    "your-username/LegalAssistant",
    token="your-huggingface-token"
)

Your assistant will now use your improved model that’s been trained on your specific examples. It should be better at handling the types of questions you included in your training data.