**Stage 1:** data generation, creating email samples.

**Justification for this approach:**

1.  **Faker Library:** by using the Faker library, I can generate realistic email addresses, names, and timestamps without exposing sensitive data.

2. **Legal Templates:** the use of predefined legal templates for subjects and body text enables me to generate synthetic emails that closely resemble real legal communications.

3. **Random Selection & Prefixes:** randomly choosing the matter id for each email ensures a distribution across multiple legal matters. Adding ambiguous or noisy prefixes such as "Re:" and "Fwd:" further simulates real email behavior, where replies and forwarded messages can complicate context recognition.

4. **True Matter Identifier:** including a true_matter_id for each email allows me to evaluate the performance of our grouping model.

In [None]:
!pip install Faker



In [None]:
import pandas as pd
import random
import datetime
from faker import Faker

fake = Faker()

legal_templates = {
    1: {
        'subject': "Litigation Case Update",
        'body_templates': [
            "Recent developments in the litigation have been significant. Please review the attached case documents for updates.",
            "The litigation case has seen new progress. Kindly examine the updated court submissions and legal opinions."
        ]
    },
    2: {
        'subject': "Contract Negotiation Discussion",
        'body_templates': [
            "The contract draft has been revised based on our recent negotiations. Please verify the changes and suggest modifications.",
            "We have updated the contract terms in line with the latest negotiation feedback. Your input on the proposed clauses is appreciated."
        ]
    },
    3: {
        'subject': "Patent Infringement Inquiry",
        'body_templates': [
            "An inquiry regarding potential patent infringement has been received. We need to review the attached documents thoroughly.",
            "Please assess the evidence related to the patent infringement issue. Your expert opinion on the technical details is required."
        ]
    },
    4: {
        'subject': "Property Dispute Resolution",
        'body_templates': [
            "The property dispute has entered a critical phase. Review the attached resolution strategy document and provide your recommendations.",
            "Updates have been made on the property dispute. Please examine the legal analysis and offer your guidance on the next steps."
        ]
    }
}

def generate_legal_email(email_id, matter_id):
    sender = fake.email()
    recipient = fake.email()
    timestamp = fake.date_time_between(start_date='-1y', end_date='now')

    matter_info = legal_templates.get(matter_id, {
        'subject': "General Legal Update",
        'body_templates': ["Please review the attached legal documentation."]
    })
    base_subject = matter_info['subject']
    body_template = random.choice(matter_info['body_templates'])

    prefix = random.choice(["", "Re: ", "Fwd: "])
    subject = prefix + base_subject

    return {
        "email_id": email_id,
        "from": sender,
        "to": recipient,
        "timestamp": timestamp,
        "subject": subject,
        "body": body_template,
        "true_matter_id": matter_id
    }

email_list = []
num_emails = 25

for i in range(num_emails):
    matter_id = random.choice([1, 2, 3, 4])
    email = generate_legal_email(email_id=i+1, matter_id=matter_id)
    email_list.append(email)

for _ in range(3):
    original_email = random.choice(email_list)
    reply_email = {
        "email_id": len(email_list) + 1,
        "from": original_email["to"],
        "to": original_email["from"],
        "timestamp": original_email["timestamp"] + datetime.timedelta(hours=random.randint(1, 24)),
        "subject": ("Re: " + original_email["subject"]) if not original_email["subject"].startswith("Re: ") else original_email["subject"],
        "body": "Following up on our previous discussion, please see my comments and suggested revisions.",
        "true_matter_id": original_email["true_matter_id"]
    }
    email_list.append(reply_email)

df_emails = pd.DataFrame(email_list)
df_emails.sort_values(by="timestamp", inplace=True)

df_emails.head(10)

Unnamed: 0,email_id,from,to,timestamp,subject,body,true_matter_id
13,14,jasonhawkins@example.org,zhorn@example.net,2024-04-16 05:22:07.258753,Patent Infringement Inquiry,An inquiry regarding potential patent infringe...,3
7,8,janet06@example.net,uayala@example.org,2024-05-14 00:21:22.868943,Re: Litigation Case Update,Recent developments in the litigation have bee...,1
1,2,jessica78@example.com,ghall@example.net,2024-05-17 21:35:50.688893,Property Dispute Resolution,The property dispute has entered a critical ph...,4
24,25,ckelley@example.com,cassandra44@example.com,2024-05-18 16:51:46.395463,Fwd: Litigation Case Update,The litigation case has seen new progress. Kin...,1
9,10,richardsonjacqueline@example.org,vsmith@example.com,2024-07-25 13:01:46.860135,Re: Patent Infringement Inquiry,An inquiry regarding potential patent infringe...,3
20,21,michael41@example.org,philip38@example.net,2024-08-19 06:39:01.907658,Re: Litigation Case Update,The litigation case has seen new progress. Kin...,1
10,11,nathan79@example.net,bradley75@example.org,2024-09-01 15:53:05.159333,Re: Patent Infringement Inquiry,Please assess the evidence related to the pate...,3
0,1,schmidtkristopher@example.net,michael59@example.net,2024-09-02 14:12:41.200100,Re: Property Dispute Resolution,The property dispute has entered a critical ph...,4
21,22,natalieblanchard@example.com,smithadam@example.org,2024-09-03 00:17:17.065691,Fwd: Litigation Case Update,The litigation case has seen new progress. Kin...,1
14,15,jessicaandrews@example.net,danielbell@example.org,2024-09-20 16:36:35.286719,Fwd: Patent Infringement Inquiry,Please assess the evidence related to the pate...,3


**Stage 2:** Mattering Grouping Model

**Semantic Preprocessing of Emails:**
- I have preprocessed each email by concatenating the subject and body into a single text string.
- I have also converted the text to lowercase and stripped the extra spaces
to help in reducing variability due to differences in capitalization or formatting. This standardization improves the consistency of the embeddings generated later.

**Embedding Extraction using Sentence-BERT:**
- I am employing the all-MiniLM-L6-v2 model from the SentenceTransformer library. This model is well-regarded for its efficiency and ability to capture fine-grained semantic relationships in short texts.
- By transforming the combined text of each email into dense vector embeddings, I can now effectively compare and group emails based on semantic similarity.

**Clustering with KMeans:**
- KMeans is chosen due to its simplicity, efficiency, and effectiveness in partitioning vector space into distinct clusters.

**Evaluation Metrics (ARI and NMI):**
- ARI helps quantify the similarity between our predicted matter assignments and the true matter IDs.
- NMI measures the amount of information shared between the predicted clusters and the ground truth clusters. It is useful because it is normalized, so it provides a consistent scale even when the number of clusters varies.

In [None]:
!pip install scikit-learn sentence-transformers



In [None]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
from sentence_transformers import SentenceTransformer

def preprocess_email(row):
    return f"{row['subject']} {row['body']}".strip().lower()

df_emails['combined_text'] = df_emails.apply(preprocess_email, axis=1)

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

embeddings = model.encode(df_emails['combined_text'].tolist(), show_progress_bar=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
num_clusters = 4

kmeans = KMeans(n_clusters=num_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(embeddings)

df_emails['predicted_matter_id'] = cluster_labels

In [None]:
true_labels = df_emails['true_matter_id'].tolist()
ari = adjusted_rand_score(true_labels, cluster_labels)
nmi = normalized_mutual_info_score(true_labels, cluster_labels)

print(f"Adjusted Rand Index (ARI): {ari:.3f}")
print(f"Normalized Mutual Information (NMI): {nmi:.3f}")

Adjusted Rand Index (ARI): 1.000
Normalized Mutual Information (NMI): 1.000


**Stage 3:** AI-Powered Matter Summarization

- Google’s Gemini models are designed for high-quality text generation. They have advanced capabilities for understanding and summarizing complex, domain-specific content.

- Prior to summarization, emails are already grouped by their predicted matter. By doing so, I can generate focused summaries for each cluster that are contextually relevant. This grouping leverages our prior work in semantic embedding-based clustering.

- I did some prompt engineering to ensure that the LLM generates a detailed and useful summary.

- Utilizing an open-source client simplifies integration, supports modularity, and makes it easy to replace or update the model in the future if needed.

In [None]:
!pip install -q -U google-genai

In [None]:
from google import genai
from collections import defaultdict

client = genai.Client(api_key="") # Replace with your api-key

def group_emails_by_matter(df, matter_col='predicted_matter_id'):
    matter_groups = defaultdict(list)
    for idx, row in df.iterrows():
        matter_groups[row[matter_col]].append(row.to_dict())
    return matter_groups

def create_summary_prompt(emails):
    emails_sorted = sorted(emails, key=lambda x: x['timestamp'])

    instructions = "This matter involves detailed legal communications regarding a legal case. Please synthesize the essential details into a comprehensive summary."
    recipients = ", ".join(set(email['to'] for email in emails_sorted))
    scope_and_fee = "The scope includes reviewing legal documents and scheduling consultations, with fees applied per consultation."
    ongoing_tasks = "Ongoing tasks include regular updates, periodic legal reviews, and continuous client communications."
    chronology = "\n".join(f"{email['timestamp']} - {email['subject']}" for email in emails_sorted)

    prompt = (
        "Generate a detailed 1-2 page summary for the following matter group. The summary should include:\n"
        "1. Recap of Instructions: " + instructions + "\n"
        "2. Recipient Details: " + recipients + "\n"
        "3. Scope of Work and Fee Information: " + scope_and_fee + "\n"
        "4. Ongoing Tasks: " + ongoing_tasks + "\n"
        "5. Chronology of Events:\n" + chronology + "\n"
        "Provide a comprehensive overview to assist in managing this legal matter."
    )
    return prompt

def generate_summary_with_gemini(prompt_text):
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=[prompt_text]
    )
    return response.text

matter_groups = group_emails_by_matter(df_emails, matter_col='predicted_matter_id')

In [None]:
if 0 in matter_groups:
    prompt0 = create_summary_prompt(matter_groups[0])
    summary0 = generate_summary_with_gemini(prompt0)
    print("Summary for Matter 0:")
    print("-" * 40)
    print(summary0)
else:
    print("Matter 0 not found in matter_groups.")

Summary for Matter 0:
----------------------------------------
## Litigation Case Summary: [Insert Case Name/Identifying Number Here]

**1. Recap of Instructions:**

This matter concerns ongoing litigation requiring continuous monitoring and proactive management. The primary directive is to stay informed of all developments within the case and provide regular updates and consultations to involved parties. This involves meticulous review of legal documents, accurate record-keeping of key events, and clear and consistent communication with all recipients listed below. Crucially, the information provided by these legal communications needs to be synthesized into a readily understandable format to facilitate informed decision-making and strategic planning related to the litigation. The overarching goal is to ensure efficient and effective progression of the case while adhering to established budgetary guidelines. The matter requires a proactive approach, anticipating potential issues and p

In [None]:
if 1 in matter_groups:
    prompt1 = create_summary_prompt(matter_groups[1])
    summary1 = generate_summary_with_gemini(prompt1)
    print("Summary for Matter 1:")
    print("-" * 40)
    print(summary1)
else:
    print("Matter 1 not found in matter_groups.")

Summary for Matter 1:
----------------------------------------
## Matter Summary: Contract Negotiation

**1. Recap of Instructions:**

This matter involves the ongoing legal management of a contract negotiation. The core communication revolves around a "Contract Negotiation Discussion," the specifics of which are revealed through email exchanges, including forwards and replies. The objective is to provide legal counsel, review the contract terms, advise on negotiation strategies, and protect the client's interests throughout the negotiation process. The matter also includes the administrative tasks of scheduling consultations, providing regular updates, and conducting periodic legal reviews. The ultimate goal is to achieve a mutually beneficial and legally sound contract agreement.

**2. Recipient Details:**

The following individuals are identified as recipients within the email communications related to this matter:

*   fletchertammy@example.org
*   thall@example.net
*   thorntondan

In [None]:
if 2 in matter_groups:
    prompt2 = create_summary_prompt(matter_groups[2])
    summary2 = generate_summary_with_gemini(prompt2)
    print("Summary for Matter 2:")
    print("-" * 40)
    print(summary2)
else:
    print("Matter 2 not found in matter_groups.")

Summary for Matter 2:
----------------------------------------
## Legal Matter Summary: Patent Infringement Inquiry

**1. Recap of Instructions:**

This legal matter concerns a potential patent infringement case. The communications indicate an initial inquiry regarding patent infringement, followed by subsequent exchanges likely involving the provision of relevant documents, legal analysis, and discussions regarding potential courses of action. The core issue revolves around determining whether a specific patent is being infringed upon by another party. The legal team is tasked with:

*   **Analyzing the potentially infringed patent:** Understanding the scope and claims of the patent in question.
*   **Identifying the alleged infringing party and their activities:** Gathering information about the party potentially infringing on the patent and the specific activities that raise concerns.
*   **Determining whether infringement exists:** Comparing the patent claims with the alleged infri

In [None]:
if 3 in matter_groups:
    prompt3 = create_summary_prompt(matter_groups[3])
    summary3 = generate_summary_with_gemini(prompt3)
    print("Summary for Matter 3:")
    print("-" * 40)
    print(summary3)
else:
    print("Matter 3 not found in matter_groups.")

Summary for Matter 3:
----------------------------------------
## Matter Group Summary: Property Dispute Resolution

**Date:** October 26, 2023 (Assuming a current date for this summary)

**Subject:** Overview and Management of Property Dispute Resolution Matter

**1. Recap of Instructions:**

This matter involves a property dispute requiring legal review, client consultation, and ongoing management. The core instruction is to actively work towards a resolution of this property dispute by thoroughly analyzing all relevant legal documentation, providing informed legal counsel to the client(s), and maintaining consistent communication throughout the process. This includes:

*   **Document Review:** Complete and ongoing review of all documents related to the property dispute, including but not limited to: deeds, titles, surveys, correspondence between involved parties, and any relevant agreements. This review is essential for understanding the legal basis of the dispute, identifying poten

**Stage 4:** Evaluation.

**Approach 1:** manual comparison of generated summaries to reference matter descriptions

- To assess the quality of generated summaries, I have implemented a manual evaluation approach. This approach allows to qualitatively judge whether each summary met the critical needs for legal matter management mentioned in the read.me.

**Points mentioned in the read.me:**

1.   **Recap of Instructions:** provided by all summaries.
2.   **Recipient Details:** provided by all summaries.
3.   **Scope of Work and Fee Information:** provided by all summaries.
4.   **Ongoing Tasks:** provided by all summaries.
5.   **Chronology of Events:** provided by all summaries.


**Qualitative Assessment**

Some key factors that were kept into consideration:
1.   **Completeness:** are all the required sections (instructions, recipient details, scope/fee, ongoing tasks, chronology) present?
2.   **Accuracy:** is the information correct and reflective of what is present in the raw email content?
3.   **Coherence:**  does the summary read as a logically organized?
4.   **Relevance:** are only the essential details included, without unnecessary details?

**After a manual check I came to a conclusion that all the summaries were having all these factors in them.**

**Approach 2:** using a simple scoring function (string overlap)
- I have choosen Jaccard similarity to quantify how much overlap there is between the words in the generated summary and the words in a reference summary.

**Justification for my choice:**
- The Jaccard similarity is straightforward to compute and easy to interpret.

- Although it is a basic metric, Jaccard similarity offers a quick quantitative snapshot. It allows us to compare different summarization approaches or monitor improvements with minimal computational overhead, acting as a baseline against which more sophisticated metrics can later be compared.

- The computational cost of calculating Jaccard similarity is low, making it an attractive option for a prototype where multiple summaries are generated and need to be evaluated quickly.

In [None]:
summaries = {}
for matter_id, emails in matter_groups.items():
    prompt = create_summary_prompt(emails)
    summary = generate_summary_with_gemini(prompt)
    summaries[matter_id] = summary
    print(f"Summary for Matter {matter_id}:\n{'-'*40}\n{summary}\n")

Summary for Matter 2:
----------------------------------------
## Matter Group Summary: Patent Infringement Inquiry

**1. Recap of Instructions:**

This matter pertains to a potential patent infringement case. The initial communications revolved around an inquiry regarding possible infringement of a client's patent. Subsequent communications represent ongoing investigation, discussion, and analysis related to the alleged infringement. The communications suggest a need for comprehensive legal review to determine the strength of the client's patent, the likelihood of infringement, and the potential strategies for addressing the situation. The goal is to provide the client with informed legal advice and representation related to protecting their patent rights. Key aspects requiring attention include:

*   **Patent Validity:** Assessing the strength and enforceability of the client's patent.
*   **Infringement Analysis:** Determining if the alleged infringing party's activities actually vi

In [None]:
def jaccard_similarity(text1, text2):
    words1 = set(text1.lower().split())
    words2 = set(text2.lower().split())
    intersection = words1.intersection(words2)
    union = words1.union(words2)
    return len(intersection) / len(union) if union else 0

In [None]:
reference_summaries = {
    0: "Reference summary for Matter 0: A detailed overview covering instructions, recipients, scope of work, ongoing tasks, and chronology of events.",
    1: "Reference summary for Matter 1: A comprehensive summary with key legal instructions, contact details, fee structure, and a timeline of events.",
    2: "Reference summary for Matter 2: An in-depth summary including essential instructions, recipient details, work scope, fee information, ongoing tasks, and chronology.",
    3: "Reference summary for Matter 3: A detailed report providing a recap of instructions, recipient list, scope and fee details, ongoing updates, and a chronological timeline."
}

for matter_id, generated_summary in summaries.items():
    ref_summary = reference_summaries.get(matter_id, "")
    if ref_summary:
        score = jaccard_similarity(generated_summary, ref_summary)
        print(f"Matter {matter_id} - Jaccard Similarity Score: {score:.3f}")
    else:
        print(f"Matter {matter_id} - No reference summary available for evaluation.")

Matter 2 - Jaccard Similarity Score: 0.021
Matter 0 - Jaccard Similarity Score: 0.022
Matter 3 - Jaccard Similarity Score: 0.028
Matter 1 - Jaccard Similarity Score: 0.022
