# Building out the clause matcher in a notebook to test performance

In [90]:
import os
import numpy as np
from transformers import AutoTokenizer, AutoModel
import pandas as pd

from tclp.clause_recommender import utils

___

### Loading and combining data

In [91]:
doc_to_use = '../../tclp/data/synth_data/untouched/000000065.txt'
model_path = "../CC_BERT/CC_model"
embeddings_dir = "../CC_BERT/CC_embeddings"
legal_model_path = "../legalbert/lebalbert_model"
legal_embeddings_dir = "../legalbert/lebalbert_embeddings"
clause_folder = "../data/cleaned_content"
clause_html = '../data/clause_boxes'

with open(doc_to_use, "r", encoding="utf-8") as f:
    query_text = f.read()


In [92]:
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path)


Some weights of RobertaModel were not initialized from the model checkpoint at ../CC_BERT/CC_model and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [93]:
legal_tokenizer = AutoTokenizer.from_pretrained("casehold/legalbert")
legal_model = AutoModel.from_pretrained("casehold/legalbert")

In [94]:
documents, file_names = utils.load_clauses(clause_folder)


In [95]:
clause_boxes, clause_box_filenames = utils.load_clauses(clause_html)

In [96]:
clause_box_df = utils.parse_clause_boxes_to_df(clause_boxes)

In [97]:
final_df = utils.attach_documents(clause_box_df, documents, file_names)

In [98]:
final_df

Unnamed: 0,Name,Title,Excerpt,Jurisdiction,Updated,URL,Matched Filename,Document
0,Griff's Clause,Template Board Paper for Significant Contracts...,Template board papers with detailed prompts fo...,England & Wales,2024-09-10 10:34:38,https://chancerylaneproject.org/clauses/templa...,Template_Board_Paper_for_Significant_Contracts...,1. Strategy implications Alignment 1.1 The pro...
1,Bella's Clause,Management Equity Ratchet Terms,A template clause for inclusion in investment ...,England & Wales,2024-09-10 10:41:23,https://chancerylaneproject.org/clauses/manage...,Management_Equity_Ratchet_Terms.txt,1. Conversion Rights 1.1(see Definitions)1.2 T...
2,Chloe's Clause,Environmental Business Charter,The Environmental Business Charter – a “soft t...,England & Wales,2024-09-10 10:41:31,https://chancerylaneproject.org/clauses/enviro...,Environmental_Business_Charter.txt,1. Join the[Net Zero Lawyer’s Alliance / the L...
3,Matthew's Clause,Late Payment – Green Interest Remedies,Creative interest rate remedies by which payme...,England & Wales,2024-09-10 10:43:45,https://chancerylaneproject.org/clauses/late-p...,Late_Payment_–_Green_Interest_Remedies.txt,1. If[Party 2]fails to make a payment due to[P...
4,Soren's Clause,Land: Sustainable Soil Management,A set of sustainable soil management obligatio...,England & Wales,2024-10-02 14:16:08,https://chancerylaneproject.org/clauses/land-s...,Land__Sustainable_Soil_Management.txt,(A)[Consider including Eddie’s Recitals(Climat...
...,...,...,...,...,...,...,...,...
117,Emilia's Protocols,Green Litigation and Arbitration Protocols,Two similar protocols (litigation and arbitrat...,England & Wales,2024-09-10 10:42:57,https://chancerylaneproject.org/clauses/green-...,Green_Litigation_and_Arbitration_Protocols.txt,<p>hello</p>\n\n1. The parties[have signed up ...
118,Mary's Clause,JCT Energy Efficiency and Environmental Obliga...,This clause amends the JCT’s standard Design a...,England & Wales,2024-09-10 10:39:02,https://chancerylaneproject.org/clauses/jct-en...,JCT_Energy_Efficiency_and_Environmental_Obliga...,1. to use sustainable materials and avoid the ...
119,Emilio's Checklist,Climate Policy Footprint,A checklist that requests one-off or repeated ...,England & Wales,2024-09-10 10:36:13,https://chancerylaneproject.org/clauses/climat...,Climate_Policy_Footprint.txt,1.1 Details of the Company’s climate policy en...
120,Luna's Clause,Net Zero Aligned Construction Modifications,A clause that incentivises building contractor...,England & Wales,2024-09-10 10:39:28,https://chancerylaneproject.org/clauses/net-ze...,Net_Zero_Aligned_Construction_Modifications.txt,1. Contractor’s Net Zero obligations The Contr...


In [99]:
names, docs = utils.rebuild_documents(final_df)

____

In [100]:
bow_results = utils.find_top_similar_bow(
    target_doc=query_text,
    documents=docs,
    file_names=names,
    similarity_threshold=0.1, k =30
)

In [101]:
top_docs = bow_results["Documents"]
top_names = bow_results["Top_Matches"]

In [102]:
top_names_sem, top_scores_sem, top_texts = utils.get_embedding_matches_subset(
        query_text,
        top_docs,
        top_names,
        tokenizer,
        model,
        method="cls", k=10
    )

In [103]:
for name, score in zip(top_names_sem, top_scores_sem):
    print(f"Clause: {name} (Score: {score:.4f})")
    index = names.index(name)
    print(documents[index][:500])  
    print("-" * 80)

Clause: SME’s Net Zero Objectives (Score: 0.9791)
(A)[Insert Eddie’s Recitals(Climate Recitals)here.]Viola's Clause A precedent clause for a supply agreement requiring the supplier/ contractor to procure energy from renewable sources.

Drafting Option A:No Tender 1. Procurement of Wholly Renewable Energy 1.1 The[Supplier]/[Contractor]shall procure that[100%]/[at least[●]%]of the electricity supplied to it during the term of this Agreement is Wholly Renewable Energy.

1.2 The[Supplier]/[Contractor]will provide[unequivocal]evidence, * satisfactor
--------------------------------------------------------------------------------
Clause: Environmental Business Charter (Score: 0.9753)
1. Join the[Net Zero Lawyer’s Alliance / the Legal Sustainability Alliance / insert other relevant environmental organisations].

2. Measure, independently verify and publish its GHG Emissions.

3. Publicly set a Net Zero Target[approved by the Science Based Targets initiative][and sign up to the Race to Zero].


___

# No Keyword Approach

In [104]:
embeddings = utils.get_embeddings('cls',
    embeddings_dir, 
    docs,
    tokenizer,
    model
)

In [105]:
query_embedding = utils.encode_text(query_text, tokenizer, model, 'cls').reshape(1, -1)

In [106]:
_, _, _, similarities, _ = utils.get_matching_clause(
    query_embedding, embeddings, names
)

top_names, top_scores, _ = utils.find_top_k(similarities, file_names)

for name, score in zip(top_names, top_scores):
    print(f"Clause: {name} (Score: {score:.4f})")
    index = file_names.index(name)
    print(documents[index][:500])  # Show the first 500 chars of the clause
    print("-" * 80)

Clause: Climate_Change_Event.txt (Score: 0.9635)
1 Climate Change Event means an event, series of events or circumstance arising from the physical impacts of Climate Change that is either Pan-terra or Epi-terra in scope and prevents a party from performing its obligations under this agreement[including an obligation to pay money], and includes:(a)unavailability of water, clean air or other Natural Capital required by a party to manufacture or supply the[products OR services]; (b)hurricanes, tornados, floods and unusually severe rain, wind, sno
--------------------------------------------------------------------------------
Clause: CLLS_Certificate_of_Title:_Climate_Change_Disclosu.txt (Score: 0.9522)
1. Physical Risks:Climate Change is likely to increase the frequency and severity of weather events with the potential to damage property such as flooding, high winds, storms and greater temperature extremes.

2. Transition Risks:The UK Government's current policy is to reduce the volumes 

____

The former does better than the latter according to comparison by an LLM so this is good evidence that some sort of keyword matching is a good way to go. 
