## Clause Risk Categorization

Putting each of the TCLP clauses into risk categories using an LLM 

In [8]:
import pandas as pd
from tclp.clause_recommender import utils

In [2]:
risk_taxonomy = pd.read_excel('../data/risk_taxonomy.xlsx')

In [13]:
clause_folder = "../data/cleaned_content"
clause_html = '../data/clause_boxes'
model_path = "../CC_BERT/CC_model"

In [14]:
tokenizer, model, names, docs, final_df = utils.getting_started(model_path, clause_folder, clause_html)

  return self.fget.__get__(instance, owner)()
Some weights of RobertaModel were not initialized from the model checkpoint at ../CC_BERT/CC_model and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  soup = BeautifulSoup(content, "html.parser")


In [17]:
# make a df of names and docs
df = pd.DataFrame({'name': names, 'clause': docs})

In [3]:
risk_taxonomy

Unnamed: 0,Label,Description
0,Physical-flooding,Clause that helps reduce exposure to flooding ...
1,Physical-wildfire,Clause that helps mitigate exposure to wildfir...
2,Physical-heat,Clause that helps reduce exposure to overheati...
3,Physical-subsidence,Clause that helps reduce exposure to ground in...
4,Physical-sea-level,Clause that helps reduce exposure to coastal e...
5,Physical-water-scarcity,Clause that helps reduce exposure to water str...
6,Physical-extreme-weather,"Clause that helps reduce exposure to storm, wi..."
7,Physical-infrastructure,Clause that helps reduce exposure to infrastru...
8,Physical-general,Clause that helps manage general exposure to p...
9,Transition-mees,Clause that helps reduce exposure to MEES-rela...


In [26]:
from openai import OpenAI

client = OpenAI(
    api_key="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", 
    base_url = "https://openrouter.ai/api/v1"
)

In [27]:
messages = "You are a helpful assistant whose job it is to identify the risk type given a provided clause. These clauses WILL NOT contain the risk themselves. Rather, they are designed to help legal users to mitigate risk. So you are meant to identify the risk categorizations that the given clause might help protect against. Feel free to pick more than one risk that you think the clause could be relevant for."

In [28]:
def format_taxonomy_prompt(risk_taxonomy, given_prompt):
    prompt = given_prompt 
    prompt += "Here are the available categories:\n\n"
    for _, row in risk_taxonomy.iterrows():
        prompt += f"- `{row['Label']}`: {row['Description']}\n"
    prompt += "\n"
    prompt += "Return only the label that best applies, and explain your reasoning.\n"
    return prompt

In [29]:
def classify_clause(clause_text, taxonomy_df, given_prompt, model="qwen/qwen-2.5-7b-instruct"):
    system_prompt = format_taxonomy_prompt(taxonomy_df, given_prompt)
    
    user_prompt = f"""Clause:
\"\"\"{clause_text}\"\"\"

Which risk categories does this clause help mitigate?
Respond in this JSON format:
{{
  "labels": ["label1", "label2", ...],
  "justification": "Explain why these labels apply to this clause."
}}"""

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.2
    )
    
    return response.choices[0].message.content

In [30]:
clause_1 = df.iloc[0]['name'] + df.iloc[0]['clause'] 

In [31]:
result = classify_clause(clause_1, risk_taxonomy, messages)

In [33]:
print(result)

{
  "labels": ["Transition-retrofit", "Transition-disclosure", "Transition-standards"],
  "justification": "The clause provides a detailed guide and checklist for accessing Sustainability-Linked Loans (SLLs), which is aimed at encouraging companies to align their financing with net zero transition goals. It includes setting and achieving Sustainability Performance Targets (SPTs), reporting on performance, and ensuring alignment with sustainability standards. This helps mitigate risks related to future retrofit obligations (Transition-retrofit) by encouraging companies to set ambitious and meaningful sustainability targets. It also helps manage disclosure-related risks (Transition-disclosure) by requiring companies to provide sustainability compliance certificates and report on performance against SPTs. Additionally, it ensures alignment with voluntary Net Zero standards (Transition-standards) by referencing frameworks and guidelines for setting SPTs."
}


____

## Creating database and applying this to all clauses

In [None]:
def format_classification_result(name, result_json, risk_labels):
    import json

    # Parse the result if needed
    if isinstance(result_json, str):
        try:
            parsed = json.loads(result_json)
        except json.JSONDecodeError:
            print(result_json)
            parsed = {"labels": [], "justification": "Invalid JSON response"}

    else:
        parsed = result_json

    labels = parsed.get("labels", [])
    justification = parsed.get("justification", "")

    # Start building the row: default to 0 for all risks
    row = {label: "0" for label in risk_labels}
    row["name"] = name
    row["justification"] = justification

    for label in labels:
        if label in risk_labels:
            row[label] = "1"  # mark as positive

    return row

In [81]:
# Get the list of all possible risk labels (from your taxonomy)
risk_labels = list(risk_taxonomy['Label'].str.strip())


In [82]:
results_df = pd.DataFrame(columns=['name'] + risk_labels + ['justification'])


In [83]:
for i, row in df.iterrows():
    clause_text = row['name'] + row['clause']
    result = classify_clause(clause_text, risk_taxonomy, messages)
    
    print(f"Processing clause {i+1}/{len(df)}: {row['name']}")
    
    # Format the result
    formatted_row = format_classification_result(row['name'], result, risk_labels)
    
    # Append to the DataFrame
    results_df = pd.concat([results_df, pd.DataFrame([formatted_row])], ignore_index=True)

Processing clause 1/122: A Beginner’s Guide and Checklist for Accessing Sustainability-Linked Loans (SLLs)
Processing clause 2/122: Allocating Scope 1, 2 and 3 Emissions for Leased Assets
Processing clause 3/122: Auditing Water Usage in Supply Chains
Processing clause 4/122: Avoiding Excessive Paperwork in Dispute Resolution
Processing clause 5/122: Benchmarking of Project Greenhouse Gas Emissions
Processing clause 6/122: Board Minutes: Consideration of Climate Change Factors
Processing clause 7/122: Board Paper Implementing Net Zero for SMEs
Processing clause 8/122: CLLS Certificate of Title: Climate Change Disclosures
Processing clause 9/122: Capital Markets ESG Due Diligence Questionnaire
Processing clause 10/122: Carbon Contract Clauses for Environmental Performance, and Associated Incentives and Remedies
Processing clause 11/122: Carbon Footprint Reduction – Mutual Notification Right (Carbon Footprint Reduction Notice)
Processing clause 12/122: Circular Economy Product Design Obli

In [88]:
results_df

Unnamed: 0,name,Physical-flooding,Physical-wildfire,Physical-heat,Physical-subsidence,Physical-sea-level,Physical-water-scarcity,Physical-extreme-weather,Physical-infrastructure,Physical-general,...,Legal-access,Legal-contract,Legal-penalties,Legal-negligence,Legal-insurance,Legal-disclosure,Legal-breach,Legal-general,justification,has_any_label
0,A Beginner’s Guide and Checklist for Accessing...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,The clause provides a detailed guide and check...,True
1,"Allocating Scope 1, 2 and 3 Emissions for Leas...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,This clause helps mitigate transition risks by...,True
2,Auditing Water Usage in Supply Chains,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,The clause helps mitigate risks related to phy...,True
3,Avoiding Excessive Paperwork in Dispute Resolu...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,This clause focuses on reducing the environmen...,True
4,Benchmarking of Project Greenhouse Gas Emissions,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,This clause helps mitigate risks associated wi...,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117,Target Product Carbon Footprint (Schedule for ...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,This clause focuses on setting and reducing th...,True
118,Template Board Paper for Significant Contracts...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,This clause helps mitigate a variety of risks ...,True
119,The Net Zero Standard for Suppliers,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,This clause helps reduce exposure to future re...,True
120,The ‘Green Supplier’ Contract – A Standardised...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,The clause focuses on incentivizing suppliers ...,True


In [89]:
results_df.has_any_label.value_counts()

has_any_label
True     112
False     10
Name: count, dtype: int64

In [90]:
#save this CSV 
results_df.to_csv('risk_classification_results.csv', index=False)