# Legal model  

This model is designed to exctract importantant sentances that summarize the imput text.  

Adding extract_definitions to identify and extract definitions from the input text. It uses a regular expression pattern to find sentences that may contain definition like structures.

**Note** This regular expression will need to be altered or additional regex will be needed to be able to accommodate a wide variety of text rather than the sample used  to build this model. 

Moded score_sentence to take into account the relevance of the sentence to the defined terms. If a sentence contains a defined term or its definition, the sentence score is increased.

Added a section to print the extracted definitions after loading.

This model uses **punkt** and **BERT-Legal**  

**punkt**  is a NLP toolkit that is part of the Natural Language Toolkit NLTK library. Its primary function is sentence tokenization, which means it helps split text into individual sentences.
CFTC blocks direct usage and download of punkt, you will need to download directly from the huggingface website and its necessary config files.

**BERT-Legal** is a specialized version of the BERT (Bidirectional Encoder Representations from Transformers) language model that has been fine-tuned specifically for legal domain tasks.
You will also need to download  directly  from the huggingface website and its config files ( config, flax_model, pytorch_model, special_tokens_map, tf_model, tokenizer_config, vocab)

https://huggingface.co/

packages will need to be installed via conda 

In [1]:
# packages
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
import re 
import pandas as pd 
# Specify the directory where punkt is already downloaded
nltk_data_dir = r"D:/Data/OneDrive/Ccantu/OneDrive - CFTC/Documents/Python Scripts/punkt"

# Add the directory to NLTK's data path
nltk.data.path.append(nltk_data_dir)

# Load model function 
def load_local_legal_bert():
    model_path = r"D:/Data/OneDrive/Ccantu/OneDrive - CFTC\Documents/Python Scripts/BERT-Legal"
    print(f"Loading the Legal-BERT model from '{model_path}'...")
    
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModel.from_pretrained(model_path)

    print("Legal-BERT model loaded successfully!")
    return tokenizer, model
# tuning
def extract_definitions(text):
    definition_pattern = r"(?P<term>\w+)\s+(?:is|means)\s+(?P<definition>.*?)[;.]" # looking for definition patterns 
    definitions = {}
    for match in re.finditer(definition_pattern, text, re.IGNORECASE):
        term = match.group("term")
        definition = match.group("definition").strip()
        definitions[term] = definition
    return definitions
# Tuning
def score_sentence(sentence, definitions, tfidf_matrix, sentence_similarities):
    score = 0
    for term, definition in definitions.items():
        if term in sentence or definition in sentence: 
            score += sentence_similarities[0,1] # may need to change assume first senetence is important 
    return score 

def extractive_summarize(text, num_sentences=3):
    sentences = nltk.sent_tokenize(text)
    definitions = extract_definitions(text) 

    # Create TF-IDF matrix
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(sentences)
    
    # Compute sentence similarities
    sentence_similarities = cosine_similarity(tfidf_matrix, tfidf_matrix)
    
    # Rank sentences based on similarity scores
    sentence_scores = [score_sentence(sentence, definitions, tfidf_matrix, sentence_similarities) for sentence in sentences]
    ranked_sentences = [sentences[i] for i in np.argsort(sentence_scores)[::-1]]
    
    # Select top sentences
    summary = ' '.join(ranked_sentences[:num_sentences])
    return summary

def process_with_legal_bert(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Optional used to search document 
def search_document(text, search_term):
    sentences = nltk.sent_tokenize(text)
    matches = []
    for sentence in sentences:
        if re.search(search_term, sentence, re.IGNORECASE):
            matches.append(sentence)
    return matches

# Load Legal-BERT call 
legal_bert_tokenizer, legal_bert_model = load_local_legal_bert()

# Example legal text
# text = """

# Except where otherwise expressly provided in this Covenant or by the terms of the present Treaty, decisions at any meeting of the Assembly or of the Council shall require the agreement of all the Members of the League represented atthe meeting. All matters of procedure at meetings of the Assembly or of the Council, including the appointment of Committees to investigate particular matters, shall be regulated by the Assembly orby the Council and may be decided by a majority of the Members of the League represented at the meeting. Thefirst meeting of the Assembly and the first meeting of the Council shall be summoned by the President of the United States of America.
# """
text = input("Enter the legal text: ")

print("\nOriginal text:")
print(text)

# Extract definitions
definitions = extract_definitions(text)
print("\nDefinitions found:")
for term, definition in definitions.items():
    print(f"{term} means {definition}")

# Process with Legal-BERT
bert_output = process_with_legal_bert(text, legal_bert_tokenizer, legal_bert_model)
print("\nLegal-BERT processing complete. Output shape:", bert_output.shape)

# Generate summary
summary = extractive_summarize(text)
print("\nGenerated Summary:")
print(summary)

# Search loop optional  
# while True:
#     search_term = input("\Enter a search term (or q to exit):")
#     if search_term.lower() == 'q':
#         break 
#     matches = search_document(text, search_term)
#     if matches:
#         print(f"\nFound {len(matches)} match(es) for '{search_term}':")
#         for i, match in enumerate(matches, 1):
#             print(f"{i}. {match}")
#     else:
#         print(f"No matches found for '{search_term}'.")



Loading the Legal-BERT model from 'D:/Data/OneDrive/Ccantu/OneDrive - CFTC\Documents/Python Scripts/BERT-Legal'...
Legal-BERT model loaded successfully!

Original text:
45.1 Definitions. This content is from the eCFR and is authoritative but unofficial. (a) As used in this part: Allocation means the process by which an agent, having facilitated a single swap transaction on behalf of several clients, allocates a portion of the executed swap to the clients. As soon as technologically practicable means as soon as possible, taking into consideration the prevalence, implementation, and use of technology by comparable market participants. 17 CFR Part 45 (up to date as of 8/22/2024) Swap Data Recordkeeping and Reporting Requirements 17 CFR Part 45 (Aug. 22, 2024) 17 CFR 45.1(a) “As soon as technologically practicable” (enhanced display) page 1 of 51 Asset class means a broad category of commodities, including, without limitation, any “excluded commodity” as defined in section 1a(19) of the Ac

## Train  the gpt model on the on extracted definitions
### Extracted Definitions and CSV Import Merge

The block below takes the extracted definitions from the legal model and merges them with a csv containing simplified legal terms. 

In [2]:
import pandas as pd

# Path to the CSV file
csv_path = "D:/Data/OneDrive/Ccantu/OneDrive - CFTC/Documents/Python Scripts/CFT_terms_and_simple.csv"

# Read the CSV file into a DataFrame
df_csv = pd.read_csv(csv_path)

# Extract definitions from the text
definitions = extract_definitions(text)

# Convert the extracted definitions dictionary to a list of tuples
definitions_list = [(term, definition) for term, definition in definitions.items()]

# Create a DataFrame from the list of tuples
df_extracted = pd.DataFrame(definitions_list, columns=['Term', 'Definition'])

# Print the DataFrame containing the extracted definitions
print("\nExtracted Definitions DataFrame:")
print(df_extracted)

# Merge the DataFrames on the 'Term' column
df_combined = pd.merge(df_csv, df_extracted, on='Term', how='outer', suffixes=('_CSV', '_Extracted'))

# Print the combined DataFrame
print("\nCombined DataFrame:")
#print(df_combined)
#df_combined


Extracted Definitions DataFrame:
             Term                                         Definition
0   Authorization                                     required to: a
1       equipment                                         available)
2            NMFS  able to review the circumstances of the prohib...
3            what  comparable as between specific CEA and CFTC re...
4           death  not associated with or related to the activiti...
5           which  based in London, for example, has been registe...
6          Notice  being issued in connection with the resolution...
7       regulator  Bank of England), Natural Gas Exchange (home c...
8    organization  subject to comparable, comprehensive supervisi...
9            that  greater than 100% of the DCO’s initial margin ...
10         person                               registered as an FCM
11           This  because EU-based CCPs offer, or are seeking to...
12          state  responsible for implementing the EMIR Framewor...


### TRAIN

Below takes a dataset of legal terms and their simplified definitions, fine-tunes the GPT-2 model on this data, and saves the resulting model. The goal here is to create a model that can generate simplified explanations for legal terms specific to the CFTC when given the summary. 

Alter and run this script whenever you want to use the model to simplify legal terms, without needing to retrain the model each time. The training only needs to be done once (or periodically if you want to update the model with new data).
Remember  the quality of the simplifications will depend on the training data and the effectiveness of the fine-tuning process. Might need to experiment with different prompts or post-processing steps to get the best results for your specific use case.

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments, PreTrainedModel, GPT2Config
from datasets import Dataset
import pandas as pd
import torch
from torch.nn import CrossEntropyLoss, Linear

# Assum df_combined is already defined above 
# Move non-empty column Definition_Extracted to Simplified Definition
mask = df_combined['Definition_Extracted'].notna() & (df_combined['Definition_Extracted'] != '')
df_combined.loc[mask, 'Simplified Definition'] = df_combined.loc[mask, 'Definition_Extracted']

# Prepares data
# Pairs prompts with their corresponding simplified definitions
def prepare_data(df_combined):
    prompts = [f"Simplify this legal term: {term}" for term in df_combined['Term']]
    completions = df_combined['Simplified Definition'].tolist()
    return [{"prompt": p, "completion": c} for p, c in zip(prompts, completions)]

train_data = prepare_data(df_combined)

# Create a dataset object
dataset = Dataset.from_list(train_data)

# Load original gpt-2 model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained("gpt2", attn_implementation="eager")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# add special tokens if needed
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
base_model.resize_token_embeddings(len(tokenizer))

# Implements forward pass  for training 
# The class allows us to use the pre-trained GPT-2 model as a base, but modify its behavior for our specific task of simplifying legal terms. The forward method, in particular, is customized to handle the calculation of loss during training.
class GPT2ForSimplification(PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.config = config
        self.transformer = base_model.transformer
        self.lm_head = Linear(config.n_embd, config.vocab_size, bias=False)
        self.loss_fct = CrossEntropyLoss() # Loss function
        
        # Copy weights from base model
        self.lm_head.weight.data = base_model.lm_head.weight.data.clone()
    # Embeddings 
    def get_output_embeddings(self):
        return self.lm_head

    def set_output_embeddings(self, new_embeddings):
        self.lm_head = new_embeddings

    def get_input_embeddings(self):
        return self.transformer.wte

    def set_input_embeddings(self, value):
        self.transformer.wte = value

    def forward(self, input_ids, attention_mask=None, labels=None):
        transformer_outputs = self.transformer(input_ids, attention_mask=attention_mask)
        hidden_states = transformer_outputs[0]
        logits = self.lm_head(hidden_states)

        loss = None
        if labels is not None:
            # shift so that tokens < n predict n
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            loss = self.loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

        return {"loss": loss, "logits": logits}

# Create custom model
config = GPT2Config.from_pretrained("gpt2", attn_implementation="eager")
model = GPT2ForSimplification(config)

# tokenize the data
def tokenize_function(examples):
    prompts = examples["prompt"]
    completions = examples["completion"]
    
    inputs = tokenizer(prompts, padding="max_length", truncation=True, max_length=64)
    targets = tokenizer(completions, padding="max_length", truncation=True, max_length=64)
    
    inputs["labels"] = targets["input_ids"]
    
    return inputs

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2,
    logging_dir='./logs',
)

# create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
)

# Train the model
trainer.train()

# Save the model and tokenizer 
model.save_pretrained("./fine_tuned_gpt2")
tokenizer.save_pretrained("./fine_tuned_gpt2")

Map:   0%|          | 0/92 [00:00<?, ? examples/s]

  0%|          | 0/36 [00:00<?, ?it/s]

{'train_runtime': 466.6132, 'train_samples_per_second': 0.591, 'train_steps_per_second': 0.077, 'train_loss': 4.950950198703342, 'epoch': 3.0}


('./fine_tuned_gpt2\\tokenizer_config.json',
 './fine_tuned_gpt2\\special_tokens_map.json',
 './fine_tuned_gpt2\\vocab.json',
 './fine_tuned_gpt2\\merges.txt',
 './fine_tuned_gpt2\\added_tokens.json',
 './fine_tuned_gpt2\\tokenizer.json')

## layperson summary using gpt model 

Designed to take the legal summary, use the GPT-2 model to simplify it for a layperson, and then output the simplified version. 

This calls the original pre trained  GPT-2 model. At this moment this model works best. 

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

def simplify_summary_for_layperson(summary, gpt2_model, gpt2_tokenizer):
    input_text = f"Simplify this legal text for a layperson: {summary}"
    inputs = gpt2_tokenizer(input_text, return_tensors='pt')
    
    outputs = gpt2_model.generate(
        **inputs,
        max_length=len(inputs['input_ids'][0]) + 200,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        temperature=0.7,
        top_k=50,
        top_p=0.95,
        do_sample=True
    )
    
    # max_length=len(inputs['input_ids'][0]) + 200:
    # This sets the maximum length of the generated sequence.
    # It's calculated as the length of the input plus 200 tokens, allowing room for the simplified explanation.

    # num_return_sequences=1:
    # This specifies that we want only one output sequence.


    # no_repeat_ngram_size=2:
    #This prevents the model from repeating any 2-gram two consecutive words in the generated text.
    # It helps reduce repetitive phrases in the output.


    # temperature=0.7:
    # This parameter controls the randomness of the output.
    # A lower value closer to 0 makes the output more deterministic and focused.
    # A higher value closer to 1 makes the output more diverse and potentially more creative.
    # 0.7 is a balancing between coherence and creativity.


    # top_k=50:
    # This limits the model to consider only the top 50 most likely next words at each step of the generation.
    # It helps prevent the model from choosing  unlikely words.


    # top_p=0.95:
    # This is nucleus sampling. It considers the smallest set of words where cumulative probability exceeds 0.95.
    # It's another way to focus the model's choices while allowing for some variability.


    # do_sample=True:
    # This enables sampling-based generation instead of always choosing the most likely next word.
    # It allows for  diverse outputs.
    simplified_summary = gpt2_tokenizer.decode(outputs[0], skip_special_tokens=True)
    simplified_summary = simplified_summary.replace(input_text, "").strip()
    
    return simplified_summary

# Load gpt-2
gpt2_model_path = r"D:/Data/OneDrive/Ccantu/OneDrive - CFTC/Documents/Python Scripts/GPT2"
gpt2_model = AutoModelForCausalLM.from_pretrained(gpt2_model_path)
gpt2_tokenizer = AutoTokenizer.from_pretrained(gpt2_model_path)



# Print
simplified_summary = simplify_summary_for_layperson(summary, gpt2_model, gpt2_tokenizer)
print("\nSimplified Summary for Laypeople:")
print(simplified_summary)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Simplified Summary for Laypeople:
(c) Any agent providing direct or indirect assistance to a reporting or clearing counter party shall comply with all applicable reporting and clearing counterparties standards, including, but not limited to, requirements of the Act.

§ 5.6.2 Reporting counterpartys.
, (d) If a swap dealer, counterparties, clearing agent, and reporting counterparty are required under this Section to report information or perform a transaction under the Exchange Act, each of them shall provide all required information to the other in such form as the filing officer may direct. A report must be filed with each swap counter and must include the following:
 I. All information provided to or used by an agent or reporting transaction reporting party, such as name, address, telephone number, date of birth, residence, occupation, employer, trade name and other business information, information concerning the transaction, any other information required for reporting purposes, w

## Testing the fined tuned approach 

Designed to take the legal summary, using the  fine-tuned GPT-2 model to simplify it for a layperson, and then output the simplified version.

At this moment the fine-tuned GPT-2 works horribly and needs work. 

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GPT2Config

def simplify_summary_for_layperson(summary, gpt2_model, gpt2_tokenizer):
    input_text = f"Simplify this legal text for a layperson: {summary}"
    inputs = gpt2_tokenizer(input_text, return_tensors='pt')
    
    outputs = gpt2_model.generate(
        **inputs,
        max_length=len(inputs['input_ids'][0]) + 200,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        temperature=0.7,
        top_k=50,
        top_p=0.95,
        do_sample=True
    )
    
    simplified_summary = gpt2_tokenizer.decode(outputs[0], skip_special_tokens=True)
    simplified_summary = simplified_summary.replace(input_text, "").strip()
    
    return simplified_summary

# Load gpt-2
gpt2_model_path = r"D:/Data/OneDrive/Ccantu/OneDrive - CFTC/Documents/Python Scripts/fine_tuned_gpt2"
gpt2_model = AutoModelForCausalLM.from_pretrained(gpt2_model_path, ignore_mismatched_sizes=True)
gpt2_tokenizer = AutoTokenizer.from_pretrained(gpt2_model_path)



# Print
simplified_summary = simplify_summary_for_layperson(summary, gpt2_model, gpt2_tokenizer)
print("\nSimplified Summary for Laypeople:")
print(simplified_summary)


Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at D:/Data/OneDrive/Ccantu/OneDrive - CFTC/Documents/Python Scripts/fine_tuned_gpt2 and are newly initialized because the shapes did not match:
- lm_head.weight: found shape torch.Size([50258, 1024]) in the checkpoint and torch.Size([50257, 1024]) in the model instantiated
- transformer.wte.weight: found shape torch.Size([50258, 1024]) in the checkpoint and torch.Size([50257, 1024]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Simplified Summary for Laypeople:
normalnormal "/normal shamelessnormalriersnormal politiciannormal005normal Griffnormal �normalokunormalRELnormalGGGGGGGGnormalalysisnormal introducednormal Saranormal highernormalTweetnormal extremistsnormal frenzynormalhomenormal fsnormal expressesnormal Additionalnormal culmin "/ "/ shameless "/arate "/riers "/ � "/Lennormal SSnormal asyncnormal simnormal AZ "/Tweetriers � fs "/ Schednormal courtesynormal JunenormalNGnormalLen "/ politician "/ latternormal circlenormal gigsnormal pastnormal Sched "/ Griff "/ circle "/projectnormalbournormalcomplexnormal meetingnormal patchesnormalkeynormal humanitiesnormal scarednormalFilmnormalproject "/bour "/alysis "/ Additional "/"...normal latter shameless shameless �riers shameless fs shamelessriers Hopefullynormal Leathernormal intrinsicnormal pairnormal"...riersriersREL "/ introduced "/ fs Additional shameless humanities "/005 shameless SS "/ AZnormal lonormal Detailsnormalankingnormal NPRnormal promulg "/ p