# Variant 1.2 : TFIDF-SRT-LegalBERT ( Normalized )
#### Overview:
• The input document is first tokenized using the LegalBERT tokenizer.
• Duplicate tokens are removed (while preserving only the first occurrence).
• The remaining tokens are sorted in descending order by their TF-IDF score (precomputed on a training corpus).
• The resulting ordered token string is re-tokenized (if needed) and fed into LegalBERT for classification.


## Explanation Variant 1:

• The TF-IDF vectorizer builds a dictionary of sub-word tokens mapped to their inverse document frequency (IDF) values.
• The preprocess_document_bow function deduplicates tokens from each document and sorts them by their corresponding TF-IDF score.
• The resulting ordered token string is then tokenized (again) to produce input IDs suitable for LegalBERT.

Finally, these inputs are fed into the model for classification.

• This variant does not modify the internal architecture of LegalBERT; it only changes the input text.

In [1]:
from datasets import load_dataset

dataset = load_dataset("victorambrose11/normalized_scotus")
dataset

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1400
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1400
    })
})

In [2]:
highest=0
total_length=0
for idx in range(len(dataset['train'])):
    total_length+=len(dataset['train'][idx]['text'])
    if len(dataset['train'][idx]['text']) > highest:
        highest=len(dataset['train'][idx]['text'])
print (f'The average length of documents in training dataset is {round(total_length/len(dataset['train']))}\nThe lengthy document in the dataset contains {highest} number of tokens')        

The average length of documents in training dataset is 37956
The lengthy document in the dataset contains 584365 number of tokens


In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm.auto import tqdm
import os

# Set this to avoid tokenizer warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"

def initialize_model():
    model_name = "nlpaueb/legal-bert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    return tokenizer, model

def compute_tfidf_dict(documents, tokenizer):
    print("Tokenizing documents...")
    # Main progress bar for tokenization
    pbar_token = tqdm(total=len(documents), desc="Tokenizing", position=0)
    tokenized_docs = []
    
    for doc in documents:
        tokenized_docs.append(tokenizer.tokenize(doc))
        pbar_token.update(1)
    pbar_token.close()
    
    def identity_tokenizer(text):
        return text
    
    tfidf_vectorizer = TfidfVectorizer(
        tokenizer=identity_tokenizer,
        preprocessor=lambda x: x,
        lowercase=False
    )
    
    print("Computing TF-IDF matrix...")
    with tqdm(total=1, desc="TF-IDF Computation", position=0) as pbar_tfidf:
        tfidf_matrix = tfidf_vectorizer.fit_transform(tokenized_docs)
        feature_names = tfidf_vectorizer.get_feature_names_out()
        pbar_tfidf.update(1)
    
    return dict(zip(feature_names, tfidf_vectorizer.idf_))

def process_documents_sequential(documents, tokenizer, batch_size=1000):
    # Compute TF-IDF dictionary once
    idf_dict = compute_tfidf_dict(documents, tokenizer)
    
    processed_docs = []
    total_batches = (len(documents) + batch_size - 1) // batch_size
    
    # Create progress bars
    main_pbar = tqdm(total=len(documents), desc="Overall Progress", position=0)
    batch_pbar = tqdm(total=total_batches, desc="Batch Progress", position=1, leave=False)
    
    try:
        # Process in batches
        for i in range(0, len(documents), batch_size):
            batch = documents[i:i + batch_size]
            
            # Process each document in the batch
            for doc in batch:
                # Tokenize
                tokens = tokenizer.tokenize(doc)
                
                # Create dictionary of unique tokens and their scores
                unique_tokens_dict = {token: idf_dict.get(token, 0) for token in set(tokens)}
                
                # Sort tokens by score
                ordered_tokens = sorted(
                    unique_tokens_dict.keys(),
                    key=lambda x: unique_tokens_dict[x],
                    reverse=True
                )[:512]  # max_length=512
                
                processed_docs.append(" ".join(ordered_tokens))
                main_pbar.update(1)
            
            # Update batch progress
            batch_pbar.update(1)
            current_batch = i // batch_size + 1
            tqdm.write(f"Completed batch {current_batch}/{total_batches}")
    
    finally:
        # Close progress bars
        main_pbar.close()
        batch_pbar.close()
    
    return processed_docs

In [4]:
from datasets import Dataset, DatasetDict, Features, Value

# Initialize model and tokenizer
print("Initializing model and tokenizer...")
with tqdm(total=1, desc="Initialization", position=0) as pbar:
    tokenizer, model = initialize_model()
    pbar.update(1)

# Process documents sequentially
train_docs = process_documents_sequential(
    documents=dataset['train']['text'],
    tokenizer=tokenizer,
    batch_size=1000
)

test_docs = process_documents_sequential(
    documents=dataset['test']['text'],
    tokenizer=tokenizer,
    batch_size=1000
)

validation_docs = process_documents_sequential(
    documents=dataset['validation']['text'],
    tokenizer=tokenizer,
    batch_size=1000
)


# Get the original label feature
train_label_feature = dataset['train'].features['label']
test_label_feature = dataset['test'].features['label']
validation_label_feature = dataset['validation'].features['label']

# Define consistent features
features = Features({
    "text": Value("string"),
    "label": train_label_feature
})

# Create new dataset with processed texts
new_train_dict = {
    "text": train_docs,
    "label": dataset['train']['label']
}

new_test_dict = {
    "text": test_docs,
    "label": dataset['test']['label']
}

new_validation_dict = {
    "text": validation_docs,
    "label": dataset['validation']['label']
}


# Create new dataset with the consistent features
train_with_features = Dataset.from_dict(
    new_train_dict,
    features=features
)


test_with_features = Dataset.from_dict(
    new_test_dict,
    features=features
)


validation_with_features = Dataset.from_dict(
    new_validation_dict,
    features=features
)


# Update the dataset
new_dataset = DatasetDict({
    'train': train_with_features,
    'test': test_with_features,
    'validation': validation_with_features
})

Initializing model and tokenizer...


Initialization:   0%|          | 0/1 [00:00<?, ?it/s]Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nlpaueb/legal-bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Initialization: 100%|██████████| 1/1 [00:01<00:00,  1.82s/it]


Tokenizing documents...


Tokenizing:   0%|          | 0/5000 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (4363 > 512). Running this sequence through the model will result in indexing errors
Tokenizing: 100%|██████████| 5000/5000 [01:24<00:00, 59.24it/s] 


Computing TF-IDF matrix...


TF-IDF Computation: 100%|██████████| 1/1 [00:06<00:00,  6.02s/it]
                                                                    
Overall Progress:  20%|██        | 1007/5000 [00:15<00:50, 78.39it/s]

Completed batch 1/5


                                                                      
Overall Progress:  40%|████      | 2004/5000 [00:30<00:39, 75.15it/s]

Completed batch 2/5


                                                                     
Overall Progress:  60%|██████    | 3015/5000 [00:45<00:32, 61.49it/s]

Completed batch 3/5


                                                                     
Overall Progress:  80%|████████  | 4002/5000 [01:06<00:29, 34.28it/s]

Completed batch 4/5


                                                                     
Overall Progress: 100%|██████████| 5000/5000 [01:32<00:00, 53.97it/s]


Completed batch 5/5
Tokenizing documents...


Tokenizing: 100%|██████████| 1400/1400 [00:33<00:00, 41.31it/s]


Computing TF-IDF matrix...


TF-IDF Computation: 100%|██████████| 1/1 [00:02<00:00,  2.22s/it]
                                                                     
Overall Progress:  72%|███████▏  | 1005/1400 [00:26<00:11, 35.54it/s]

Completed batch 1/2


                                                                     
Overall Progress: 100%|██████████| 1400/1400 [00:36<00:00, 38.66it/s]


Completed batch 2/2
Tokenizing documents...


Tokenizing: 100%|██████████| 1400/1400 [00:31<00:00, 43.90it/s]


Computing TF-IDF matrix...


TF-IDF Computation: 100%|██████████| 1/1 [00:01<00:00,  1.92s/it]
                                                                    
Overall Progress:  72%|███████▏  | 1004/1400 [00:24<00:11, 35.20it/s]

Completed batch 1/2


                                                                     
Overall Progress: 100%|██████████| 1400/1400 [00:34<00:00, 40.16it/s]


Completed batch 2/2


In [5]:
# Push to hugging face=
new_dataset.push_to_hub("victorambrose11/lex_glue_normalized_TFIDF-SRT")

Creating parquet from Arrow format: 100%|██████████| 5/5 [00:00<00:00, 94.53ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:10<00:00, 10.16s/it]
Creating parquet from Arrow format: 100%|██████████| 2/2 [00:00<00:00, 126.05ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:05<00:00,  5.48s/it]
Creating parquet from Arrow format: 100%|██████████| 2/2 [00:00<00:00, 132.47ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:03<00:00,  3.63s/it]


CommitInfo(commit_url='https://huggingface.co/datasets/victorambrose11/lex_glue_normalized_TFIDF-SRT/commit/f6dd1b5e3b28aac57a8dacea8a721987837be227', commit_message='Upload dataset', commit_description='', oid='f6dd1b5e3b28aac57a8dacea8a721987837be227', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/victorambrose11/lex_glue_normalized_TFIDF-SRT', endpoint='https://huggingface.co', repo_type='dataset', repo_id='victorambrose11/lex_glue_normalized_TFIDF-SRT'), pr_revision=None, pr_num=None)