<a href="https://colab.research.google.com/github/Vishal24-6/AI-Legal-Sentiment-Analyzer/blob/main/AI_Legal_Sentiment_Analyzer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -U transformers datasets scikit-learn nltk --quiet


In [None]:
import pandas as pd
import string
import re
import nltk
from nltk.corpus import stopwords

from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification

import torch
from torch.utils.data import DataLoader, Dataset as TorchDataset
from torch.optim import AdamW

from sklearn.metrics import classification_report, accuracy_score
from tqdm import tqdm


In [None]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
train_df = pd.read_csv('/content/legal_sentiment_train.csv')
test_df = pd.read_csv('/content/legal_sentiment_test.csv')


In [None]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = ' '.join(word for word in text.split() if word not in stop_words)
    return text

train_df['clean_text'] = train_df['text'].apply(clean_text)
test_df['clean_text'] = test_df['text'].apply(clean_text)


In [None]:
label2id = {label: i for i, label in enumerate(sorted(train_df['label'].unique()))}
id2label = {i: label for label, i in label2id.items()}
train_df['label_id'] = train_df['label'].map(label2id)
test_df['label_id'] = test_df['label'].map(label2id)


In [None]:
class LegalDataset(TorchDataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.encodings = tokenizer(texts, truncation=True, padding=True, max_length=max_len)
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
train_dataset = LegalDataset(train_df['clean_text'].tolist(), train_df['label_id'].tolist(), tokenizer)
test_dataset = LegalDataset(test_df['clean_text'].tolist(), test_df['label_id'].tolist(), tokenizer)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=len(label2id)
)


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
optimizer = AdamW(model.parameters(), lr=2e-5)


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.train()

for epoch in range(20):
    print(f"\nEpoch {epoch+1}")
    loop = tqdm(train_loader)
    for batch in loop:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loop.set_description(f"Epoch {epoch+1}")
        loop.set_postfix(loss=loss.item())



Epoch 1


Epoch 1: 100%|██████████| 125/125 [00:06<00:00, 20.62it/s, loss=0.0219]



Epoch 2


Epoch 2: 100%|██████████| 125/125 [00:05<00:00, 24.12it/s, loss=0.00737]



Epoch 3


Epoch 3: 100%|██████████| 125/125 [00:05<00:00, 23.84it/s, loss=0.00339]



Epoch 4


Epoch 4: 100%|██████████| 125/125 [00:05<00:00, 24.56it/s, loss=0.00212]



Epoch 5


Epoch 5: 100%|██████████| 125/125 [00:05<00:00, 23.32it/s, loss=0.00157]



Epoch 6


Epoch 6: 100%|██████████| 125/125 [00:05<00:00, 24.75it/s, loss=0.0008]



Epoch 7


Epoch 7: 100%|██████████| 125/125 [00:05<00:00, 23.56it/s, loss=0.000922]



Epoch 8


Epoch 8: 100%|██████████| 125/125 [00:05<00:00, 24.71it/s, loss=0.000478]



Epoch 9


Epoch 9: 100%|██████████| 125/125 [00:05<00:00, 24.74it/s, loss=0.000348]



Epoch 10


Epoch 10: 100%|██████████| 125/125 [00:05<00:00, 23.35it/s, loss=0.000351]



Epoch 11


Epoch 11: 100%|██████████| 125/125 [00:05<00:00, 24.37it/s, loss=0.000219]



Epoch 12


Epoch 12: 100%|██████████| 125/125 [00:05<00:00, 23.22it/s, loss=0.000288]



Epoch 13


Epoch 13: 100%|██████████| 125/125 [00:05<00:00, 24.81it/s, loss=0.000208]



Epoch 14


Epoch 14: 100%|██████████| 125/125 [00:05<00:00, 23.94it/s, loss=0.000171]



Epoch 15


Epoch 15: 100%|██████████| 125/125 [00:05<00:00, 24.03it/s, loss=0.000158]



Epoch 16


Epoch 16: 100%|██████████| 125/125 [00:05<00:00, 24.83it/s, loss=0.000117]



Epoch 17


Epoch 17: 100%|██████████| 125/125 [00:05<00:00, 23.42it/s, loss=8.96e-5]



Epoch 18


Epoch 18: 100%|██████████| 125/125 [00:05<00:00, 24.84it/s, loss=0.000119]



Epoch 19


Epoch 19: 100%|██████████| 125/125 [00:05<00:00, 23.39it/s, loss=8.78e-5]



Epoch 20


Epoch 20: 100%|██████████| 125/125 [00:05<00:00, 24.98it/s, loss=7.82e-5]


In [None]:
model.eval()
test_loader = DataLoader(test_dataset, batch_size=8)
preds, true_labels = [], []

with torch.no_grad():
    for batch in test_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        logits = outputs.logits
        preds.extend(torch.argmax(logits, dim=-1).cpu().numpy())
        true_labels.extend(batch['labels'].cpu().numpy())


In [None]:
print("\nClassification Report:\n")
print(classification_report(true_labels, preds, target_names=list(label2id.keys())))

print("\nAccuracy Score:")
print(accuracy_score(true_labels, preds))



Classification Report:

              precision    recall  f1-score   support

    negative       1.00      1.00      1.00       136
     neutral       1.00      1.00      1.00       133
    positive       1.00      1.00      1.00       131

    accuracy                           1.00       400
   macro avg       1.00      1.00      1.00       400
weighted avg       1.00      1.00      1.00       400


Accuracy Score:
1.0


In [None]:
test_df['predicted_sentiment'] = [id2label[p] for p in preds]

def show_examples(sentiment):
    subset = test_df[test_df['predicted_sentiment'] == sentiment]
    print(f"\n🔹 Example {sentiment.upper()} Sentiment Documents:\n")
    for i, row in subset.head(3).iterrows():
        print(f"Original: {row['text']}")
        print(f"Cleaned: {row['clean_text']}")
        print("-" * 80)

for sentiment in sorted(label2id.keys()):
    show_examples(sentiment)



🔹 Example NEGATIVE Sentiment Documents:

Original: The contract terms were ambiguous and misleading.
Cleaned: contract terms ambiguous misleading
--------------------------------------------------------------------------------
Original: The defendant was found guilty of negligence.
Cleaned: defendant found guilty negligence
--------------------------------------------------------------------------------
Original: The defendant breached the confidentiality agreement.
Cleaned: defendant breached confidentiality agreement
--------------------------------------------------------------------------------

🔹 Example NEUTRAL Sentiment Documents:

Original: The agreement was signed by both parties.
Cleaned: agreement signed parties
--------------------------------------------------------------------------------
Original: The panel examined the evidence objectively.
Cleaned: panel examined evidence objectively
--------------------------------------------------------------------------------
Orig

In [1]:
!jupyter nbconvert --ClearMetadataPreprocessor.enabled=True \
    --ClearOutputPreprocessor.enabled=True \
    --to notebook --output cleaned.ipynb AI_Legal_Sentiment_Analyzer.ipynb

This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePr