In [8]:
!pip install transformers rouge-score



## **Import Libraries and Load Models**

In [9]:
import os
import torch  # Import torch for tensor operations and model inference
from transformers import BartTokenizer, BartForConditionalGeneration, pipeline, BertTokenizer, BertForSequenceClassification
from sklearn.metrics import classification_report

# Load the BART summarization model
summarization_tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
summarization_model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

# Load a BERT model for classification
classification_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
classification_model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)  # Adjust num_labels as needed

# Create pipelines
summarization_pipeline = pipeline("summarization", model=summarization_model, tokenizer=summarization_tokenizer)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


## **Load and Preprocess the Dataset**

In [10]:
# Define the dataset paths
train_judgement_path = '/kaggle/input/legal-case-document-summarization/dataset/UK-Abs/train-data/judgement'
test_judgement_path = '/kaggle/input/legal-case-document-summarization/dataset/UK-Abs/test-data/judgement'

# Load and preprocess the text files from the training data
judgement_files_train = os.listdir(train_judgement_path)

# Ensure input text is within the model's maximum length
max_input_length = 1024  # BART's maximum input length

# Process a small sample of files from the training data with truncation
processed_texts_sample_train = []
for filename in judgement_files_train[:10]:  # Limiting to the first 10 files for summarization
    file_path = os.path.join(train_judgement_path, filename)
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read()
            tokenized_text = summarization_tokenizer.encode(text, truncation=True, max_length=max_input_length)
            decoded_text = summarization_tokenizer.decode(tokenized_text, skip_special_tokens=True)
            processed_texts_sample_train.append(decoded_text)
    except Exception as e:
        print(f"Error reading file {filename}: {e}")

## **Generate Summaries**

In [11]:
# Generate summaries for the loaded and truncated documents
summaries = []
for text in processed_texts_sample_train:
    summary = summarization_pipeline(text, max_length=150, min_length=40, do_sample=False)
    summaries.append(summary[0]['summary_text'])

# Display the summaries
for i, summary in enumerate(summaries):
    print(f"Document {i+1} Summary:")
    print(summary)
    print("\n")

Document 1 Summary:
From 4 April 2005 until 3 December 2012, English law provided for the imposition of sentences of imprisonment for public protection. The case is before the Supreme Court as an application for permission to appeal, with the appeal to follow if permission is granted.


Document 2 Summary:
HMRC claim that Mr Fowler's income from diving engagements is subject to UK taxation. Mr Fowler denies that he is a self employed contractor. The issue depends on how the double taxation treaty between the UK and South Africa applies to a person in his position.


Document 3 Summary:
The appeal is the latest to be heard at the Court of Session in Glasgow. The case involves three fields near the village of Killearn. The fields were let out to a farming partnership under separate leases in 1981 and 1983. The trust acquired the fields because of their potential for residential development.


Document 4 Summary:
The need for reliable guidance on this issue is growing day by day. Both app

## **Classification of Documents**

In [12]:
# Convert summaries (or original texts) to tokenized inputs for classification
classification_inputs = classification_tokenizer(summaries, padding=True, truncation=True, max_length=128, return_tensors="pt")

# Perform classification
with torch.no_grad():
    outputs = classification_model(**classification_inputs)
    predictions = torch.argmax(outputs.logits, dim=1)

# Map predictions to labels (Assuming binary classification: 0 = 'Type A', 1 = 'Type B')
labels = {0: 'Type A', 1: 'Type B'}
predicted_labels = [labels[pred.item()] for pred in predictions]

# Display the classification results
for i, label in enumerate(predicted_labels):
    print(f"Document {i+1} Classified as: {label}")

Document 1 Classified as: Type A
Document 2 Classified as: Type B
Document 3 Classified as: Type B
Document 4 Classified as: Type B
Document 5 Classified as: Type B
Document 6 Classified as: Type B
Document 7 Classified as: Type B
Document 8 Classified as: Type B
Document 9 Classified as: Type B
Document 10 Classified as: Type B


## **Evaluate Classification Performance**

In [13]:
# Dummy reference labels (for demonstration purposes)
reference_labels = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]  # Replace with actual labels

# Generate a classification report
print("Classification Report:")
print(classification_report(reference_labels, predictions, target_names=['Type A', 'Type B']))

Classification Report:
              precision    recall  f1-score   support

      Type A       1.00      0.20      0.33         5
      Type B       0.56      1.00      0.71         5

    accuracy                           0.60        10
   macro avg       0.78      0.60      0.52        10
weighted avg       0.78      0.60      0.52        10

