# AI Legal Triage - RoBERTa Training

This notebook fine-tunes a RoBERTa model on the CUAD dataset for legal clause classification. The model is part of a dual-track legal contract analysis system.

**NOTE: This notebook requires GPU acceleration!**

## Check GPU Availability

In [None]:
!nvidia-smi

## Clone the GitHub Repository

In [None]:
!git clone https://github.com/adarench/AI-Legal-Triage.git
%cd AI-Legal-Triage

## Install Dependencies

We need to install all the required packages for both the preprocessing and training steps.

In [None]:
!pip install datasets transformers torch pandas numpy scikit-learn python-dotenv
!pip install -r requirements.txt

## Fix Import Paths

Let's ensure the imports work correctly when running in Kaggle.

In [None]:
# Add root directory to path
import sys
import os
sys.path.append(os.getcwd())

## Create Necessary Directories

In [None]:
!mkdir -p data/cuad_processed
!mkdir -p bert_model/fine_tuned_roberta
!mkdir -p results

## Download and Preprocess the CUAD Dataset

We'll use our custom preprocessing script to download and prepare the CUAD dataset.

In [None]:
from bert_model.cuad_preprocessing import CUADPreprocessor
from bert_model.label_map import LABEL_MAP

# Initialize preprocessor
preprocessor = CUADPreprocessor(output_dir="data/cuad_processed")

# Download and preprocess dataset
print("Downloading and preprocessing CUAD dataset...")
train_df, val_df, test_df = preprocessor.download_and_preprocess()

# Create sample file
print("Creating sample clauses file...")
preprocessor.create_sample_file(filename="results/cuad_samples.json")

print(f"Preprocessing complete! Data saved to data/cuad_processed")
print(f"Number of training examples: {len(train_df)}")
print(f"Number of validation examples: {len(val_df)}")
print(f"Number of testing examples: {len(test_df)}")
print(f"Number of labels: {len(LABEL_MAP)}")

# Show label distribution
label_counts = train_df.label.value_counts()
print("\nLabel distribution in training set:")
for label, count in label_counts.items():
    print(f"  {label}: {count}")

## Run Training with Optimal Parameters for Kaggle P100 GPU

Now we'll run the training process. This will take a few hours.

In [None]:
from bert_model.train_model import train_model

# Run training with optimized parameters for P100
metrics = train_model(
    data_dir="data/cuad_processed",
    output_dir="bert_model/fine_tuned_roberta",
    batch_size=16,  # P100 can handle this size
    epochs=3,
    learning_rate=3e-5,
    model_name="roberta-base",
    max_length=512,
    seed=42
)

print(f"Training complete! Model saved to bert_model/fine_tuned_roberta")
print(f"Test Accuracy: {metrics['accuracy']:.4f}")
print(f"Test F1 Score: {metrics['f1_score']:.4f}")

## Test the Model on Sample Clauses

Let's test our trained model on some sample clauses.

In [None]:
from bert_model.infer_clause import RobertaClausePredictor
import json

# Load sample clauses
with open("results/sample_clauses.json", "r") as f:
    samples = json.load(f)

# Initialize predictor
predictor = RobertaClausePredictor(model_dir="bert_model/fine_tuned_roberta")

# Test on a few samples
for i, sample in enumerate(samples[:3]):  # Just show the first 3
    clause_text = sample["clause_text"]
    actual_type = sample["type"]
    
    # Get predictions
    prediction = predictor.predict_clause(clause_text)
    
    print(f"\nSample {i+1}:")
    print(f"Clause excerpt: {clause_text[:100]}...")
    print(f"Actual type: {actual_type}")
    print(f"Predicted type: {prediction['type']}")
    print(f"Risk score: {prediction['risk_score']:.2f}")
    print(f"Explanation: {prediction['explanation']}")

## Compress the Model for Download

Finally, let's package up the trained model for easy download.

In [None]:
!zip -r fine_tuned_roberta.zip bert_model/fine_tuned_roberta/

## Done!

You can now download the `fine_tuned_roberta.zip` file, which contains the trained model.

Unzip it into your local project's `bert_model/` directory to use it for inference.