# Legal Document Classification with BERT - V2 (Full Dataset)

This notebook demonstrates how to build a BERT-based classifier for legal documents using the full 45K dataset with complete document text (header + recitals + main body).

## Part 1: Setup and Installation

In [None]:
# Install required packages
!pip install transformers datasets torch scikit-learn tqdm matplotlib seaborn pandas nltk

In [None]:
# Mount Google Drive for saving models and data
from google.colab import drive
drive.mount('/content/drive')

# Create a folder for the project
!mkdir -p /content/drive/MyDrive/legal_bert_classification_v2

In [None]:
# Check if GPU is available
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Print GPU info if available
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## Part 2: Data Upload

This section lets you upload the full 45K dataset that was processed with the `process_full_dataset.py` script.

In [None]:
# Option 1: Upload the dataset directly
from google.colab import files

# You can upload the full dataset CSV here
print("Upload your full_bert_dataset.csv file:")
uploaded = files.upload()  # Will prompt for file upload

# Get the uploaded file name
import io
import pandas as pd

for filename in uploaded.keys():
    print(f"Uploaded {filename} successfully!")
    
    # Save to Drive for future use
    with open(f"/content/drive/MyDrive/legal_bert_classification_v2/{filename}", 'wb') as f:
        f.write(uploaded[filename])
    print(f"Saved to Drive for future reference")
    
    # Read the uploaded CSV
    df = pd.read_csv(io.StringIO(uploaded[filename].decode('utf-8')))
    print(f"Dataset loaded with {len(df)} rows and {df.columns.tolist()} columns")

In [None]:
# Option 2: If you've already uploaded to Drive, set the path
# dataset_path = '/content/drive/MyDrive/legal_bert_classification_v2/full_bert_dataset.csv'
# df = pd.read_csv(dataset_path)
# print(f"Dataset loaded from Drive with {len(df)} rows and {df.columns.tolist()} columns")

In [None]:
# Basic verification that the data looks correct
# Display first few rows
print("First 3 rows of the dataset:")
print(df.head(3))

# Check for missing values
print("\nMissing values in each column:")
print(df.isnull().sum())

# Verify label distribution
print("\nLabel distribution:")
print(df['label'].value_counts())