<a href="https://colab.research.google.com/github/adithyanum/legal_document_simplifier/blob/main/legal_document_simplifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📘 Legal Document Simplifier using BART and Streamlit


## 🎯 Objective

The goal of this project is to create an interactive web app that allows users to upload legal documents in `.txt`, `.pdf`, or `.docx` format and receive simplified summaries of the text.

This is especially useful for:
- Law students
- Researchers
- Laypersons needing clarity on complex contracts or legal acts


## 🔧 Tools and Libraries

- **Streamlit**: To build the interactive web UI
- **Hugging Face Transformers**: For abstractive summarization using `facebook/bart-large-cnn`
- **NLTK**: For sentence tokenization (chunking)
- **PyMuPDF (`fitz`)**: For extracting text from PDFs
- **python-docx**: For extracting text from DOCX files


## 🧠 How It Works

1. User uploads a legal document (`.txt`, `.pdf`, or `.docx`)
2. The document text is extracted based on file type
3. If the text is short (less than 700 words), it's summarized directly
4. If long, the document is split into chunks using `nltk.sent_tokenize()`
5. Each chunk is summarized using BART (`facebook/bart-large-cnn`)
6. All summaries are combined and shown to the user
7. User can download the final summary as a `.txt` file


## 🧪 Features

- 📂 Accepts `.txt`, `.pdf`, and `.docx` legal documents
- ✂️ Automatically chunks long documents
- 🧠 Summarizes using Hugging Face BART model
- 🎛 Users can choose between "Concise" or "Detailed" summary styles
- 📥 Final summary is downloadable

In [None]:
from transformers import pipeline
import docx
import fitz
import os

## 🤖 Model: facebook/bart-large-cnn

- **Type**: Abstractive Summarizer
- **Base**: BART (Bidirectional and Auto-Regressive Transformer)
- **Fine-tuned on**: CNN/DailyMail news articles
- **Strength**: Good at condensing long passages and preserving meaning


In [None]:
summarizer = pipeline('summarization', model ='facebook/bart-large-cnn')

In [None]:
import nltk
from nltk.tokenize import sent_tokenize

nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

## 🛠️ Sentence Chunking (Why?)

Transformers like BART have a max token limit (typically 1024 tokens). To handle long legal documents, we:
- Split the text into sentences using `nltk.sent_tokenize()`
- Combine sentences into chunks of ~500–800 words
- Summarize each chunk separately and merge results


In [None]:
def chunk_text(text, max_words=800) :

    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences :
        word_count = len(sentence.split())

        if current_length + word_count > max_words :
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentence]
            current_length = word_count
        else :
            current_chunk.append(sentence)
            current_length += word_count

    if current_chunk :
        chunks.append(' '.join(current_chunk))

    return chunks

In [27]:
def summarize_chunks(chunks) :
    summaries = []

    for chunk in chunks :
        summary = summarizer(chunk, max_length = max_len, min_length = min_len, do_sample = False)[0]['summary_text']
        summaries.append(summary)

    return summaries

In [None]:
import streamlit as st

In [None]:
st.title('📄 Legal Document Simplifier')
st.write('Upload a Legal Document to get a simplified summary.')

In [None]:
uploaded_file = st.file_uploader("Upload a Legal Document 📂", type = ['txt','pdf','docx'])

In [33]:
def extract_text_from_pdf(pdf_file):
    text = ""
    with fitz.open(stream=pdf_file.read(), filetype="pdf") as doc:
        for page in doc:
            text += page.get_text()
    return text

In [34]:
def extract_text_from_docx(docx_file):
    text = ""
    document = docx.Document(docx_file)
    for paragraph in document.paragraphs:
        text += paragraph.text + "\n"
    return text

In [35]:
if uploaded_file is not None :
    file_extension = os.path.splitext(uploaded_file.name)[1].lower()

    if file_extension == '.txt':
        raw_text = uploaded_file.read().decode("utf-8")
    elif file_extension == '.pdf':
        raw_text = extract_text_from_pdf(uploaded_file)
    elif file_extension == '.docx':
        raw_text = extract_text_from_docx(uploaded_file)
    else:
        st.error("Unsupported file type.")
        raw_text = None

    if raw_text:
        st.subheader("📄 Original Legal Document:")
        st.text_area("Full Text", raw_text, height=300)

        summary_style = st.selectbox("Select Summary Style", ["Concise", "Detailed"])
        if summary_style == "Concise":
            max_len, min_len = 150, 30
        else:
            max_len, min_len = 300, 50


        if st.button("Simplify Document"):
            with st.spinner("Summarizing... Please wait..."):
                if len(raw_text.split()) < 700:
                    summary = summarizer(raw_text, max_length=250, min_length=50, do_sample=False)[0]['summary_text']
                    final_summary = summary
                    st.subheader("✂️ Simplified Summary:")
                    st.text_area("Summary", summary, height=300)
                else:
                    chunks = chunk_text(raw_text)
                    summaries = summarize_chunks(chunks)
                    final_summary = "\n".join(summaries)
                    st.subheader("✂️ Simplified Summary:")
                    st.text_area("Summary", final_summary, height=300)

                st.download_button("📥 Download Summary", final_summary, file_name="summary.txt")

## 📊 Results

The Legal Document Simplifier successfully summarized long and complex legal documents — including the **Patents Act, 1970 (India)** — into concise, readable summaries using BART-based abstractive summarization.

### 🧪 Example:

**Original Document**: ~14,000 words  
**Simplified Summary**: ~300–500 words

### 📝 Observations:

- ✅ Retains essential legal terms (e.g., "patent", "specification", "controller")
- ✅ Condenses redundant clauses and long procedural paragraphs
- ✅ Performs best on acts, contracts, and legal notices with formal structure
- ⚠️ May retain original phrasing for legally sensitive definitions (intentional behavior of BART)

This tool significantly reduces the time and effort needed to interpret legal documents while preserving critical information.


## 📎 Sample Output

[📄 View Sample Summary PDF](https://github.com/adithyanum/legal_document_simplifier/blob/977c317e1ec518e2081a9ebb225f30b3c1931995/Legal_Document_Simplifier.pdf)


## 📌 Conclusion

This app simplifies complex legal documents using cutting-edge NLP. It is a helpful tool for anyone who needs to understand legal content more easily and quickly.

## 🙏 Acknowledgments

I would like to acknowledge the following tools, organizations, and libraries for making this project possible:

- 🤗 **Hugging Face Transformers** — for providing the `facebook/bart-large-cnn` pretrained summarization model
- 📚 **NLTK (Natural Language Toolkit)** — for sentence-level tokenization
- 📄 **PyMuPDF** and **python-docx** — for extracting text from PDFs and DOCX files
- 🚀 **Streamlit** — for enabling rapid development of user-friendly NLP apps
- 👨‍⚖️ Indian Government Publications — for access to the **Patents Act, 1970** as a test document

Special thanks to the open-source community for building tools that make applied NLP accessible to everyone.
