Hierarchical Sorting Legal Documents using NLP

by Christian Grech

Introduction

One of the challenges Legal Tech companies encounter involves reconstructing hierarchical structures within regulatory documents after parsing. Establishing clear parent-child relationships between paragraphs is crucial for organizing data in a manner that benefits our customers and enables various machine learning applications.

For this task, I create a system capable of organizing a provided list of paragraphs into a hierarchy by defining parent-child relationships between them. I have included an example dataset (data.json) containing paragraphs from a generic legal document, arranged in the correct hierarchy based on reading order. Feel free to utilize this dataset as input for your machine learning model or any alternative approach you prefer.

It's important to note that the hierarchy may consist of multiple levels, meaning a paragraph could serve as both a parent and a child simultaneously. Ultimately, I have used the best performing classifier to organize the paragraphs in test.json into a hierarchy and generate an HTML file with visible indentations.

Dataset description

data.json consists of 785 paragraphs in reading order. id is a unique identifier of the paragraph. parent_id holds the reference to the parent paragraph. if parent_id is null the paragraph has no parent. html is the html representation of the paragraph.
test.json consists of 48 paragraphs in reading order from a legal document similar to the one the paragraphs in data.json were taken from. id is a unique identifier of the paragraph. html is the html representation of the paragraph. parent_id has been purposefully hidden and should be predicted by the best performing model.

Solution

Data imported from the two json files.
Feature extraction: html tags are extracted to try to identify html tags for paragraphs and headers.
Data is vectorized and split into training and validation sets.
Several classifiers are trained with and their performance is evaluated with the validation dataset.
In the end the XGBoost Classifier is used on the test dataset to create the test_data.json file with indents.

Evaluation

The XGBoost model performs with an Accuracy of 0.96 and a Balanced Accuracy of 0.85.

Improvements

Finetuning feature extraction and model parameters.
Predicting relative relationship rather than the numerical level.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.gitattributes		.gitattributes
Hierarchy.ipynb		Hierarchy.ipynb
Jupyter_Notebook_HTML.html		Jupyter_Notebook_HTML.html
README.md		README.md
data.json		data.json
test.html		test.html
test_data.json		test_data.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

.gitattributes

.gitattributes

Hierarchy.ipynb

Hierarchy.ipynb

Jupyter_Notebook_HTML.html

Jupyter_Notebook_HTML.html

README.md

README.md

data.json

data.json

test.html

test.html

test_data.json

test_data.json

Repository files navigation

Hierarchical Sorting Legal Documents using NLP

by Christian Grech

Introduction

Dataset description

Solution

Evaluation

Improvements

About

Releases

Packages

Languages

cgre23/Hierarchical-Sorting-Legal-Documents-using-NLP

Folders and files

Latest commit

History

Repository files navigation

Hierarchical Sorting Legal Documents using NLP

by Christian Grech

Introduction

Dataset description

Solution

Evaluation

Improvements

About

Resources

Stars

Watchers

Forks

Languages