JurisBERT - Brazilian Legal Text Dataset

Brazilian Legal Text Dataset for trainning transformer based models.

Requeriments

Before run, you have to install in your path a Firefox WebDriver for Selenium. Download last release at https://github.com/mozilla/geckodriver/releases Put executable file in PATH.

Get Started

Run command below to install all required dependencies.

pip install -r requirements.txt

Generate MLM Dataset

To generate a dataset for MLM pre-trainning. Run the command below to execute all pipeline that will generate 2 files in output/mlm/.

python mlm.py all

To run individual tasks, you can pass a task as parameter:

python mlm.py scrap
python run.py parse
python run.py export

Generate STS Dataset

To generate a dataset for STS fine-tunning. Run the command below to execute all pipeline that will generate files in output/sts/{sts_type}/.

python sts.py all --sts_type "binary | scale | triplet | benchmark"

Generated Datasets

If you are interested in downloading only the pre-generated datasets, just use the links below:

MLM datasets

STS for train datasets

STS for benchmark datasets

Raw datasets

Citation

If you use our work, please cite:

@incollection{Viegas_2023,
	doi = {10.1007/978-3-031-36805-9_24},
	url = {https://doi.org/10.1007%2F978-3-031-36805-9_24},
	year = 2023,
	publisher = {Springer Nature Switzerland},
	pages = {349--365},
	author = {Charles F. O. Viegas and Bruno C. Costa and Renato P. Ishii},
	title = {{JurisBERT}: A New Approach that~Converts a~Classification Corpus into~an~{STS} One},
	booktitle = {Computational Science and Its Applications {\textendash} {ICCSA} 2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
pipeline		pipeline
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mlm.py		mlm.py
requirements.txt		requirements.txt
split.sh		split.sh
stats.py		stats.py
sts.py		sts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline

pipeline

resources

resources

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

mlm.py

mlm.py

requirements.txt

requirements.txt

split.sh

split.sh

stats.py

stats.py

sts.py

sts.py

Repository files navigation

JurisBERT - Brazilian Legal Text Dataset

Requeriments

Get Started

Generate MLM Dataset

Generate STS Dataset

Generated Datasets

Citation

About

Releases 1

Packages

Contributors 2

Languages

License

alfaneo-ai/brazilian-legal-text-dataset

Folders and files

Latest commit

History

Repository files navigation

JurisBERT - Brazilian Legal Text Dataset

Requeriments

Get Started

Generate MLM Dataset

Generate STS Dataset

Generated Datasets

Citation

About

Resources

License

Stars

Watchers

Forks

Languages