This repository contains various tools and scripts used to convert, process and prepare the texts received as part of the text collection activities in work package 4 of the Slovene in the Digital Environment project.
This script converts all pdfs in a folder to text, translates them with a machine translation service and then aligns them with Bleualign (https://github.com/rsennrich/Bleualign).
The scripts supports English and Slovene, but can easily be adapted to other languages.
The script expects pairs of pdfs in a separate folder. For each pdf file, you need to add either en
or sl
to the end of the filename:
filename_en.pdf
filename_sl.pdf
By default, the script uses the the RSDO machine translation service, but you can also switch to Google Translate (uncomment line 168). Note that you need to set up a Google Cloud account in order to use this functionality.
This script converts tmx files into bilingual csv files.
This script creates a tmx file from a bilingual csv
This script anonymises text using an external anonymisation service (e.g. https://gitlab.com/MAPA-EU-Project/mapa_project)