Bitextor 8.3: Snake Runner, the Sentence Retirer

@lpla lpla released this 30 May 10:31
I've seen things you people wouldn't believe. Roy Batty, The Preverticant

What's Changed

  • Neural tools (Vecalign and Neural Document Aligner) integration by @cgr71ii in #235
  • CI and tests updates and fixes by @cgr71ii in #238
  • Range of paragraphs count using option paragraphIdentification by @lpla in #241
  • Document pair output file by @aarongaliano in #242
  • Update Bicleaner(-AI) submodules given new Bicleaner Hardrules by @lpla in #244
  • Remove Linguacrawl from Bitextor by @aarongaliano in #248
    • It is still compatible with Bitextor regarding the WARC format, but crawling management should be performed manually
  • Metadata code refactorization by @cgr71ii in #245
  • Now you can use compatible documents (like PDFs, TXTs, HTMLs) in the Bitextor input without encapsulating it into WARC or Prevertical formats! Check directories and directioriesFile documentation, by @aarongaliano in #247
  • PDFprocessingoption (previously PDFextract). Now it is a list that allows you to choose whether to use pdf2html, pdfextract or Apache Tika (new PDF processor), by @aarongaliano in #247
  • Now you can use warc2html (e.g. to process PDFs) with warc2text, by @aarongaliano in #247
  • New Bitextor multilangoption (if activated, warc2text will extract content in different languages from the same document), by @aarongaliano in #247
  • New Bitextor argument bicleanerExtraArgs to pass extra arguments to Bicleaner(-AI) by @lpla in #250
  • Add fastspell apt dependencies to Dockerfile by @aliciannz in #249
  • Scikit 1.1.3 updated base dependency, including new models for dict-based docaligner model by @aarongaliano in #243
  • New L2 normalization in TF-IDF translation-based document aligner by @lpla in #252
  • Updated Python requirements, submodules, and documentation.
  • Minor bug fixes and changes (including #253)

New Contributors

Full Changelog: v8.2...v8.3

Notes tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the tarball or cloning the repo v8.3 tag.

We will support Bitextor 8.x branch until the next major version is released.