Releases: among/fusus
Moved the main tf files
v0.8 moved tf files
Aligned Lakhnawi and Afifi
Data versions for fususl and fususa have been bumped to 0.7.
There is now also fusus data (i.e. aligned and merged fususa and fususl) in version 0.7.
New Lakhnawi version 0.6
Lakhnawi tf generation:
The numbers of the proper bezels were not correct.
Fixed it and created a new data version.
Delivered
This release markes the handing over of this repostiory from Dirk to Cornelis as main contributer.
So far, Dirk has written most of the code, although all of the work is the result of a close cooperation between Cornelis and Dirk.
Cornelis provided the seminal ideas, organized the project and procured the funding.
Cornelis and Dirk discussed every problem and issue underway in Slack.
The main results are (between brackets the location in this repo)
- the fusus code: OCR pipeline and PDF text extraction (fusus)
- example data (examples) (attached as example.zip)
- output data: Lakhnawi TF, TSV, HTML, PDF; Affifi TF, TSV, HTML, PDF (ur) (attached as Lakhnawi.zip and Affifi.zip)
- documentation: Readme, doc-strings in the fusus code, extra markdown files (fusus/docs), (the built site is attached to this release as site.zip)
- notebooks (notebooks) - view them on nb-viewer
Fusus-Lakhnawi converted
The pdf with the Lakhnawi editon of the Fusus has been converted to plain unicode.
Only the original text is retained. Footnotes and page numbers have been removed.
From there I made some exports to html and pdf, which are attached.
This is just for informational purposes.
Later we plain to produce a Text-Fabric version of this text, which will include the exact positions of all words in the original pdf.
From there you can get a plain text easily.
There might still be some rough edges.
Pipeline works
The pipeline from scanned images via cleaning to OCR works.
A few hundred pages have been done.
There is still a lot of tweaking to do.
The OCR results are delivered as tab separated files, with position and confidence information, at
word and character levels.