PHTI-Dataset

The PHTI is a real dataset developed for the research community, considering the recognition systems' generalization. The overall process includes; the data collection process, image acquisition process, text-line segmentation process, annotation process and detailed statistics of the PHTI dataset. The PHTI dataset is developed to cover many Pashto language genres such as prose, poetry, short story, history, sports, BBC Pashto, and religion. In this context, data were collected from a diverse background of learners having different levels of educational background, including University, College, School, and Deeni Madaras. Each collected handwritten sample is scanned with 600 DPI using an HP Scanjet scanner. The acquired image is named PHTI-XX-Y-ZZZZ.jpg; PHTI is a constant prefix that refers to our new dataset. The "XX" annotates the source name, and "Y" represents the gender, while "ZZZZ" describes the page number in the respective source. The PHTI images are further segmented into 36, 082 text-line images. After segmentation into text-line images, the notion used for image file names is extended with another number.

However, each scanned image, as well as an image of a segmented text line, was manually checked and validated regarding its ground truth, and if there was a mismatch or an error, then the image, as well as its annotation, was corrected. However, there might be a ±4 % error in the validation process. The ground truth is stored in a text file with a UTF-8 codec. The file name is the same as the image file name, except the extension. The PHTI dataset consists of, Total of writers 400, Female writers 200, Male writers 200, Total pages 3, 970, Pages per writer 10, Average line per page 10, Total text-lines images 36, 082, Text-line written by females 18,069, Text-line was written by males 18,013, Total words 4, 20, 961, Unique words 33, 330, Total character 169, Minimum height of image 28, Maximum height of image 237, Minimum width of image 78, Maximum width of image 1, 263.

APPLICATIONS OF THE PHTI DATASET:

Natural Language Processing
Gender and Age Classification
Skew and Line Segmentation
OCR Application

The dataset is available under the terms and conditions covered within the license provided by (GNU General Public License v3.0).

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
PHTI-DATASET		PHTI-DATASET
LICENSE		LICENSE
README.md		README.md
Test.txt		Test.txt
Training.txt		Training.txt
Validation.txt		Validation.txt
proposedTAGSET.txt		proposedTAGSET.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHTI-DATASET

PHTI-DATASET

LICENSE

LICENSE

README.md

README.md

Test.txt

Test.txt

Training.txt

Training.txt

Validation.txt

Validation.txt

proposedTAGSET.txt

proposedTAGSET.txt

Repository files navigation

PHTI-Dataset

About

Releases

Packages

License

adqecsbbu/PHTI

Folders and files

Latest commit

History

Repository files navigation

PHTI-Dataset

About

Resources

License

Stars

Watchers

Forks