Handwritten Character Disentanglement Benchmark compiled from EMNIST dataset
EMNIST class reads in handwritten character images and allows for flexible creation of word recognition datasets from the EMNIST letters found here : https://www.nist.gov/itl/iad/image-group/emnist-dataset
See examples.py for usage. Code below replicates dataset used in the paper : https://arxiv.org/abs/1710.03839 : Ver Steeg et. al "Disentangled Representations via Synergy Minimization"
all_words.txt contains words listed by occurence frequency from Peter Norvig, according to Google Web Trillion Word Corpus. http://norvig.com/ngrams/ (count_1w.txt)
from emnist import EMNIST import numpy as np length = 3 emnist = EMNIST() top_words = emnist.top_words_of_length(length, max_words=300) letters = emnist.top_letters_by_position(top_words, n = 8) print('Length ', length, ' Words combined using letters: ') print(letters) train_words, test_words = emnist.valid_words_from_letters(letters) x_train, y_train = emnist.get_data(train_words, data = 'train', per_word = 1, resample_letters = 'none', save_all_imgs = True) # Here, we treat the test set as using the same letter images but consisting of invalid English words. # Set data = 'test' to sample EMNIST test set images x_test, y_test = emnist.get_data(test_words, data = 'train', per_word = 1, resample_letters = 'none')
Useful Methods & Parameters
top_words_of_length ( length , max_words , file_path , return_probabilities)
Choose word length and number of words, returns list of word strings and, optionally, relative occurence probabilities normalized amongst chosen words. These may be fed to "get_data" directly to pull images, or used to choose commonly occuring letters in each character position.
top_letters_by_position ( words, n )
Choose top n letters occuring in each of the i positions in words list. Returns list (length = word length) of lists of letters (length = n)
valid_words_from_letters ( letters )
Given list of lists of letters to go in each position, find split all combinations of letters into validly defined words (train), and not (test). Can be used with top_letters or custom letter choices by position.
get_data (words, data = 'train'/'test', per_word , resample_letters)
Construct dataset of images by specifying word list, whether to take images from EMNIST training or test data, and how many samples per_word (can be an integer or 1d array with indices matching word list).
- resample_letters = 'none': Same letter image for each instance of a letter in a word
- resample_letters = 'all' : Resample image for each instance of a letter in word
- resample_letters = 'words' : Use same letter image within same word sample : i.e same 'p' within word 'pip'