# Converting English Hand-Drawn Chars74k Data to JSON

By [Allison Parrish](http://www.decontextualize.com/)

This is just a quick notebook to convert the stroke trajectories of the hand-drawn English letters in the [Chars74k](http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/) dataset to JSON. (See bibliography for paper reference.) I wanted to use this data for art reasons, but the dataset only comes supplied as MATLAB `.m` files. After spending an hour or so trying to figure out how MATLAB works, I decided to just write some Python to just read in the files and parse the data. In the interest of making the data available more widely Follow along with this notebook to reproduce the process. I've supplied the JSON files in this repository.

Here's [an example of the data in use](https://editor.p5js.org/allison.parrish/full/rJWnRELhQ).

For this notebook to work, you'll need to [download a copy of the original data](http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/EnglishHnd.tgz). Don't decompress the file! Put it in the same directory as this notebook.

First, the needed Python libs:

In [224]:
import tarfile
import re
import json
from collections import defaultdict

Although the data comes as `.m` files, they don't contain arbitrary MATLAB code, and they're all formatted in exactly the same way (thank god). In the cell below, `parsedata` takes the content of one of these files and finds the stroke data, which is turned into lists of integers in `extractlists`. (I used integers because the stroke data didn't look like the original data actually had useful information in the fractional part of the number.)

In [225]:
def extractlists(s):
    lists = re.findall("\[([^\[\]]*)\];", s, re.MULTILINE)
    return [[int(float(y)) for y in x.split()] for x in lists]
def parsedata(s):
    row_match = re.search(r"^rows = {([\[\];0-9.e+\r\n]+)};", s)
    col_match = re.search( r"cols = {([\[\];0-9.e+\r\n]+)};", s)
    row_data = extractlists(row_match.group(1))
    col_data = extractlists(col_match.group(1))
    assert len(row_data) == len(col_data)
    return [list(zip(col_data[i], row_data[i])) for i in range(len(row_data))]

The `charmap` string maps an index to the character at that index in the dataset.

In [227]:
charmap = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
len(charmap)

62

And the following bit of code does the magic, checking the filename of each file in the archive and extracting the information contained therein using the functions defined above.

In [234]:
chars = defaultdict(lambda: [None]*55)
with tarfile.open("./EnglishHnd.tgz", "r:gz") as tgz:
    for tarinfo in tgz:
        if tarinfo.isfile() and tarinfo.name.endswith(".m"):
            imgidx, dataidx = re.findall("img(\d\d\d)_(\d\d\d).m", tarinfo.name)[0]
            data = tgz.extractfile(tarinfo).read().decode('utf8')
            chars[charmap[int(imgidx)-1]][int(dataidx)-1] = parsedata(data)

The resulting `chars` dictionary maps each letter in the dataset to a list of letterforms (55 for each letter), which in turn are lists of (x, y) coordinates (as tuples) for each point in the stroke:

In [235]:
chars['t'][1]

[[(300, 439),
  (330, 428),
  (335, 421),
  (338, 417),
  (340, 413),
  (342, 408),
  (344, 404),
  (346, 399),
  (349, 393),
  (351, 388),
  (353, 383),
  (356, 377),
  (359, 370),
  (364, 357),
  (367, 351),
  (370, 344),
  (373, 337),
  (375, 330),
  (378, 323),
  (382, 309),
  (384, 302),
  (389, 289),
  (393, 275),
  (395, 268),
  (397, 262),
  (400, 249),
  (402, 243),
  (404, 237),
  (409, 220),
  (411, 216),
  (412, 211),
  (414, 202),
  (416, 195),
  (418, 189),
  (418, 187),
  (419, 184),
  (419, 182),
  (420, 178),
  (420, 177),
  (420, 176),
  (420, 175),
  (420, 176),
  (420, 177),
  (420, 178),
  (419, 179),
  (419, 180),
  (418, 184),
  (417, 186),
  (415, 194),
  (414, 197),
  (413, 201),
  (413, 204),
  (412, 208),
  (412, 211),
  (411, 215),
  (410, 223),
  (409, 227),
  (408, 231),
  (407, 235),
  (404, 249),
  (402, 258),
  (400, 268),
  (400, 278),
  (398, 287),
  (396, 298),
  (394, 308),
  (394, 314),
  (391, 337),
  (390, 343),
  (389, 349),
  (389, 360),
  (389

Dump to JSON:

In [236]:
with open("char74k.json", "w") as fh:
    json.dump(chars, fh)

## Normalized data

For my purposes, a problem with this dataset is that all of the characters are drawn at different places on the canvas. To fix this, the following cell creates a copy of the original dictionary and then "normalizes" each character by centering the coordinates on the point halfway between the X and Y extents of the character, so that the midpoint of the bounding box is at (0, 0).

In [230]:
import itertools
chars_normalized = dict(chars)
for ch, forms in chars_normalized.items():
    for form in forms:
        minx = min([item[0] for item in itertools.chain(*form)])
        maxx = max([item[0] for item in itertools.chain(*form)])
        miny = min([item[1] for item in itertools.chain(*form)])
        maxy = max([item[1] for item in itertools.chain(*form)])
        centerx = (minx + maxx) / 2
        centery = (miny + maxy) / 2
        for stroke in form:
            for i, point in enumerate(stroke):
                stroke[i] = (point[0] - centerx, point[1] - centery)

In [232]:
chars_normalized['a'][50]

[[(59.0, -94.0),
  (34.0, -100.0),
  (32.0, -101.0),
  (29.0, -101.0),
  (27.0, -101.0),
  (24.0, -101.0),
  (22.0, -102.0),
  (17.0, -102.0),
  (15.0, -102.0),
  (12.0, -101.0),
  (8.0, -101.0),
  (5.0, -101.0),
  (1.0, -101.0),
  (-4.0, -101.0),
  (-8.0, -100.0),
  (-11.0, -100.0),
  (-13.0, -99.0),
  (-17.0, -98.0),
  (-19.0, -97.0),
  (-23.0, -95.0),
  (-24.0, -94.0),
  (-26.0, -93.0),
  (-30.0, -91.0),
  (-32.0, -90.0),
  (-33.0, -89.0),
  (-35.0, -88.0),
  (-37.0, -86.0),
  (-39.0, -85.0),
  (-40.0, -84.0),
  (-42.0, -83.0),
  (-44.0, -81.0),
  (-47.0, -78.0),
  (-48.0, -76.0),
  (-50.0, -74.0),
  (-51.0, -72.0),
  (-54.0, -69.0),
  (-56.0, -67.0),
  (-59.0, -64.0),
  (-61.0, -62.0),
  (-62.0, -60.0),
  (-64.0, -58.0),
  (-65.0, -57.0),
  (-68.0, -53.0),
  (-69.0, -52.0),
  (-70.0, -51.0),
  (-71.0, -49.0),
  (-73.0, -47.0),
  (-74.0, -45.0),
  (-76.0, -41.0),
  (-77.0, -38.0),
  (-79.0, -34.0),
  (-80.0, -32.0),
  (-81.0, -29.0),
  (-82.0, -24.0),
  (-83.0, -22.0),
  (-84.0, -19

Dump this to JSON as well:

In [233]:
with open("char74k-normalized.json", "w") as fh:
    json.dump(chars_normalized, fh)

## Works cited

T. E. de Campos, B. R. Babu and M. Varma. [Character recognition in natural images](http://personal.ee.surrey.ac.uk/Personal/T.Decampos/papers/decampos_etal_visapp2009.pdf). In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP), Lisbon, Portugal, February 2009. 