# Heuristic Functions

A walkthrough of the heuristic functions I've created using Holiday's stroke error scoring and how to understand them.

The goal is to create a function that returns as high a score as possible for any given gene-archetype pairing without sacrificing speed. Aim to:
- Reduce calls to the Stylus API (meaning generate less stroke orders to score)
- Reduce time complexity (meaning reduce the amount of potential stroke maps to iterate through)

In [1]:
import numpy as np

Get used to working with NumPy arrays instead of standard Python lists. When building heuristics, avoid iteration (especially nested for loops) as that makes time complexity worse. Instead, try to vectorize where you can. NumPy has plenty of functions to work with arrays that are faster than Python's built-in methods.

In [2]:
from score_strokes import strokeErrorMatrix
from xmlparse import loadGeometryBases, loadRef, getXmlScore, minXml

2024-06-24T21:10:14.888105Z [INFO ] Stylus initialized - Stylus 1.5.0 [RELEASE - May 21 2024 14:06:24] (c) 2006-2009 Biologic Institute


These are functions from Holiday's code that my heuristics are built from. They are the foundation for building heuristics that I used. Make sure to read over her documentation on her repo, which is linked in the README of this repo.

When first importing you should see a message that says [INFO ] Stylus initialized - Stylus...
This message just means that Stylus is up and running correctly. Make sure that Stylus is configured correctly or else you will encounter errors.

**strokeErrorMatrix** is used to generate an n*m matrix of stroke errors where n is the number of archetype strokes and m is the number of gene strokes. I have found that in certain genes the function does not generate a matrix with an equal amount of archetype strokes and gene strokes, so be careful. The smaller the stroke error, the closer the gene stroke is to the archetype stroke. When testing every viable permutation of stroke errors I have found that at least one permutation matches the exhaustive score in nearly all cases.

**loadGeometryBases** loads the gene data for each file in a specified directory.

**loadRef** loads the archetype data for a specific reference character.

**minXml** generates the XML data necessary for Stylus to score a gene.

**getXmlScore** calls the Stylus API and returns a score.

In [3]:
from pathlib import Path

In [4]:
ref_dir = f'{Path.home()}/Stylus_Scoring_Generalization/Reference' # archetype directory
data_dir = f'{Path.home()}/Stylus_Scoring_Generalization/NewGenes' # gene directory
ref_char = "4EFB"

Your reference directory (where to find the archetype data) and your gene directory (where to find the gene data). You can find sample genes (keep in mind these sample genes all have six strokes) in Holiday's repo at Genes/sixgenes, I changed the directory for my own purposes.

The reference character is the Unicode representation of your archetype that the genes will be scored against (see the Reference folder in Holiday's repo).

In [5]:
stroke_count = 6
stroke_map = np.empty(stroke_count, dtype=int)

Before getting into the actual heuristic algorithms it may be helpful to understand the fundamentals. The goal is to generate a stroke map, but what is a stroke map?

A stroke map is an array matching gene strokes to archetype strokes. Stylus isn't able to determine the best match between each gene stroke and each archetype stroke, so we have to do it ourselves. For example, say I'm attempting to match a six stroke gene to a six stroke archetype. Remember that array indices begin from 0. Let's say gene stroke 0 matches up best with archetype stroke 2. The gene stroke becomes the index and the archetype stroke becomes the value in the stroke map array. In this case, stroke_map[0] = 2.

In [6]:
stroke_map[0] = 2

Continue on and you might end up with something like this:

In [7]:
# pretend our heuristic generates this stroke map...
stroke_map[5] = 5
stroke_map[3] = 0
stroke_map[1] = 1
stroke_map[2] = 4
stroke_map[4] = 3

print(stroke_map)

[2 1 4 0 3 5]


Again, the gene strokes are the indices and the archetype strokes are the values. This can be quite confusing but there's no way around it.

In **strokeErrorMatrix**, the rows are the archetype strokes and the columns are the gene strokes. Each row-column coordinate represents the error between that archetype stroke and that gene stroke. So matrix[2][0] represents the error between archetype stroke 2 and gene stroke 0.

In [8]:
from itertools import permutations

def heuristic_total(strokes, ref, p_strokes, p_ref):
    error_maps = strokeErrorMatrix(strokes, ref, p_strokes, p_ref) # Retrieve error matrix
    least = 10000 # Since we want the smallest possible total error, the variable should be set to a high number
    stroke_map = ()
    for priority in permutations(range(len(ref))): # Iterate over every permutation of stroke order
        s = np.sum(error_maps[np.arange(len(error_maps)), priority]) # Sum every error in this particular stroke order
        if s < least: # Check if the generated sum is smaller than the current sum stored
            least = s
            stroke_map = priority
    return np.argsort(stroke_map) # Swap indices and values

As an example, here's one of my heuristic functions. All of my heuristics are located in benchmark.py (which is in this repo). Specifically, this heuristic takes every possible ordering of errors, calculates the sum of the errors, and returns the ordering with the lowest sum.

In [9]:
from compare_genes import getScores

heuristic_scores, heuristic_alignments, marks = getScores(heuristic_total, ref_char, data_dir)
print("Here's one score", heuristic_scores[0])
print("And here's the stroke map that obtained this score", heuristic_alignments[0])

Here's one score 0.2009721093859917
And here's the stroke map that obtained this score [1 2 6 5 3 4]


Holiday's API makes it very simple to obtain scores given a certain heuristic algorithm. Any function with the correct call signature (which can be found in her documentation under **getScores**) is compatible with **getScores**.

But algorithms compatible with **getScores** must return a single stroke order. What if we wanted to score multiple stroke orders and compare them to find the best one?

The problem with doing this is that it likely increases the algorithm's time complexity, which is not ideal. But doing so may yield more accurate scores.

In [10]:
ref_geometry, ref_progress_percentage, output_size = loadRef(ref_char, ref_dir) # READ HOLIDAY'S DOCS!
g_data, han_chars, base_data, stroke_sets, stroke_orders, f_names = loadGeometryBases(data_dir, output_size) # READ HOLIDAY'S DOCS!