Skip to content

Denmark: Handwriting Training

Collin edited this page Dec 20, 2022 · 1 revision

Sending Data for HWR

Sending to Denmark:

Make some sort of csv with the image names and the transcription with corresponding images. Put those images in a folder on the supercomputer (somewhere in fsl_groups that is reasonable) and tar them up (I’ve used a .tar.gz in the past; it is simple to look up the exact command on how to do this.) Then email Torben with the file path to the tar file and the csv. You can either put the csv on the supercomputer (with the images but not included in the actual tar file) or you can attach it to the email. Torben is the phd student who built the HWR model and our main point of contact with Denmark. His email is: tsdj@sam.sdu.dk

Other: You can send to a BYU student that is good with HWR. The current one is Brian Robinson who can be reached at: elderbrianrobinson@gmail.com

An email from Denmark

Thanks for reaching out - we certainly also appreciate all the work you do in the lab! Getting properly started on a collaboration such as this is indeed always a challenge, but I think we have done some great work already. I will try to outline the main components we need, but I also need to stress that it is quite task-dependent - it depends a lot on what the purpose of the project/the HTR step is. As such, I will try to describe what is needed separately for different tasks. However, do note that even within tasks, there is quite a lot of variation, depending on the underlying difficulty. I will try to also highlight this.

Sec. 1: Training a model for transcription with the aim of research afterwards

Ultimately, we would like to use the data post transcription for some research project. Maybe we have transcribed occupation and would like to study occupational choices between geographic areas - or something completely different. In any case, these types of projects are characterized by the need for a high level of accuracy in our transcription. This is possible through the use of sufficient numbers of (image, label) pairs, where sufficient depends a lot on the task. Suppose we are interested in transcribing only the gender of individuals from some census - this task is very easy, as there only are two options. As such, a relatively low number of observations is likely needed. If, on the other hand, we want to transcribe the names of individuals, a very challenging HTR task, we need far more observations to achieve the same accuracy. In all cases, we need the following data:

  1. Segmented images of the field(s) of interest. These should contain, e.g., the occupation of individuals, if that is what we are interested in transcribing. These may come in folders of raw images, such as .jpg files, or compressed somewhat, such as in tar.gz files. If the structure is relatively stable, that is a plus - just to make everything a bit easier.
  2. Labels associated with the images (that is, a label associated with each image). In its most simple form, this can be a .csv file with two columns: The name of the image (including any path to the image, if the files are structured in, e.g., folders) and the label. The label can come in many forms, as this depends on the task. Some examples below for different tasks:
  • Names: If we are transcribing names, the label will likely be a string of characters of the name, such as ["Christian", "Logan", "Torben"], etc.
  • Occupation: This is trickier. The raw images may contain many different spellings of occupations that might in fact be the same, such as "farmworker" and "working on a farm". Or there might be occupations that aren't the same, but which we would like to treat as the same (if, e.g., we are only interested in occupational sectors and not specific occupations). The question is then: What to do? Here, we typically employ a pragmatic approach: Use general preprocessing, and then perform any additional "rewiring" later. This is discussed in Step 3 below.
  • Sex: Sometimes the label space is small, such as only two categories (male/female). In such cases, once again using the raw coding is just fine, whether written as ["male", "female"] or perhaps abbreviations such as ["M", "F"].
  1. Lexicon/dictionary to "map" the labels from the raw label space to what we are interested in for a given project. I think this is the step where we perhaps have not been particularly good in describing our needs. Sometimes, this is not needed. For example, if the goal is to transcribe names, having you provide a list of names is all we need - there is perhaps no specific lexicon of all the names available. However, in many cases a lexicon is crucial. For example, suppose there are five different levels of education we are interested in, but you provide labels for education with far more categories, perhaps due to different spellings in the raw data of the same education, perhaps due to typos in transcription, or perhaps due to there truly being more categories, but for this project we aggregate across some of them (perhaps we do not want to distinction between someone holding a "bachelor" versus a "master", but rather categorizing both of them as "university educated"). What we need is a file, such as a .csv file, with two columns: The first column should contain all unique categories in the label files, and the second should contain the category of interest for the specific project. In the education example below, there will be one row for "bachelor" and one row for "master", where the second column in both cases should be "university educated".

What we have sometimes lacked, which has made it difficult on our end, is the lexicon. When we receive (image, label) pairs, but no explanation of what the labels represent, it is not possible for us to do much. Especially if the labels contain misspelling or categories that are in fact identical for the purpose of the specific project. Without this knowledge, building models become impossible.

In relation to the number of samples needed, this depends a lot on the underlying difficulty of the task. This difficulty can, roughly, be partitioned into two elements: The quality of the handwriting, depending both on image quality and inherent text clarity, and the outcome space, i.e. predicting whether someone is male or female (sample space of 2) is far easier than transcribing names (sample space enormously large, as a ten-letter name can in principle belong to (number of letters)^10 different categories). If both elements are on the easier side, as little as perhaps 1000 or a few 1000s (image, label) pairs can often result in very high performance, perhaps sufficient for creating data ready for research. If both elements are difficult, hundreds of 1000s of (image, label) pairs might be needed. In the most advanced ML research papers, the best performance is only reached with hundreds of millions of (image, label) pairs - more data is always better. In all cases, verifying that the labels are correct provides large benefits. If a rare label here or there might be wrong (i.e. stating something else than the image), this is not catastrophic, but it is very helpful (and matters a lot for the performance of the model) that the labels have been corrected from typos etc., whether directly or through the lexicon.

Sec. 2: Training a model for transcription with the aim of reverse indexing (or similar) afterwards

Sometimes, performance/accuracy need not be particularly high, as the purpose of transcription is not to use the data directly for a research project but, e.g., to kickstart labelling (i.e. reverse indexing). In such cases, we need the same information as in Sec. 1, but fewer (image, label) pairs suffice. We still need a lexicon, to understand the underlying data, but in some cases we can kickstart such a project with quite few (image, label) pairs, perhaps as few as 1000. Note, however, that the same considerations as above apply with respect to the difficulty: We probably (or rather, certainly) cannot successfully kickstart transcription of names on low quality images with only 1000 (image, label) pairs. Further, as is always the case with ML/AI, the more (image, label) pairs we receive, the better the model we can train. Even if we can kickstart at only, e.g., 1000 (image, label) pairs, it might be much better to label for example 5000.

Sec. 3: Using a trained model for transcription on images with no labels

In all projects (i.e. both Sec. 1 and 2), we ultimately want to transcribe images with no labels. After all, we would otherwise not need to build an HTR model. Here, we only need images - labels obviously don't exist, and we already have the lexicon, which we will have received whether Sec. 1 or 2 were used when we trained the model. However, even if relatively simple to do (i.e. we just need images), some caution with respect to the results is warranted. This is due to possible domain shift between training data and the new images. I will outline two "extremes" below, and then discuss what lies in between:

  1. No domain shift: In some cases, the data we train on is similar to the new images we would like to transcribe. This is optimal: In this case, we can trust that our model achieves the same accuracy as we have estimated when we trained our model, and it is in these cases we obtain the best performance. An example of such a case is if you have transcribed a random subset of, e.g., names for same specific census, and we then transcribe the names for all other individuals in the same census using our trained model.
  2. Significant domain shift: Domain shift, roughly, happens for two reasons: When we transcribe something different in type and when we transcribe something from a different data source. Suppose we have trained a model to transcribe educational status in the 1860 census, but then want to use it to transcribe the names of individuals in the 1940 census. This will not work to any degree, perhaps not unexpected - both the type and data source are different. Suppose now that we have trained a model to transcribe names in the 1860 census and want to use it to transcribe names in the 1940 census. This is better - only the data source has changed, not the type of data - we still transcribe names. Although this will likely work to some degree, we cannot expect the performance to be as high as on the original task (1940 census). As such, for such tasks we might want at least a small number of (image, label) pairs, to get an idea about the accuracy of our model on this new data source. This way, we know whether the accuracy is high enough to make it useful or not.

In reality, we often find ourselves with some degree of domain shift. Even when we transcribe the same column from the same census, the (image, label) pairs we get for training might oversample easy cases - after all, you cannot provide the label if it is not possible to read the text in the image! However, with random sampling and careful labelling, this will often not be particularly problematic.

Conclusion

Phew, that ended up being rather long! I hope it is "digestible", although I realize it might be a lot to take in. I hope this can serve as, at least, a starting point for understanding our process. Please do not hesitate to reach out in case anything is unclear - whether now or later. Further, if you think a meeting would be helpful, we can always plan one - whether the two of us or with others as well. Perhaps we can, using this as a starting point, write some document with everything explained - this might also be helpful for you, whenever new people join your team? You might already have something of this sort, which we can also use to build this "guide" on? Perhaps we could work together to create an example of the needed information, with explanation - that it, a "toy example" with (image, label) pairs and an accompanying lexicon - this might be helpful for some?

As a final remark, there are occasional exceptions to the above. Sometimes, we might want to incorporate additional information, for example, in which case we might also request that. However, I still believe it might be helpful as a "rule-of-thumb" and general enough to encompass most cases.