question if you know of a local_alignment implementation on sequences of letters #3

jwijffels · 2020-03-11T13:23:52Z

Thanks for releasing this package. I'm using it to match person names from 18th-19th century persons in Bruges as well as street addresses from medieval documents (which were extracted from image scans as well as some OCR-ed images/documents) to an existing set of street addresses and person names.

About the functions to find similarities, do you know if something similar exists as textreuse::align_local but instead of working on a sequence of words on sequence of characters and a similarity metric instead of a distance metric

Adding @lmullen (author of textreuse), @markvanderloo (author of stringdist) and @djvanderlaan (author of reclin) just in case someone has pointers.
thanks for any feedback

djvanderlaan · 2020-03-11T13:40:25Z

I have some difficulty understanding exactly what the align_local function does, but it seems quite similar to longest common substring which is implemented in stringdist and since it has a maximum value for a given pair of strings (https://journal.r-project.org/archive/2014-1/loo.pdf) it can be translated into a similarity score.

djvanderlaan · 2020-03-11T13:51:28Z

For a linkage project I did quite a while back, I ran into the problem that string distances for some dutch street names can be quite large. For example:

> stringdist('burgemeester ferdinand van de waalstraat', 'burg. vd waalstr.')
[1] 25
> stringdist('burgemeester ferdinand van de waalstraat', 'burg. vd waalstr.', method = "jw")
[1] 0.393175

Don't know if this is a similar problem you are running into. I tried to solve this first by standardisation of terms as burgemeester/burg. and straat/str, which in this case solves most of the problem. Second, by trying to match words to each other using similarity. In this case matching burg to burgemeester, waalstr to waalstraat and vd probably to de. As the number of words is generally small this was done using brute force. The total similarity was then the average similarity of the matched words (ignoring words that could not be matched). This helped.

jwijffels · 2020-03-11T13:59:49Z

textreuse::align_local uses Smith-Waterman algorithm https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm for finding sequences of words but instead on sequences of ACGT letters it does it on words.
I'm just looking for an implementation based on sequences of characters.
The distance coming out of textreuse::align_local depends a lot on the length which gives different ranges of distances depending on the size of both persons/streets to align. The logic of Smith-Waterman is not the longest common substring but more a sequence of letters/words where some letters/words can differ.
I'm basically looking into it in order to make sure errors due to OCR-ing or transcription errors when doing image-to-text are allowed.

Thanks for the input on these abbreviations. Problem I'm having with medieval / 18th/19th texts is that the there did not exist a lot of standardisation of names at that time.

jwijffels · 2020-03-12T08:56:33Z

Thanks for the input Jan. Closing as it looks like textreuse::align_local as a similarity metric does not exist for sequences of letters instead. Will implement it myself.

djvanderlaan · 2020-03-12T09:06:11Z

Ok. Succes. If you have an implementation, I can imagine @markvanderloo might be interested to include the measure in the stringdist package.

jwijffels · 2020-04-16T08:00:38Z

FYI.
I've made an implementation - source code here https://github.com/DIGI-VUB/text.alignment
It's at CRAN: https://cran.r-project.org/web/packages/text.alignment/index.html

jwijffels closed this as completed Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question if you know of a local_alignment implementation on sequences of letters #3

question if you know of a local_alignment implementation on sequences of letters #3

jwijffels commented Mar 11, 2020 •

edited

Loading

djvanderlaan commented Mar 11, 2020

djvanderlaan commented Mar 11, 2020

jwijffels commented Mar 11, 2020 •

edited

Loading

jwijffels commented Mar 12, 2020

djvanderlaan commented Mar 12, 2020

jwijffels commented Apr 16, 2020

question if you know of a local_alignment implementation on sequences of letters #3

question if you know of a local_alignment implementation on sequences of letters #3

Comments

jwijffels commented Mar 11, 2020 • edited Loading

djvanderlaan commented Mar 11, 2020

djvanderlaan commented Mar 11, 2020

jwijffels commented Mar 11, 2020 • edited Loading

jwijffels commented Mar 12, 2020

djvanderlaan commented Mar 12, 2020

jwijffels commented Apr 16, 2020

jwijffels commented Mar 11, 2020 •

edited

Loading

jwijffels commented Mar 11, 2020 •

edited

Loading