Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question if you know of a local_alignment implementation on sequences of letters #3

Closed
jwijffels opened this issue Mar 11, 2020 · 6 comments

Comments

@jwijffels
Copy link

jwijffels commented Mar 11, 2020

Thanks for releasing this package. I'm using it to match person names from 18th-19th century persons in Bruges as well as street addresses from medieval documents (which were extracted from image scans as well as some OCR-ed images/documents) to an existing set of street addresses and person names.

About the functions to find similarities, do you know if something similar exists as textreuse::align_local but instead of working on a sequence of words on sequence of characters and a similarity metric instead of a distance metric

Adding @lmullen (author of textreuse), @markvanderloo (author of stringdist) and @djvanderlaan (author of reclin) just in case someone has pointers.
thanks for any feedback

@djvanderlaan
Copy link
Owner

I have some difficulty understanding exactly what the align_local function does, but it seems quite similar to longest common substring which is implemented in stringdist and since it has a maximum value for a given pair of strings (https://journal.r-project.org/archive/2014-1/loo.pdf) it can be translated into a similarity score.

@djvanderlaan
Copy link
Owner

For a linkage project I did quite a while back, I ran into the problem that string distances for some dutch street names can be quite large. For example:

> stringdist('burgemeester ferdinand van de waalstraat', 'burg. vd waalstr.')
[1] 25
> stringdist('burgemeester ferdinand van de waalstraat', 'burg. vd waalstr.', method = "jw")
[1] 0.393175

Don't know if this is a similar problem you are running into. I tried to solve this first by standardisation of terms as burgemeester/burg. and straat/str, which in this case solves most of the problem. Second, by trying to match words to each other using similarity. In this case matching burg to burgemeester, waalstr to waalstraat and vd probably to de. As the number of words is generally small this was done using brute force. The total similarity was then the average similarity of the matched words (ignoring words that could not be matched). This helped.

@jwijffels
Copy link
Author

jwijffels commented Mar 11, 2020

textreuse::align_local uses Smith-Waterman algorithm https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm for finding sequences of words but instead on sequences of ACGT letters it does it on words.
I'm just looking for an implementation based on sequences of characters.
The distance coming out of textreuse::align_local depends a lot on the length which gives different ranges of distances depending on the size of both persons/streets to align. The logic of Smith-Waterman is not the longest common substring but more a sequence of letters/words where some letters/words can differ.
I'm basically looking into it in order to make sure errors due to OCR-ing or transcription errors when doing image-to-text are allowed.

Thanks for the input on these abbreviations. Problem I'm having with medieval / 18th/19th texts is that the there did not exist a lot of standardisation of names at that time.

@jwijffels
Copy link
Author

Thanks for the input Jan. Closing as it looks like textreuse::align_local as a similarity metric does not exist for sequences of letters instead. Will implement it myself.

@djvanderlaan
Copy link
Owner

Ok. Succes. If you have an implementation, I can imagine @markvanderloo might be interested to include the measure in the stringdist package.

@jwijffels
Copy link
Author

FYI.
I've made an implementation - source code here https://github.com/DIGI-VUB/text.alignment
It's at CRAN: https://cran.r-project.org/web/packages/text.alignment/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants