Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design matching algorithm #6

Open
Mitigooli opened this issue Jan 24, 2016 · 0 comments
Open

Design matching algorithm #6

Mitigooli opened this issue Jan 24, 2016 · 0 comments

Comments

@Mitigooli
Copy link

We first implement a matching between two entries. The result of the matching between the two entries is a distance score between those two. The distance score is the sum of the distances between its components for each attribute. This way the two entries are equal if they get the distance score 0, i.e. if all the attributes match.

The components are computed as follows: The distance between the values is the absolute value of their arithmetic difference.

The distance between the dates is the difference in days.

The distance between the descriptions are a bit more complex. We work with them as case-insensitive. If one is a sub-string of the other, the distance is 0.

Possibly it is interesting to weight the different components.

The next step is to match entries in one left-hand-side list against entries in another right-hand-side list. The result is a matching matrix where the rows are entries from the left-hand-side list and the columns are entries in the right-hand-side list. Basically we can find the minimums for rows and columns, those are the best matchings.

Suggestion for defining the matching score:

  • Order description, date.
  • The distance between descriptions are calculated as:
    • 0 if one string is the subset of the other string (case insensitive),
    • and as 5 (a chosen threshold) if they are not a subset of each other (case insensitive).
  • The distance between dates is the absolute number of days.
  • The sum of the date and description distances are added together to give the final distance between two entries.

Depending on how the matching algorithm is implemented, it might not be commutative, meaning that you cannot get the same results if you match file A against B or file B against A. In case of a non-commutative implementation, it is suggested to compare the budget file (which has a higher possibility to be incomplete) against the bank file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant