Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Golden Record: Confidence Level Logic #616

Open
nicoprow opened this issue Nov 8, 2023 · 5 comments
Open

Golden Record: Confidence Level Logic #616

nicoprow opened this issue Nov 8, 2023 · 5 comments
Labels
epic This issue is an epic to other issues

Comments

@nicoprow
Copy link
Contributor

nicoprow commented Nov 8, 2023

This is an issue for discussing the logic to calculate the confidence level business partner records.

Description

Sharing members would like to see how good the quality of a given golden record is. Quality meaning the confidence of the golden record process on how likely the data of that record is correct.

The confidence level should be calculated from three different factors:

  1. The record has been checked against an official register
  2. Whether the record has been approved by the data owner
  3. Number of sharing members who have shared that record

The responsibility to create such a final confidence level lies with the cleaning service. However, the BPDM system should be able to hold the confidence level as well as provide enough information to calculate such a number.

Thoughts on Factors

  1. Business partners already contain information to search in official registers. This holds especially true if we consider a set of minimal required filled data like country and name before a record may enter the golden record process.
  2. Golden record tasks contain the owning companies BPNL if the sharing member flags the record as own company data. With this, cleaning services get to know how does the data of the owning company look like and can create a golden record based on this information. We don't plan include a separate approval process for the data owner but expect that the resulting record matches considerably the owner's record. Therefore, this confidence factor is fulfilled if the owner shared that business partner record.
  3. Since cleaning services should generally be unaware of who originally shared the record (data owner being the exception), the number of members who share the same record is easily calculatable. For this I propose a more complex approach further down.

Dependencies

This story depends on creating the data model changes to include fields to store and show the confidence level: #631

@nicoprow nicoprow added the epic This issue is an epic to other issues label Nov 8, 2023
@nicoprow
Copy link
Contributor Author

A solution for determining the overlap counter for point 3 could look like this:
Each gate could assign a business partner record a uuidv3 which is calculated from the gate owner's BPNL and the BPNs of the record. This ID is passed to the Orchestrator and can be used for curation services to determine how many different sharing members have shared a record.

When the BPNs for a record change a new uuidv3 is generated and the count for the other BPNs has to be reduced. In order to make this information available to the curation services each business partner record also contains a field for the previous unique ID, if exists. This way the process canself correct the overlaps count on new updates.

One problem: The first time before a BPN is assigned we can't really create a meaningful uuidv3. The count would only work after the "second round". Proposal: use a uuidv4 for generating an entirely random ID, that is being used as a placeholder. It will be replaced on the next update when a uuidv3 can be generated.

@martinfkaeser
Copy link
Contributor

I think there is an even simpler solution:
Each business partner record should be assigned a random UUID by the gate which never changes.

The cleaning service should keep a mapping between UUID and BPN.
The number of UUIDs that refer to a specific BPN is its sharing count.
If a record is changed by the gate which leads to a new BPN this automatically decreases the sharing count for old BPN and increases the sharing count for the new BPN when calculated by the cleaning service.

@Sebastian-Wurm
Copy link

Sebastian-Wurm commented Nov 23, 2023

I think there is an even simpler solution: Each business partner record should be assigned a random UUID by the gate which never changes.

The cleaning service should keep a mapping between UUID and BPN. The number of UUIDs that refer to a specific BPN is its sharing count. If a record is changed by the gate which leads to a new BPN this automatically decreases the sharing count for old BPN and increases the sharing count for the new BPN when calculated by the cleaning service.

Good approach but keep in mind that there are duplicates in the Gate, which link to the same BPNA. They are sometimes even intended. So the gate needs to keep track of these non-intended / intended duplicates and assign the same UUID for counting to all these duplicates.

@nicoprow
Copy link
Contributor Author

I think there is an even simpler solution: Each business partner record should be assigned a random UUID by the gate which never changes.
The cleaning service should keep a mapping between UUID and BPN. The number of UUIDs that refer to a specific BPN is its sharing count. If a record is changed by the gate which leads to a new BPN this automatically decreases the sharing count for old BPN and increases the sharing count for the new BPN when calculated by the cleaning service.

Good approach but keep in mind that there are duplicates in the Gate, which link to the same BPNA. They are sometimes even intended. So the gate needs to keep track of these non-intended / intended duplicates and assign the same UUID for counting to all these duplicates.

Exactly, such duplicates in the same Gate are the reason why I poposed UUIDv3 in the first place, as that naturally leads to the same UUID for records that are in the same Gate and have the same BPNs.

I think it's not possible to prevent reassigning unique IDs and employ a self correcting algorithm under these circumstances. But maybe someone has a simpler solution.

@martinfkaeser
Copy link
Contributor

You're right that's something I didn't consider yet.

I could see a somewhat simpler variation:
Each record in the Gate generates two UUIDs per records and transmits them to the Orchestrator.
One is random and never changes - let's call it RecordUUID.
The other is a hash over the Gate owner's identifier and its current BPN, so this changes whenever the BPN changes - let's call it OwnerBpnHash.

So this works similar to the initial proposition with the previous UUID, but tracks BPN changes related to RecordUUID. Let's discuss if we prefer this or not.

Using the Gate owner's BPNL for calculating the hash has some potential risk: It might be possible to guess the owner's BPNL from the hash by brute-force, because the BPN is known and for the owner's BPNL there is just a relatively small list of options.
So as a small improvement we could use a private key per owner which is used only for this purpose.

@nicoprow nicoprow changed the title Golden Record: Confidence Level Golden Record: Confidence Level Logic Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic This issue is an epic to other issues
Projects
None yet
Development

No branches or pull requests

3 participants