Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A MrzParseException is thrown when the date fields are not parseable #15

Closed
GoUpNorth opened this issue Sep 14, 2017 · 12 comments
Closed

Comments

@GoUpNorth
Copy link
Contributor

GoUpNorth commented Sep 14, 2017

When parsing the following MRZ, mrz-java throws an MrzParseException and stops the parsing:
"P<GBRUK<SPECIMEN<<ANGELA<ZOE<<<<<<<<<<<<<<<<"
"9250764733GBRBB09157F2007162<<<<<<<<<<<<<<08" unparseable date of birth

"P<GBRUK<SPECIMEN<<ANGELA<ZOE<<<<<<<<<<<<<<<<"
"9250764733GBR8809117F20HH162<<<<<<<<<<<<<<08" unparseable date of expiry

I suggest that the library continues the parsing, keeps track of the raw text that was unparseable and send back a MrzModel to the caller.

@GoUpNorth GoUpNorth changed the title A MrzParserException is thrown when the date fields are not parseable A MrzParseException is thrown when the date fields are not parseable Sep 14, 2017
@ZsBT
Copy link
Owner

ZsBT commented Sep 14, 2017

Well, I don't think the issue is in the scope of this project. The root cause is the improper previous OCR process, e.g. digit "8" is often recognized as "B". I recommend the following procedure that works in my environment:
When parse fails, replace unstable characters (8<=>B, M<=>H, 2<=>Z etc) while exception is thrown or check digit validation fails.
For certain reasons, precise name parsing is also important: imagine the situation where the above-mentioned name appears as SPEO1HEN ANGEL4 20E.
Of course, this could be handled within this project. However, as this requires massive work, we'd need a "corrector" class that tries to parse the input string several times until all fields' check digit gets valid (keeping an eye on that check digit can be also recognized wrong).

@jaaufauvre
Copy link

jaaufauvre commented Sep 14, 2017

The example above ("BB0915") is a particular case where indeed the caller can implement some rules as you explained (8 <=> B, ...), and call the library again with a "better" MRZ.

But there will still remain cases where you can't fix the MRZ completely.

In that cases you won't have any information at all because the library has stopped everything and has thrown an Exception without any result.

You could just return null objects instead.

Looking at the code it seems it's already the case sometimes (partial results instead of Exception), so it could be extended to all the fields and all reasons of parsing errors.

Unless you want the library to be a parser of valid MRZ only.

Alex.

@ZsBT
Copy link
Owner

ZsBT commented Sep 14, 2017

Indeed, an "Exception without any result" does not help uncovering the problematic part of the MRZ line.
What are the ideas for the behavior when the date is totally unparseable? Leaving that property null?
That sounds reasonable. Shall you have a solution already, I am always open for pull requests (please use the dev branch).

Meantime, as we are talking about error handling, I am getting closer to plan that "corrector" class.

@GoUpNorth
Copy link
Contributor Author

I will make a pull request that leaves the property to null if the date parsing fails.

We could also set the day, month or year property of the date to -1, when it specifically fails on that element, and try to parse the other element of the date. That way the MrzModel could be returned with a partially parsed date.
For the MRZ formatted date "BB0915", it would give something like that:
model.dateOfBirth.year = -1
model.dateOfBirth.month = 9
model.dateOfBirth.day = 15

@ZsBT
Copy link
Owner

ZsBT commented Sep 15, 2017

Good workaround. With this, we can get the most data as possible. Can I ask you to also take care of the validity flags, I mean to set all applicable ones (even overall) to false. That helps code users to know something is wrong with the MRZ lines.

@GoUpNorth
Copy link
Contributor Author

What do yo mean about "all the applicable validity flags" ? Because if the date is not parseable because of some OCR failure, the check digit calculation will fail, no ?

@ZsBT
Copy link
Owner

ZsBT commented Sep 16, 2017 via email

@jaaufauvre
Copy link

jaaufauvre commented Sep 18, 2017

Hi,

Could it be possible to make the distinction between the check digit verification results and the fact the fields can actually be parsed?

Indeed the check digit value can be the right one, but a date still unparseable (I have some examples like this where the MRZ is coming from fraudulent passports).

The name of the 4 booleans in the MrzRecord class is too ambiguous in my opinion. They should be named "validDateOfBirthCheckDigit" instead of "validDateOfBirth".

The MrzDate could also have a boolean coming along with the year, month and day to indicate if the string was actually valid or not.
[Edit] I just saw it's already the case in the development branch ("isValidDate").

Best regards,
Alex.

@GoUpNorth
Copy link
Contributor Author

@Alex-D14
There already is a flag in the MrzDate to indicate if the date is valid or not.
The problem is that the date validity booleans have an effect on the checkDigits boolean.

@GoUpNorth
Copy link
Contributor Author

@ZsBT
Since the MrzDate year, month and day can be set to -1 in case of an unparseable date, we have to change the behaviour of the MrzDate.toMrz() function.

Right now it just formats the date properties like an mrz date ("yymmdd"). But if the original date was not parseable, this one for example, "651502", the MrzDate.toMrz() will give the following result, "65-102", which doesn't really make sense.
It should be able to give the original date even if it is not a valid mrz date. What do you think ?

@ZsBT
Copy link
Owner

ZsBT commented Sep 22, 2017

The purpose of the booleans validDateOfBirth and expiry is to quickly indicate that something is wrong with that field. Further inspection of the MrzDate object (isValidDate, and maybe a new boolean validCheckdigit?) could show what is the exact issue.
@GoUpNorth yes, leaving the original date value would also help the planned "Corrector" class to guess what was the proper value on the MRZ line.

@ZsBT ZsBT closed this as completed Sep 22, 2017
@ZsBT ZsBT reopened this Sep 22, 2017
@ZsBT
Copy link
Owner

ZsBT commented Sep 24, 2017

I close this issue as the main subject is fixed with pull request #17 and #19 .
We can continue the discussion about checkdigit booleans in a new thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants