Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain spaces when splitting long lines #4

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jimrybarski
Copy link

When long qualifier values are split on spaces to keep the total line length less than 80 characters, and the final character on a line is a space, that space is deleted. However, this prevents parsing the original value - it's not possible to know when joining the lines together whether a space should be inserted or not. For example:

Lipopolysaccharide export system ATP-binding protein LptB

will appear in the Genbank text as:

/product="Lipopolysaccharide export system ATP-binding
 protein LptB"

and when this is read later and newlines are removed, it will be:

Lipopolysaccharide export system ATP-bindingprotein LptB
(note bindingprotein is one word now).

This commit retains the final space at the end of the line so that the original value can be reconstructed unambiguously.

When long qualifier values are split on spaces to keep the total line length less than 80 characters,
and the final character on a line is a space, that space is deleted. However, this prevents parsing the
original value - it's not possible to know when joining the lines together whether a space should be
inserted or not. For example:

Lipopolysaccharide export system ATP-binding protein LptB

will appear in the Genbank text as:

/product="Lipopolysaccharide export system ATP-binding
 protein LptB"

and when this is read later and newlines are removed, it will be:

Lipopolysaccharide export system ATP-bindingprotein LptB

This commit retains the final space at the end of the line so that the original value can be
reconstructed unambiguously.
This commit implements all the changes recommended by `cargo clippy`, which is currently causing the CI pipeline to fail.
@dlesl
Copy link
Owner

dlesl commented Nov 24, 2022

Thanks for the PR and thanks for fixing the clippy errors! It might take me a while to review this one though, gotta get my head around it (I remember this being a bit tricky)

@dlesl
Copy link
Owner

dlesl commented Nov 24, 2022

So this is a bit of a tricky one, from my reading of the docs here and looking at some official GenBank files (like mg1566.gb) it seems like the interpretation of the newlines in the qualifier values needs to be context-dependent, ie. on the key. For example for sequences the newlines should be dropped but for descriptive text they are equivalent to spaces. I did go a tiny bit of the way towards implementing that here for translations, but for everything thing else the newlines are just left as is and it is up to the library user to figure out what they mean, which not very consistent at all 😀. I'm not sure what the best solution is here really, I think it might be good to compare other implementations like BioPython to see how they handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants