Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issue #615 #616

Closed
wants to merge 30 commits into from
Closed

fix issue #615 #616

wants to merge 30 commits into from

Conversation

bneron
Copy link
Contributor

@bneron bneron commented Sep 18, 2015

#615
ensure that the filed DEFINITION ends with a period as in Genbank
format specifications.

biopython#615
ensure that the filed DEFINITION ends with a period as in Genbank
format specifications.
@peterjc
Copy link
Member

peterjc commented Sep 18, 2015

The unit tests failed (see Travis CI) because we can no longer round-trip an arbitrary description. Should the GenBank parser remove any trailing period, or not?

If the parser should stay as it is (probably best), then test_SeqIO_features.py at least will need adjusting.

@bneron
Copy link
Contributor Author

bneron commented Sep 21, 2015

The parser does not modify the definition field.
The problem come from the unit tests.
In the test a gbk sequence is write on disc then parse it.
The 2 seq records are compared

  • the description in original record is 'Test'
  • the description in the read record is 'Test.'
    this explain the failure of the unit tests.

The problem is that the method which raise the error
is common to compare record from 'genbank', 'embl' and 'imgt'
only genbank require to ends with a period.

I don't know which strategy to fix the tests

  1. in write_read_checks (test_SeqIO_features.py L 553)
    we can check if gb and add a period to the description property value.

  2. the unit test to compare description require only one word in common.
    (see code below) may be we can use this, and provide 'foo Test'
    as record description to write, so we should received after parsing ['foo' , 'Test.']
    and 'foo' will be common ?

L 80

# Just insist on at least one word in common:
    if (old.description or new.description) \
    and not set(old.description.split()).intersection(new.description.split()):
        raise ValueError("%s versus %s"
                         % (repr(old.description), repr(new.description)))

@peterjc
Copy link
Member

peterjc commented Sep 21, 2015

Assuming we don't change the parser then the simplest fix in test_SeqIO_features.py is probably to replace this:

class FeatureWriting(unittest.TestCase):
    def setUp(self):
        self.record = SeqRecord(Seq("ACGT" * 100, generic_dna),
                                id="Test", name="Test", description="Test")

with:

class FeatureWriting(unittest.TestCase):
    def setUp(self):
        self.record = SeqRecord(Seq("ACGT" * 100, generic_dna),
                                id="Test", name="Test", description="Test.")

Do you think changing the parser to remove the trailing . from GenBank/EMBL descriptions would be acceptable? I worry it might break existing scripts. Maybe we should look at how BioPerl or EMBOSS handle this for file format conversion?

@bneron
Copy link
Contributor Author

bneron commented Sep 21, 2015

BioPerl does not care to the period at the end of definition, But emboss does.
I conevrt a sequence in fasta format without period at the end of comments,
with the ebi web service (see url below), into genbank format.
The DEFINITION field ends with a period.

http://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?jobId=emboss_seqret-I20150921-121232-0853-79884287-oy

input in FASTA format

>seq_1 some comments
GATC

after conversion in GENBANK

LOCUS       seq_1                      4 bp    DNA     linear   UNC 21-SEP-2015
DEFINITION  some comments.
ORIGIN
       1  GATC
//

@peterjc
Copy link
Member

peterjc commented Sep 21, 2015

And continuing the example, it would appear (that online version of) EMBOSS will take that GenBank file with the period/dot and convert it to a FASTA file with the the period/dot:

>seq_1 some comments.
GATC

Based on this the round trip FASTA -> GenBank -> FASTA with EMBOSS adds a trailing dot/period to the description if not already there.

@bneron
Copy link
Contributor Author

bneron commented Sep 21, 2015

We can consider the ending period as an element belonging to the genbank syntax.
Thus we should remove the last period in the genbank parser and
add systematically a period at the end of the DEFINITION field in the genbank writer.
but don't add it to the description property itself, only when we serialize it on file.

in this case
fasta

>seq1 some comments
atcg

genbank

LOCUS       seq_1                      4 bp    DNA     linear   UNC 21-SEP-2015
DEFINITION  some comments.
ORIGIN
       1  GATC

fasta again

>seq1 some comments
atcg

But in this case if fasta comments end with a period we will obtain
a definition file with 2 periods at the end it's ugly for a human reader
but it's correct syntactically view, and at the end of the round trip
fasta -> gbk -> fasta we obtain the same file as the original file.

fasta

>seq1 some comments.
atcg

genbank

LOCUS       seq_1                      4 bp    DNA     linear   UNC 21-SEP-2015
DEFINITION  some comments..
ORIGIN
       1  GATC

fasta again

>seq1 some comments.
atcg

@peterjc
Copy link
Member

peterjc commented Sep 21, 2015

Whatever we do will have side effects.

Here's another example FASTA input file to consider, using multiple trailing . characters:

>geneX important but not sure what it does yet...
ACGTACGTACGTACGTACGTACGTACGTACGTACGT

@bneron
Copy link
Contributor Author

bneron commented Sep 21, 2015

with my last "algorithm" it will produce

LOCUS       geneX                     36 bp XXXXXX     XXX      XXX 21-SEP-2015
DEFINITION  important but not sure what it does yet....
KEYWORDS    .
ORIGIN      
        1 ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGT
//

with four periods at the end
but we can get back to

>geneX important but not sure what it does yet...
ACGTACGTACGTACGTACGTACGTACGTACGTACGT

the original when it is converted back in fasta.

@peterjc
Copy link
Member

peterjc commented Sep 21, 2015

OK, let's try that then - can you commit that change, and retest locally (or push the update to your branch and TravisCI will retest it)?

bneron added a commit to gem-pasteur/Integron_Finder that referenced this pull request Sep 21, 2015
due to issue
biopython/biopython#616
the genbak file in output is not alaways recognize as genebank by
squizz.
thus integronfinger take care that SeqREcord description property
ends with a period.
@bneron
Copy link
Contributor Author

bneron commented Sep 22, 2015

I have patch as we discussed the genbank parser and writer

the tests passed successfully for the SeqIO module
but failed in common_BioSQL as they parse lot of sequences in genbank format
and expect to have a period at the end of description property

https://travis-ci.org/biopython/biopython/jobs/81551316

# see ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt [3.4.5]
# and discussion https://github.com/biopython/biopython/pull/616
# So let's add a period
descr += '.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the description was the dummy value <unknown description> we don't want to record .. but just .

@peterjc
Copy link
Member

peterjc commented Sep 22, 2015

Note test_GenBank.py is an old style print-and-compare test (see the chapter in the tutorial for background), so you can update its expected output using run_tests.py -g test_GenBank.py.

For the other tests which verify a specific description, and the doctests which verify an exact string output, these must be updated manually.

@bneron
Copy link
Contributor Author

bneron commented Oct 28, 2015

Hi peter,
I have finally found some time to fix the unit tests.

@bneron bneron closed this Oct 28, 2015
@bneron bneron reopened this Oct 29, 2015
@peterjc
Copy link
Member

peterjc commented Oct 29, 2015

Thanks @bneron - I'm currently waiting for the TravisCI tests to finish.

Assuming there's nothing further to do, I will probably squash this down into a single commit when I apply it to the master. Would you be happy to be named in the NEWS and CONTRIB file?

@bneron
Copy link
Contributor Author

bneron commented Oct 29, 2015

On 10/29/2015 11:56 AM, Peter Cock wrote:

Thanks @bneron https://github.com/bneron - I'm currently waiting for the TravisCI tests to finish.

Assuming there's nothing further to do, I will probably squash this down into a single commit when I apply it to the master. Would you be happy to be named in the |NEWS| and |CONTRIB| file?

thank you, I'm already in CONTRIB file ;-)


Reply to this email directly or view it on GitHub #616 (comment).

Bertrand Néron
Software Engineer
Center of Informatics for Biology
(http://www.pasteur.fr/en/research/center-informatics-biology)

Institut Pasteur
28, Rue du Dr Roux
75724 Paris cedex 15

@peterjc
Copy link
Member

peterjc commented Feb 1, 2016

Note to self: Examine what changed in Tests/BioSQL/cor6_6.db other than removing the trailing dots on the descriptions. Do we care about the likely change to the relative path metadata for its testing?

@peterjc
Copy link
Member

peterjc commented Feb 1, 2016

Rebased/squashed version of this work with related changes here: https://github.com/peterjc/biopython/tree/dots

@peterjc
Copy link
Member

peterjc commented Aug 29, 2016

Two releases later, I just rebased my branch https://github.com/peterjc/biopython/tree/dots again. If the TravisCI run looks clean I think I'll just merge this...

@peterjc
Copy link
Member

peterjc commented Aug 29, 2016

Applied as 3231fa0 and the subsequence commits, acknowledgement in 0313ae4 - thank and sorry I forgot about this for so long (twice).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants