Skip to content

Deprecate record.database_letters, record.query_letters #1000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vincentdavis opened this issue Nov 16, 2016 · 11 comments
Open

Deprecate record.database_letters, record.query_letters #1000

vincentdavis opened this issue Nov 16, 2016 · 11 comments
Labels
From Redmine good first issue This should be an easy fix, suitable for beginners

Comments

@vincentdavis
Copy link
Contributor

From redmine XML Blast parser: miscellaneous bug fixes and cleanup
Most of the cleanup, bug fixes were done. The last comment was:

We could perhaps deprecate record.database_letters immediately, and at a later point, record.query_letters

There is a TODO in the code regarding this.

@peterjc last comment

I've updated CVS as per comment 12 to also use record.query_length, and comment 13 to also use record.database_length.

Before:

from Bio.Blast import NCBIXML
for record in NCBIXML.parse(open("xbt007.xml")) :

... print record.query_id
... print record.query_letters, record.query_length
... print record.num_letters_in_database, record.database_letters, record.database_length
...
gi|585505|sp|Q08386|MOPB_RHOCA
270 None
13958303 None None
gi|129628|sp|P07175.1|PARA_AGRTU
222 None
13958303 None None

Now, with Bio/Blast/NCBIXML.py CVS revision 1.20 or 1.21,

from Bio.Blast import NCBIXML
for record in NCBIXML.parse(open("xbt007.xml")) :

... print record.query_id
... print record.query_letters, record.query_length
... print record.num_letters_in_database, record.database_letters, record.database_length
...
gi|585505|sp|Q08386|MOPB_RHOCA
270 270
13958303 None 13958303
gi|129628|sp|P07175.1|PARA_AGRTU
222 222
13958303 None 13958303

We could perhaps deprecate record.database_letters immediately, and at a later point, record.query_letters

@peterjc peterjc added the good first issue This should be an easy fix, suitable for beginners label Nov 16, 2016
@peterjc
Copy link
Member

peterjc commented Nov 16, 2016

I would approach this by turning the attributes into read only properties, and then adding the warning to the getter functions. See e.g. e3cf12a which was how the Seq object's .data attribute was deprecated.

See also http://biopython.org/wiki/Deprecation_policy

i.e. Define read only property for .database_letters which gives a warning and returns the value of self.num_letters_in_database instead.

@peterjc
Copy link
Member

peterjc commented Nov 16, 2016

Copying my comments from the old RedMine issue:

Query Length

XML output includes this information once, currently recorded as .query_letters only.

Plain text output includes this twice, recorded as .query_letters (associated with the query header) and .query_length (associated with the pairwise alignments).

e.g.

...
Query= gi|120291|sp|P21297|FLBT_CAUCR FLBT PROTEIN
(141 letters)
...

gi|120291|sp|P21297|FLBT_CAUCR FLBT PROTEIN

Length = 141
...
As far as I know, these are always the same. An assertion could be added to the plain text parser to verify this...

For consistency, the XML parser could just populate both .query_length and .query_letters - a simple change that won't break any old code and makes migrating from the text parser to the XML parser a little easier.

This does perpetuate the confusion of two names. We could go further and make one of these properties officially deprecated (e.g. using a property method to issue a warning). But which one should we keep? Currently the XML parser only supports .query_letters but .query_length is more natural.

And:

Database Length

I wanted to record my notes on this based on findings reported on the mailing list. See this thread:

http://lists.open-bio.org/pipermail/biopython-dev/2008-August/004101.html

The plain text BLAST format contains the database length information three times (!), once in the header (for each query) and then again at the end of the file in the database report and the parameters "total letters" and again as "length of database", e.g.
http://bugzilla.open-bio.org/attachment.cgi?id=676

...
Database: Leigo
4,535,438 sequences; 1,573,298,872 total letters
...
Database: Leigo
Posted date: Jan 22, 2007 11:26 AM
Number of letters in database: 1,573,298,872
Number of sequences in database: 4,535,438
...
Length of database: 1,573,298,872
...

The Bio.Record.Header class defines "database_letters" (this is repeated every query), Bio.Record.DatabaseReport defines "num_letters_in_database", and Bio.Record.Parameters class defines "database_length" (where the names reflect the NCBI strings). The Bio.Record.Record inherits from all three, so ends up with "database_letters", "database_length" and "num_letters_in_database" (all coming from different bits of a plain text BLAST file).

If the -z option is used, only the last of these three databases in the plain text output is changed (tested using standalone BLAST 2.2.18, which Biopython can parse for single queries). Using the Biopython plain text parser, "database_letters" and "num_letters_in_database" reflect the real database size, while "database_length" reflects the -z argument (which is used in the
statistics).

If the -z option is used with XML output, then <Statistics_db-len> is updated. As far as I can tell, the "real" database size is not reported. The XML parser stores this as "num_letters_in_database".

So from plain text BLAST we have two pieces of information,

actual database size - "database_letters" and "num_letters_in_database
specified database size - "database_length"

While for XML BLAST we only get one piece of information,

specified database size - "num_letters_in_database"
while "database_letters" and "database_length" default to None.

This is a horrid mess. In the short term I propose the XML parser also record the specified database size as "database_length", and perhaps also as "database_letters" which would facilitate anyone trying to migrate a script from the plain text parser to the XML parser.

And then,

I've updated CVS as per comment 12 to also use record.query_length, and comment 13 to also use record.database_length.

Before:

from Bio.Blast import NCBIXML
for record in NCBIXML.parse(open("xbt007.xml")) :

... print record.query_id
... print record.query_letters, record.query_length
... print record.num_letters_in_database, record.database_letters, record.database_length
...
gi|585505|sp|Q08386|MOPB_RHOCA
270 None
13958303 None None
gi|129628|sp|P07175.1|PARA_AGRTU
222 None
13958303 None None

Now, with Bio/Blast/NCBIXML.py CVS revision 1.20 or 1.21,

from Bio.Blast import NCBIXML
for record in NCBIXML.parse(open("xbt007.xml")) :

... print record.query_id
... print record.query_letters, record.query_length
... print record.num_letters_in_database, record.database_letters, record.database_length
...
gi|585505|sp|Q08386|MOPB_RHOCA
270 270
13958303 None 13958303
gi|129628|sp|P07175.1|PARA_AGRTU
222 222
13958303 None 13958303

We could perhaps deprecate record.database_letters immediately, and at a later point, record.query_letters

@souravsingh
Copy link
Contributor

@peterjc Do we have to add the property decorator for database_letters here?

@peterjc
Copy link
Member

peterjc commented Apr 12, 2017

Probably yes, I would think to class DatabaseReport?

@farAtGitHub
Copy link
Contributor

@peterjc I'm trying to understand what should be done here and have read through parts of the comments of the closed PR, but I get lost way earlier.
You say:

The Bio.Record.Header class defines "database_letters" (this is repeated every query), Bio.Record.DatabaseReport defines "num_letters_in_database", and Bio.Record.Parameters class defines "database_length" (where the names reflect the NCBI strings). The Bio.Record.Record inherits from all three, so ends up with "database_letters", "database_length" and "num_letters_in_database" (all coming from different bits of a plain text BLAST file).

and in a previous comment:

i.e. Define read only property for .database_letters which gives a warning and returns the value of self.num_letters_in_database instead.

To my understanding (bear with me, I'm YAPYN - yet another Python newbie) Bio.Record.Record is a module. How can it inherit from Header, DatabaseReport and Parameters? And while a deprecation in any of Header, DatabaseReport and Parameters is possible I don't see a way to return in one class the value of another class without any inheritance hierachy which I can't make out here.
Probably I'm missing something really important?
If not:
Is it worthwhile to have a(nother) go at this? (I see you have already invested a lot of time in this guiding souravsingh.)

@peterjc
Copy link
Member

peterjc commented Jan 5, 2020

PEP8 style says use lower case for modules, and title case for classes - to help avoid this confusion. This part of Biopython pre-dates that convention.

Bio.Blast.Record is a module, specifically this file:

https://github.com/biopython/biopython/blob/master/Bio/Blast/Record.py

Within that is a class Record, i.e. Bio.Blast.Record.Record in full. Update: class called Blast, i.e. Bio.Blast.Record.Blast in full

Also, a few Bio.Blast.xxx got shortened to Bio.xxx which adds to the confusion in this thread.

@farAtGitHub
Copy link
Contributor

Do you mean there should be a line starting with
class Record
in https://github.com/biopython/biopython/blob/master/Bio/Blast/Record.py?
If so, I cannot find it, else I give up on this one for the time being.

@peterjc
Copy link
Member

peterjc commented Jan 5, 2020

No, I didn't double check the code before replying - we were most likely talking about the Blast class (in other parsers it got called the Record class - my mistake).

@valentin994
Copy link
Contributor

Hey, I'm looking at this issue, so you want to remove it from https://github.com/biopython/biopython/blob/master/Bio/Blast/NCBIXML.py#L786, and in the DatabaseRecords you want a property that will mirror the removed database_letters? Is there anything else that I need to do here?

@uxritu
Copy link

uxritu commented Mar 20, 2025

Hi, I'm new to open-source contributions. Can I work on this issue

@peterjc
Copy link
Member

peterjc commented Mar 20, 2025

Note @uxritu has written the same comment on multiple issues today - which I am initially willing to assuming is merely over enthusiasm.

--

A belated answer to @valentin994, I don't recall the details of this code well enough to say without actually trying it. i.e. Add the deprecation warnings, run the test suite including documentation, and see what needs updating and/or to have the new warning silenced until the code and its test is removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
From Redmine good first issue This should be an easy fix, suitable for beginners
Projects
None yet
Development

No branches or pull requests

6 participants