Utilized slots in Seq and SeqRecord classes, issue #2854 #3309

kaskales · 2020-10-14T17:36:50Z

Added __slots__ to the Seq and SeqRecord classes.

In SeqRecord the seq and letter_annotations attributes were changed into functions with property decorators.

I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the CONTRIBUTING.rst file, have run pre-commit locally,
and understand that AppVeyor and TravisCI will be used to confirm the Biopython unit
tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst and CONTRIB.rst as part of this pull request, am listed
already, or do not wish to be listed. (This acknowledgement is optional.)

codecov · 2020-10-14T17:48:16Z

Codecov Report

Merging #3309 into master will decrease coverage by 0.00%.
The diff coverage is 78.57%.

@@            Coverage Diff             @@
##           master    #3309      +/-   ##
==========================================
- Coverage   83.98%   83.98%   -0.01%     
==========================================
  Files         318      318              
  Lines       51661    51677      +16     
==========================================
+ Hits        43389    43402      +13     
- Misses       8272     8275       +3

Impacted Files	Coverage Δ
Bio/SeqRecord.py	`92.33% <76.00%> (-0.77%)`	⬇️
Bio/Seq.py	`97.01% <100.00%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 63ec6b5...7b0ab8e. Read the comment docs.

peterjc · 2020-10-14T17:55:46Z

The flake8 tests failed, running black will fix most of them for you:

$ flake8 .
Bio/Seq.py:76:18: Q000 Remove bad quotes
Bio/Seq.py:76:27: Q000 Remove bad quotes
Bio/Seq.py:1132:18: Q000 Remove bad quotes
Bio/Seq.py:1132:29: Q000 Remove bad quotes
Bio/Seq.py:1703:18: Q000 Remove bad quotes
Bio/SeqRecord.py:161:9: Q000 Remove bad quotes
Bio/SeqRecord.py:162:9: Q000 Remove bad quotes
Bio/SeqRecord.py:163:9: Q000 Remove bad quotes
Bio/SeqRecord.py:164:9: Q000 Remove bad quotes
Bio/SeqRecord.py:165:9: Q000 Remove bad quotes
Bio/SeqRecord.py:166:9: Q000 Remove bad quotes
Bio/SeqRecord.py:167:9: Q000 Remove bad quotes
Bio/SeqRecord.py:168:9: Q000 Remove bad quotes
Bio/SeqRecord.py:169:9: Q000 Remove bad quotes
Bio/SeqRecord.py:170:9: Q000 Remove bad quotes
Bio/SeqRecord.py:280:1: D401 First line should be in imperative mood; try rephrasing
Bio/SeqRecord.py:345:1: D401 First line should be in imperative mood; try rephrasing

See the CONTRIBUTING file for details of how to install pre-commit which will run flake8 and black for you.

peterjc · 2020-10-14T17:57:27Z

Bio/SeqRecord.py

@@ -250,23 +274,10 @@ def __init__(
        self.features = features

    # TODO - Just make this a read only property?


This comment is obsolete with your proposed changes (but right now I don't see why you changed the SeqRecord, I though you were only changing the Seq classes).

I will delete the comment.
I mentioned in #3309 I wanted to try to implement __slots__ for both classes, Seq and SeqRecord. If this is a problem I can remove the changes for SeqRecord.

kaskales · 2020-10-14T18:10:10Z

I think I have flake8/pre-commit installed incorrectly because I do not get the same errors when checking locally, I will fix this.

peterjc · 2020-10-14T18:11:06Z

Hmm, looks like the discussion on #2854 was unclear about if you would try the SeqRecord or just the Seq etc objects. Well, once the stylistic changes are done, TravisCI will run the actual tests on Linux.

Then comes benchmarking...

kaskales · 2020-10-14T18:22:29Z

Bio/SeqRecord.py:280:1: D401 First line should be in imperative mood; try rephrasing
Bio/SeqRecord.py:345:1: D401 First line should be in imperative mood; try rephrasing

I did not write these descriptions so I am hesitant to change them since they were already part of the code and accepted.

kaskales · 2020-10-14T18:53:05Z

@peterjc I'm having trouble with pre-commit. I was hoping you could help me, I never used this before and don't know how to fix my problem. It is blocking me from adding any changes now. This is the error I get when I try to commit my new changes,

[INFO] Initializing environment for https://github.com/myint/rstcheck.
An unexpected error has occurred: CalledProcessError: command: ('/usr/lib/git-core/git', 'checkout', '')
return code: 128
expected return code: 0
stdout: (none)
stderr:
    fatal: empty string is not a valid pathspec. please use . instead if you meant to match all paths

Check the log at /home/kaska/.cache/pre-commit/pre-commit.log

I'm not working on my master branch but on the "slots" branch, I'm not sure why it is trying to checkout.

JoaoRodrigues · 2020-10-14T19:02:04Z

@kaskales what OS do you have? Are you running pre-commit on a conda environment or on your system Python?

kaskales · 2020-10-14T19:18:25Z

I have Ubuntu 18.04.5 and I am using my system Python 2.7.17, I normally have to specify python3 when working on running tests locally. I installed pre-commit with pip3. I am working from the biopython root directory. My previous commands were

~/biopython$ git add Bio/Seq.py
~/biopython$ git add Bio/SeqRecord.py
~/biopython$ git commit -m "attempt to fix flake8 and removed comment"
[INFO] Initializing environment for https://github.com/myint/rstcheck.
An unexpected error has occurred: CalledProcessError .....

JoaoRodrigues · 2020-10-14T19:22:50Z

See this issue: pre-commit/pre-commit#1036

It might be an issue with the pre-commit config?

peterjc · 2020-10-14T19:29:04Z

Regarding these:

Bio/SeqRecord.py:280:1: D401 First line should be in imperative mood; try rephrasing
Bio/SeqRecord.py:345:1: D401 First line should be in imperative mood; try rephrasing

The original property docstring was not style compliant, but since it was defined via the property function and not a decorated method, the flake8 plugin didn't catch it. You'll have to tweak the wording to get this to pass.

While pre-commit is not working, I suggest installing flake8 and calling it directly - pip install flake8 and flake8 .

Update: In your case, pip3 install flake8 if that alias is defined, or python3 -m pip install flake8 to be explicit.

kaskales · 2020-10-14T19:37:30Z

While pre-commit is not working, I suggest installing flake8 and calling it directly - pip install flake8 and flake8 .

Thanks @JoaoRodrigues ! I will take a look. I forgot to add this line which prints before the error occurs too.

[INFO] Initializing environment for https://github.com/myint/rstcheck.

@peterjc I was doing this and the only error I was getting was:

 Bio/Seq.py:111:86: E999 SyntaxError: invalid syntax

It wasn't catching the ones you commented here. I am fixing them by hand, since I think the error was I was using single quotes instead of double quotes.

peterjc · 2020-10-14T19:43:40Z

Running pip install black and black Bio/Seq.py will do the quotes for you.
(As an aside, there is another tool called flynt which focuses on just doing strings)

That flake8 gave you a SyntaxError strongly suggests to me that flake8 is running under Python 2, not Python 3.

JoaoRodrigues · 2020-10-14T19:45:07Z

I would strongly suggest getting (mini)conda and creating a new Python3 environment for biopython where you install pre-commit. Makes it much easier to handle all these version issues than battling with system Python.

kaskales · 2020-10-14T22:13:50Z

@peterjc I fixed the flake8 errors by hand, I also used black on it's own. I wasn't able to get pre-commit working or flake8 for python3, I will fix this before I try to make any other PR. Thank you for your help!

Thank you @JoaoRodrigues , I will look into getting miniconda too.

peterjc · 2020-10-14T23:12:34Z

Green lights - the tests passed. Do you want to run some FASTA or FASTQ parsing benchmarking now? Maybe try the example on #3188?

P.S. It would be good if we can solve the pre-commit issue (to help avoid anyone else suffering from it).

kaskales · 2020-10-20T21:34:16Z

I look at that PR and I used this to test:

for x in range(1000):
     handle = open("BWA/human_g1k_v37_truncated.fasta")
     records = SeqIO.parse(handle, 'fasta')     
     records = list(records)
     handle.close()
 warnings.warn(str(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss))

Maybe because of the small test file I used but I couldn't find much of a performance different between with/without __slots__. With 10 trials the averages were both 45216 and 45165.2 bytes. The times were nearly identical as well.

peterjc · 2020-10-21T08:49:07Z

That's disappointing, but how many sequences are there in BWA/human_g1k_v37_truncated.fasta?

We probably need to try a massive high throughput sequencing file to see much difference, with millions of short sequences. The overheads we're hoping to see an improvement on are per sequence.

The example used on the other issue was from SRR12143416, which you can download from https://www.ebi.ac.uk/ena/browser/view/SRR12143416 with 35 million sequences in each of the FASTQ paried files. Converting either to FASTA would be fine for this test. Or, download from the SRA in their custom format and use their tool to convert to FASTA: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR12143416 (1.8GB)

kaskales · 2020-11-19T12:54:02Z

I repeated the trails with the fasta file you suggested, but I trimmed the file down to 1GB for my laptop.

Without slots the average memory usage was 7446676 bytes during 36seconds, and with slots was 6462748 bytes during 33 seconds. The advantage of using slots is much more apparent when testing on larger files.

peterjc · 2020-11-19T16:55:15Z

FASTA or FASTQ? The percentage saving would be more obvious on FASTA without the quality scores.

kaskales · 2020-11-19T17:11:13Z

FASTA, I downloaded the SRA file from the link you sent me and used fastq-dump to convert it to a fasta format.

peterjc · 2023-06-06T08:37:01Z

Thank you for working on this. It was an interesting experiment even though the performance impact was minimal. Closing pull request.

kaskales added 3 commits October 14, 2020 01:48

added property decorators

b300a8d

added seq doc string

1b1aacd

squashed

5f9316b

kaskales requested a review from peterjc as a code owner October 14, 2020 17:36

peterjc reviewed Oct 14, 2020

View reviewed changes

fixed quotes and docstring

7b0ab8e

peterjc mentioned this pull request Nov 19, 2020

Explore using __slots__ on Seq, SeqRecord, etc #2854

Closed

mdehoon self-requested a review October 7, 2022 11:34

peterjc closed this Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utilized slots in Seq and SeqRecord classes, issue #2854 #3309

Utilized slots in Seq and SeqRecord classes, issue #2854 #3309

kaskales commented Oct 14, 2020

codecov bot commented Oct 14, 2020 •

edited

Loading

peterjc commented Oct 14, 2020

peterjc Oct 14, 2020

kaskales Oct 14, 2020 •

edited

Loading

kaskales commented Oct 14, 2020

peterjc commented Oct 14, 2020

kaskales commented Oct 14, 2020

kaskales commented Oct 14, 2020 •

edited

Loading

JoaoRodrigues commented Oct 14, 2020

kaskales commented Oct 14, 2020 •

edited

Loading

JoaoRodrigues commented Oct 14, 2020

peterjc commented Oct 14, 2020 •

edited

Loading

kaskales commented Oct 14, 2020

peterjc commented Oct 14, 2020 •

edited

Loading

JoaoRodrigues commented Oct 14, 2020

kaskales commented Oct 14, 2020

peterjc commented Oct 14, 2020

kaskales commented Oct 20, 2020

peterjc commented Oct 21, 2020

kaskales commented Nov 19, 2020

peterjc commented Nov 19, 2020

kaskales commented Nov 19, 2020

peterjc commented Jun 6, 2023

		@@ -250,23 +274,10 @@ def __init__(
		self.features = features

		# TODO - Just make this a read only property?

Utilized __slots__ in Seq and SeqRecord classes, issue #2854 #3309

Utilized __slots__ in Seq and SeqRecord classes, issue #2854 #3309

Conversation

kaskales commented Oct 14, 2020

codecov bot commented Oct 14, 2020 • edited Loading

Codecov Report

peterjc commented Oct 14, 2020

peterjc Oct 14, 2020

Choose a reason for hiding this comment

kaskales Oct 14, 2020 • edited Loading

Choose a reason for hiding this comment

kaskales commented Oct 14, 2020

peterjc commented Oct 14, 2020

kaskales commented Oct 14, 2020

kaskales commented Oct 14, 2020 • edited Loading

JoaoRodrigues commented Oct 14, 2020

kaskales commented Oct 14, 2020 • edited Loading

JoaoRodrigues commented Oct 14, 2020

peterjc commented Oct 14, 2020 • edited Loading

kaskales commented Oct 14, 2020

peterjc commented Oct 14, 2020 • edited Loading

JoaoRodrigues commented Oct 14, 2020

kaskales commented Oct 14, 2020

peterjc commented Oct 14, 2020

kaskales commented Oct 20, 2020

peterjc commented Oct 21, 2020

kaskales commented Nov 19, 2020

peterjc commented Nov 19, 2020

kaskales commented Nov 19, 2020

peterjc commented Jun 6, 2023

Utilized slots in Seq and SeqRecord classes, issue #2854 #3309

Utilized slots in Seq and SeqRecord classes, issue #2854 #3309

codecov bot commented Oct 14, 2020 •

edited

Loading

kaskales Oct 14, 2020 •

edited

Loading

kaskales commented Oct 14, 2020 •

edited

Loading

kaskales commented Oct 14, 2020 •

edited

Loading

peterjc commented Oct 14, 2020 •

edited

Loading

peterjc commented Oct 14, 2020 •

edited

Loading