Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilized __slots__ in Seq and SeqRecord classes, issue #2854 #3309

Closed
wants to merge 4 commits into from

Conversation

kaskales
Copy link
Contributor

Added __slots__ to the Seq and SeqRecord classes.

In SeqRecord the seq and letter_annotations attributes were changed into functions with property decorators.

  • I hereby agree to dual licence this and any previous contributions under both
    the Biopython License Agreement AND the BSD 3-Clause License.

  • I have read the CONTRIBUTING.rst file, have run pre-commit locally,
    and understand that AppVeyor and TravisCI will be used to confirm the Biopython unit
    tests and style checks pass with these changes.

  • I have added my name to the alphabetical contributors listings in the files
    NEWS.rst and CONTRIB.rst as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

Closes #2854

@codecov
Copy link

codecov bot commented Oct 14, 2020

Codecov Report

Merging #3309 into master will decrease coverage by 0.00%.
The diff coverage is 78.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3309      +/-   ##
==========================================
- Coverage   83.98%   83.98%   -0.01%     
==========================================
  Files         318      318              
  Lines       51661    51677      +16     
==========================================
+ Hits        43389    43402      +13     
- Misses       8272     8275       +3     
Impacted Files Coverage Δ
Bio/SeqRecord.py 92.33% <76.00%> (-0.77%) ⬇️
Bio/Seq.py 97.01% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 63ec6b5...7b0ab8e. Read the comment docs.

@peterjc
Copy link
Member

peterjc commented Oct 14, 2020

The flake8 tests failed, running black will fix most of them for you:

$ flake8 .
Bio/Seq.py:76:18: Q000 Remove bad quotes
Bio/Seq.py:76:27: Q000 Remove bad quotes
Bio/Seq.py:1132:18: Q000 Remove bad quotes
Bio/Seq.py:1132:29: Q000 Remove bad quotes
Bio/Seq.py:1703:18: Q000 Remove bad quotes
Bio/SeqRecord.py:161:9: Q000 Remove bad quotes
Bio/SeqRecord.py:162:9: Q000 Remove bad quotes
Bio/SeqRecord.py:163:9: Q000 Remove bad quotes
Bio/SeqRecord.py:164:9: Q000 Remove bad quotes
Bio/SeqRecord.py:165:9: Q000 Remove bad quotes
Bio/SeqRecord.py:166:9: Q000 Remove bad quotes
Bio/SeqRecord.py:167:9: Q000 Remove bad quotes
Bio/SeqRecord.py:168:9: Q000 Remove bad quotes
Bio/SeqRecord.py:169:9: Q000 Remove bad quotes
Bio/SeqRecord.py:170:9: Q000 Remove bad quotes
Bio/SeqRecord.py:280:1: D401 First line should be in imperative mood; try rephrasing
Bio/SeqRecord.py:345:1: D401 First line should be in imperative mood; try rephrasing

See the CONTRIBUTING file for details of how to install pre-commit which will run flake8 and black for you.

Bio/SeqRecord.py Outdated
@@ -250,23 +274,10 @@ def __init__(
self.features = features

# TODO - Just make this a read only property?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is obsolete with your proposed changes (but right now I don't see why you changed the SeqRecord, I though you were only changing the Seq classes).

Copy link
Contributor Author

@kaskales kaskales Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will delete the comment.
I mentioned in #3309 I wanted to try to implement __slots__ for both classes, Seq and SeqRecord. If this is a problem I can remove the changes for SeqRecord.

@kaskales
Copy link
Contributor Author

I think I have flake8/pre-commit installed incorrectly because I do not get the same errors when checking locally, I will fix this.

@peterjc
Copy link
Member

peterjc commented Oct 14, 2020

Hmm, looks like the discussion on #2854 was unclear about if you would try the SeqRecord or just the Seq etc objects. Well, once the stylistic changes are done, TravisCI will run the actual tests on Linux.

Then comes benchmarking...

@kaskales
Copy link
Contributor Author

Bio/SeqRecord.py:280:1: D401 First line should be in imperative mood; try rephrasing
Bio/SeqRecord.py:345:1: D401 First line should be in imperative mood; try rephrasing

I did not write these descriptions so I am hesitant to change them since they were already part of the code and accepted.

@kaskales
Copy link
Contributor Author

kaskales commented Oct 14, 2020

@peterjc I'm having trouble with pre-commit. I was hoping you could help me, I never used this before and don't know how to fix my problem. It is blocking me from adding any changes now. This is the error I get when I try to commit my new changes,

[INFO] Initializing environment for https://github.com/myint/rstcheck.
An unexpected error has occurred: CalledProcessError: command: ('/usr/lib/git-core/git', 'checkout', '')
return code: 128
expected return code: 0
stdout: (none)
stderr:
    fatal: empty string is not a valid pathspec. please use . instead if you meant to match all paths

Check the log at /home/kaska/.cache/pre-commit/pre-commit.log

I'm not working on my master branch but on the "slots" branch, I'm not sure why it is trying to checkout.

@JoaoRodrigues
Copy link
Member

@kaskales what OS do you have? Are you running pre-commit on a conda environment or on your system Python?

@kaskales
Copy link
Contributor Author

kaskales commented Oct 14, 2020

I have Ubuntu 18.04.5 and I am using my system Python 2.7.17, I normally have to specify python3 when working on running tests locally. I installed pre-commit with pip3. I am working from the biopython root directory. My previous commands were

~/biopython$ git add Bio/Seq.py
~/biopython$ git add Bio/SeqRecord.py
~/biopython$ git commit -m "attempt to fix flake8 and removed comment"
[INFO] Initializing environment for https://github.com/myint/rstcheck.
An unexpected error has occurred: CalledProcessError .....

@JoaoRodrigues
Copy link
Member

See this issue: pre-commit/pre-commit#1036

It might be an issue with the pre-commit config?

@peterjc
Copy link
Member

peterjc commented Oct 14, 2020

Regarding these:

Bio/SeqRecord.py:280:1: D401 First line should be in imperative mood; try rephrasing
Bio/SeqRecord.py:345:1: D401 First line should be in imperative mood; try rephrasing

The original property docstring was not style compliant, but since it was defined via the property function and not a decorated method, the flake8 plugin didn't catch it. You'll have to tweak the wording to get this to pass.

While pre-commit is not working, I suggest installing flake8 and calling it directly - pip install flake8 and flake8 .

Update: In your case, pip3 install flake8 if that alias is defined, or python3 -m pip install flake8 to be explicit.

@kaskales
Copy link
Contributor Author

While pre-commit is not working, I suggest installing flake8 and calling it directly - pip install flake8 and flake8 .

Thanks @JoaoRodrigues ! I will take a look. I forgot to add this line which prints before the error occurs too.

[INFO] Initializing environment for https://github.com/myint/rstcheck.

@peterjc I was doing this and the only error I was getting was:

 Bio/Seq.py:111:86: E999 SyntaxError: invalid syntax

It wasn't catching the ones you commented here. I am fixing them by hand, since I think the error was I was using single quotes instead of double quotes.

@peterjc
Copy link
Member

peterjc commented Oct 14, 2020

Running pip install black and black Bio/Seq.py will do the quotes for you.
(As an aside, there is another tool called flynt which focuses on just doing strings)

That flake8 gave you a SyntaxError strongly suggests to me that flake8 is running under Python 2, not Python 3.

@JoaoRodrigues
Copy link
Member

I would strongly suggest getting (mini)conda and creating a new Python3 environment for biopython where you install pre-commit. Makes it much easier to handle all these version issues than battling with system Python.

@kaskales
Copy link
Contributor Author

@peterjc I fixed the flake8 errors by hand, I also used black on it's own. I wasn't able to get pre-commit working or flake8 for python3, I will fix this before I try to make any other PR. Thank you for your help!

Thank you @JoaoRodrigues , I will look into getting miniconda too.

@peterjc
Copy link
Member

peterjc commented Oct 14, 2020

Green lights - the tests passed. Do you want to run some FASTA or FASTQ parsing benchmarking now? Maybe try the example on #3188?

P.S. It would be good if we can solve the pre-commit issue (to help avoid anyone else suffering from it).

@kaskales
Copy link
Contributor Author

I look at that PR and I used this to test:

for x in range(1000):
     handle = open("BWA/human_g1k_v37_truncated.fasta")
     records = SeqIO.parse(handle, 'fasta')     
     records = list(records)
     handle.close()
 warnings.warn(str(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss))

Maybe because of the small test file I used but I couldn't find much of a performance different between with/without __slots__. With 10 trials the averages were both 45216 and 45165.2 bytes. The times were nearly identical as well.

@peterjc
Copy link
Member

peterjc commented Oct 21, 2020

That's disappointing, but how many sequences are there in BWA/human_g1k_v37_truncated.fasta?

We probably need to try a massive high throughput sequencing file to see much difference, with millions of short sequences. The overheads we're hoping to see an improvement on are per sequence.

The example used on the other issue was from SRR12143416, which you can download from https://www.ebi.ac.uk/ena/browser/view/SRR12143416 with 35 million sequences in each of the FASTQ paried files. Converting either to FASTA would be fine for this test. Or, download from the SRA in their custom format and use their tool to convert to FASTA: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR12143416 (1.8GB)

@kaskales
Copy link
Contributor Author

I repeated the trails with the fasta file you suggested, but I trimmed the file down to 1GB for my laptop.

Without slots the average memory usage was 7446676 bytes during 36seconds, and with slots was 6462748 bytes during 33 seconds. The advantage of using slots is much more apparent when testing on larger files.

@peterjc
Copy link
Member

peterjc commented Nov 19, 2020

FASTA or FASTQ? The percentage saving would be more obvious on FASTA without the quality scores.

@kaskales
Copy link
Contributor Author

FASTA, I downloaded the SRA file from the link you sent me and used fastq-dump to convert it to a fasta format.

@mdehoon mdehoon self-requested a review October 7, 2022 11:34
@peterjc
Copy link
Member

peterjc commented Jun 6, 2023

Thank you for working on this. It was an interesting experiment even though the performance impact was minimal. Closing pull request.

@peterjc peterjc closed this Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Explore using __slots__ on Seq, SeqRecord, etc
3 participants