Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added 'testseq' function #1262

Closed
wants to merge 4 commits into from
Closed

Added 'testseq' function #1262

wants to merge 4 commits into from

Conversation

Adil-Iqbal
Copy link
Contributor

Hopefully resolved all of the Travis CI style conflicts.

This function (and all of its helper functions/classes) have been instrumental in demonstrating the Biopython module to others. My hope is to pass this utility on to others and help them as well.

Very open to comments and criticisms. And will support my work for the foreseeable future.

@Adil-Iqbal
Copy link
Contributor Author

I've reviewed the Travis CI test failures and they all occurred during the docstring tests.

Due to the random nature of the sequence generation, no two occurences will be exactly alike.
If you take the time to look at the reports, all of the Travis outputs serve to prove all of the concepts discussed in the docstring.

It would mean a lot to me if you would help me figure out how to circumvent this situation.

Thanks,
Adil.

@Adil-Iqbal
Copy link
Contributor Author

Alright. I've decided to add a random seed to the function.

The user will have the option of changing the seed as a function argument.

I will also play around with having a global Boolean variable that the user can change to revert the function's behavior to what it is currently but it will require some testing.

Thanks,
Adil

@peterjc
Copy link
Member

peterjc commented May 29, 2017

As you have noticed, random examples are not well suited to doctests.

I'm wondering if rather than adding a testseq function to Bio/SeqUtils/__init__.py this would be better under Scripts/ or simply in the main Tutorial?

(Note the Tutorial is currently written in LaTeX with special comments for doctests which are run via Tests/test_Tutorial.py)

I think we should ask for more opinions on the Biopython(-dev) mailing list - are you subscribed to those? http://biopython.org/wiki/Mailing_lists

- added global seed variable 'shuffle_seed'
- added argument 'rand_seed'
- added argument 'shuffle'
- updated docstrings to seeded outputs
- updated docstrings to include new arguments
@andrewguy
Copy link
Contributor

andrewguy commented May 30, 2017

Just a couple of comments.

While I think that this is a good idea for the purpose of education, I don't know if there is any obvious utility in everyday research use. Happy to hear otherwise - I haven't given this too much thought yet. As such, I think I agree with @peterjc that this would be better off somewhere other than the main SeqUtils module. Or perhaps we keep it under SeqUtils, but don't import it by default?

A few comments on the code itself. I'd be really hesitant to introduce a global variable (shuffle_seed) unless you really, really need to. The whole approach to setting a random seed is unnecessarily complicated. Just set a seed the first time you use the function, and everything from that point will be repeatable (and negating the need for the shuffle argument). If you want to retrieve the same sequence twice, just use the same seed in both calls. Just to illustrate how this works:

>>> import random
>>> random.seed(1234)
>>> random.random()
0.9664535356921388

Without providing another seed, we get a different number the second time we call random()

>>> random.random()
0.4407325991753527

Resetting the seed reproduces the same sequence of random numbers

>>> random.seed(1234)
>>> random.random()
0.9664535356921388
>>> random.random()
0.4407325991753527

Given this, I think you can reduce your random seeding code down to:

if rand_seed is not None:
    seed(rand_seed)

Set the default random seed to 0. No need to use an arbitrary number such as 9001. And you can then drop the shuffle argument.

One other comment from my cursory look at this - if a user sets gc_target > 100 or < 0, wouldn't it make more sense to raise a warning or exception, rather than silently set the GC target to the upper or lower bound? In this scenario it is likely that the user doesn't really understand what the gc_target represents, and hence should be alerted to the issue.

Happy to give this a more thorough review if there is consensus that this is a useful addition to the code-base.

Edit: You need to set the default random seed to None, not 0. Sorry.

@andrewguy
Copy link
Contributor

For random number generation, an even better idea would be to instantiate a local version of the random.Random class, so any seeding that happens in your function doesn't affect other parts of the code (i.e. other modules!) which use the default random instance.

@andrewguy
Copy link
Contributor

Hi Adil,

I've just seen your message on the dev mailing list.

With regards to the random number generation, I would put the rand_instance = Random() outside your functions definitions. This should solve all the issues that you have with seeding. As you've got it currently, you are creating a new instance each time you call the function. You want a single instance created when you load the module, which you can then seed from within the main testseq function. Let me know if that's unclear and I will do a pull request on your repo to show what I mean.

Within the function body you can then reduce it to:

if rand_seed is not None:
    rand_instance.seed(rand_seed)

No need to convert the seed to an integer, as Python uses the hash of the provided seed, so the seed can really be any hashable object.

@Adil-Iqbal
Copy link
Contributor Author

See Pull Request #1269
Closing pull request.

@Adil-Iqbal Adil-Iqbal closed this Jun 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants