Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stricter checks for SDF sniffing #4729

Merged
merged 2 commits into from Oct 1, 2017

Conversation

Projects
None yet
5 participants
@nsoranzo
Copy link
Member

nsoranzo commented Sep 29, 2017

I had a bunch of protein FASTA files being sniffed as 'sdf'.

Also fix #4558 .

Stricter checks for SDF sniffing
I had a bunch of protein FASTA files being sniffed as 'sdf'.

Also fix #4558 .
@bgruening
Copy link
Member

bgruening left a comment

Thanks @nsoranzo! LGTM, small comment included.

else:
return False
m_end_found = False
limit = 500

This comment has been minimized.

Copy link
@bgruening

bgruening Sep 29, 2017

Member

This probably needs to be increased. I have seen molecules with 1000 atoms, so maybe 1500?

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Oct 1, 2017

Author Member

Good point! I've increased this to 10000, but also added a strict check on the 4th line which would stop the method early for most files.

More precise SDF sniffer
First check 4th line: if this is conforming, scan up to 10000 lines.
Add 2 test files downloaded from:

https://github.com/rdkit/rdkit/tree/master/Code/GraphMol/FileParsers/test_data
limit = 10000
idx = 0
with open(filename) as in_file:
for line in in_file:

This comment has been minimized.

Copy link
@jmchilton

jmchilton Oct 1, 2017

Member

We really need to stop doing this - this line can consume unbounded memory on relatively sane user supplied inputs. I guess there are a lot of instances of it though and I'm not finding my previous advice on fixing it. (I believe the right thing to do is pick a maximum line size and use readlines directly instead of the iterator).

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Oct 2, 2017

Author Member

xref. #4319

@jmchilton jmchilton merged commit 068c61a into galaxyproject:dev Oct 1, 2017

6 checks passed

api test Build finished. 297 tests run, 4 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 161 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 46 tests run, 0 skipped, 0 failed.
Details
lgtm analysis: JavaScript No alert changes
Details
toolshed test Build finished. 579 tests run, 0 skipped, 0 failed.
Details

@nsoranzo nsoranzo deleted the nsoranzo:fix_molecule_datatypes branch Oct 2, 2017

@jvolkening

This comment has been minimized.

Copy link
Contributor

jvolkening commented Dec 1, 2017

Will this make it into 18.01? I'm having FASTQ files with '$$$$' in qualities detected as SDF.

@nsoranzo

This comment has been minimized.

Copy link
Member Author

nsoranzo commented Dec 2, 2017

@jvolkening Yes, it will.

@mvdbeek

This comment has been minimized.

Copy link
Member

mvdbeek commented Dec 2, 2017

I guess we should backport this change if the problem appears on 17.09.

dannon added a commit that referenced this pull request Dec 2, 2017

Merge pull request #5118 from mvdbeek/backport_sdf_sniffing_fixes
[17.09] Backport stricter checks for SDF sniffing #4729
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.