Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CLI flag to load-into-counting.py allowing TSV output of basic stats #649

Closed
wants to merge 12 commits into from

Conversation

kdm9
Copy link
Contributor

@kdm9 kdm9 commented Nov 6, 2014

Hi all,

We're developing a few things based off khmer, and one of the metrics we're using is the total number of k-mers. I know this can be obtained programmatically via the khmer API (by loading in the counting hash and grabbing its occupancy), but this takes a while with many large hashes. So, currently, we're grepping through log files and .info files for the FPR and total number of kmers.

This PR adds a CLI flag to load-into-counting.py allowing it to output the FPR, total number of distinct k-mers (cm-sketch occupancy) and input files in a machine/R/spreadsheet readable format (tab-seperated file). This saves us from resorting to regex hacks to pull this info out of out.kh.info and logs. I've added a test of the tsv creation behaviour too, and pep8 is clean.

An example of the tsv format is below (markdown table-ified for prettyness):

python ./scripts/load-into-counting.py  --machine-readable-info -x 1e7 -N 4 -k 20 tst data/100k-filtered.fa data/100k-surrendered.fa 
name fpr kmers files
tst 0.010 3167409 data/100k-filtered.fa;data/100k-surrendered.fa

I've also make the info files contain all files used as input. (This can be reverted, I forgot to separate the commits)

Hope you find it helpful 😄

Cheers,
Kevin

@ged-jenkins
Copy link

Can one of the admins verify this patch?

1 similar comment
@ged-jenkins
Copy link

Can one of the admins verify this patch?

@kdm9
Copy link
Contributor Author

kdm9 commented Nov 6, 2014

  • Is it mergable?
  • Did it pass the tests?
  • If it introduces new functionality in scripts/ is it tested?
    Check for code coverage.
  • Is it well formatted? Look at make pep8, make diff_pylint_report,
    make cppcheck, and make doc output. Use make format and manual fixing as needed.
  • Is it documented in the ChangeLog?
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@mr-c
Copy link
Contributor

mr-c commented Nov 6, 2014

add to testlist


n_kmers = htable.n_occupied()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking news: see #562 (comment). A new n_kmers() function is pending.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, was wondering if this could be implemented 😄

Will update this PR once that one is merged, unless you'd prefer me to do so now.

@ctb
Copy link
Member

ctb commented Nov 7, 2014

Question/thought for you and @mr-c -- what about providing this in a JSON or YAML format, instead of tsv? More structured formats could allow for later expansion into additional information, and I think we could put semantic versioning to use by specifying the minimum information in the file by version, allowing for flexible expansion. (This was an idle thought I had yesterday while wandering around.)

@luizirber
Copy link
Member

And, if going the JSON route, can it be a newline delimited JSON? This would make it compatible with dat, and is still reasonably easy to parse.

@kdm9
Copy link
Contributor Author

kdm9 commented Nov 8, 2014

Yes, after hacking with this a bit, I like the idea of something more extensible than a TSV (particularly something both versioned and dat-friendly). Although that has the disadvantage of being less easy to just glob into R for a quick plot.

Maybe the CLI flag could take a format parameter that defaults to JSON if not given, but can also do something tabular like TSV for easy interfacing with R and friends.

@mr-c
Copy link
Contributor

mr-c commented Nov 15, 2014

Do you mind rebasing off of our master branch?

@kdm9 kdm9 force-pushed the feature/load-into-counting_tsv-output branch from ceea16d to af47c33 Compare November 15, 2014 03:36
@kdm9
Copy link
Contributor Author

kdm9 commented Nov 15, 2014

I've also been more consistent in using the print >>fp, 'x' syntax rather than fp.write()

@mr-c
Copy link
Contributor

mr-c commented Nov 17, 2014

Looking good so far @kdmurray91. I think you're ready for a ChangeLog entry.

@kdm9
Copy link
Contributor Author

kdm9 commented Nov 18, 2014

Will do this tonight and push it up 😄

@kdm9
Copy link
Contributor Author

kdm9 commented Nov 18, 2014

Also, do I need to rebase again? GH is complaining about merge conflicts

@kdm9 kdm9 force-pushed the feature/load-into-counting_tsv-output branch from 593c10b to 455c856 Compare November 18, 2014 12:20
@kdm9
Copy link
Contributor Author

kdm9 commented Nov 18, 2014

Rebased against master, and changed the script to use ht.n_unique_kmers() as suggested by @mr-c on an outdated diff somewhere I think 😄

@@ -54,6 +56,10 @@ def get_parser():
parser.add_argument('-b', '--no-bigcount', dest='bigcount', default=True,
action='store_false',
help='Do not count k-mers past 255')
parser.add_argument('--machine-readable-info', '-m', default=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's got to be a better parameter name than this :). @mr-c, any thoughts?

Maybe... --summary-info?

@ctb
Copy link
Member

ctb commented Nov 19, 2014

Apart from that one comment, LGTM.

@kdm9
Copy link
Contributor Author

kdm9 commented Nov 20, 2014

Open to any suggestions for a better flag name. I agree the current flag is overly long and too verbose, but can't think of a better one. --summary-info isn't bad. I'll see what @mr-c says before implementing it though...

@mr-c
Copy link
Contributor

mr-c commented Nov 20, 2014

+1 to the shorter name

On Wed Nov 19 2014 at 7:29:39 PM Kevin Murray notifications@github.com
wrote:

Open to any suggestions for a better flag name. I agree the current flag
is overly long and too verbose, but can't think of a better one.
--summary-info isn't bad. I'll see what @mr-c https://github.com/mr-c
says before implementing it though...


Reply to this email directly or view it on GitHub
#649 (comment).

@kdm9
Copy link
Contributor Author

kdm9 commented Nov 20, 2014

Cool, I'll also make the short flag be -s.

kdm9 added a commit to kdm9/khmer that referenced this pull request Nov 20, 2014
This changes the overly long flag we added in PR dib-lab#649 to be a little
shorter.
@kdm9 kdm9 force-pushed the feature/load-into-counting_tsv-output branch from c35f13e to d8e8aeb Compare November 20, 2014 04:35
kdm9 added a commit to kdm9/khmer that referenced this pull request Nov 20, 2014
This changes the overly long flag we added in PR dib-lab#649 to be a little
shorter.
@kdm9
Copy link
Contributor Author

kdm9 commented Nov 20, 2014

Ooops, I seem to have buggered up a rebase. Sorry!

@kdm9 kdm9 force-pushed the feature/load-into-counting_tsv-output branch from d8e8aeb to 5d47b86 Compare November 20, 2014 04:48
kdm9 added a commit to kdm9/khmer that referenced this pull request Nov 20, 2014
This changes the overly long flag we added in PR dib-lab#649 to be a little
shorter.
@kdm9
Copy link
Contributor Author

kdm9 commented Nov 20, 2014

That's better! All good from my end 😄

EDIT: Scratch that, found a bug.

@kdm9
Copy link
Contributor Author

kdm9 commented Nov 20, 2014

Fixed issue with test, I think. At least it works on my machine.

print >>sys.stderr, 'Saving k-mer counting table to %s' % base
print >>sys.stderr, 'Loading kmers from sequences in %s' % repr(filenames)
print >>sys.stderr, 'Saving k-mer counting table', base
print >>sys.stderr, 'Loading kmers from sequences in', repr(filenames)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not change these two lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, think that was the result of merge conflict resolution. Will revert.


with open(base + '.info', 'a') as info_fp:
print >> sys.stderr, "Writing run information to", base + '.info'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeat of line 136/172

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Another merge conflict mistake. Will remove it

@mr-c
Copy link
Contributor

mr-c commented Nov 20, 2014

retest this please

@mr-c
Copy link
Contributor

mr-c commented Nov 20, 2014

I can confirm that the test is fixed

We now use `with open(X) as y:` which automatically handles the closing
of files.
This adds a CLI flag to load-into-counting.py allowing it to output the
FPR, total number of distinct k-mers (cm-sketch occupancy) and input
files in a machine/R/excel readable format to save us from resorting to
regex hacks to pull this info out of out.kh.info etc.

I've also make the info files contain all files used as input.

I've also also added a test of the tsv creation behaviour.

Changes:
	modified:   scripts/load-into-counting.py:
			new cli flag, write all inputs to .info file
	modified:   tests/test_scripts.py:
			add test of the new cli flag
Writing it to the info file had the wrong sentence.
This changes the CLI flag that enables machine readable info so that to
enalbe this behaviour, you give it an argument of either 'json' or
'tsv'. This allows for a user choice of a versioned, dat-friendly json
or something easy to whack into R and plot.
Add a file format version key-val pair
We now use the choices kwarg to parser.add_argument to validate the
---machine-readable-info argument, instead of manually checking it.

Also, updates the tests.

	modified:   scripts/load-into-counting.py
	modified:   tests/test_scripts.py
New API feature in ChangeLog on 2014-11-06. Also solves merge
conflict (hopefully)

	modified:   scripts/load-into-counting.py
This changes the overly long flag we added in PR dib-lab#649 to be a little
shorter.
Forgot to update the expected stderr after a rebase changed it.
	modified:   tests/test_scripts.py
Accidentally changed a few unrelated issues while fixing merge
conflcits. Reverted as asked to by @mr-c.

Also remove duplicated status line introduced in a similar manner.
	modified:   scripts/load-into-counting.py
@kdm9 kdm9 force-pushed the feature/load-into-counting_tsv-output branch from d564155 to 8fb778c Compare November 21, 2014 00:33
@kdm9
Copy link
Contributor Author

kdm9 commented Nov 21, 2014

@mr-c's bugs fixed, rebased to remove merge conflict in ChangeLog.

mr-c pushed a commit that referenced this pull request Dec 1, 2014
This changes the overly long flag we added in PR #649 to be a little
shorter.
@mr-c
Copy link
Contributor

mr-c commented Dec 1, 2014

@kdmurray91 Congratulations on your first commit to the khmer project! Your name will be included in the release notes for the next version and you'll be listed amongst our other contributors in the next software release paper.

@mr-c mr-c closed this Dec 1, 2014
@mr-c
Copy link
Contributor

mr-c commented Dec 1, 2014

Merged as 820710c

@kdm9 kdm9 deleted the feature/load-into-counting_tsv-output branch March 3, 2015 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants