Preprocessing notebook #126

jwdebelius · 2015-02-19T03:09:26Z

A notebook for the generation of clean, consistent tables for use in downstream analyses. Rather than spending half a notebook trying to get data in a desired format, this centralizes the process.

Notebook can be viewed here.

This relies on files in #125.

Removes files not up for review in this pull request

Removes data generated by the preprocessing notebook

jwdebelius · 2015-03-27T23:08:23Z

Bumping this, since the tables are being distributed. @wasade, @ElDeveloper , @cuttlefishh:
Could you review this?

The

wasade · 2015-03-30T15:17:33Z

few comments below:

beta diversity likely includes archaea as well

boolian -> boolean

I don't think pandas was described as a requirement for the notebook

beta diversity hyperlink in "beta diversity parameters" doesn't work

ag is picked against Greengenes 13_8

greengenes -> Greengenes

in split directories and files, why are the underscores escaped? eg split_raw_dir. this happens in multiple places when describing variables, like file pattern fill-in, etc

why does the notebook download the preprocessed data if the point is to produce the preprocessed data?

the hyperlinks for the references don't work, look like they're formatted incorrectly

jwdebelius · 2015-03-30T17:35:58Z

I only listed the requirements that were not superseded by QIIME or are
ambitious in QIIME. (You need the newest IPython, but the version of
pandas, matplotlib, etc. work fine with the QIIME dependencies.) Would it
clarify things to explicitly list these packages?

I think the issue is mixing markdown and HTML. It worked in my Safari and
Chrome, but it seems to be a problem in NBViewer. I will correct the issue.

The escapes are there because it explicitly forces the underscores to be
displayed, rather than italicizing the remaining text. So, I've put them in
explicitly to maintain the formatting.

I can't set up an explicit way to download the FTP as a hyperlink in an
IPython notebook.
Before the data was FTP, I had an in-text link to the HTML. I can assume
that if people want to download the data, they can run other notebooks, but
I sort of imagined that people looking at the notebook would want to get
the data for other things.

On Mon, Mar 30, 2015 at 8:17 AM, Daniel McDonald notifications@github.com
wrote:

few comments below:

beta diversity likely includes archaea as well

boolian -> boolean

I don't think pandas was described as a requirement for the notebook

beta diversity hyperlink in "beta diversity parameters" doesn't work

ag is picked against Greengenes 13_8

greengenes -> Greengenes

in split directories and files, why are the underscores escaped? eg
split_raw_dir. this happens in multiple places when describing variables,
like file pattern fill-in, etc

why does the notebook download the preprocessed data if the point is to
produce the preprocessed data?

the hyperlinks for the references don't work, look like they're formatted
incorrectly

—
Reply to this email directly or view it on GitHub
#126 (comment).

wasade · 2015-03-30T17:43:43Z

Hyperlinks work with ftp, just use: ftp://ftp.microbio.me/foo/bar.

Pandas is not a dep of any of those IIRC, but might be wrong
On Mar 30, 2015 11:37 AM, "J W Debelius" notifications@github.com wrote:

I only listed the requirements that were not superseded by QIIME or are
ambitious in QIIME. (You need the newest IPython, but the version of
pandas, matplotlib, etc. work fine with the QIIME dependencies.) Would it
clarify things to explicitly list these packages?

I think the issue is mixing markdown and HTML. It worked in my Safari and
Chrome, but it seems to be a problem in NBViewer. I will correct the issue.

The escapes are there because it explicitly forces the underscores to be
displayed, rather than italicizing the remaining text. So, I've put them in
explicitly to maintain the formatting.

I can't set up an explicit way to download the FTP as a hyperlink in an
IPython notebook.
Before the data was FTP, I had an in-text link to the HTML. I can assume
that if people want to download the data, they can run other notebooks, but
I sort of imagined that people looking at the notebook would want to get
the data for other things.

On Mon, Mar 30, 2015 at 8:17 AM, Daniel McDonald <notifications@github.com

wrote:

few comments below:

beta diversity likely includes archaea as well

boolian -> boolean

I don't think pandas was described as a requirement for the notebook

beta diversity hyperlink in "beta diversity parameters" doesn't work

ag is picked against Greengenes 13_8

greengenes -> Greengenes

in split directories and files, why are the underscores escaped? eg
split_raw_dir. this happens in multiple places when describing variables,
like file pattern fill-in, etc

why does the notebook download the preprocessed data if the point is to
produce the preprocessed data?

the hyperlinks for the references don't work, look like they're formatted
incorrectly

—
Reply to this email directly or view it on GitHub
<#126 (comment)
.

—
Reply to this email directly or view it on GitHub
#126 (comment).

jwdebelius · 2015-03-30T17:58:33Z

When I tried to make them in IPython, the hyperlink hover-over is a curser, not the little link-click hand. I looked over what I could find in StackOverflow, python and IPython documentation, but I couldn't find a satisfactory solution for linking to an ftp in the markdown cells.

Pandas is a scikit-bio dependency AFAIK.

ElDeveloper · 2015-03-30T18:00:40Z

Ah yeah, this is a known problem with Markdown.

On (Mar-30-15|10:58), J W Debelius wrote:

When I tried to make them in IPython, the hyperlink hover-over is a curser, not the little link-click hand. I looked over what I could find in StackOverflow, python and IPython documentation, but I couldn't find a satisfactory solution for linking to an ftp in the markdown cells.

Pandas is a scikit-bio dependency AFAIK.

Reply to this email directly or view it on GitHub:
#126 (comment)

jwdebelius · 2015-03-30T19:50:59Z

With that in mind, how would you prefer to handle this?

The solutions I can come up with include something like the current
implementation, where a flag gets set to allow download or dataset
generation; removing a reference to the pre-generated dataset; or including
a link somewhere, with the note that it needs to be copied into the browser.

I think all three have advantages and drawbacks.

On Mon, Mar 30, 2015 at 11:00 AM, Yoshiki Vázquez Baeza <
notifications@github.com> wrote:

Ah yeah, this is a known problem with Markdown.

On (Mar-30-15|10:58), J W Debelius wrote:

When I tried to make them in IPython, the hyperlink hover-over is a
curser, not the little link-click hand. I looked over what I could find in
StackOverflow, python and IPython documentation, but I couldn't find a
satisfactory solution for linking to an ftp in the markdown cells.

Pandas is a scikit-bio dependency AFAIK.

Reply to this email directly or view it on GitHub:
#126 (comment)

—
Reply to this email directly or view it on GitHub
#126 (comment).

jwdebelius · 2015-04-01T06:35:50Z

This fixes the markdown issues with appearance and the links.

The download flag is still included in the notebook.

I'd like to get this merged sooner because it reflects tables we're sending out with the manuscript. It can live in my repo indefinitely, but it would be better to have it in master.

ElDeveloper · 2015-04-02T00:01:31Z

That notebook is 🌟 amazing 🌟!

wasade · 2015-04-02T00:14:31Z

@ElDeveloper, merge?

jwdebelius · 2015-04-13T16:38:06Z

@ElDeveloper: Any update on review?

Thank you for your help!

ElDeveloper · 2015-04-13T18:20:57Z

@jwdebelius and I are going through the notebook, we hope to have a final version around 4:00 pm PDT today.

jwdebelius · 2015-04-14T03:20:42Z

Running a little behind.

Thank you for all the awesome help today, @ElDeveloper.

I had one question: you suggested applying the scipy.spatial.distance.euclidean function when calculating the best rarefaction. When I try to apply the function, it requires a 1D array. My understanding is that performance wise, it's better to operate on 2D numpy arrays than to do a list comprehension and then cast to an array, but I'm happy to do which ever?

@ElDeveloper

Thank you @ElDeveloper!

ElDeveloper · 2015-04-14T05:02:48Z

My main suggestion for using the euclidian distance function in scipy,
was to improve the readability of those few lines of code, not really to
improve performance. However, if you think this is not ideal and cannot
be easily adapted, then it's find to leave as is.

On (Apr-13-15|20:20), J W Debelius wrote:

Running a little behind.

Thank you for all the awesome help today, @ElDeveloper.

I had one question: you suggested applying the scipy.spatial.distance.euclidean function when calculating the best rarefaction. When I try to apply the function, it requires a 1D array. My understanding is that performance wise, it's better to operate on 2D numpy arrays than to do a list comprehension and then cast to an array, but I'm happy to do which ever?

Reply to this email directly or view it on GitHub:
#126 (comment)

Fixes syntax and functions calls

ElDeveloper · 2015-04-14T06:56:50Z

Thanks @jwdebelius

Preprocessing notebook

jwdebelius added 5 commits February 18, 2015 18:10

Removes unnecessary files

004c377

Removes files not up for review in this pull request

Updates the index file

efb7451

Removes checkpoints

b2667a3

fixes image display issue

e4cfcc5

removes irrelevant images

d2a1735

This was referenced Feb 19, 2015

Statistical power #127

Merged

Easy template #128

Closed

Processing Notebooks #118

Closed

jwdebelius added 4 commits March 16, 2015 12:56

Brings in master changes

0bcbae0

Updates the .gitignore file

e637ba4

Removes data generated by the preprocessing notebook

Updates Preprocessing and index notebook

ad8e124

updates preprocessing notebook, removes checkpoint

0033ca8

jwdebelius added the results label Mar 27, 2015

Removes auxillary files

7435f2c

Brings back deleted images

38cf1e1

jwdebelius added 5 commits March 30, 2015 13:38

Pull upstream commits

6c364e8

Brings back geography_lib, which got lost

0f3abf6

Updates Preprocessing for round_15

df5d5c0

Fixes html links

fbce4cf

fixes base directory

401b3a0

jwdebelius added 3 commits April 3, 2015 09:32

updates data location

064b87a

Fixes subset generation

248e8ff

Fixes spelling

ea98c33

mid review

d67a79f

Updates the notebook with Yoshiki's comments

74c278f

Thank you @ElDeveloper!

jwdebelius added 2 commits April 13, 2015 23:33

Additional IPython format updates

40961b0

Fixes syntax and functions calls

removes unnecessary import

89a148b

ElDeveloper added a commit that referenced this pull request Apr 14, 2015

Merge pull request #126 from JWDebelius/preprocessing_notebook

79a05e3

Preprocessing notebook

ElDeveloper merged commit 79a05e3 into biocore:master Apr 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing notebook #126

Preprocessing notebook #126

jwdebelius commented Feb 19, 2015

jwdebelius commented Mar 27, 2015

wasade commented Mar 30, 2015

jwdebelius commented Mar 30, 2015

wasade commented Mar 30, 2015

jwdebelius commented Mar 30, 2015

ElDeveloper commented Mar 30, 2015

jwdebelius commented Mar 30, 2015

jwdebelius commented Apr 1, 2015

ElDeveloper commented Apr 2, 2015

wasade commented Apr 2, 2015

jwdebelius commented Apr 13, 2015

ElDeveloper commented Apr 13, 2015

jwdebelius commented Apr 14, 2015

ElDeveloper commented Apr 14, 2015

ElDeveloper commented Apr 14, 2015

Preprocessing notebook #126

Preprocessing notebook #126

Conversation

jwdebelius commented Feb 19, 2015

jwdebelius commented Mar 27, 2015

wasade commented Mar 30, 2015

jwdebelius commented Mar 30, 2015

wasade commented Mar 30, 2015

jwdebelius commented Mar 30, 2015

ElDeveloper commented Mar 30, 2015

jwdebelius commented Mar 30, 2015

jwdebelius commented Apr 1, 2015

ElDeveloper commented Apr 2, 2015

wasade commented Apr 2, 2015

jwdebelius commented Apr 13, 2015

ElDeveloper commented Apr 13, 2015

jwdebelius commented Apr 14, 2015

ElDeveloper commented Apr 14, 2015

ElDeveloper commented Apr 14, 2015