Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing notebook #126

Merged
merged 23 commits into from
Apr 14, 2015
Merged

Conversation

jwdebelius
Copy link
Contributor

A notebook for the generation of clean, consistent tables for use in downstream analyses. Rather than spending half a notebook trying to get data in a desired format, this centralizes the process.

Notebook can be viewed here.

This relies on files in #125.

This was referenced Feb 19, 2015
@jwdebelius
Copy link
Contributor Author

Bumping this, since the tables are being distributed. @wasade, @ElDeveloper , @cuttlefishh:
Could you review this?

The

@wasade
Copy link
Member

wasade commented Mar 30, 2015

few comments below:

beta diversity likely includes archaea as well

boolian -> boolean

I don't think pandas was described as a requirement for the notebook

beta diversity hyperlink in "beta diversity parameters" doesn't work

ag is picked against Greengenes 13_8

greengenes -> Greengenes

in split directories and files, why are the underscores escaped? eg split_raw_dir. this happens in multiple places when describing variables, like file pattern fill-in, etc

why does the notebook download the preprocessed data if the point is to produce the preprocessed data?

the hyperlinks for the references don't work, look like they're formatted incorrectly

@jwdebelius
Copy link
Contributor Author

I only listed the requirements that were not superseded by QIIME or are
ambitious in QIIME. (You need the newest IPython, but the version of
pandas, matplotlib, etc. work fine with the QIIME dependencies.) Would it
clarify things to explicitly list these packages?

I think the issue is mixing markdown and HTML. It worked in my Safari and
Chrome, but it seems to be a problem in NBViewer. I will correct the issue.

The escapes are there because it explicitly forces the underscores to be
displayed, rather than italicizing the remaining text. So, I've put them in
explicitly to maintain the formatting.

I can't set up an explicit way to download the FTP as a hyperlink in an
IPython notebook.
Before the data was FTP, I had an in-text link to the HTML. I can assume
that if people want to download the data, they can run other notebooks, but
I sort of imagined that people looking at the notebook would want to get
the data for other things.

On Mon, Mar 30, 2015 at 8:17 AM, Daniel McDonald notifications@github.com
wrote:

few comments below:

beta diversity likely includes archaea as well

boolian -> boolean

I don't think pandas was described as a requirement for the notebook

beta diversity hyperlink in "beta diversity parameters" doesn't work

ag is picked against Greengenes 13_8

greengenes -> Greengenes

in split directories and files, why are the underscores escaped? eg
split_raw_dir. this happens in multiple places when describing variables,
like file pattern fill-in, etc

why does the notebook download the preprocessed data if the point is to
produce the preprocessed data?

the hyperlinks for the references don't work, look like they're formatted
incorrectly


Reply to this email directly or view it on GitHub
#126 (comment).

@wasade
Copy link
Member

wasade commented Mar 30, 2015

Hyperlinks work with ftp, just use: ftp://ftp.microbio.me/foo/bar.

Pandas is not a dep of any of those IIRC, but might be wrong
On Mar 30, 2015 11:37 AM, "J W Debelius" notifications@github.com wrote:

I only listed the requirements that were not superseded by QIIME or are
ambitious in QIIME. (You need the newest IPython, but the version of
pandas, matplotlib, etc. work fine with the QIIME dependencies.) Would it
clarify things to explicitly list these packages?

I think the issue is mixing markdown and HTML. It worked in my Safari and
Chrome, but it seems to be a problem in NBViewer. I will correct the issue.

The escapes are there because it explicitly forces the underscores to be
displayed, rather than italicizing the remaining text. So, I've put them in
explicitly to maintain the formatting.

I can't set up an explicit way to download the FTP as a hyperlink in an
IPython notebook.
Before the data was FTP, I had an in-text link to the HTML. I can assume
that if people want to download the data, they can run other notebooks, but
I sort of imagined that people looking at the notebook would want to get
the data for other things.

On Mon, Mar 30, 2015 at 8:17 AM, Daniel McDonald <notifications@github.com

wrote:

few comments below:

beta diversity likely includes archaea as well

boolian -> boolean

I don't think pandas was described as a requirement for the notebook

beta diversity hyperlink in "beta diversity parameters" doesn't work

ag is picked against Greengenes 13_8

greengenes -> Greengenes

in split directories and files, why are the underscores escaped? eg
split_raw_dir. this happens in multiple places when describing variables,
like file pattern fill-in, etc

why does the notebook download the preprocessed data if the point is to
produce the preprocessed data?

the hyperlinks for the references don't work, look like they're formatted
incorrectly


Reply to this email directly or view it on GitHub
<#126 (comment)
.


Reply to this email directly or view it on GitHub
#126 (comment).

@jwdebelius
Copy link
Contributor Author

When I tried to make them in IPython, the hyperlink hover-over is a curser, not the little link-click hand. I looked over what I could find in StackOverflow, python and IPython documentation, but I couldn't find a satisfactory solution for linking to an ftp in the markdown cells.

Pandas is a scikit-bio dependency AFAIK.

@ElDeveloper
Copy link
Member

Ah yeah, this is a known problem with Markdown.

On (Mar-30-15|10:58), J W Debelius wrote:

When I tried to make them in IPython, the hyperlink hover-over is a curser, not the little link-click hand. I looked over what I could find in StackOverflow, python and IPython documentation, but I couldn't find a satisfactory solution for linking to an ftp in the markdown cells.

Pandas is a scikit-bio dependency AFAIK.


Reply to this email directly or view it on GitHub:
#126 (comment)

@jwdebelius
Copy link
Contributor Author

With that in mind, how would you prefer to handle this?

The solutions I can come up with include something like the current
implementation, where a flag gets set to allow download or dataset
generation; removing a reference to the pre-generated dataset; or including
a link somewhere, with the note that it needs to be copied into the browser.

I think all three have advantages and drawbacks.

On Mon, Mar 30, 2015 at 11:00 AM, Yoshiki Vázquez Baeza <
notifications@github.com> wrote:

Ah yeah, this is a known problem with Markdown.

On (Mar-30-15|10:58), J W Debelius wrote:

When I tried to make them in IPython, the hyperlink hover-over is a
curser, not the little link-click hand. I looked over what I could find in
StackOverflow, python and IPython documentation, but I couldn't find a
satisfactory solution for linking to an ftp in the markdown cells.

Pandas is a scikit-bio dependency AFAIK.


Reply to this email directly or view it on GitHub:
#126 (comment)


Reply to this email directly or view it on GitHub
#126 (comment).

@jwdebelius
Copy link
Contributor Author

This fixes the markdown issues with appearance and the links.

The download flag is still included in the notebook.

I'd like to get this merged sooner because it reflects tables we're sending out with the manuscript. It can live in my repo indefinitely, but it would be better to have it in master.

@ElDeveloper
Copy link
Member

That notebook is 🌟 amazing 🌟!

@wasade
Copy link
Member

wasade commented Apr 2, 2015

@ElDeveloper, merge?

@jwdebelius
Copy link
Contributor Author

@ElDeveloper: Any update on review?

Thank you for your help!

@ElDeveloper
Copy link
Member

@jwdebelius and I are going through the notebook, we hope to have a final version around 4:00 pm PDT today.

@jwdebelius
Copy link
Contributor Author

Running a little behind.

Thank you for all the awesome help today, @ElDeveloper.

I had one question: you suggested applying the scipy.spatial.distance.euclidean function when calculating the best rarefaction. When I try to apply the function, it requires a 1D array. My understanding is that performance wise, it's better to operate on 2D numpy arrays than to do a list comprehension and then cast to an array, but I'm happy to do which ever?

@ElDeveloper
Copy link
Member

My main suggestion for using the euclidian distance function in scipy,
was to improve the readability of those few lines of code, not really to
improve performance. However, if you think this is not ideal and cannot
be easily adapted, then it's find to leave as is.

On (Apr-13-15|20:20), J W Debelius wrote:

Running a little behind.

Thank you for all the awesome help today, @ElDeveloper.

I had one question: you suggested applying the scipy.spatial.distance.euclidean function when calculating the best rarefaction. When I try to apply the function, it requires a 1D array. My understanding is that performance wise, it's better to operate on 2D numpy arrays than to do a list comprehension and then cast to an array, but I'm happy to do which ever?


Reply to this email directly or view it on GitHub:
#126 (comment)

@ElDeveloper
Copy link
Member

Thanks @jwdebelius

ElDeveloper added a commit that referenced this pull request Apr 14, 2015
@ElDeveloper ElDeveloper merged commit 79a05e3 into biocore:master Apr 14, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants