Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF for DR9 #648

Merged
merged 14 commits into from Nov 18, 2020
Merged

RF for DR9 #648

merged 14 commits into from Nov 18, 2020

Conversation

echaussidon
Copy link
Contributor

Add random forest for DR9 with correct threshold for each footprint.

Presentation done on 11/11/21 at SV telecon

@echaussidon
Copy link
Contributor Author

Hi, I need a review, to add new version of RF in DR9 : @geordie666. @Cyeche

@geordie666
Copy link
Contributor

Thanks, I'll look at this later today from the perspective of the code.

I'd also like to request a quick comment from @sbailey. This PR adds an additional ~9MB of files in the form of RFs, and I'm always wary of adding large files to the repository without oversight.

I think we're at the point where we've pushed enough notebooks to various repositories that 9MB is more acceptable than it was a few years ago, but I'd still appreciated a "Data Systems policy" comment from @sbailey.

@geordie666 geordie666 requested review from geordie666 and removed request for geordie666 November 12, 2020 22:59
@sbailey
Copy link
Contributor

sbailey commented Nov 13, 2020

@echaussidon thanks for doing this PR from a fork instead of pushing to a branch so that we can evaluate filesize impacts before it makes it into the main repo.

desitarget is currently at 210M (vs. 1 GB soft limit from GitHub for a full repo) and the two new files are 4-5 MB each (vs. the 100 MB hard limit from GitHub on individual files). So I think this is ok to add from a filesize perspective, as long as we only do it a few more times throughout SV.

Please post your SV telecon presentation to DocDB and include the number here for the record. Are the exact training and test samples available somewhere on disk? Please also document exactly what commands were run to do the training and generate these files.

I'll let @geordie666 review from a performance, code, and testing perspective.

Copy link
Contributor

@geordie666 geordie666 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echaussidon: I continue to appreciate that you update the RF code in your own fork and then add a PR in the main repository. It makes my life easier. Thank you!

You've introduced a couple of structural issues that the unit tests caught (which is why Travis is failing). Please address my comments to fix these issues, and then I'll test the code directly to check for more subtle bugs (which I don't expect to find).

If needed, I can hijack your fork and fix the issues myself, but it will probably expedite this PR if you address them.

py/desitarget/cuts.py Outdated Show resolved Hide resolved
py/desitarget/cuts.py Outdated Show resolved Hide resolved
py/desitarget/cuts.py Show resolved Hide resolved
@echaussidon
Copy link
Contributor Author

@geordie666 I update the corrections that you required ! I think it will work now.

@echaussidon
Copy link
Contributor Author

echaussidon commented Nov 16, 2020

@sbailey I don't know "how to upload" my slide in DocBD, there is a specific website for DESI ? (Slides are on the track of DESI).

For the training and test sample, there are not currently on NERSC. I removed Test sample which was really heavy (20GB...). However, the training sample and the test sample are generated from the dr9 sweepfiles inside the code which generate the RF files.

Question about the doc for generating RF files: what do you want exactly ? I can put the code and an explicative doc somewhere in desitarget (in desitarget/train for example).

@geordie666
Copy link
Contributor

@echaussidon: Thanks for responding to my review. I'm happy for this to be merged now. Unit tests are failing, but that isn't anything that you did, it's just an intermittent network glitch (I think).

To try to answer your questions for @sbailey:

  • For reference, DESI DocDB can be accessed from the "internal" pages on the public-facing DESI website. I don't think it's necessary to upload your slides to DocDB if they're already on the wiki.
  • I'd suggest that the best place for the training code is in desitarget/train and the best place for a document explaining the training code is either in that directory, too, or as a Jupyter notebook in doc/nb. A Jupyter notebook with all of the required environment variables and code to test and train (but without the actual "heavy" data) would be ideal.

I'm going to leave this PR open until tomorrow to give @sbailey a chance to add anything to my answers, but then I'll merge it.

@sbailey
Copy link
Contributor

sbailey commented Nov 18, 2020

I had suggested posting the presentation to DocDB because "DocDB NNNN" is more archival than searching the wiki or mailing lists for want presentations may have been posted where for a given date. It would be good practice for @echaussidon to learn how to post something to DocDB, but if it is too much of a hassle ping me and I can post it for him.

I agree with @geordie666 about posting the training code/commands/documentation but not adding the training/validation files themselves to desitarget. At the same time, we should get the full training and test samples at NERSC; 20 GB is small compared to our overall quota (2 PB). @geordie666 could you suggest an area, perhaps under /global/common/cfs/cdirs/desi/target/analysis/ or some other area intended to document how the TS cuts were derived? Thanks.

@echaussidon
Copy link
Contributor Author

@sbailey I try tu upload the document in Doc DB, however i don't find find my name in the author list. Do you know how can I do ?

@geordie666 I will publish the code and a jupyter notebook and make a pull request after the merging

@geordie666
Copy link
Contributor

@echaussidon: I like @sbailey's suggestion to put the testing/training files somewhere at NERSC on the community file system. Ideally, wherever we put them would be somewhere that would correspond to a named environment variable that you could call from inside a jupyter notebook tutorial.

As soon as Cori is back up, I'll work out the best directory for the RF testing/training files and add my suggestion to this PR. Then I'll merge.

@echaussidon
Copy link
Contributor Author

Meeting presentation of this update is available here (DocDB : 5912)

@geordie666
Copy link
Contributor

@echaussidon: I suggest that you work towards putting the training and validation files at NERSC in this directory:

/global/cfs/cdirs/desi/target/analysis/RF

and also work towards writing a notebook about how to do the training and put it in doc/nb of desitarget. Ideally, the notebook would include an environment variable that points to /global/cfs/cdirs/desi/target/analysis/RF so that people could run the training (or a simple version of it) themselves.

But, writing that notebook and posting the training/validation files shouldn't delay this PR, so I'll merge this in about an hour, once all of my morning meetings conclude!

@geordie666 geordie666 merged commit c95e921 into desihub:master Nov 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants