Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Work report from Codefest 2013

  • Loading branch information...
commit 010b1ad6bd811165b0a4e6c796f4d4b9637f6c88 1 parent 17968c5
@chapmanb authored
View
242 posts/conferences/bosc_codefest_report.org
@@ -0,0 +1,242 @@
+#+BLOG: bcbio
+#+POSTID: 524
+#+DATE: [2013-07-18 Thu 18:26]
+#+TITLE: Summary from Bioinformatics Open Science Codefest 2013: Tools, infrastructure, standards and visualization
+#+CATEGORY: OpenBio
+#+TAGS: bioinformatics, bosc, hackathon
+#+OPTIONS: toc:nil num:nil
+
+The [[bosc][2013 Bioinformatics Open Source Conference (BOSC)]] starts tomorrow in
+Berlin, Germany. It's a yearly conference devoted to community-based
+software development projects supporting biological research. Members of the
+[[open-bio][Open Bioinformatics Foundation]] discuss implementations and approaches
+to better provide interoperable and reusable software, libraries and
+pipelines.
+
+For the past five years, a two day [[codefest][Codefest]] and hackathon preceded
+the conference. This gives programmers time to work face-to-face,
+sharing approaches and discovering connections between projects. This
+year, the [[ivo][the Department of Biology, Humboldt-Universität zu Berlin]]
+kindly hosted [[codefest][Codefest 2013]]. Thanks to the organizers and [[attendees][attendees]],
+we finished projects ranging from tool development, infrastructure
+integration, standards development and visualization. There are
+[[roman-photos][photos of the Codefest in progress]] and a [[codefest-doc][detailed writeup of projects]].
+
+Below we summarize the accomplishments from the two days. We
+welcome feedback on the topics covered and hope that by sharing our
+work we can encourage more programmers to become part of the open
+science bioinformatics community. Actively working to build
+well-tested, community-developed, interoperable tools is how we solve
+increasingly difficult research questions ranging from human health to
+plant breeding to microbial community function. The progress
+made in two days illuminates the effectiveness of open collaborative
+science.
+
+#+LINK: attendees https://docs.google.com/spreadsheet/ccc?key=0Agxg-o4ZmoZ4dEQyOFhrLUt4YVBXX0xxWjRyYTBRb2c#gid=0
+#+LINK: ivo http://www.biologie.hu-berlin.de/
+#+LINK: bosc http://www.open-bio.org/wiki/BOSC_2013
+#+LINK: open-bio http://www.open-bio.org/wiki/Main_Page
+#+LINK: codefest http://www.open-bio.org/wiki/Codefest_2013
+#+LINK: codefest-doc https://docs.google.com/document/d/1xbS7ZkjipXct00eOfR7-IL_Ti6QzAsjFvcJtopMeT2g/edit
+#+LINK: roman-photos https://plus.google.com/u/0/photos/115208034315059721590/albums/590205279902807640
+
+* Tool Development
+
+** BioRuby and BaseSpace - Develop SDK and apps for Illumina BaseSpace
+
+/Toshiaki Katayama, Raoul Bonnal, Eri Kibukawa, Joachim Baran, Dan MacLean, Fernando Izquierdo-Carrasco, Spencer Bliven/
+
+During the Codefest, we tested and documented our
+[[basespace-ruby][port of the BaseSpace Python SDK to Ruby]]. Ruby/Biogem developers
+can now easily utilize next-generation sequencing code within the
+[[basespace][Illumina's BaseSpace]] framework. For non-Ruby programers, we
+found that it can be a burden to create new Web app from
+scratch on top of your NGS program. So we started new project to
+provide a Web-app scaffold for BaseSpace. We have already implemented the
+basic portion but will need some more time before releasing the
+BioBaseSpace application.
+
+#+BEGIN_HTML
+<a href="http://bcbio.files.wordpress.com/2013/07/biobasespace.png">
+ <img src="http://bcbio.files.wordpress.com/2013/07/biobasespace.png?w=400" width="400">
+</a>
+#+END_HTML
+
+#+LINK: basespace-ruby https://github.com/joejimbo/basespace-ruby-sdk
+#+LINK: basespace https://basespace.illumina.com/home/index
+
+** Barrnap - Bacterial ribosomal RNA predictor
+
+/Torsten Seemann, Tim Booth/
+
+For the last 8 years RNAmmer has been the standard tool for predicting
+ribosomal RNA features in genomes, because it is reasonably fast,
+accurate, and works on bacteria and eukaryotes. Its drawbacks are that
+it relies on small, older databases; requires an older conflicting
+version of HMMER; and has restrictive licence terms. To resolve these
+issues we have implemented a new rRNA predictor which uses the new
+“nmmer” tool from HMMER 3.1 for searching DNA profiles against DNA
+sequence. We used the Silva and GreenGenes seed alignments for the 5S,
+23S and 16S genes to build the profile models from. Barrnap is a small
+Perl script which takes FASTA as input, and outputs the rRNA feature
+predictions in GFF3 format. It will be packaged in Bio-Linux and
+replace RNAmmer in the Prokka bacterial annotation system.
+
+** BioJVM - Coordinating and integrating BioJava and ScaBio
+
+/Spencer Bliven, Andreas Prlic, Markus Gumbel/
+
+Both Java and Scala run on the Java Virtual Machine. As such, it makes
+sense to [[biojava-scala][coordinate and document]] the various Bio* projects which run on the JVM and
+therefor can interoperate to some degree. We are able to successfully
+reference BioJava functions from Scala code and ScaBio functions from
+Java code. The ease of this process means that users can easily use both
+libraries from whichever language is more suited for their biological
+problem.
+
+#+LINK: biojava-scala http://biojava.org/wiki/Scala
+
+** Biopython
+
+/Peter Cock, Konstantin Tretyakov, Bin Zhang/
+
+The Biopython team worked on training new users at Codefest and exploring
+integration of Biopython with other Python molecular visualization toolkits
+like [[pymol][PyMol]]. Infrastructure development involved testing and debugging
+on multiple systems, including identifying and fixing Windows and
+PyPy problems. We also identified areas where we can make it easier
+to contribute to Biopython: specifically easing the process to report
+and fix bugs by moving to integrated GitHub issue tracking and
+working to support Biopython-associated projects with easy
+installation tools.
+
+#+LINK: biopython http://biopython.org/wiki/Main_Page
+#+LINK: pymol http://pymol.org/
+
+** Galaxy Debianization
+
+/Tim Booth/
+
+I spent several hours revisiting previous work on the Galaxy package
+for Bio-Linux and made significant progress towards it being something
+that can go into Debian-proper. Results will be committed to Deb-Med
+public SVN and patches will be forwarded to the Galaxy dev mailing
+list.
+
+
+* Standards and Visualization
+
+** Ontology and provenance representation
+
+/Herve Menager, Bertrand Neron, Jackie Quinn, Stian Soiland-Reyes, Matus Kalas, Steffen Moller/
+
+The goal of this group was to [[ont-doc][investigate and implement solutions]] to
+use ontologies to help people find and use the programs and data they
+need for their work, and to help automate the integration of tools or
+data resources into workflows or workbenches. We also wanted to
+identify useful provenance metadata, to store in a rigorous way the
+conditions and configuration of analysis steps run by users. This
+improves transparency, reproducibility, and reliability of the
+scientific results.
+
+We worked toward inclusion of the [[edam][EDAM onotology]] as part of the
+[[mobyle][Mobyle system's]] built-in type and classification mechanisms.
+We created a user case by identify workflows in Mobyle and mapped the
+descriptions unto EDAM classification to allow mapping between the types.
+We also investigated the possibilities opened by projects such as PROV
+to standardize the provenance information stored by systems such as
+Mobyle. We added a prototype functionality to the development version
+of Mobyle that dynamically generates this provenance information in a
+JSON-based format.
+
+#+LINK: ont-doc https://docs.google.com/document/d/19VpzwxZdlz1K4P1q1a-WYZUtiSXwUp2nafM716dzW8I/edit
+#+LINK: mobyle http://mobyle.pasteur.fr/
+#+LINK: edam http://edamontology.org/page
+#+LINK: prov http://www.w3.org/TR/prov-o/
+
+** Integrate DGE-Vis & Dalliance, JS animation scheduler
+
+/David Powell, Thomas Down, Skyler Brungardt, Alex Kalderimis/
+
+We worked on integrating two visualization tools: the
+[[dalliance][Dalliance genome browser]] and the [[dge-vis][DGE-Vis]] RNA-seq explorer.
+We now have [[dge-dalliance][a proof-of-concept tool]] that makes it possible to
+visualise RNA-seq analysis while browsing the genome.
+This inspired [[timeywimey][a JavaScript scheduler]] that is able to schedule
+slow animation updates when the JavaScript engine is not busy,
+allowing smoother animations and more accurate windows.
+Finally, we added a JBrowse-compatible JSON backend for Dalliance
+for integration with [[intermine][Intermine]].
+
+#+LINK: dalliance https://github.com/dasmoth/dalliance
+#+LINK: dge-vis https://www.youtube.com/watch?v=ucucQ_LtZ1g
+#+LINK: dge-dalliance http://dna.med.monash.edu/~powell/dge-vis-dalliance/
+#+LINK: timeywimey https://github.com/StrictlySkyler/timeywimey
+#+LINK: intermine http://intermine.github.io/intermine.org/
+
+
+* Infrastructure
+
+** Infrastructure management via CloudBioLinux (CBL)
+
+/Enis Afgan, John Chilton, Brad Chapman/
+
+- Galaxy: We integrated custom installation procedures present in CBL
+ with the Galaxy-tools versioned installation methodology.
+
+- Documentation: Due to the increased interest by individuals to use
+ and contribute to CBL, we invested effort into creating purpose-driven
+ documentation for CBL. This should help people use the endproduct of
+ CBL, customize CBL their needs, as well as learn about the internals
+ of CBL with the aim of contributing. We will finish and make the
+ documentation available on ReadTheDocs over the coming months.
+
+- Build frameworks: We developed a simpler automated method to invoke the
+ CBL build framework to help remove complex error prone steps.
+
+- Web tooling: In spirit of making CBL more accessible and easier to use, we’ve
+ decided to tackle development of a lightweight webapp that helps with
+ customizing and generating CBL configuration files.
+
+** Improve ipython cluster support and runtime metrics
+
+/Valentine Svensson, Guillermo Carrasco, Roman Valls, Per Unneberg/
+
+We worked to extend the [[ipython-parallel][Ipython parallel cluster]] framework to support
+additional schedulers, specifically [[slurm-code][implementing SLURM support]] to
+supplement existing SGE, LSF, Torque and Condor schedulers. We plan to
+extend this to allow generalized use of the DRMAA connector,
+ultimately port such generalization into ipython so that python
+scientific computations can be executed efficiently across different
+clusters implementation. Both [[g-blog][Roman]] and [[g-blog][Guillermo]] blogged
+[[r-blog2][detailed documentation]] of the [[g-blog2][work in progress]].
+
+We also worked to build a tool that helps provide run time
+estimations for bioinformatcs jobs (e.g. “how long should aligning 40 million
+reads against hg19 with BWA take if I use 8 cores?”). We plan to
+collaborate on longer term development of this with the
+[[gcat][Genome Comparison of Analytic Testing]] team.
+
+#+LINK: g-blog http://mussolblog.wordpress.com/2013/07/17/setting-up-a-testing-slurm-cluster/
+#+LINK: g-blog2 http://blogs.nopcode.org/brainstorm/2013/07/19/berlin-bosc-codefest-2013-day-2/
+#+LINK: r-blog http://blogs.nopcode.org/brainstorm/2013/07/18/berlin-bosc-codefest-2013-day-1/
+#+LINK: r-blog2 http://mussolblog.wordpress.com/2013/07/19/pushing-forward-pytravis-during-berlin-codefest-2013/
+#+LINK: slurm-code https://github.com/roryk/ipython-cluster-helper/pull/6
+#+LINK: ipython-parallel http://ipython.org/ipython-doc/dev/parallel/index.html
+#+LINK: gcat http://www.bioplanet.com/forum/§discussion/7916/runtimeswallclock-time-alongside-the-accuracy-metrics#Item_1
+
+** GATK-based reusable pipeline based around Rubra/Ruffus
+
+/Clare Sloggett, Bernie Pope/
+
+We worked on code cleanup, documentation and test data for
+[[ruffus-pipe][a reusable pipeline]] to handle variant calling and annotation, using [[rubra][Rubra]] built
+on the [[ruffus][Ruffus]] framework. It handles BWA alignment, GATK alignment
+cleaning and variant calling and ENSEMBL annotation. To make these
+pipelines easier to run, we worked on integrating them into the
+[[gvl-flavor][GVF flavor]] in CloudBioLinux.
+
+#+LINK: ruffus-pipe https://github.com/claresloggett/variant_calling_pipeline
+#+LINK: rubra https://github.com/bjpop/rubra
+#+LINK: ruffus http://www.ruffus.org.uk/
+#+LINK: gvl-flavor https://github.com/afgane/gvl_flavor
View
194 posts/conferences/scipy2013_day2.org
@@ -0,0 +1,194 @@
+#+BLOG: smallchangebio
+#+POSTID: 47
+#+DATE: [2013-06-28 Fri 08:15]
+#+TITLE: Scientific Python 2013, Day 2: Bioinformatics frameworks, open science and reproducibility
+#+CATEGORY: conference
+#+TAGS: bioinformatics, ngs, scaling, python, ipython, machine-learning
+#+OPTIONS: toc:nil
+
+I'm at the [[scipy2013][2013 Scientific Python conference]] in beautiful Austin,
+Texas. I'm helping organize this year's [[scipy-bio][Bioinformatics Symposium]]
+and learning about Python scaling and reproducibility from the
+Scientific Python community. These are my notes from the second day.
+See also my [[first-day-talks][notes from the first day]].
+
+#+LINK: first-day-talks http://smallchangebio.wordpress.com/2013/06/27/scientific-python-2013-day-1-bioinformatics-ipython-parallel-processing-and-machine-learning/
+
+* Ian Rees -- Bioinformatics symposium: electron microscopy platform
+
+[[ir-talk][Ian talked]] about software to handle data challenges associated with
+imaging. They principally focus on imaging of macromolecules using
+[[cryo-em][Cryo-EM]]. There is then a ton of processing before this can get into
+PDB as structures. 300+ active projects. Focus on archival,
+automation of record keeping, understanding how protocols change over
+time and sharing results with collaborators.
+
+[[emen2][EMEN2]] is their solution: a object-oriented lab notebook. It uses a
+protocol ontology to allow flexible queries of approaches. Take a
+general approach to connect records. Impressive display of a long 10
+year project with 15,500 records. Records look like json documents of
+key-value pairs. Built on top of BerkeleyDB, with a twisted web
+server. Provides integrated plotting and viewing of results, in
+addition to table-based viewing of projects and samples. Hooks in
+with the microscopy software so data uploads automatically. It
+provides a public API with JSON to query the data with a constraint
+based query language. Also has python hooks to create extensions with
+controllers and mako templates.
+
+The connected image processing toolkit is [[eman2][EMAN2]].
+
+#+LINK: ir-talk http://conference.scipy.org/scipy2013/presentation_detail.php?id=137
+#+LINK: cryo-em https://en.wikipedia.org/wiki/Cryo-electron_microscopy
+#+LINK: emen2 http://blake.grid.bcm.edu/emanwiki/EMEN2/
+#+LINK: eman2 http://blake.grid.bcm.edu/emanwiki/EMAN2
+
+* Larsson Omberg -- Bioinformatics symposium: Synapse platform
+
+[[lo-talk][Larsson discussed]] the approaches and tools that [[sage-base][SAGE bionetworks]] use
+to help improve reproducibility of science. The [[synapse][Synapse]] tool tries to
+handle reproducibility on a distributed scale with multiple
+collaborating scientists, as opposed to other projects which focus on
+single researchers. Example of usage for the cancer genome atlas:
+10,000 patients, 24 cancer types, and multiple inputs: variations,
+RNA-seq. Biggest challenge is coordination of multiple data sources
+and inputs. Data automatically pushed into Synapse, then do data
+freezes to allow analysis. Analysis results get pushed back into
+Synapse as well.
+
+Synapse is a web framework that allows multiple usages of tools in
+multiple places, and register results back with Synapse to coordinate
+results. Python API allos you to query with SQL syntax and retrieve
+specific datasets which have key/value style metadata annotations in
+addition to the raw data. Impressive demo of uploaded results with
+lots of metadata: nice way to understand custom analysis and review
+results.
+
+Synapse focuses on avoiding the self-assessment trap by moving this
+assessment into a centralized location. Also run challenges that help
+formalize this: [[dream8][Dream8 Challenges]].
+
+#+LINK: lo-talk http://conference.scipy.org/scipy2013/presentation_detail.php?id=208
+#+LINK: sage-base http://www.sagebase.org/
+#+LINK: synapse https://www.synapse.org/
+#+LINK: dream8 http://www.sagebase.org/challenges-overview/2013-dream-challenges/
+* Joshua Warner -- scikit-fuzzy
+
+[[jw-talk][Joshua]] talked about his implementation of a SciPy toolkit for fuzzy
+logic: [[scikit-fuzzy][scikit-fuzzy]]. Has fuzzy c-means for smaller uses and needs
+full Cythonization. Has 100% test coverage. Provides foundational
+tools for fuzzy logic but focusing on community building to provide
+additional tools. Good questions about the most useful places for
+fuzzy logic usage: it's a good prototyping step which includes some
+insight into the logic intuition for understanding approaches. It's
+not especially useful for categorical variables.
+
+#+LINK: jw-talk http://conference.scipy.org/scipy2013/presentation_detail.php?id=161
+#+LINK: scikit-fuzzy https://github.com/scikit-fuzzy/scikit-fuzzy
+
+* Jack Minardi -- Raspberry Pi sensor control
+
+[[jm-talk][Jack talked]] about interacting with [[raspberry-pi][Raspberry Pi]] using [[pyzmq][pyzmq]]: controlled
+LEDs and motors. Raspberry Pi has general purpose input output pins
+which allow you to interact with other external devices. Used the
+[[occidentalis][Occidentalis]] Wheezy based operating system which provides a lot of
+pre installed tools over the base installations that Raspberry Pi
+recommends. Jack live demos blinking an LED, which one ups live
+software demos for sure. Another cool demo uses pyzmq to stream the
+xyz location of a device to a real time plotting tool.
+
+There are an incredible number of open source tools for Raspberry Pi
+on GitHub that help manage interacting with the different hardware.
+There is a cool community around working with it.
+
+Raspberry Pi and other hackable hardware tools also provide a wonderful
+teaching environment for programming. It is so much more satisfying to
+make something happen in real life, and teaches all the important
+skills of installing, learning and debugging that you need in any kind
+of hacking. On a larger scale in genomics, efforts like the [[polonator][polonator]]
+from George Church's lab offer an opportunity to learn all the
+hardware behind sequencing
+
+#+LINK: jm-talk http://conference.scipy.org/scipy2013/presentation_detail.php?id=215
+#+LINK: raspberry-pi https://en.wikipedia.org/wiki/Raspberry_Pi
+#+LINK: pyzmq http://www.zeromq.org/bindings:python
+#+LINK: occidentalis http://learn.adafruit.com/adafruit-raspberry-pi-educational-linux-distro/overview
+#+LINK: polonator http://polonator.org/
+
+* Jeff Spies -- The Open Science Framework
+
+[[js-talk][Jeff discussed]] work at the [[cos][Center for Open Science]] to build
+infrastructure and community around opening up science to reduce the
+gap between scientific goals (open) and science practical needs
+(papers, funding). Problem is that published science is not
+synonymous with accurate science. Worry about unconscious biases like
+[[motivated-reasoning][Motivated Reasoning]]. Approach of OSF is to provide tools that work
+within scientific workflows to enable and incentivize openness. The
+[[osf][Open Science Framework]] provides a simplified front end to Git,
+handling archiving and versioning of study data. Provides unique URLs
+to tag specific versions for publication. Goals are to make
+components API driven to allow other interfaces like IPython notebooks.
+
+#+LINK: js-talk http://conference.scipy.org/scipy2013/presentation_detail.php?id=156
+#+LINK: cos http://www.centerforopenscience.org/
+#+LINK: motivated-reasoning https://en.wikipedia.org/wiki/Motivated_reasoning
+#+LINK: osf http://openscienceframework.org/
+
+* Burcin Eröcal -- scientific software distribution
+
+[[be-talk][Burcin discussed]] approaches to replicate, build on and improve
+scientific work. Shows an example of [[sage][Sage]], which has multiple
+requirements and installs well: installation matters, a lot. His
+approach is [[lmonade][lmonade]], which provides customizable distribution of
+scientific software. Burcin does not think virtual machines solve this
+problem because they are not programmable to add updates. I wonder if
+lightweight solutions like [[docker][docker]] help mitigate some of these
+concerns. In general, I haven't heard any usage of virtual machines at
+SciPy which makes me sad because I think this is an important path
+for moving forward with complex installations.
+He also compares to the [[nix][nix package manager]]. The main
+issue with this is that it requires explicit definition of
+dependencies so not as flexible as scientific software needs.
+
+#+LINK: be-talk http://conference.scipy.org/scipy2013/presentation_detail.php?id=181
+#+LINK: lmonade http://www.lmona.de/
+#+LINK: sage http://sagemath.org/
+#+LINK: docker http://www.docker.io/
+#+LINK: nix http://nixos.org/
+
+* John Kitchin -- emacs org-mode for reproducible research
+
+[[jk-talk][John discussed]] using [[org-mode][emacs org-mode]] to create reproducible documents
+with embedded python code. [[kitchin-github][John's GitHub]] has tons of useful example
+of using this for blogging and book writing. Impressive demos, I need
+to dig into his org files for tips and tricks.
+
+#+LINK: jk-talk http://conference.scipy.org/scipy2013/presentation_detail.php?id=178
+#+LINK: org-mode http://orgmode.org/
+#+LINK: kitchin-github https://github.com/jkitchin?tab=repositories
+
+* Lightning talks
+
+Travis talked about solutions to packaging problems in Python:
+[[conda][conda]] and [[binstar][binstar.org]]. Look like useful alternatives to pip that
+might help with lots of installation problems we see with multiple
+dependencies. The [[conda-recipes][Conda recipes GitHub repo]] has lots of existing
+tools.
+
+[[jiffyclub][Matt Davis]] gave a great advertisement for [[software-carpentry][Software Carpentry]].
+It's a wonderful resource for teaching scientists to
+program. They also need teachers and help from the community, so
+volunteer please.
+
+[[blz][BLZ]] is a high IO distributed format inside of [[blaze][Blaze]]. The
+[[blz-format][BLZ format document]] has additional documentation. This all gives you
+a distributed NumPy operating on massive arrays.
+
+#+LINK: conda http://docs.continuum.io/conda/index.html
+#+LINK: binstar https://binstar.org/
+#+LINK: conda-recipes https://github.com/ContinuumIO/conda-recipes
+#+LINK: software-carpentry http://software-carpentry.org/
+#+LINK: jiffyclub https://twitter.com/jiffyclub
+#+LINK: blz http://continuum.io/blog/blz-format
+#+LINK: blaze http://blaze.pydata.org/index.html
+#+LINK: blz-format http://blaze.pydata.org/docs/format.html
+
Please sign in to comment.
Something went wrong with that request. Please try again.