Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Browse files

Work report from Codefest 2013

  • Loading branch information...
commit 010b1ad6bd811165b0a4e6c796f4d4b9637f6c88 1 parent 17968c5
@chapmanb authored
242 posts/conferences/
@@ -0,0 +1,242 @@
+#+BLOG: bcbio
+#+POSTID: 524
+#+DATE: [2013-07-18 Thu 18:26]
+#+TITLE: Summary from Bioinformatics Open Science Codefest 2013: Tools, infrastructure, standards and visualization
+#+CATEGORY: OpenBio
+#+TAGS: bioinformatics, bosc, hackathon
+#+OPTIONS: toc:nil num:nil
+The [[bosc][2013 Bioinformatics Open Source Conference (BOSC)]] starts tomorrow in
+Berlin, Germany. It's a yearly conference devoted to community-based
+software development projects supporting biological research. Members of the
+[[open-bio][Open Bioinformatics Foundation]] discuss implementations and approaches
+to better provide interoperable and reusable software, libraries and
+For the past five years, a two day [[codefest][Codefest]] and hackathon preceded
+the conference. This gives programmers time to work face-to-face,
+sharing approaches and discovering connections between projects. This
+year, the [[ivo][the Department of Biology, Humboldt-Universität zu Berlin]]
+kindly hosted [[codefest][Codefest 2013]]. Thanks to the organizers and [[attendees][attendees]],
+we finished projects ranging from tool development, infrastructure
+integration, standards development and visualization. There are
+[[roman-photos][photos of the Codefest in progress]] and a [[codefest-doc][detailed writeup of projects]].
+Below we summarize the accomplishments from the two days. We
+welcome feedback on the topics covered and hope that by sharing our
+work we can encourage more programmers to become part of the open
+science bioinformatics community. Actively working to build
+well-tested, community-developed, interoperable tools is how we solve
+increasingly difficult research questions ranging from human health to
+plant breeding to microbial community function. The progress
+made in two days illuminates the effectiveness of open collaborative
+#+LINK: attendees
+#+LINK: ivo
+#+LINK: bosc
+#+LINK: open-bio
+#+LINK: codefest
+#+LINK: codefest-doc
+#+LINK: roman-photos
+* Tool Development
+** BioRuby and BaseSpace - Develop SDK and apps for Illumina BaseSpace
+/Toshiaki Katayama, Raoul Bonnal, Eri Kibukawa, Joachim Baran, Dan MacLean, Fernando Izquierdo-Carrasco, Spencer Bliven/
+During the Codefest, we tested and documented our
+[[basespace-ruby][port of the BaseSpace Python SDK to Ruby]]. Ruby/Biogem developers
+can now easily utilize next-generation sequencing code within the
+[[basespace][Illumina's BaseSpace]] framework. For non-Ruby programers, we
+found that it can be a burden to create new Web app from
+scratch on top of your NGS program. So we started new project to
+provide a Web-app scaffold for BaseSpace. We have already implemented the
+basic portion but will need some more time before releasing the
+BioBaseSpace application.
+<a href="">
+ <img src="" width="400">
+#+LINK: basespace-ruby
+#+LINK: basespace
+** Barrnap - Bacterial ribosomal RNA predictor
+/Torsten Seemann, Tim Booth/
+For the last 8 years RNAmmer has been the standard tool for predicting
+ribosomal RNA features in genomes, because it is reasonably fast,
+accurate, and works on bacteria and eukaryotes. Its drawbacks are that
+it relies on small, older databases; requires an older conflicting
+version of HMMER; and has restrictive licence terms. To resolve these
+issues we have implemented a new rRNA predictor which uses the new
+“nmmer” tool from HMMER 3.1 for searching DNA profiles against DNA
+sequence. We used the Silva and GreenGenes seed alignments for the 5S,
+23S and 16S genes to build the profile models from. Barrnap is a small
+Perl script which takes FASTA as input, and outputs the rRNA feature
+predictions in GFF3 format. It will be packaged in Bio-Linux and
+replace RNAmmer in the Prokka bacterial annotation system.
+** BioJVM - Coordinating and integrating BioJava and ScaBio
+/Spencer Bliven, Andreas Prlic, Markus Gumbel/
+Both Java and Scala run on the Java Virtual Machine. As such, it makes
+sense to [[biojava-scala][coordinate and document]] the various Bio* projects which run on the JVM and
+therefor can interoperate to some degree. We are able to successfully
+reference BioJava functions from Scala code and ScaBio functions from
+Java code. The ease of this process means that users can easily use both
+libraries from whichever language is more suited for their biological
+#+LINK: biojava-scala
+** Biopython
+/Peter Cock, Konstantin Tretyakov, Bin Zhang/
+The Biopython team worked on training new users at Codefest and exploring
+integration of Biopython with other Python molecular visualization toolkits
+like [[pymol][PyMol]]. Infrastructure development involved testing and debugging
+on multiple systems, including identifying and fixing Windows and
+PyPy problems. We also identified areas where we can make it easier
+to contribute to Biopython: specifically easing the process to report
+and fix bugs by moving to integrated GitHub issue tracking and
+working to support Biopython-associated projects with easy
+installation tools.
+#+LINK: biopython
+#+LINK: pymol
+** Galaxy Debianization
+/Tim Booth/
+I spent several hours revisiting previous work on the Galaxy package
+for Bio-Linux and made significant progress towards it being something
+that can go into Debian-proper. Results will be committed to Deb-Med
+public SVN and patches will be forwarded to the Galaxy dev mailing
+* Standards and Visualization
+** Ontology and provenance representation
+/Herve Menager, Bertrand Neron, Jackie Quinn, Stian Soiland-Reyes, Matus Kalas, Steffen Moller/
+The goal of this group was to [[ont-doc][investigate and implement solutions]] to
+use ontologies to help people find and use the programs and data they
+need for their work, and to help automate the integration of tools or
+data resources into workflows or workbenches. We also wanted to
+identify useful provenance metadata, to store in a rigorous way the
+conditions and configuration of analysis steps run by users. This
+improves transparency, reproducibility, and reliability of the
+scientific results.
+We worked toward inclusion of the [[edam][EDAM onotology]] as part of the
+[[mobyle][Mobyle system's]] built-in type and classification mechanisms.
+We created a user case by identify workflows in Mobyle and mapped the
+descriptions unto EDAM classification to allow mapping between the types.
+We also investigated the possibilities opened by projects such as PROV
+to standardize the provenance information stored by systems such as
+Mobyle. We added a prototype functionality to the development version
+of Mobyle that dynamically generates this provenance information in a
+JSON-based format.
+#+LINK: ont-doc
+#+LINK: mobyle
+#+LINK: edam
+#+LINK: prov
+** Integrate DGE-Vis & Dalliance, JS animation scheduler
+/David Powell, Thomas Down, Skyler Brungardt, Alex Kalderimis/
+We worked on integrating two visualization tools: the
+[[dalliance][Dalliance genome browser]] and the [[dge-vis][DGE-Vis]] RNA-seq explorer.
+We now have [[dge-dalliance][a proof-of-concept tool]] that makes it possible to
+visualise RNA-seq analysis while browsing the genome.
+This inspired [[timeywimey][a JavaScript scheduler]] that is able to schedule
+slow animation updates when the JavaScript engine is not busy,
+allowing smoother animations and more accurate windows.
+Finally, we added a JBrowse-compatible JSON backend for Dalliance
+for integration with [[intermine][Intermine]].
+#+LINK: dalliance
+#+LINK: dge-vis
+#+LINK: dge-dalliance
+#+LINK: timeywimey
+#+LINK: intermine
+* Infrastructure
+** Infrastructure management via CloudBioLinux (CBL)
+/Enis Afgan, John Chilton, Brad Chapman/
+- Galaxy: We integrated custom installation procedures present in CBL
+ with the Galaxy-tools versioned installation methodology.
+- Documentation: Due to the increased interest by individuals to use
+ and contribute to CBL, we invested effort into creating purpose-driven
+ documentation for CBL. This should help people use the endproduct of
+ CBL, customize CBL their needs, as well as learn about the internals
+ of CBL with the aim of contributing. We will finish and make the
+ documentation available on ReadTheDocs over the coming months.
+- Build frameworks: We developed a simpler automated method to invoke the
+ CBL build framework to help remove complex error prone steps.
+- Web tooling: In spirit of making CBL more accessible and easier to use, we’ve
+ decided to tackle development of a lightweight webapp that helps with
+ customizing and generating CBL configuration files.
+** Improve ipython cluster support and runtime metrics
+/Valentine Svensson, Guillermo Carrasco, Roman Valls, Per Unneberg/
+We worked to extend the [[ipython-parallel][Ipython parallel cluster]] framework to support
+additional schedulers, specifically [[slurm-code][implementing SLURM support]] to
+supplement existing SGE, LSF, Torque and Condor schedulers. We plan to
+extend this to allow generalized use of the DRMAA connector,
+ultimately port such generalization into ipython so that python
+scientific computations can be executed efficiently across different
+clusters implementation. Both [[g-blog][Roman]] and [[g-blog][Guillermo]] blogged
+[[r-blog2][detailed documentation]] of the [[g-blog2][work in progress]].
+We also worked to build a tool that helps provide run time
+estimations for bioinformatcs jobs (e.g. “how long should aligning 40 million
+reads against hg19 with BWA take if I use 8 cores?”). We plan to
+collaborate on longer term development of this with the
+[[gcat][Genome Comparison of Analytic Testing]] team.
+#+LINK: g-blog
+#+LINK: g-blog2
+#+LINK: r-blog
+#+LINK: r-blog2
+#+LINK: slurm-code
+#+LINK: ipython-parallel
+#+LINK: gcat§discussion/7916/runtimeswallclock-time-alongside-the-accuracy-metrics#Item_1
+** GATK-based reusable pipeline based around Rubra/Ruffus
+/Clare Sloggett, Bernie Pope/
+We worked on code cleanup, documentation and test data for
+[[ruffus-pipe][a reusable pipeline]] to handle variant calling and annotation, using [[rubra][Rubra]] built
+on the [[ruffus][Ruffus]] framework. It handles BWA alignment, GATK alignment
+cleaning and variant calling and ENSEMBL annotation. To make these
+pipelines easier to run, we worked on integrating them into the
+[[gvl-flavor][GVF flavor]] in CloudBioLinux.
+#+LINK: ruffus-pipe
+#+LINK: rubra
+#+LINK: ruffus
+#+LINK: gvl-flavor
194 posts/conferences/
@@ -0,0 +1,194 @@
+#+BLOG: smallchangebio
+#+POSTID: 47
+#+DATE: [2013-06-28 Fri 08:15]
+#+TITLE: Scientific Python 2013, Day 2: Bioinformatics frameworks, open science and reproducibility
+#+CATEGORY: conference
+#+TAGS: bioinformatics, ngs, scaling, python, ipython, machine-learning
+#+OPTIONS: toc:nil
+I'm at the [[scipy2013][2013 Scientific Python conference]] in beautiful Austin,
+Texas. I'm helping organize this year's [[scipy-bio][Bioinformatics Symposium]]
+and learning about Python scaling and reproducibility from the
+Scientific Python community. These are my notes from the second day.
+See also my [[first-day-talks][notes from the first day]].
+#+LINK: first-day-talks
+* Ian Rees -- Bioinformatics symposium: electron microscopy platform
+[[ir-talk][Ian talked]] about software to handle data challenges associated with
+imaging. They principally focus on imaging of macromolecules using
+[[cryo-em][Cryo-EM]]. There is then a ton of processing before this can get into
+PDB as structures. 300+ active projects. Focus on archival,
+automation of record keeping, understanding how protocols change over
+time and sharing results with collaborators.
+[[emen2][EMEN2]] is their solution: a object-oriented lab notebook. It uses a
+protocol ontology to allow flexible queries of approaches. Take a
+general approach to connect records. Impressive display of a long 10
+year project with 15,500 records. Records look like json documents of
+key-value pairs. Built on top of BerkeleyDB, with a twisted web
+server. Provides integrated plotting and viewing of results, in
+addition to table-based viewing of projects and samples. Hooks in
+with the microscopy software so data uploads automatically. It
+provides a public API with JSON to query the data with a constraint
+based query language. Also has python hooks to create extensions with
+controllers and mako templates.
+The connected image processing toolkit is [[eman2][EMAN2]].
+#+LINK: ir-talk
+#+LINK: cryo-em
+#+LINK: emen2
+#+LINK: eman2
+* Larsson Omberg -- Bioinformatics symposium: Synapse platform
+[[lo-talk][Larsson discussed]] the approaches and tools that [[sage-base][SAGE bionetworks]] use
+to help improve reproducibility of science. The [[synapse][Synapse]] tool tries to
+handle reproducibility on a distributed scale with multiple
+collaborating scientists, as opposed to other projects which focus on
+single researchers. Example of usage for the cancer genome atlas:
+10,000 patients, 24 cancer types, and multiple inputs: variations,
+RNA-seq. Biggest challenge is coordination of multiple data sources
+and inputs. Data automatically pushed into Synapse, then do data
+freezes to allow analysis. Analysis results get pushed back into
+Synapse as well.
+Synapse is a web framework that allows multiple usages of tools in
+multiple places, and register results back with Synapse to coordinate
+results. Python API allos you to query with SQL syntax and retrieve
+specific datasets which have key/value style metadata annotations in
+addition to the raw data. Impressive demo of uploaded results with
+lots of metadata: nice way to understand custom analysis and review
+Synapse focuses on avoiding the self-assessment trap by moving this
+assessment into a centralized location. Also run challenges that help
+formalize this: [[dream8][Dream8 Challenges]].
+#+LINK: lo-talk
+#+LINK: sage-base
+#+LINK: synapse
+#+LINK: dream8
+* Joshua Warner -- scikit-fuzzy
+[[jw-talk][Joshua]] talked about his implementation of a SciPy toolkit for fuzzy
+logic: [[scikit-fuzzy][scikit-fuzzy]]. Has fuzzy c-means for smaller uses and needs
+full Cythonization. Has 100% test coverage. Provides foundational
+tools for fuzzy logic but focusing on community building to provide
+additional tools. Good questions about the most useful places for
+fuzzy logic usage: it's a good prototyping step which includes some
+insight into the logic intuition for understanding approaches. It's
+not especially useful for categorical variables.
+#+LINK: jw-talk
+#+LINK: scikit-fuzzy
+* Jack Minardi -- Raspberry Pi sensor control
+[[jm-talk][Jack talked]] about interacting with [[raspberry-pi][Raspberry Pi]] using [[pyzmq][pyzmq]]: controlled
+LEDs and motors. Raspberry Pi has general purpose input output pins
+which allow you to interact with other external devices. Used the
+[[occidentalis][Occidentalis]] Wheezy based operating system which provides a lot of
+pre installed tools over the base installations that Raspberry Pi
+recommends. Jack live demos blinking an LED, which one ups live
+software demos for sure. Another cool demo uses pyzmq to stream the
+xyz location of a device to a real time plotting tool.
+There are an incredible number of open source tools for Raspberry Pi
+on GitHub that help manage interacting with the different hardware.
+There is a cool community around working with it.
+Raspberry Pi and other hackable hardware tools also provide a wonderful
+teaching environment for programming. It is so much more satisfying to
+make something happen in real life, and teaches all the important
+skills of installing, learning and debugging that you need in any kind
+of hacking. On a larger scale in genomics, efforts like the [[polonator][polonator]]
+from George Church's lab offer an opportunity to learn all the
+hardware behind sequencing
+#+LINK: jm-talk
+#+LINK: raspberry-pi
+#+LINK: pyzmq
+#+LINK: occidentalis
+#+LINK: polonator
+* Jeff Spies -- The Open Science Framework
+[[js-talk][Jeff discussed]] work at the [[cos][Center for Open Science]] to build
+infrastructure and community around opening up science to reduce the
+gap between scientific goals (open) and science practical needs
+(papers, funding). Problem is that published science is not
+synonymous with accurate science. Worry about unconscious biases like
+[[motivated-reasoning][Motivated Reasoning]]. Approach of OSF is to provide tools that work
+within scientific workflows to enable and incentivize openness. The
+[[osf][Open Science Framework]] provides a simplified front end to Git,
+handling archiving and versioning of study data. Provides unique URLs
+to tag specific versions for publication. Goals are to make
+components API driven to allow other interfaces like IPython notebooks.
+#+LINK: js-talk
+#+LINK: cos
+#+LINK: motivated-reasoning
+#+LINK: osf
+* Burcin Eröcal -- scientific software distribution
+[[be-talk][Burcin discussed]] approaches to replicate, build on and improve
+scientific work. Shows an example of [[sage][Sage]], which has multiple
+requirements and installs well: installation matters, a lot. His
+approach is [[lmonade][lmonade]], which provides customizable distribution of
+scientific software. Burcin does not think virtual machines solve this
+problem because they are not programmable to add updates. I wonder if
+lightweight solutions like [[docker][docker]] help mitigate some of these
+concerns. In general, I haven't heard any usage of virtual machines at
+SciPy which makes me sad because I think this is an important path
+for moving forward with complex installations.
+He also compares to the [[nix][nix package manager]]. The main
+issue with this is that it requires explicit definition of
+dependencies so not as flexible as scientific software needs.
+#+LINK: be-talk
+#+LINK: lmonade
+#+LINK: sage
+#+LINK: docker
+#+LINK: nix
+* John Kitchin -- emacs org-mode for reproducible research
+[[jk-talk][John discussed]] using [[org-mode][emacs org-mode]] to create reproducible documents
+with embedded python code. [[kitchin-github][John's GitHub]] has tons of useful example
+of using this for blogging and book writing. Impressive demos, I need
+to dig into his org files for tips and tricks.
+#+LINK: jk-talk
+#+LINK: org-mode
+#+LINK: kitchin-github
+* Lightning talks
+Travis talked about solutions to packaging problems in Python:
+[[conda][conda]] and [[binstar][]]. Look like useful alternatives to pip that
+might help with lots of installation problems we see with multiple
+dependencies. The [[conda-recipes][Conda recipes GitHub repo]] has lots of existing
+[[jiffyclub][Matt Davis]] gave a great advertisement for [[software-carpentry][Software Carpentry]].
+It's a wonderful resource for teaching scientists to
+program. They also need teachers and help from the community, so
+volunteer please.
+[[blz][BLZ]] is a high IO distributed format inside of [[blaze][Blaze]]. The
+[[blz-format][BLZ format document]] has additional documentation. This all gives you
+a distributed NumPy operating on massive arrays.
+#+LINK: conda
+#+LINK: binstar
+#+LINK: conda-recipes
+#+LINK: software-carpentry
+#+LINK: jiffyclub
+#+LINK: blz
+#+LINK: blaze
+#+LINK: blz-format
Please sign in to comment.
Something went wrong with that request. Please try again.