Note: As of Sep 2017, these processing scripts are deprecated in favor of nextstrain/augur. Current Nextflu builds run off the flu build detailed here. The code in this directory is kept in place for archival reasons.
Augur is the processing pipeline to track flu evolution. It currently
- imports public sequence data
- subsamples, cleans and aligns sequences
- builds a phylogenetic tree from this data
- reports statistics about mutations and branching patterns of the tree
- infers mutation frequency trajectories through time
- infers antigenic phenotypes from titer data
The entire pipeline is run with
Sequence download, cleaning and alignment
Virus sequence data is manually downloaded from the GISAID EpiFlu database. Data from GISAID may not be disclosed outside the GISAID community. We are mindful of this and raw GISAID data has not been released publicly as part of this project. The current pipeline is designed to work specifically for HA from influenza H3N2. Save GISAID sequences as
Keeps viruses with fully specified dates, cell passage and only one sequence per strain name. Subsamples to 50 (by default) sequences per month for the last 3 (by default) years before present. Appends geographic metadata. Subsampling prefers longer sequences over shorter sequences and prefer more geographic diversity over less geographic diversity.
Clean up alignment so that reference frame is kept intact. Remove sequences that don't conform to a rough molecular clock and remove known reassortant sequences and other outliers.
Reroot the tree based on outgroup strain, collapse nodes with zero-length branches, ladderize the tree and collect strain metadata.
Estimate genotype and clade frequency trajectories using a Bernoulli observation model combined with a genetic drift model of process noise.
Prep and remove cruft from data files for auspice visualization.