Skip to content

Latest commit

 

History

History
280 lines (205 loc) · 11.8 KB

io.rst

File metadata and controls

280 lines (205 loc) · 11.8 KB

Input/Output Files

Help on the usage and complete list of I/O arguments of each command can be obtained using the command line help

vtam COMMAND --help
i.e.
vtam filter --help

Here we detail the content of the I/O files

params

Input of most commands. YML file with numerical parameters <numerical_parameter_file_reference>. Can be omitted if all parameters are by default. Simple text file with a “parameter name: parameter value” format. One parameter per line e.g.

lfn_variant_cutoff: 0.001
lfn_sample_replicate_cutoff: 0.003
lfn_read_count_cutoff: 70
pcr_error_var_prop: 0.05

fastqinfo

Input of merge <merge_reference>. TSV file with the following columns:

  • TagFwd: Sequence of the tag on the forward primer (5’=>3’)
  • PrimerFwd: Sequence of the forward primer (5’=>3’)
  • TagRev: Sequence of the tag on the reverse primer (5’=>3’)
  • PrimerRev: Sequence of the reverse primer (5’=>3’)
  • Marker: Name of the marker (e.g. MFZR)
  • Sample: Name of the sample
  • Replicate: ID of the replicate
  • Run: Name of the sequencing run
  • FastqFwd: Name of the forward fastq file
  • FastqRev: Name of the reverse fastq file

fastainfo

Output of merge <merge_reference>, input of sortreads <sortreads_reference>. TSV file with the following columns:

  • run: Name of the sequencing run
  • marker: Name of the marker (e.g. MFZR)
  • sample: Name of the sample
  • replicate: ID of the replicate
  • tagfwd: Sequence of the tag on the forward primer (5’=>3’)
  • primerfwd: Sequence of the forward primer (5’=>3’)
  • tagrev: Sequence of the tag on the revrese primer (5’=>3’)
  • primerrev: Sequence of the reverse primer (5’=>3’)
  • mergedfasta: name of the fasta file with merged sequences

sortedinfo

Output of sortreads <sortreads_reference>, input of filter <filter_reference> and optimize <optimize_reference>. TSV file with the following columns:

  • run: Name of the sequencing run
  • marker: Name of the marker (e.g. MFZR)
  • sample: Name of the sample
  • replicate: ID of the replicate
  • sortedfasta: name of the fasta file containing merged, demultiplexed, trimmed sequeces

db

I/O of filter <filter_reference>, taxassign <taxassign_reference>. Input of optimize <optimize_reference>, pool <pool_reference>. Sqlite database containing variants, samples, replicates, read counts, information on filtering steps, taxonomic assignations.

asvtable

Output of filter <filter_reference> or pool <pool_reference>, input of taxassign <taxassign_reference>. TSV file with the variants (in lines) that passed all filtering steps, samples (in columns), presence-absence (output of pool) or read counts (output of filter) in cells and additional columns:

  • run: Name of the sequencing run
  • marker: Name of the marker (e.g. MFZR)
  • variant: Variant ID
  • pooled_variants (only in output of pool): IDs of variants pooled since identical in their overlapping regions
  • sequence_length length of the variant
  • read_count: Total number of reads of the variants in the samples listed in the table
  • [one column per sample] : presence-absence (output of pool) or read counts (output of filter)
  • clusterid: ID of the centroïd of the cluster (0.97 clustering of all variants of the asv table)
  • clustersize: Number of variants in the cluster
  • chimera_borderline (only in output of filter): Potential chimeras (very similar to one of the parental sequence)
  • [keep_mockXX; One column per mock sample, if known_occurrences option is used]: 1 if variant is expected in the mock sample, 0 otherwise
  • pooled_sequences (only in output of pool): Sequences of pooled_variants
  • sequence: Sequence of the variant

known_occurrences <optimize_reference>

Input of filter <filter_reference> and optimize <optimize_reference>. Output of make_known_occurrences <make_known_occurrences_reference>. TSV file with expected occurrences (keep) and known false positives (delete).

  • marker: Name of the marker (e.g. MFZR)
  • run: Name of the sequencing run
  • sample: Name of the sample
  • mock: 1 if sample is a mock, 0 otherwise
  • variant: Varinat ID (can be empty)
  • action: keep (occurrences that should be kept after filtering) or delete (clear false positives)
  • sequence: Sequence of the variant
  • tax_name: optional, not used by optimize

mock_composition <make_known_occurrences_reference>

Input of filter <make_known_occurrences_reference>. TSV file with expected sequences (keep) in mock samples.

  • marker: Name of the marker (e.g. MFZR)
  • run: Name of the sequencing run
  • sample: Name of the sample
  • mock: 1 if sample is a mock, 0 otherwise
  • variant: Variant ID (can be empty)
  • action: keep (occurrences that should be kept after filtering) or delete (clear false positives) or tolerate (variant present in a mock sample but amplifies badly)
  • sequence: Sequence of the variant
  • tax_name: optional, not used by optimize

sample_types <make_known_occurrences_reference>

Input of make_known_occurrences <make_known_occurrences_reference>. TSV file.

  • run: Name of the sequencing run
  • sample: Name of the sample
  • sample_type: real/negative(negative control)/mock
  • habitat: habitat type (e.g. freshwater, marine), NA for negative contol samples. It is used to detect occurrences that do not correspond to the habitat type.

missing_occurrences <make_known_occurrences_reference>

Output of make_known_occurrences <make_known_occurrences_reference>. TSV file with keep occurrences that are missing from the input ASV table.

  • marker: Name of the marker (e.g. MFZR)
  • run: Name of the sequencing run
  • sample: Name of the sample
  • mock: 1 if sample is a mock, 0 otherwise
  • variant: Variant ID (can be empty)
  • action: keep (occurrences that should be kept after filtering) or delete (clear false positives)
  • sequence: Sequence of the variant
  • tax_name: optional, not used by optimize

optimize_lfn_sample_replicate.tsv <OptimizeLFNsampleReplicate_reference>

Output of optimize <optimize_reference>. TSV file with the following columns:

  • run: Name of the sequencing run
  • marker: Name of the marker (e.g. MFZR)
  • sample: Name of the sample
  • replicate: ID of the replicate
  • variant: Variant ID
  • N_ijk: Number of reads of variant i, in sample j and replicate k
  • N_jk: Number of reads in sample j and replicate k (all variants)
  • N_ijk/N_jk
  • round_down: Rounded value of N_ijk/N_jk
  • sequence: Variant sequence

optimize_lfn_read_count_and_lfn_variant.tsv OR optimize_lfn_read_count_and_lfn_variant_replicate.tsv <OptimizeLFNReadCountAndLFNvariant_reference>

Output of optimize <optimize_reference>. TSV file with the following columns:

  • occurrence_nb_keep: Number of keep occurrence left after filtering with lfn_nijk_cutoff and lfn_variant_cutoff values
  • occurrence_nb_delete: Number of delete occurrence left after filtering with lfn_nijk_cutoff and lfn_variant_cutoff values
  • lfn_nijk_cutoff: lfn_read_count_cutoff
  • lfn_variant_cutoff or lfn_variant_replicate_cutoff
  • run: Name of the sequencing run
  • marker: Name of the marker (e.g. MFZR)

optimize_lfn_variant_specific.tsv OR optimize_lfn_variant_replicate_specific.tsv <OptimizeLFNReadCountAndLFNvariant_reference>

Output of optimize <optimize_reference>. TSV file with the following columns:

  • run: Name of the sequencing run
  • marker: Name of the marker (e.g. MFZR)
  • variant: Variant ID
  • replicate: (if optimize_lfn_variant_replicate_specific.tsv) ID of the replicate
  • action: Type d’occurrece (delete/keep)
  • read_count_max: Max of N_ijk for a given i
  • N_i (optimize_lfn_variant_specific.tsv) : Number of reads of variant i
  • N_ik (optimize_lfn_variant_replicate_specific.tsv): Number of reads of variant i in replicate k
  • lfn_variant_cutoff: read_count_max/N_i or read_count_max/N_ik
  • sequence: Variant sequence

optimize_pcr_error.tsv <OptimizePCRError_reference>

Output of optimize <optimize_reference>. TSV file with the following columns:

  • run: Name of the sequencing run
  • marker: Name of the marker (e.g. MFZR)
  • sample: Name of the sample
  • variant_expected: ID of a keep variant
  • N_ij_expected: Number of reads of the expected variant in the sample (all replicates)
  • variant_unexpected: ID of an unexpected variants with one mismatch to the keep variant
  • N_ij_unexpected: Number of reads of the unexpected variant in the sample (all replicates)
  • N_ij_unexpected_to_expected_ratio: N_ij_unexpected/N_ij_expected
  • sequence_expected: Sequence of the expected variant
  • sequence_unexpected: Sequence of the unexpected variant

output (taxassign)

Output of taxassign <taxassign_reference> The input asvtable completed with the following columns:

  • ltg_tax_id: TaxID of the LTG (Lowest Taxonomic Group)
  • ltg_tax_name ltg_rank: Name of the LTG
  • identity: Percentage of identity used to determine the LTG
  • blast_db: Name of the taxonomic BLAST database files (without extensions)
  • phylum: Phylum of LTG
  • class: class of LTG
  • order: order of LTG
  • family: family of LTG
  • genus: genus of LTG
  • species: species of LTG

taxonomy

Output of taxonomy <taxonomy_reference>, input of taxassign <taxassign_reference>. TSV file with information of all taxa in the reference (BLAST) database.

  • tax_id: Taxonomic identifier of the taxon
  • parent_tax_id: Taxonomic identifier of the direct parent of the taxon
  • rank: Taxonomic rank of the taxon (e.g. class, species, no rank)
  • name_txt: Name of the taxon
  • old_tax_id: TaxID of taxa merged to taxon (not valid any more)
  • taxlevel index (optional; 0 = root, 1 = superkingdom, 2 = kingdom, 3 = phylum, 4 = class, 5 = order, 6 = family, 7 = genus, 8 = species, x.5 for intermediate levels)

runmarker

Input of pool <pool_reference>. TSV file with the list of all run-marker combinations to be pooled.

  • run: Name of the sequencing run
  • marker: Name of the marker (e.g. MFZR)