Skip to content

preparing_configcsv

Fei Zhao edited this page Apr 29, 2021 · 5 revisions

Modifying the configuration file config.yaml

To run the Triti-Map main program, you need to modify the parameters in the configuration file.

Description of the main parameters of the configuration file

  • email: Important , you need to provide your personal email when using EMBL-EBI API for related analysis.

  • samples: No modification is needed, the path of the sample information file. The default sample information file is sample.csv in the Triti-Map running directory.

  • datatype : Important , the sequencing type of the sample data, dna: ChIP-seq or WGS; rna: RNA-seq.

  • maxthreads: the maximum number of threads that can be used. Default value: 30 .

  • ref : Important , reference genome related file paths, contains 3 sub-parameters.

    • genome: path to the reference genome file, use absolute path.
    • annotation: path to the reference genome gene annotation file, use the absolute path.
    • STARdir: STAR reference genome index directory (required for RNA-seq analysis).
  • gatk: GATK-related analysis parameters, including one sub-parameter.

    • min_SNP_DP: No modification is needed, the minimum depth that needs to be met for each pool of valid SNPs. default is 10.
  • snpindex: Parameters required for BSA localization analysis, including 6 sub-parameters.

    • pop_struc: Important , the population structure of the pool samples. If the data of the pool is F2 generation, fill in F2; if the data of the pool is RIL population or all individuals are homozygous, fill in RIL.
    • bulksize: The number of samples in the pool, e.g., if each pool consists of 30 samples, then fill in 30.
    • winsize: No modification is needed, the length of the sliding window for data correction, default is 1000000(1Mb).
    • filter_probs : Important, the percentage of the original results to filter based on Delta SNPindex and SNPconut/Mb. If the value is 0.75, it means that the average Delta SNPindex and the average number of SNPs per 1Mb of the candidate interval should be both greater than 75% of the corresponding values of all original results.
    • fisher_p: No modification is needed, filter the SNP loci of the trait association interval and calculate the pvalue of each locus using fisher test, default is 0.0001.
    • min_length: No modification is needed, the minimum length of the candidate trait association interval. For large genome species such as wheat, the default length is 1000000(1Mb).
  • bulk_specific: Important, how to define bulk specific scaffold(sequence).

    • identical_percentage: blast percentage of identical matches. Default is 0.85: blast percentage of identical matches need < 85%
    • length_percentage:blast match length / query length. Default is 0.85: blast match length / query length < 85%
  • merge_lib: How to handle multiple sets of different ChIP-seq data of the same pool. The default is merge, i.e. samples are merged first and then assembled to get better results; if dealing with large genomic data such as hexaploid wheat and your server memory is less than 300G, you can modify it to split, i.e. each group of data is assembled separately and then merged for subsequent analysis.

  • memory: the maximum memory available when assembling transcriptome sequences using SPAdes, 300 means 300G.

  • denovo_filter_method : Important , the pool-specific sequence filtering method, valid when running only_assembly module. Set to external_region means user customize filter interval in region.csv file; set to external_fasta means user use own prepared external fasta sequence as filter database, please refer to FAQ.

  • filter_region_file: If you use external_region, you need to fill in the location of the filter interval file here, the default is region.csv.

  • filter_fasta_file : If you use external_fasta, you need to fill in the location of the fasta file, e.g. /your/path/region.fasta.

  • blast_database: No modification is needed, the database used for BLAST of assembly sequences. em_cds_pln means using EBI ENA plant coding sequence database, em_std_pln means using EBI ENA plant standard sequence database. Default value: em_cds_pln .