PlantClusterFinder (PCF) detects metabolic gene clusters in a sequenced genome. It uses a gene location file provided by the user (see below) and a PGDB created with Pathway Tools as well as further information (see below) to identify enzyme-coding genes (metabolic genes) located together on a chromosome. Initially only continous stretches of metabolic genes lying directly next to each other are allowed. This condition is relaxed by iteratively increasing the intervening (non-metabolic) gene size by one. Several criteria to select for clusters are provided. In addition to this, clusters can be prevented from forming by a section of criteria.
Details of PCF (version 1.0) can be found in PMID: 28228535.
The major differences between this version (1.3) and previous versions (1.0 and 1.2) are:
1) Physical breaks of the genome or sequencing gaps of unknown size are typically encoded by stretches of Ns in the genome assembly fasta file. Previously we inserted 20 hypothetical genes at each break. This however diluted the background of low quality genomes with non-enzymes, and hence the likelyhood of a cluster to be classified as top x% of enzyme dense regions was better than in a genome that had good quality. In version 1.3 we identify these breaks (but no longer insert 20 hypo genes) and prevent formation of a cluster over these gaps.
2) Any sequencing information that is missing is typically hard masked with Ns. Previously, any intergenic region affected by at least one N was evaluated for its length, and hypothetical genes were inserted accordingly (See Schlapfer et al, PMID:
28228535). This led sometimes to unrealistic prevention of detecting gene clusters. It is unlikely that missing information about a single nucleotide would (if it would be known) lead to the finding of multiple gene models. Thus here we changed the code to insert 2 hypothetical genes only if a strech of unknown sequence is larger than nth percentile of gene sequences (set to 5). We also provide the option to NOT insert any hypothetical genes all together. Instead by default we use MaxSeqGapSize set to 100000 and MaxInterGeneDistByMedian set to 50 resulting in similar cluster predictions as in PCF version 1.0.
3) Large gene poor intergenic regions are present in genomes. In this version we provide the option of several parameters to prevent clusters from spanning such large gene poor regions.
Pascal Schlapfer, December 2017
Bo Xue, December 2017