Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
CAMSA works with assemblies, that are represented as sets of individual assembly points (links between (oriented) scaffolds). Such format allows for a very broad range of scaffold assemblies obtained from multiple techniques to be processed using the single framework. In the text below we will use both terms scaffold and fragment in the same meaning.
CAMSA looks at order and orientation of scaffolds along chromosomes, thus it is expected that all of the input scaffold assemblies are built on the same set of input scaffolds (some scaffolds might be missing from some of the assemblies). It is expected that each genomic region is represented uniquely with a scaffold or a gap, and all of the scaffolders were working in terms of ordering and orienting scaffolds.
As was mentioned previously, CAMSA expects each assembly to be represented as a set of individual assembly points between pairs of scaffolds, on which all of the input scaffold assemblies are comprised of.
The standard CAMSA input file is a tab separated text-based file with a header and then a list of assembly points (one per line). An example is shown below:
origin seq1 seq1_or seq2 seq2_or A1 s1 + s2 - A1 s2 - s3 + ...
idof the assembly, that produced a corresponding assembly point.
idof the first scaffold, that participates in the assembly point.
seq1_or: relative orientation
(+/-/?), of the first scaffold in the assembly point.
idof the second scaffold, that participates in the assembly point.
seq2_or: relative orientation
(+/-/?), of the second scaffold in the assembly point.
gap_size: integer value
(>=0/?), determining a gap size between two assembled scaffolds.
cw: confidence weight, of the reported assembly point
([0, 1]/?). By default for oriented assembly points
cw=1, while for semi-oriented and non-oriented assembly points realizations
cw=0.75. These values can be overwritten in CAMSA, please refer to usage wiki page on how to do so and more.
The order of fields is determined by the header, so, theoretically, there are no restrictions on how you can organize the input file, but we recommend to stick with the shown order for the main fields. There are no restrictions on the additional optional fields, that can be thrown into the input files, as they will simply be ignored.
CAMSA input format is described on the input wiki page. This format is not common for conventional scaffolders, and thus some data preparation can be in order for CAMSA to be able to process it. We include multiple built-in conversion scripts, that can automate the translation of the more common scaffold assemblies formats (FASTA, AGPv2.0, GRIMM etc) into the CAMSA one.
The overall usage description for these conversion utils is as follows:
xxx stands for the format, that input scaffold assembly is in. For each specific conversion util script on all supported format script please refer to the utils wiki page.