Skip to content

Output Format

Pat edited this page Aug 25, 2023 · 7 revisions

Bystro's default output format is a tab-separated text file.

It has 17 common fields (chrom, pos, type, discordant, alt, trTv, heterozygotes, heterozygosity, homozygotes, homozygosity, missingGenos, missingness, ac, an, sampleMaf, vcfPos, id) , and an unlimited number of "track" fields, which are directly taken from the genome build YAML configuration file (e.g. "hg19.yml") that defines those "tracks". "Tracks" are best thought of as database sources, which contain either a single value for a given position in the genome ("score" tracks, which output either a scalar or vector), or dictionaries of such values. We'll get into a longer explanation of tracks and how they map to output values, but for now let's use an example, so that we can focus the discussion on how to parse these data.

Parsing strategy

Bystro has a few value delimiters, which can be parsed in a consistent manner to go from output tsv to a dictionary or other favorite data structure.

Bystro's guiding principle in output is that it is lossless; any relationship stored in the annotation database can be recovered entirely by parsing the tsv.

";" - the vector value delimiter

The value delimiter signifies a vector value.

For instance in the below image, we have 2 samples that are heterozygous at a mutation site, and we denote that there are 2 values with ";":

Screenshot 2023-08-23 at 8 21 19 AM

"|"

The pipe ("|") delimiter signifies is used only to separate insertion and deletion values. For insertions, which are a new mutation between two reference nucleotides, Bystro outputs 2 values for every track, separated by a "|", one for the disrupted reference base, and one for the next nucleotide (which is also disrupted by the insertion).

For example, in the attached image the refSeq track has a Bystro-generated "siteType" field, which calculates what functional role this position in the genome has, e.g. "intronic" for an intron, "exotic" for an exon, etc. Here both bases disrupted by the intron are intronic:

Screenshot 2023-08-23 at 8 41 03 AM

For deletions, Bystro provides values for every deleted position, up to 32 bases.

ASCII 31 (unit separator) - the matrix value delimiter

In some cases, it is not possible to represent an entry as a scalar (single value, no ";" delimiter) or as a vector (with ";" delimiter). For instance, imagine that your favorite track had the matrix value [ [0, 1], [1, 0] ]. Bystro would store that as "0{us}1;1{us}0" where {us} denotes the unit separator. In python you would use chr(31) to get the control character, and parse this value as:

In [7]: list(map(lambda x: x.split(chr(31)), output_string.split(";")))
Out[7]: [['0', '1'], ['1', '0']]

This is an advanced case: most tracks do not have such values, and since bioinformatics data often included many displayable delimiters (e.g. ",", "/", "") we chose one that no bioinformatics databases use, to avoid conflating Bystro delimiters with the original source's text value, which does not always have a 1:1 mapping with Bystro delimiters.

These unit separator values are either not displayed (e.g. in Excel), or are displayed in an implementation specific way: for instance Mac terminal shows the above example as 0^_1;1^_0.

Parsing

Parsing a value in Bystro is quite simple. At most we can have 3 value delimiters, and they are always present in a consistent order:

FIELD_DELIMITER = "\t"
INDEL_DELIMITER = "|"
POSITION_DELIMITER = ";"
OVERLAP_DELIMITER = chr(31)

DELIMITER_SEQUENCE = [
    FIELD_DELIMITER,
    INDEL_DELIMITER,
    POSITION_DELIMITER,
    OVERLAP_DELIMITER,
]


def parse_row(row: str, delimiters: list[str] = DELIMITER_SEQUENCE) -> list[list[list[list[str]]]]:
    if not delimiters:
        return row
    else:
        delimiter, remaining_delimiters = delimiters[0], delimiters[1:]
        return [parse_row(chunk, remaining_delimiters) for chunk in row.split(delimiter)]