This repository has been archived by the owner. It is now read-only.

Selecting Arbitrary Nth Column in Syntax #17

Closed
ababaian opened this Issue Oct 26, 2017 · 9 comments

Comments

6 participants
@ababaian
Copy link
Owner

ababaian commented Oct 26, 2017

I was working on the mostly trivial case of fasta-index format (faidx) and I think because it was so simple I found a very nice way to select columns by the order in which they appear. The only requirement right now is that it is in a tab-delimited file.

What it does is match the first column until the first tab, scopes it, then pushes to contig.length

In contig.length every non-whitespace character is selected and scoped. Then when it hits the next tab it pops out.

The third column is then selected, scoped and pushed to genomic.offset. The fourth column is selected and then popped at the tab.

etc... This push-pop back and forth with tabs can be repeated for N number of columns which means that .bed, .bedpe, .gtf, .sam, and possibly some of .vcf can now be 'solved' since we know what type of data is supposed to be in the Nth column.

Can anyone think of a reason that this won't work or will break at some edge-case?

If not, we'll need to re-work those syntaxes as I think this is a more robust approach then trying to select each column by the data range which could be there.

faidx.sublime-syntax

%YAML 1.2
---
name: faidx
file_extensions: [fa.fai,fasta.fai]
scope: source.faidx

contexts:
  main:
    # COLUMN 1
    - match: '^[\S]*\t'
      scope: coord.Chr.faidx
      push: contig.length

    # COLUMN 3
    - match: '(?<=\t)[\S]*\t'
      scope: constant.numeric.faidx
      push: genomic.offset

    # COLUMN 5
    - match: '[\S]*$'
      scope: comment.line.faidx

  contig.length:
    # COLUMN 2
    - match: '[\S]*'
      scope: coord.Start.faidx
    - match: \t
      pop: true

  genomic.offset:
    # COLUMN 4
    - match: '[\S]*'
      scope: comment.line.faidx
    - match: \t
      pop: true
@echu113

This comment has been minimized.

Copy link
Collaborator

echu113 commented Oct 26, 2017

This is really cool.

I think it looks pretty robust as it is. Though, would it work for files with >5 columns? Might need to do some figuring out on how to encode the 5th column if there is a 6th column. Maybe something along the lines of '(?<=\t[\S]\t)[\S]\t'?

@ababaian

This comment has been minimized.

Copy link
Owner

ababaian commented Oct 26, 2017

Even simpler version with an open-ended scope for all columns greater then 5.

Robust Nth Column Selection

%YAML 1.2
---
name: faidx
file_extensions: [fa.fai,fasta.fai]
scope: source.faidx

# Fasta Index Filetype Description
# NAME  Name of this reference sequence
# LENGTH  Total length of this reference sequence, in bases
# OFFSET  Offset within the FASTA file of this sequence's first base
# LINEBASES The number of bases on each line
# LINEWIDTH The number of bytes in each line, including the newline

contexts:
  main:
    # COLUMN 1
    - match: '^[\S]*\t'
      scope: coord.Chr.faidx
      push: col2

  col2:
    # COLUMN 2
    - match: '[\S]*'
      scope: coord.Start.faidx
    - match: \t
      push: col3
    - match: $
      pop: true

  col3:
    # COLUMN 3
    - match: '[\S]*'
      scope: constant.numeric.faidx
    - match: \t
      push: col4
    - match: $
      pop: true

  col4:
    # COLUMN 4
    - match: '[\S]*'
      scope: comment.line.faidx
    - match: \t
      push: col5
    - match: $
      pop: true

  col5:
    # COLUMN 5
    - match: '[\S]*'
      scope: comment.line.faidx
    - match: \t
      push: colast
    - match: $
      pop: true

  colast:
    # Any COLUMN >5
    - match: .*
      scope: comment.line.faidx
      pop: true
@echu113

This comment has been minimized.

Copy link
Collaborator

echu113 commented Oct 26, 2017

brilliant!

@ababaian

This comment has been minimized.

Copy link
Owner

ababaian commented Oct 26, 2017

I think this same logic could be applied for gedit and Vim syntax as well. There is a Match Start // Match End logic which can be extended in this way. I would say if we figure this out soon we'll simplify our lives greatly.

Maybe read some syntax highlighting files for other complex langauges (C / XML etc...) to learn how other people solved similar problems.

@Ebedthan

This comment has been minimized.

Copy link
Collaborator

Ebedthan commented Oct 31, 2017

Can we get a screenshot of what it looks like @ababaian ?

@ababaian

This comment has been minimized.

Copy link
Owner

ababaian commented Oct 31, 2017

faidx

@Ebedthan

This comment has been minimized.

Copy link
Collaborator

Ebedthan commented Oct 31, 2017

Please can you give me the colors you used to do this colors scheme?

@ababaian

This comment has been minimized.

Copy link
Owner

ababaian commented Oct 31, 2017

I'd say let's not worry 100% about all the color schemes just yet. This was based off of bioMonokai for Sublime which is dark background. Gedit is based off of Kate and is light background so it might not work. The third column is simply the default 'numeric' color, fourth + fifth are comment colored.

We're going to have to formalize all the colors and/or set one dark one light theme to make the same for all the different programs. We can worry about this last; now we need the syntax files to work reliably for all the different software as the highest priority.

@ababaian ababaian closed this Nov 16, 2017

@ababaian ababaian reopened this Nov 20, 2017

@ababaian

This comment has been minimized.

Copy link
Owner

ababaian commented Nov 20, 2017

Also faidx-gedit syntax

Check out Fasta Index Language File for an example of the logic. It's the same thing as in sublime / less where nested contexts can be used to select by column. This should make SAM/VCF/BED/GTF files much much easier to deal with.

faidx-gedit

@ababaian ababaian closed this Dec 5, 2017

@ababaian ababaian referenced this issue Dec 17, 2017

Open

bioSyntax TODO #2

0 of 21 tasks complete
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.