Skip to content

Guide library format

Victoria Offord edited this page Mar 29, 2022 · 6 revisions

The guide library format is designed to handle single or multi fragment guide data.

At present pycroquet only accepts single and dual guide libraries in this format.

Minimal example, this is a tab separated file:

##library-type: single
#id	sgrna_ids	sgrna_seqs	gene_pair_id
0	a	ACGT	A~B
1	x	ACGT	X~Y

Metadata Headers

All except library-type are optional, but recommended.

Metadata headers begin ## and should precede the column header, e.g.:

##library-type: single
##library-name: my first library
##species: human
##assembly: GRCh38
##gene-build-source: ensembl
##gene-build-version: 103

You are able to define anything you like here, although the above are recommended.

The only exception is library-type, which is used in the validation of columns that can use the | separator, see below. Values for library-type are currently:

  • single
  • dual
  • other

Meta data header for dual-guide

To allow for R1/R2 to be in a different order to the sgrna_seqs in the library, a dual-guide specific header has been defined. To swap the order simply include:

##dual-orientation: R2_R1

Other values have no impact.

Column header

There are 4 required fields and 11 optional items

Required fields

id (unique_id)

A unique identifier for the vector and it is different for each vector in the library. This is the id that will be used for outputting the counts.

sgrna_ids

This is the set of identifiers for the guides that are used in the vector. For the dual CRISPR-Cas9 libraries, there are two guides and for combination screens there will be more than two guides. The guides are combined together using a separator |. The order of the guides are <left_guide_id>|<right_guide_id> for dual CRISPR-Cas9 knockout screens.

In single guide no | separator is expected

sgrna_seqs

This is the sequence of the guides of the vector that are combined together using the separator character |. The order of the sequences is the same as the order of the guide ids that are in the sgRNA_ids field. This is necessary for the mapping. For dual CRISPR-Cas9 knockout screens the first guide is expected to map in forward direction and the second guide is expected to map in reverse direction. These are always provided in 5'-3' orientation, see sgrna_strands for orientation.

gene_pair_id

This should still be completed for single-guide libraries.

This is an id that represents the pair of the regions (genes, non-targeting and intergenic regions) that are targeted by the vector. This can be a numerical ID.

Optional columns

Items with separator are expected to follow the ordering as defined in sgrna_ids above

  • sgrna_strands
    • Can be used to override expected mapping orientation of sgrna_seqs
    • separator: '|'
  • sgrna_symbols
    • separator: '|'
  • sgrna_chrs
    • separator: '|'
  • sgrna_starts
    • separator: '|'
  • sgrna_ends
    • separator: '|'
  • sgrna_confidences
    • separator: '|'
  • sgrna_off_targets
    • separator: '|'
  • sgrna_libraries
    • separator: '|'
  • scaffold
  • target_type
  • custom_annotation
    • No tabs.