Skip to content

File Definitions

Brian Haas edited this page Oct 31, 2018 · 27 revisions

There are several files that may be needed depending on the analysis. These files, as well as, files output by inferCNV are described here.

Input Files

Raw Counts Matrix for Genes x Cells

InferCNV is compatible with both smart-seq2 and 10x single cell transcriptome data, and presumably other methods (not tested). The counts matrix can be generated using any conventional single cell transcriptome quantification pipeline, yielding a matrix of genes (rows) vs. cells (columns) containing assigned read counts.

The format might look like so:

MGH54_P16_F12 MGH54_P12_C10 MGH54_P11_C11 MGH54_P15_D06 MGH54_P16_A03 ...
A2M 0 0 0 0 0 ...
A4GALT 0 0 0 0 0 ...
AAAS 0 37 30 21 0 ...
AACS 0 0 0 0 2 ...
AADAT 0 0 0 0 0 ...
... ... ... ... ... ... ...

The matrix can be provided as a tab-delimited file. (note, sparse matrices are also supported - see Running-InferCNV)

Sample annotation file

The sample annotation file is used to define the different cell types, and optionally, indicating how the cells should be grouped according to sample (ie. patient). The format is simply two columns, tab-delimited, and there is no column header.

MGH54_P2_C12    Microglia/Macrophage
MGH36_P6_F03    Microglia/Macrophage
MGH54_P16_F12   Oligodendrocytes (non-malignant)
MGH54_P12_C10   Oligodendrocytes (non-malignant)
MGH36_P1_B02    malignant_MGH36
MGH36_P1_H10    malignant_MGH36

The first column is the cell name, and the 2nd column indicates the known cell type. For the normal cells, if you have different types of known normal cells (ie. immune cells, normal fibroblasts, etc.), you can give an indication as to what the cell type is. Otherwise, you can group them all as 'normal'. If multiple 'normal' types are defined separately, the the expression distribution for normal cells will be explored according to each normal cell grouping, as opposed treating them all as a single normal group. They'll also be clustered and plotted in the heatmap according to normal cell grouping.

The sample (ie. patient) information is encoded in the attribute name as "malignant_{patient}", which allows the tumor cells to be clustered and plotted according to sample (patient) in the heatmap.

Only those cells listed in the sample annotations file will be analyzed by inferCNV. This is useful in case you cells of interest are a subset of the total counts matrix, without needing create a new matrix containing the subset of interest.

Gene ordering file

The gene ordering file provides the chromosomal location for each gene. The format is tab-delimited and has no column header, simply providing the gene name, chromosome, and gene span:

WASH7P  chr1    14363   29806
LINC00115       chr1    761586  762902
NOC2L   chr1    879584  894689
MIR200A chr1    1103243 1103332
SDF4    chr1    1152288 1167411
UBE2J2  chr1    1189289 1209265

Every gene in the counts matrix to be analyzed should have the corresponding gene name and location info provided in this gene ordering file.

Note, only those genes that exist in both the counts matrix and the gene ordering file will be included in the inferCNV analysis.

Some Genomic Position Files have been generated from common references and made available at TrinityCTAT.

If you need to construct your own custom genomic positions file, see instructions for creating a genomic position file.

Clone this wiki locally