Skip to content

6 Input file formats

Ezequiel L. Nicolazzi edited this page Jan 27, 2016 · 1 revision

##PLINK PED/MAP FILE FORMATS: Zanardi accepts single or multiple PLINK ped/map file couples, which should be included in the INPUT_PED and INPUT_MAP variables of the parameter file. For specifics about PLINK [ped/map] format, please visit PLINK software web page http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped. Note that [transposed/long/binary] PLINK formats are not accepted as input file in Zanardi.

##705 FILE FORMATS: Generally, this is a format useful to cattle breeder associations. "Interbull 705" format is a file format for the International Genomic Evaluation. It is a known and common format for most major cattle breeder associations (at least those participating to the Gene2farm project). Zanardi accepts single or multiple "Interbull 705" files, which should be included in the INPUT_705 and INPUT_705_MAP variables of the parameter file. Please note that, conversely to PLINK, it is not required to have the same number of INPUT_705 and INPUT_705_MAP files, since multiple SNP arrays (e.g. densities) could be present in a single 705 file. In fact, Zanardi will create one plink file for each SNP array, and then merge all into a single file. In any case, Zanardi expects a map, sorted by the SNP index, for each of the arrays present in the INPUT_705 file(s). Map sorting is extremely important, as Zanardi will assume the order of markers is as it is shown in the map (just like it does for PLINK map files).

NOTE: all samples with a number of SNP not matching any of the INPUT_705_MAP files provided will be excluded from the analyses (a big warning message will be shown, and a file with a list of excluded samples will be created).

Genotypic files (the INPUT_705 files) contain general information of individuals and numerical coding for genotypes (see notes of the following table for more information).

Field Start. pos. Content Format Example
1 1 Record type Int. 3 705
2 4 Country sending the info Char. 3 ITA
3 7 Breed of evaluation Char. 3 BSW
4 11 Breed of animal (1) Char. 3 BSW
5 14 Country of first registration (1) Char. 3 AUS
6 17 Sex (1) Char. 1 M
7 18 ID number of individual (1) Char. 12 00Z000000000
8 31 Number of SNPs in array Int. 10 54001
9 42 Genotype for SNPs in Index Int. X (2) 0/1/2.. (3)

Notes: 1) These are the elements of the International ID (Char. 19) 2) The number of integers has to be exactly the same as the number of SNPs in array (Field n.8) 3) Fields here are 1 number for each genotype, as follows: 0="BB",1="AB" or "BA", 2="AA", 5="Missing", 7=Imputed "BB" ,8=Imputed "AB" or "BA", 9=Imputed "AA"

Index – or MAP - files (INPUT_705_MAP) contain information regarding the SNPs (position, name, chromosome number, etc)

Field Start. pos. Content Format Example
1 1 SNP name Char. 53 UA-IFASA-9433
2 54 SNP index for array (1) Int. 6 777909
3 60 Chromosome Int. 10 20
4 70 Physical position Int. 15 15478658
5 85 Overall SNP index (2) Int. 10 53947

Notes: 1) This field indicates the position of the SNP in Field n.9 of the genotype file 2) This is an across-array index used by Interbull. If you're not using Interbull Index files, set to "0" all this column (Zanardi does not uses it!)

##COMBINATION OF PLINK AND 705 FILE FORMATS: Zanardi can contemporarily handle both PLINK and 705 type of files. All the above applies for each of the file format, so please carefully follow the instructions given. However, since Zanardi will convert Interbull 705 genotype coding into "A" and "B" alleles, PLINK files provided should have A/B coding as well, otherwise the program will stop! If you need help doing this, please see chapter 3.F, as we also provide a “satellite” open-source software able to do all this for you! ☺ If both types of files are used contemporarily, Zanardi will first convert the 705 files in PLINK format (one PLINK file for each array), and then merge all files together. You do not need to specify a thing, Zanardi will do everything automatically – your only concern should be in providing the right input files!

##PEDIGREE FILE FORMAT: Pedigree files are required for some Zanardi options (e.g. --pedigchk, --gsprep, etc..). The pedigree file must:

  • be sorted from old to young,
  • Include all individuals provided in the genotype file.
  • if using --gsprep or --optiprep options, it must include all the individuals in the pedigree (e.g. all male/female parents that are not missing should be present as individuals with parents in the pedigree).

If any of the above condition is not met, Zanardi will stop (telling you what is wrong with your file). In addition, special requirements for some options will be explained in the options section. The pedigree file should contain 5 fields, semicolon (“;”) separated, for: Individual ID, Male parent ID, Female parent ID, Date of birth and sex (see table below). This file is "free" format, e.g. column length is not needed to be fixed. Note that although Zanardi accepts IDs with blank spaces, PLINK does not, so we strongly suggest you to avoid it! Missing/Unknown individuals should be coded as “0” or five or more “U” letters (e.g. “UUUUUUUUUUUUUUUUUUU” is ok, “UUUU” is not).

Field Content (1) Format Example
1 Individual ID ANY Luke
2 Sire/Male parent ID ANY Anakin
3 Dam/Female parent ID ANY Padme
4 Date of birth YYYYMMDD 19820324
5 Sex (M or F) Char. 1 M

Notes: 1) Depending on the option chosen, Zanardi will run a lenient or strict pedigree check. Pedigree is, in any case, assumed to be correct.

##PHENOTYPE FILE FORMAT: Similarly to the pedigree file, the phenotype file is required only for some Zanardi options (e.g. --gsprep, etc..). The only condition required for this file is that all individuals in the genotype file must be present in the phenotype file, and that it must contain 3 fields, semicolon (“;”) separated, for: Individual ID, Phenotype (e.g. EBV, Deregressed Proof, Yield deviation, etc…), Accuracy (currently used only by --gsprep option. It can be set to 0 for all other options).

Field Content Format Example
1 Individual ID ANY Luke
2 Phenotype (EBV/DRP/DYD/YD/etc..) Int/Float 365
3 Accuracy (1) Int/Float 99.9

Notes: 1) Accuracy is only used by the --gsprep option, therefore if you're using for any other option, this field can be set to 0.