Skip to content

1. Reference preparation

Sasha edited this page Feb 14, 2020 · 29 revisions

In this walk-through, to keep datafiles small, we use analysis of data for one chromosome (mouse chromosome 10).

pic

Files needed at the very beginning of the analysis:

  1. Reference genome

For example, GRCm38_68.fa.

wget ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa

For a test sample please find data/ref/GRCm38_68.chromosome10.fa.gz in current repository.

  1. Either:
  • One/Two vcf files for maternal and paternal inbred lines (should be compatible with the reference genome)
  • Joint vcf file for multiple species, where two lines are presented (should be compatible with reference genome)
  • Individual vcf file or joint individuals vcf file (should be compatible with reference genome) (see correponding section about reference preparation)

For example, mgp.v5.merged.snps_all.dbSNP142.vcf.gz for multiple mice lines.

wget ftp://ftp-mouse.sanger.ac.uk/current_snps/mgp.v5.merged.snps_all.dbSNP142.vcf.gz

For a test sample please find data/ref/mgp.v5.merged.snps_all.dbSNP142.chromosome10.strains_subsample.snps_subsample.vcf.gz in current repository.

Note: you can do vcf calling by using mutect, varscan, etc. if you have no pre-existing vcf

  1. (optional) Either:
  • gtf annotation for the reference genome
  • bed file with selected regions These are needed for vcf filtering.

For example, Mus_musculus.GRCm38.68.gtf.gz for corresponding reference genome version.

wget ftp://ftp.ensembl.org/pub/release-68/gtf/mus_musculus/Mus_musculus.GRCm38.68.gtf.gz

For a test sample please find data/ref/Mus_musculus.GRCm38.68.chromosome10.gtf.gz in current repository.

Preparing files for using as reference:

  1. For reference genome fasta should be created the index (samtools faidx) and the dictionary (picard CreateSequenceDictionary, see gatkforums for more information).

  2. vcf-files should be compressed (bgzip from htslib) and indexed (tabix from htslib).

Reference preparation:

One script to rule them all (don't forget to insert real paths to corresponding files and directories):

Note: this procedure may take hours on the real data. (For our example data it will take few minutes only)

python3 python3/prepare_reference.py --PSEUDOREF True --HETVCF True \
  --pseudoref_dir /full/path/to/dir/for/pseudo/ref/out/ \
  --vcf_dir /full/path/todir/for/vcf/outputs/ \
  --ref /full/path/to/ref/genome/GRCm38_68.chromosome10.fa \
  --name_mat 129S1_SvImJ --name_pat CAST_EiJ \
  --vcf_joint /full/path/to/vcf/mgp.v5.merged.snps_all.dbSNP142.chromosome10.strains_subsample.snps_subsample.vcf.gz \
  --gtf /full/path/to/gtf/Mus_musculus.GRCm38.68.chromosome10.gtf

For help:

python3 python3/prepare_reference.py --help

Pseudoreference fasta creation:

--PSEUDOREF True

  • Input:
Inbred lines Inbred lines Individual Individual
Joint lines vcf Separate line(s) vcf Joint individuals vcf Separate individual vcf
FASTA Reference genome --ref --ref --ref --ref
VCF Variant file(s)[1] --vcf_joint --vcf_mat, --vcf_pat --vcf_joind --vcf_ind
Name(s) --name_mat, --name_pat --name_mat, --name_pat --name_ind --name_ind
FASTA Output directory --pseudo_dir --pseudo_dir --pseudo_dir --pseudo_dir
VCF Output directory --vcf_dir --vcf_dir --vcf_dir --vcf_dir

[1] If one or the alleles in case of inbred lines is reference, then everything should be provided as mat or pat only, consistently.

  • Output:
    • Pseudoreference genome fastas with own directories.
    • Supporting vcf or bed files (if needed).

Heterozygous(parental) VCF creation:

--HETVCF True

  • Input:
Inbred lines Inbred lines Individual Individual
Joint lines vcf Separate line(s) vcf Joint individuals vcf Separate individual vcf
FASTA Reference genome --ref --ref --ref --ref
nothing or GTF or BED Selected regions annotation [2] --gtf or --bed --gtf or --bed --gtf or --bed --gtf or --bed
VCF Variant file(s)[1] --vcf_joint --vcf_mat, --vcf_pat --vcf_joind --vcf_ind
Name(s) --name_mat, --name_pat --name_mat, --name_pat --name_ind --name_ind
VCF Output directory --vcf_dir --vcf_dir --vcf_dir --vcf_dir

[1] If one or the alleles in case of inbred lines is reference, then everything should be provided as mat or pat only, consistently. [2] Bed will be considered as main; if gtf provided, automatically considered regions=exons and groups=genes; bed file should have 4 columns and prepared in advance: contig, start position, end position, group ID (no colnames).

  • Output:
    • VCF with heterozygous positions, one allele as reference and the second as alternative.
    • Support vcf files (if needed).

Additional tables with regions and variants of interest, needed in further analysis:

  • bed file (with four columns: contig, start and end positions, group ID; no column names) with selected regions, for example, exon regions grouped by genes:
awk '$3=="exon" && ($1 ~ /^[1-9]/ || $1 == "X" || $1 == "Y")' /full/path/to/Mus_musculus.GRCm38.68.chromosome10.gtf | cut -f1,4,5,9 | awk -F $'\t' 'BEGIN {OFS = FS} {split($4, a, "gene_id \""); split(a[2], b, "\""); print $1, $2-1, $3, b[1]}' > /full/path/to/output/Mus_musculus.GRCm38.68.chromosome10.EXONS.bed

Note that you may use any method to create such a table which is convenient to you, here is provided only one of them.

  • snp table, which can be obtained, for regions of interest, from vcf with heterozigous positions created above, as 5 first columns (in our case it is relevant to look only at exon snps):
grep "^#CHROM" /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.exons.vcf | cut -f1-5 > /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.snp_table.txt
grep -v "^#" /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.exons.vcf | sort -V | cut -f1-5 >> /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.snp_table.txt

Next