Skip to content

2. Allelic Counts Table Creation

Asia Mendelevich edited this page Feb 24, 2020 · 37 revisions

pic

How to get Gene Allelic Counts tables starting with Fastq files

For the further analysis you will need reference files being preprocessed, namely: pseudogenomes for each of the alleles, heterozygous VCF and tables with genome regions and snps of interest. If not, please go through reference preparation worked example before continuing performing the steps from this page.

Step 1: Getting RNA-seq data

For the purposes of this workflow, we will demonstrate how to run analysis for two technical replicates. Feel free to write a script to run these step in parallel for all replicates / libraries that you have.

We will be using two technical replicates of RNA-seq kidney data (75 SE reads):

SG8_SMARTseq_10_ng.chromosome10.sample.fq.gz
SG9_SMARTseq_10_ng.chromosome10.sample.fq.gz

You may find the files in the data/fastq/ directory of this repository.

Step 2: Aligning reads to maternal and paternal genomes

You need to align your reads to so-called preudogenomes, i.e. reference genome with SNP from corresponding maternal and paternal genomes. So if you don't have pseudogenomes ready yet, please refer to instructions) to generate them.

For alignment we use STAR (in the paper we used v2.5.4a):

pseudoRefDir=/full/path/to/dir/for/pseudo/refs/

STAR --runMode genomeGenerate --genomeDir $pseudoRefDir/129S1_SvImJ/ --genomeFastaFiles $pseudoRefDir/129S1_SvImJ/129S1_SvImJ_pseudo.fa
STAR --runMode genomeGenerate --genomeDir $pseudoRefDir/CAST_EiJ/ --genomeFastaFiles $pseudoRefDir/CAST_EiJ/CAST_EiJ_pseudo.fa
  • Make sure that you have unziped (gunzip) sample fasta/fastq files before alignment.
pseudoRefDir=/full/path/to/dir/with/pseudo/refs/
refDir=/full/path/to/dir/with/references/
fastqDir=/full/path/to/dir/with/fastqs/
alignDir=/full/path/to/dir/with/alignments/

# replicate 1
## align to 129S1 pseudogenome
STAR --readFilesIn $fastqDir/SG8_SMARTseq_10_ng.chromosome10.sample.fq \
     --outFileNamePrefix $alignDir/SG8_SMARTseq_10_ng_on129S1. \
     --runThreadN 4 --outSAMtype SAM \
     --outSAMattrRGline ID:mat \
     --genomeDir $pseudoRefDir/129S1_SvImJ/ \
     --outFilterMultimapNmax 1 --sjdbGTFfile $refDir/Mus_musculus.GRCm38.68.chromosome10.gtf
## align to CAST pseudogenome     
STAR --readFilesIn $fastqDir/SG8_SMARTseq_10_ng.chromosome10.sample.fq \
     --outFileNamePrefix $alignDir/SG8_SMARTseq_10_ng_onCAST. \
     --runThreadN 4 --outSAMtype SAM \
     --outSAMattrRGline ID:pat \
     --genomeDir $pseudoRefDir/CAST_EiJ/ \
     --outFilterMultimapNmax 1 --sjdbGTFfile $refDir/Mus_musculus.GRCm38.68.chromosome10.gtf
# replicate 2  
STAR --readFilesIn $fastqDir/SG9_SMARTseq_10_ng.chromosome10.sample.fq \
     --outFileNamePrefix $alignDir/SG9_SMARTseq_10_ng_on129S1. \
     --runThreadN 4 --outSAMtype SAM \
     --outSAMattrRGline ID:mat \
     --genomeDir $pseudoRefDir/129S1_SvImJ/ \
     --outFilterMultimapNmax 1 --sjdbGTFfile $refDir/Mus_musculus.GRCm38.68.chromosome10.gtf
STAR --readFilesIn $fastqDir/SG9_SMARTseq_10_ng.chromosome10.sample.fq \
     --outFileNamePrefix $alignDir/SG9_SMARTseq_10_ng_onCAST. \
     --runThreadN 4 --outSAMtype SAM \
     --outSAMattrRGline ID:pat \
     --genomeDir $pseudoRefDir/CAST_EiJ/ \
     --outFilterMultimapNmax 1 --sjdbGTFfile $refDir/Mus_musculus.GRCm38.68.chromosome10.gtf

Output: sam files with reads alignments.

Step 3: Merging two files with reads aligned to maternal and to paternal genomes into one file with assigning each read to a particular allele.

First, sort the files by read names (samtools sort -n ):

alignDir=/full/path/to/dir/with/alignments/

samtools sort -n -o $alignDir/SG8_SMARTseq_10_ng_on129S1.Nsorted.sam -@ 4 $alignDir/SG8_SMARTseq_10_ng_on129S1.Aligned.out.sam
samtools sort -n -o $alignDir/SG8_SMARTseq_10_ng_onCAST.Nsorted.sam -@ 4 $alignDir/SG8_SMARTseq_10_ng_onCAST.Aligned.out.sam
samtools sort -n -o $alignDir/SG9_SMARTseq_10_ng_on129S1.Nsorted.sam -@ 4 $alignDir/SG9_SMARTseq_10_ng_on129S1.Aligned.out.sam
samtools sort -n -o $alignDir/SG9_SMARTseq_10_ng_onCAST.Nsorted.sam -@ 4 $alignDir/SG9_SMARTseq_10_ng_onCAST.Aligned.out.sam

Then merge:

Note: install pandas package

python3 python3/alleleseq_merge_stream_v2.py \
       --mat_sam $alignDir/SG8_SMARTseq_10_ng_on129S1.Nsorted.sam \
       --pat_sam $alignDir/SG8_SMARTseq_10_ng_onCAST.Nsorted.sam \
       --o $alignDir/SG8_SMARTseq_10_ng_merged.sam \
       --paired 0
python3 python3/alleleseq_merge_stream_v2.py \
       --mat_sam $alignDir/SG9_SMARTseq_10_ng_on129S1.Nsorted.sam \
       --pat_sam $alignDir/SG9_SMARTseq_10_ng_onCAST.Nsorted.sam \
       --o $alignDir/SG9_SMARTseq_10_ng_merged.sam \
       --paired 0

For help:

python3 python3/alleleseq_merge_stream_v2.py --help

Output: one sam file with mat and pat read groups per replicate, sorted by name.

Step 4: Reads sampling

All sam files in the analysis should be sampled to the same lib size (for example, min(sizes)), please see our paper for reasoning.

It is necessary for the next proposed method of reads sampling to have reads properly organized, namely that PE reads should be stored together, on sequential lines, and no unpaired reads should exist in the fasta. The output of python3/alleleseq_merge_stream_v2.py contains already properly organized reads for both SE and PE cases, but you always may use samtools sort -n.

  • First calculate sizes (samtools view -c), for SE data in our case:
for sam in $alignDir/SG8_SMARTseq_10_ng_merged.sam $alignDir/SG9_SMARTseq_10_ng_merged.sam
do
  echo -e $sam'\t'$(samtools view -c $sam) >> $alignDir/samsizes.tsv
done

Remember, that samtools counts reads, not fragments, so for PE data it would be $(( $(samtools view -c $sam) / 2 ))

  • Take minimum:
minsize=$(cut -f2 $alignDir/samsizes.tsv | sort -V | head -1)

In our case it is 512736.

  • And sample all files to that number of reads, in SE case, for example:
for sam in $alignDir/SG8_SMARTseq_10_ng_merged.sam $alignDir/SG9_SMARTseq_10_ng_merged.sam
do
  SAMpled=${sam::-4}".sample"$minsize"_SEreads.sam"
  grep "^@" $sam > $SAMpled
  grep -v "^@" $sam | shuf -n $minsize >> $SAMpled
done

For PE data the proposed reads sampling may be adjusted via concatenation with pattern, which almost certainly will not be found in the file, for example:

for sam in $alignDir/XXX_merged.Nsorted.sam $alignDir/YYY_merged.Nsorted.sam
do
  SAMpled=${sam::-4}".sample"$minsize"_PEreads.sam"
  grep "^@" $sam > $SAMpled
  grep -v "^@" $sam | sed '$!N;s/\n/ IHOPETHATNEVERWOULDAPPERINSAMFILE /' | shuf -n $minsize | \
       sed 's/ IHOPETHATNEVERWOULDAPPERINSAMFILE /\n/' >> $SAMpled
done

Output: one sampled sam file per replicate.

Step 5: Extracting SNP coverage information from the alignments

Convert sam to sorted bam (samtools sort):

samtools sort -o $alignDir/SG8_SMARTseq_10_ng_merged.sample512736_SEreads.sorted.bam $alignDir/SG8_SMARTseq_10_ng_merged.sample512736_SEreads.sam

samtools sort -o $alignDir/SG9_SMARTseq_10_ng_merged.sample512736_SEreads.sorted.bam $alignDir/SG9_SMARTseq_10_ng_merged.sample512736_SEreads.sam

If you don't have heterozygous VCF file ready yet, please refer to instructions) to generate it.

Obtain table with SNP allele counts:

Note: install pysam package

countsDir=/full/path/to/dir/with/alelle/count/tables/

python2 python2/allelecounter.py --vcf $refDir/Het_Allelic_129S1_SvImJ_CAST_EiJ.exons.vcf.gz \
       --bam $alignDir/SG8_SMARTseq_10_ng_merged.sample512736_SEreads.sorted.bam \
       --ref $pseudoRefDir/129S1_SvImJ/129S1_SvImJ_pseudo.fa \
       --sample F1 --min_cov 0 --min_baseq 2 --min_mapq 10 \
       --o $countsDir/SG8_SMARTseq_10_ng_merged.sample512736_SEreads.stat_0.txt
python2 python2/allelecounter.py --vcf $refDir/Het_Allelic_129S1_SvImJ_CAST_EiJ.exons.vcf.gz \
       --bam $alignDir/SG9_SMARTseq_10_ng_merged.sample512736_SEreads.sorted.bam \
       --ref $pseudoRefDir/129S1_SvImJ/129S1_SvImJ_pseudo.fa \
       --sample F1 --min_cov 0 --min_baseq 2 --min_mapq 10 \
       --o $countsDir/SG9_SMARTseq_10_ng_merged.sample512736_SEreads.stat_0.txt

For help:

python2 python2/allelecounter.py --help

Output: one table with SNPs coverage depths per replicate.

Step 6: Getting allelic counts per SNP and per gene

If you don't have regions and snps of interest tables ready yet, please refer to instructions) to generate them.

Rscript --vanilla R/counts_to_snp_genes.R \
        -d $countsDir \
        -n SG8_SMARTseq_10_ng_merged.sample512736_SEreads,SG9_SMARTseq_10_ng_merged.sample512736_SEreads \
        -r Kidney_SG8_SG9_SMARTseq_10_ng \
        -o $countsDir \
        -v $refDir/Het_Allelic_129S1_SvImJ_CAST_EiJ.snp_table.txt \
        -b $refDir/Mus_musculus.GRCm38.68.chromosome10.EXONS.bed 

Output: SNP table and Grouped SNP table (for example, genes) per set of replicates.

As a result of the last step, you will get several files in your output folder, including file with extension _processed_gene.v3.1.txt (Kidney_SG8_SG9_SMARTseq_10_ng_processed_gene.v3.1.txt). This file will be used for the downstream analysis, see description here.