2. Allelic Counts Table Creation
For the further analysis you will need reference files being preprocessed, namely: pseudogenomes for each of the alleles, heterozygous VCF and tables with genome regions and snps of interest. If not, please go through reference preparation worked example before continuing performing the steps from this page.
For the purposes of this workflow, we will demonstrate how to run analysis for two technical replicates. Feel free to write a script to run these step in parallel for all replicates / libraries that you have.
We will be using two technical replicates of RNA-seq kidney data (75 SE reads):
SG8_SMARTseq_10_ng.chromosome10.sample.fq.gz
SG9_SMARTseq_10_ng.chromosome10.sample.fq.gz
You may find the files in the data/fastq/
directory of this repository.
You need to align your reads to so-called preudogenomes, i.e. reference genome with SNP from corresponding maternal and paternal genomes. So if you don't have pseudogenomes ready yet, please refer to instructions) to generate them.
For alignment we use STAR (in the paper we used v2.5.4a):
- Make sure that your (pseudo-)reference fasta files are ready for the STAR alignment step, each of them should be indexed with STAR (
STAR --runMode genomeGenerate
](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf)).
pseudoRefDir=/full/path/to/dir/for/pseudo/refs/
STAR --runMode genomeGenerate --genomeDir $pseudoRefDir/129S1_SvImJ/ --genomeFastaFiles $pseudoRefDir/129S1_SvImJ/129S1_SvImJ_pseudo.fa
STAR --runMode genomeGenerate --genomeDir $pseudoRefDir/CAST_EiJ/ --genomeFastaFiles $pseudoRefDir/CAST_EiJ/CAST_EiJ_pseudo.fa
- Make sure that you have unziped (
gunzip
) sample fasta/fastq files before alignment.
pseudoRefDir=/full/path/to/dir/with/pseudo/refs/
refDir=/full/path/to/dir/with/references/
fastqDir=/full/path/to/dir/with/fastqs/
alignDir=/full/path/to/dir/with/alignments/
# replicate 1
## align to 129S1 pseudogenome
STAR --readFilesIn $fastqDir/SG8_SMARTseq_10_ng.chromosome10.sample.fq \
--outFileNamePrefix $alignDir/SG8_SMARTseq_10_ng_on129S1. \
--runThreadN 4 --outSAMtype SAM \
--outSAMattrRGline ID:mat \
--genomeDir $pseudoRefDir/129S1_SvImJ/ \
--outFilterMultimapNmax 1 --sjdbGTFfile $refDir/Mus_musculus.GRCm38.68.chromosome10.gtf
## align to CAST pseudogenome
STAR --readFilesIn $fastqDir/SG8_SMARTseq_10_ng.chromosome10.sample.fq \
--outFileNamePrefix $alignDir/SG8_SMARTseq_10_ng_onCAST. \
--runThreadN 4 --outSAMtype SAM \
--outSAMattrRGline ID:pat \
--genomeDir $pseudoRefDir/CAST_EiJ/ \
--outFilterMultimapNmax 1 --sjdbGTFfile $refDir/Mus_musculus.GRCm38.68.chromosome10.gtf
# replicate 2
STAR --readFilesIn $fastqDir/SG9_SMARTseq_10_ng.chromosome10.sample.fq \
--outFileNamePrefix $alignDir/SG9_SMARTseq_10_ng_on129S1. \
--runThreadN 4 --outSAMtype SAM \
--outSAMattrRGline ID:mat \
--genomeDir $pseudoRefDir/129S1_SvImJ/ \
--outFilterMultimapNmax 1 --sjdbGTFfile $refDir/Mus_musculus.GRCm38.68.chromosome10.gtf
STAR --readFilesIn $fastqDir/SG9_SMARTseq_10_ng.chromosome10.sample.fq \
--outFileNamePrefix $alignDir/SG9_SMARTseq_10_ng_onCAST. \
--runThreadN 4 --outSAMtype SAM \
--outSAMattrRGline ID:pat \
--genomeDir $pseudoRefDir/CAST_EiJ/ \
--outFilterMultimapNmax 1 --sjdbGTFfile $refDir/Mus_musculus.GRCm38.68.chromosome10.gtf
Output: sam files with reads alignments.
Step 3: Merging two files with reads aligned to maternal and to paternal genomes into one file with assigning each read to a particular allele.
First, sort the files by read names (samtools sort -n
):
alignDir=/full/path/to/dir/with/alignments/
samtools sort -n -o $alignDir/SG8_SMARTseq_10_ng_on129S1.Nsorted.sam -@ 4 $alignDir/SG8_SMARTseq_10_ng_on129S1.Aligned.out.sam
samtools sort -n -o $alignDir/SG8_SMARTseq_10_ng_onCAST.Nsorted.sam -@ 4 $alignDir/SG8_SMARTseq_10_ng_onCAST.Aligned.out.sam
samtools sort -n -o $alignDir/SG9_SMARTseq_10_ng_on129S1.Nsorted.sam -@ 4 $alignDir/SG9_SMARTseq_10_ng_on129S1.Aligned.out.sam
samtools sort -n -o $alignDir/SG9_SMARTseq_10_ng_onCAST.Nsorted.sam -@ 4 $alignDir/SG9_SMARTseq_10_ng_onCAST.Aligned.out.sam
Then merge:
Note: install pandas package
python3 python3/alleleseq_merge_stream_v2.py \
--mat_sam $alignDir/SG8_SMARTseq_10_ng_on129S1.Nsorted.sam \
--pat_sam $alignDir/SG8_SMARTseq_10_ng_onCAST.Nsorted.sam \
--o $alignDir/SG8_SMARTseq_10_ng_merged.sam \
--paired 0
python3 python3/alleleseq_merge_stream_v2.py \
--mat_sam $alignDir/SG9_SMARTseq_10_ng_on129S1.Nsorted.sam \
--pat_sam $alignDir/SG9_SMARTseq_10_ng_onCAST.Nsorted.sam \
--o $alignDir/SG9_SMARTseq_10_ng_merged.sam \
--paired 0
For help:
python3 python3/alleleseq_merge_stream_v2.py --help
Output: one sam file with mat and pat read groups per replicate, sorted by name.
All sam files in the analysis should be sampled to the same lib size (for example, min(sizes)), please see our paper for reasoning.
It is necessary for the next proposed method of reads sampling to have reads properly organized, namely that PE reads should be stored together, on sequential lines, and no unpaired reads should exist in the fasta. The output of
python3/alleleseq_merge_stream_v2.py
contains already properly organized reads for both SE and PE cases, but you always may usesamtools sort -n
.
- First calculate sizes (
samtools view -c
), for SE data in our case:
for sam in $alignDir/SG8_SMARTseq_10_ng_merged.sam $alignDir/SG9_SMARTseq_10_ng_merged.sam
do
echo -e $sam'\t'$(samtools view -c $sam) >> $alignDir/samsizes.tsv
done
Remember, that
samtools
counts reads, not fragments, so for PE data it would be$(( $(samtools view -c $sam) / 2 ))
- Take minimum:
minsize=$(cut -f2 $alignDir/samsizes.tsv | sort -V | head -1)
In our case it is 512736.
- And sample all files to that number of reads, in SE case, for example:
for sam in $alignDir/SG8_SMARTseq_10_ng_merged.sam $alignDir/SG9_SMARTseq_10_ng_merged.sam
do
SAMpled=${sam::-4}".sample"$minsize"_SEreads.sam"
grep "^@" $sam > $SAMpled
grep -v "^@" $sam | shuf -n $minsize >> $SAMpled
done
For PE data the proposed reads sampling may be adjusted via concatenation with pattern, which almost certainly will not be found in the file, for example:
for sam in $alignDir/XXX_merged.Nsorted.sam $alignDir/YYY_merged.Nsorted.sam
do
SAMpled=${sam::-4}".sample"$minsize"_PEreads.sam"
grep "^@" $sam > $SAMpled
grep -v "^@" $sam | sed '$!N;s/\n/ IHOPETHATNEVERWOULDAPPERINSAMFILE /' | shuf -n $minsize | \
sed 's/ IHOPETHATNEVERWOULDAPPERINSAMFILE /\n/' >> $SAMpled
done
Output: one sampled sam file per replicate.
Convert sam to sorted bam (samtools sort
):
samtools sort -o $alignDir/SG8_SMARTseq_10_ng_merged.sample512736_SEreads.sorted.bam $alignDir/SG8_SMARTseq_10_ng_merged.sample512736_SEreads.sam
samtools sort -o $alignDir/SG9_SMARTseq_10_ng_merged.sample512736_SEreads.sorted.bam $alignDir/SG9_SMARTseq_10_ng_merged.sample512736_SEreads.sam
If you don't have heterozygous VCF file ready yet, please refer to instructions) to generate it.
Obtain table with SNP allele counts:
Note: install pysam package
countsDir=/full/path/to/dir/with/alelle/count/tables/
python2 python2/allelecounter.py --vcf $refDir/Het_Allelic_129S1_SvImJ_CAST_EiJ.exons.vcf.gz \
--bam $alignDir/SG8_SMARTseq_10_ng_merged.sample512736_SEreads.sorted.bam \
--ref $pseudoRefDir/129S1_SvImJ/129S1_SvImJ_pseudo.fa \
--sample F1 --min_cov 0 --min_baseq 2 --min_mapq 10 \
--o $countsDir/SG8_SMARTseq_10_ng_merged.sample512736_SEreads.stat_0.txt
python2 python2/allelecounter.py --vcf $refDir/Het_Allelic_129S1_SvImJ_CAST_EiJ.exons.vcf.gz \
--bam $alignDir/SG9_SMARTseq_10_ng_merged.sample512736_SEreads.sorted.bam \
--ref $pseudoRefDir/129S1_SvImJ/129S1_SvImJ_pseudo.fa \
--sample F1 --min_cov 0 --min_baseq 2 --min_mapq 10 \
--o $countsDir/SG9_SMARTseq_10_ng_merged.sample512736_SEreads.stat_0.txt
For help:
python2 python2/allelecounter.py --help
Output: one table with SNPs coverage depths per replicate.
If you don't have regions and snps of interest tables ready yet, please refer to instructions) to generate them.
Rscript --vanilla R/counts_to_snp_genes.R \
-d $countsDir \
-n SG8_SMARTseq_10_ng_merged.sample512736_SEreads,SG9_SMARTseq_10_ng_merged.sample512736_SEreads \
-r Kidney_SG8_SG9_SMARTseq_10_ng \
-o $countsDir \
-v $refDir/Het_Allelic_129S1_SvImJ_CAST_EiJ.snp_table.txt \
-b $refDir/Mus_musculus.GRCm38.68.chromosome10.EXONS.bed
Output: SNP table and Grouped SNP table (for example, genes) per set of replicates.
As a result of the last step, you will get several files in your output folder, including file with extension _processed_gene.v3.1.txt
(Kidney_SG8_SG9_SMARTseq_10_ng_processed_gene.v3.1.txt
). This file will be used for the downstream analysis, see description here.